Analyze Categorical Data
00:00 Analyze Categorical Data. To process bigger chunks of information, the human mind consciously and unconsciously sorts data into categories. This technique is often useful, but it’s far from flawless.
00:15 Sometimes we put things into a category that, on further examination, aren’t all that similar. In this section, you’ll get to know some tools for examining categories and verifying whether a given categorization makes sense.
00:27 Many data sets already contain some explicit or implicit categorization. In the current example, the 173 majors are divided into 16 categories. A basic use of categories is grouping and aggregation.
00:42
You can use .groupby()
to determine how popular each of the categories in the college major dataset are. With .groupby()
, you create a DataFrameGroupBy
object.
00:56
With the .sum()
method, you create a Series.
01:09
Let’s draw a horizontal bar plot showing all the category totals in cat_totals
.
01:20 You should see a plot with one horizontal bar for each category. As your plot shows, business is by far the most popular major category. While humanities and liberal arts is the clear second, the rest of the fields are more similar in popularity. With groups clearly established, in the next section you’ll see the best way to visually compare ratios.
Bartosz Zaczyński RP Team on April 8, 2022
@pnmcdos It’s somewhat similar to a regular Python interpreter. When you run the python
command, you’ll start an interactive Python interpreter session known as REPL (Read-Evaluate-Print Loop). It reads your commands, evaluates them, and prints the corresponding results immediately onto the screen.
It does so by calling repr()
on the last evaluated expression, behind the scenes, in order to turn that expression into a textual representation. For example, a date object will have the following representation:
>>> from datetime import date
>>> date.today()
datetime.date(2022, 4, 8)
It’s almost the same as if you called repr()
against that object yourself:
>>> repr(date.today())
'datetime.date(2022, 4, 8)'
The only difference is that the string will be enclosed in single quotes.
Now, when you call print()
on something, in turn, it will call str()
on that object for you, which might produce a slightly different textual representation:
>>> print(date.today())
2022-04-08
>>> str(date.today())
'2022-04-08'
Jupyter Notebooks take the same idea, but instead of producing textual output, they can leverage much more visually appealing representations thanks to running in a web browser, which often lets you interact with those representations. The specific representation will depend on the type of object at hand. In this case, the variable references a Pandas data frame object, which the notebook can display as tabular data.
Become a Member to join the conversation.
pnmcdos on April 8, 2022
Why is it that we can print the variables without having to write
print
? Line 24cat_totals
for example. Is this a Jupyter functionality?