Preparing and Setting Up Your Environment
00:00 Set Up Your Environment. You can best follow along with the code in this course in a Jupyter Notebook. This way, you’ll immediately see your plots and be able to play around with them.
00:12 You’ll also need a working Python environment including pandas. If you don’t have one yet, then you have several options. If you have more ambitious plans, then download the Anaconda distribution.
00:25 It’s huge, at about 500 megabytes, but you will be equipped for most data science work. If you prefer a minimalist setup, then check out the section on installing Miniconda in this Real Python course, Setting Up Python for Machine Learning on Windows.
00:41
If you want to go bare bones and stick to pip
, then install the libraries discussed in this course with pip install pandas matplotlib
.
00:54
You can also install Jupyter Notebook with pip install jupyterlab
.
01:06 If you don’t want to do any setup or are working on a system where you can’t install these libraries, then follow along in an online Jupyter Notebook trial.
01:16 Once your environment is set up, you’re ready to download a dataset. In this video course, you’re going to analyze data on college majors, originally sourced from the American Community Survey 2010-2012 Public Use Microdata Sample.
01:31 It served as the basis for the Economic Guide To Picking A College Major featured on the website FiveThirtyEight, and it’s their GitHub repository where the dataset will be downloaded from.
01:42 For most of the rest of the course, the work that you see me doing will be done in a Jupyter Notebook, and if you’re not familiar, I’m just going to run through how you start this.
01:51
Once everything is installed, you can run a Notebook by typing jupyter notebook
in the terminal. Note that this will start up a web server and open your browser to allow you to see the Notebooks in the current folder.
02:03 You can create a new one by going to New and picking Python 3. Rename it by clicking the title, which starts out as Untitled, and entering the name that you want the workbook to have. It’s important to save your work, and this is done by File > Save and Checkpoint or by using the keyboard shortcut which is appropriate for your operating system, shown in the menu onscreen.
02:26 Help is available by pressing H, and as you can see, there are lots of shortcuts and it’s useful to learn as many of these as you can. The important concept with Jupyter is that there are two modes, there’s command mode and there’s edit mode.
02:41 In command mode, you control the cells themselves. And in edit mode, you control the contents of those cells, which can be either commands, which are run by the Python interpreter,
02:54 or markup, which allows you to enter richly-formatted text in cells which aren’t run by Python. With a cell active, such as this one with the flashing cursor here, you can change into command mode by hitting Escape and change the cell’s contents by pressing M to move into Markdown mode and Y to move back into code mode. Tab will normally take you back to the cell to enter code, but sometimes you’ll need to click in it with the mouse.
03:22 You can then enter your commands, as seen here.
03:28 If you want to generate a new cell, there are a number of ways of doing so but the easiest way is to run the code in the final cell and create a new one at the same time by pressing Shift and tapping Enter. That runs the current cell, and you can see the number 2 appears next to it to show that’s the second cell that’s been run. And a new cell is ready underneath, where you can enter some more commands and run them!
03:52 You should be able to run all of the code in this course using just these few shortcuts, but if you want to learn more, Real Python has got you covered with this course on using Jupyter Notebooks.
04:05
The first step here is to import pandas
with the traditional alias of pd
, create the download_url
variable,
04:23 and then create a DataFrame with this command here.
04:32
We can see when we type type()
of the DataFrame, we can see that its a pandas DataFrame
. By calling read_csv()
, you create a DataFrame
, the main data structure used in pandas.
04:46 You can follow along with this course even if you aren’t familiar with DataFrames, but if you’re interested in learning more about working with pandas and DataFrames, then you can check out Using Pandas in Python to Explore Your Dataset and The Pandas DataFrame: Make Working With Data Delightful.
05:05
Now that you have a DataFrame, you can take a look at the data. First, you should configure the display.max.columns
option to make sure pandas doesn’t hide any columns.
05:14
Then you can view the first few rows of the data with the .head()
method. So here, you can see the option of "display.max.columns"
is being set to None
, so none of the columns will be hidden.
05:27
And this is a method you’ll get extremely familiar with: .head()
. It displays the first five rows of the DataFrame by default, and your output should look something like this.
05:44
By default, .head()
displays five rows, but you can specify any number of rows as an argument. Here you can see what happens when you use df.head(10)
.
06:03 Now that you have your environment set up and the data source imported into pandas, you’re ready to create your first plot.
Bartosz Zaczyński RP Team on April 8, 2022
@pnmcdos Correct, you can even skip the intermediate variable if you wanted to:
>>> import pandas as pd
>>> df = pd.read_csv("https://raw.githubusercontent.com/fivethirtyeight"
... "/data/master/college-majors/recent-grads.csv")
>>> df
Rank Major_code ... Non_college_jobs Low_wage_jobs
0 1 2419 ... 364 193
1 2 2416 ... 257 50
2 3 2415 ... 176 0
3 4 2417 ... 102 0
4 5 2405 ... 4440 972
.. ... ... ... ... ...
168 169 3609 ... 2947 743
169 170 5201 ... 615 82
170 171 5202 ... 870 622
171 172 5203 ... 1245 308
172 173 3501 ... 338 192
[173 rows x 21 columns]
Become a Member to join the conversation.
pnmcdos on April 7, 2022
So
and
is all that is necessary to bring in a website and establish a data frame once Pandas is installed?
Meaning provide I:
import pandas as pd
download_url = "anywebsite.csv"
df = pd.read_csv(download_url)
would essentially work for any website?