Getting Started and Compiling the Data Set

The Google Colab notebook is available here:

00:00 I’ve set up a Jupyter Notebook on Google Colab, a free service for editing and hosting Jupyter Notebooks. I’ll be referring to it throughout the course. You can follow along by cloning the Notebook located at the URL at the bottom of the screen. Using Google Colab is done for your convenience.

00:18 If you’d like, you can still download the Notebook and run it on other services like Microsoft Azure or locally using a stock Jupyter Notebook server. The advantage of Google Colab is that all of the packages needed to complete the demo are pre-installed. To get started, you just need to connect to a runtime hosted on Google servers with a single click.

00:43 Before starting any data science or machine learning project, you should fully understand the data you are working with. This course will use the Sentiment Labelled Sentences Data Set from the Machine Learning Repository, located at the University of California, Irvine.

01:00 You won’t need to labor over the function get_data(). It simply uses modules from the Python standard library to download the data set ZIP file, save it to disk, and then the extract_data() function to extract it.

01:16 At this point, you can open the Files tab in Google Colab and see a folder named sentiment labelled sentences, which is the contents of the data set ZIP file. Note that the free tier of Google Colab is a shared service. If your session times out and the resources are reclaimed, you will lose any files that you downloaded. In this case, it’s not that big of a deal because you just download them again, but if you save any data generated by your Notebook, it’s a good idea to download it.

01:49 The next function, rename_data_folder(), cleans up the folder names to make the code more readable.

01:57 Open the file amazon_cells_labelled.txt and look at the contents.

02:06 Each line has a sentence and then a sentiment score. For negative sentiment, the score is 0, while positive sentiment has a 1 score. And there are two other files for data from the Internet Movie Database and Yelp with the same structure.

02:26 You’re going to concatenate those files using Pandas. First, import pandas and alias it as pd, because data scientists don’t like to type.

02:37 Next, create a mapping between each file in the data set and its source. Iterate over the files in the dictionary and read each one into a Pandas DataFrame.

02:50 Notice that even though you are using the read_csv() function from pandas, you can specify any separator. The files in the data set are tab-separated, so use the sep keyword argument to tell Pandas to use the tab character instead.

03:06 And the names keyword argument contains the names of the columns to be used in the DataFrame. Keep the source of each item by appending a new 'source' column to the DataFrame.

03:19 Then store each DataFrame in a list and join them together. Using the .head() method of the DataFrame will show the first five items.

03:31 In the next video, you’ll start to prepare the data for machine learning.

Avatar image for vickaul-ai

vickaul-ai on May 10, 2021

You may run into issues trying to access the UCI hosted dataset. I did. Assuming its because of the UCI servers being unavailable.

Avatar image for dstricks

dstricks on Nov. 1, 2021

FYI the requests module is not part of the Python Standard Library and will need to be installed into your environment (minute 1:05 mentioned it was part of the Python Standard Library). Just a quick note to help out new learners that might run into an issue with that.

Like Dan says… Happy Pythoning!

Avatar image for Bartosz Zaczyński

Bartosz Zaczyński RP Team on Nov. 2, 2021

@dstricks Good catch! Thanks for pointing this out.

Become a Member to join the conversation.