In this part of the series, we’re going to scrape the contents of a webpage and then process the text to display word counts.
- 03/22/2016: Upgraded to Python version 3.5.1 as well as the latest versions of requests, BeautifulSoup, and nltk. See below for details.
- 02/22/2015: Added Python 3 support.
Remember: Here’s what we’re building – A Flask app that calculates word-frequency pairs based on the text from a given URL.
- Part One: Set up a local development environment and then deploy both a staging and a production environment on Heroku.
- Part Two: Set up a PostgreSQL database along with SQLAlchemy and Alembic to handle migrations.
- Part Three: Add in the back-end logic to scrape and then process the word counts from a webpage using the requests, BeautifulSoup, and Natural Language Toolkit (NLTK) libraries. (current)
- Part Four: Implement a Redis task queue to handle the text processing.
- Part Five: Set up Angular on the front-end to continuously poll the back-end to see if the request is done processing.
- Part Six: Push to the staging server on Heroku – setting up Redis and detailing how to run two processes (web and worker) on a single Dyno.
- Part Seven: Update the front-end to make it more user-friendly.
Need the code? Grab it from the repo.
- requests (2.9.1) – a library for sending HTTP requests
- BeautifulSoup (4.4.1) – a tool used for scraping and parsing documents from the web
- Natural Language Toolkit (3.2) – a natural language processing library
Navigate into the project directory to activate the virtual environment, via autoenv, and then install the requirements:
1 2 3
Refactor the Index Route
To get started, let’s get rid of the “hello world” part of the index route in our app.py file and set up the route to render a form to accept URLs. First, add a templates folder to hold our templates and add an index.html file to it.
Set up a very basic HTML page:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
We used Bootstrap to add a bit of style so our page isn’t completely hideous. Then we added a form with a text input box for users to enter a URL into. Additionally, we utilized a Jinja
for loop to iterate through a list of errors, displaying each one.
Update app.py to serve the template:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Why both HTTP methods,
methods=['GET', 'POST']? Well, we will eventually use that same route for both GET and POST requests – to serve the index.html page and handle form submissions, respectively.
Fire up the app to test it out:
Navigate to http://localhost:5000/ and you should see the form staring back at you.
Now let’s use the requests library to grab the HTML page from the submitted URL.
Change your index route like so:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Make sure to update the imports as well:
1 2 3 4
- Here, we imported the
requestslibrary as well as the
requestobject from Flask. The former is used to send external HTTP GET requests to grab the specific user-provided URL, while the latter is used to handle GET and POST requests within the Flask app.
- Next, we added variables to capture both errors and results, which are passed into the template.
Within the view itself, we checked if the request is a GET or POST-
- If POST: We grabbed the value (URL) from the form and assigned it to the
urlvariable. Then we added an exception to handle any errors and, if necessary, appended a generic error message to the
errorslist. Finally, we rendered the template, including the
- If GET: We simply rendered the template.
- If POST: We grabbed the value (URL) from the form and assigned it to the
Let’s test this out:
You should be able to type in a valid webpage and in the terminal you’ll see the text of that page returned.
With the HTML in hand, let’s now count the frequency of the words that are on the page and display them to the end user. Update your code in app.py to the following and we’ll walk through what’s happening:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
Create a new file called stop_words.py and add the following list:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
In our index route we used beautifulsoup to clean the text, by removing the HTML tags, that we got back from the URL as well as nltk to-
In order for nltk to work properly, you need to download the correct tokenizers. First, create a new directory –
mkdir nltk_data– then run –
python -m nltk.downloader.
When the installation window appears, update the ‘Download Directory’ to whatever_the_absolute_path_to_your_app_is/nltk_data/.
Then click the ‘Models’ tab and select ‘punkt’ under the ‘Identifier’ column. Click ‘Download’. Check the official documentation for more information.
Remove Punctuation, Count Raw Words
- Since we don’t want punctuation counted in the final results, we created a regular expression that matched anything not in the standard alphabet.
- Then, using a list comprehension, we created a list of words without punctuation or numbers.
- Finally, we tallied the number of times each word appeared in the list using Counter.
Our current output contains a lot of words that we likely don’t want to count – i.e., “I”, “me”, “the”, and so forth. These are called stop words.
- With the
stopslist, we again used a list comprehension to create a final list of words that do not include those stop words.
- Next, we created a dictionary with the words (as keys) and their associated counts (as values).
- And finally we used the sorted method to get a sorted representation of our dictionary. Now we can use the sorted data to display the words with the highest count at the top of the list, which means that we won’t have to do that sorting in our Jinja template.
For a more robust stop word list, use the NLTK stopwords corpus.
Save the Results
Finally, we used a try/except to save the results of our search and the subsequent counts to the database.
Let’s update index.html in order to display the results:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
Here, we added an
if statement to see if our
results dictionary has anything in it and then added a
for loop to iterate over the
results and display them in a table. Run your app and you should be able to enter a URL and get back the count of the words on the page.
What if we wanted to display only the first ten keywords?
1 2 3 4 5
Test it out.
Okay great. Given a URL we can count the words that are on the page. If you use a site without a massive amount of words, like http://realpython.com, the processing should happen fairly quickly. What happens if the site has a lot of words, though? For example, try out http://gutenberg.ca. You’ll notice that this takes longer to process.
If you have a number of users all hitting your site at once to get word counts, and some of them are trying to count larger pages, this can become a problem. Or perhaps you decide to change the functionality so that when a user inputs a URL, we recursively scrape the entire web site and calculate word frequencies based on each individual page. With enough traffic, this will significantly slow down the site.
What’s the solution?
Instead of counting the words after each user makes a request, we need to use a queue to process this in the backend – which is exactly where will start next time in Part 4.
For now, commit your code, but before you push to Heroku, you should remove all language tokenizers except for English along with the zip file. This will significantly reduce the size of the commit. Keep in mind though that if you do process a non-English site, it will only process English words.
1 2 3 4 5 6
Push it up to the staging environment only since this new text processing feature is only half finished:
Test it out on staging. Comment if you have questions. See you next time!
This is a collaboration piece between Cam Linke, co-founder of Startup Edmonton, and the folks at Real Python