In this part of the series, we’re going to scrape the contents of a webpage and then process the text to display word counts.
Updates:
- 02/10/2020: Upgraded to Python version 3.8.1 as well as the latest versions of requests, BeautifulSoup, and nltk. See below for details.
- 03/22/2016: Upgraded to Python version 3.5.1 as well as the latest versions of requests, BeautifulSoup, and nltk. See below for details.
- 02/22/2015: Added Python 3 support.
Free Bonus: Click here to get access to a free Flask + Python video tutorial that shows you how to build Flask web app, step-by-step.
Remember: Here’s what we’re building - A Flask app that calculates word-frequency pairs based on the text from a given URL.
- Part One: Set up a local development environment and then deploy both a staging and a production environment on Heroku.
- Part Two: Set up a PostgreSQL database along with SQLAlchemy and Alembic to handle migrations.
- Part Three: Add in the back-end logic to scrape and then process the word counts from a webpage using the requests, BeautifulSoup, and Natural Language Toolkit (NLTK) libraries. (current)
- Part Four: Implement a Redis task queue to handle the text processing.
- Part Five: Set up Angular on the front-end to continuously poll the back-end to see if the request is done processing.
- Part Six: Push to the staging server on Heroku - setting up Redis and detailing how to run two processes (web and worker) on a single Dyno.
- Part Seven: Update the front-end to make it more user-friendly.
- Part Eight: Create a custom Angular Directive to display a frequency distribution chart using JavaScript and D3.
Need the code? Grab it from the repo.
Install Requirements
Tools used:
- requests (2.22.0) - a library for sending HTTP requests
- BeautifulSoup (4.8.2) - a tool used for scraping and parsing documents from the web
- Natural Language Toolkit (3.4.5) - a natural language processing library
Navigate into the project directory to activate the virtual environment, via autoenv, and then install the requirements:
$ cd flask-by-example
$ python -m pip install requests==2.22.0 beautifulsoup4==4.8.2 nltk==3.4.5
$ python -m pip freeze > requirements.txt
Refactor the Index Route
To get started, let’s get rid of the “hello world” part of the index route in our app.py file and set up the route to render a form to accept URLs. First, add a templates folder to hold our templates and add an index.html file to it.
$ mkdir templates
$ touch templates/index.html
Set up a very basic HTML page:
<!DOCTYPE html>
<html>
<head>
<title>Wordcount</title>
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<link href="//netdna.bootstrapcdn.com/bootstrap/3.3.6/css/bootstrap.min.css" rel="stylesheet" media="screen">
<style>
.container {
max-width: 1000px;
}
</style>
</head>
<body>
<div class="container">
<h1>Wordcount 3000</h1>
<form role="form" method='POST' action='/'>
<div class="form-group">
<input type="text" name="url" class="form-control" id="url-box" placeholder="Enter URL..." style="max-width: 300px;" autofocus required>
</div>
<button type="submit" class="btn btn-default">Submit</button>
</form>
<br>
{% for error in errors %}
<h4>{{ error }}</h4>
{% endfor %}
</div>
<script src="//code.jquery.com/jquery-2.2.1.min.js"></script>
<script src="//netdna.bootstrapcdn.com/bootstrap/3.3.6/js/bootstrap.min.js"></script>
</body>
</html>
We used Bootstrap to add a bit of style so our page isn’t completely hideous. Then we added a form with a text input box for users to enter a URL into. Additionally, we utilized a Jinja for
loop to iterate through a list of errors, displaying each one.
Update app.py to serve the template:
import os
from flask import Flask, render_template
from flask_sqlalchemy import SQLAlchemy
app = Flask(__name__)
app.config.from_object(os.environ['APP_SETTINGS'])
app.config['SQLALCHEMY_TRACK_MODIFICATIONS'] = False
db = SQLAlchemy(app)
from models import Result
@app.route('/', methods=['GET', 'POST'])
def index():
return render_template('index.html')
if __name__ == '__main__':
app.run()
Why both HTTP methods, methods=['GET', 'POST']
? Well, we will eventually use that same route for both GET and POST requests - to serve the index.html page and handle form submissions, respectively.
Fire up the app to test it out:
$ python manage.py runserver
Navigate to http://localhost:5000/ and you should see the form staring back at you.
Requests
Now let’s use the requests library to grab the HTML page from the submitted URL.
Change your index route like so:
@app.route('/', methods=['GET', 'POST'])
def index():
errors = []
results = {}
if request.method == "POST":
# get url that the user has entered
try:
url = request.form['url']
r = requests.get(url)
print(r.text)
except:
errors.append(
"Unable to get URL. Please make sure it's valid and try again."
)
return render_template('index.html', errors=errors, results=results)
Make sure to update the imports as well:
import os
import requests
from flask import Flask, render_template, request
from flask_sqlalchemy import SQLAlchemy
- Here, we imported the
requests
library as well as therequest
object from Flask. The former is used to send external HTTP GET requests to grab the specific user-provided URL, while the latter is used to handle GET and POST requests within the Flask app. - Next, we added variables to capture both errors and results, which are passed into the template.
-
Within the view itself, we checked if the request is a GET or POST-
- If POST: We grabbed the value (URL) from the form and assigned it to the
url
variable. Then we added an exception to handle any errors and, if necessary, appended a generic error message to theerrors
list. Finally, we rendered the template, including theerrors
list andresults
dictionary. - If GET: We simply rendered the template.
- If POST: We grabbed the value (URL) from the form and assigned it to the
Let’s test this out:
$ python manage.py runserver
You should be able to type in a valid webpage and in the terminal you’ll see the text of that page returned.
Note: Make sure, that your URL includes
http://
orhttps://
. Otherwise our application won’t detect, that it’s a valid URL.
Text Processing
With the HTML in hand, let’s now count the frequency of the words that are on the page and display them to the end user. Update your code in app.py to the following and we’ll walk through what’s happening:
import os
import requests
import operator
import re
import nltk
from flask import Flask, render_template, request
from flask_sqlalchemy import SQLAlchemy
from stop_words import stops
from collections import Counter
from bs4 import BeautifulSoup
app = Flask(__name__)
app.config.from_object(os.environ['APP_SETTINGS'])
app.config['SQLALCHEMY_TRACK_MODIFICATIONS'] = True
db = SQLAlchemy(app)
from models import Result
@app.route('/', methods=['GET', 'POST'])
def index():
errors = []
results = {}
if request.method == "POST":
# get url that the person has entered
try:
url = request.form['url']
r = requests.get(url)
except:
errors.append(
"Unable to get URL. Please make sure it's valid and try again."
)
return render_template('index.html', errors=errors)
if r:
# text processing
raw = BeautifulSoup(r.text, 'html.parser').get_text()
nltk.data.path.append('./nltk_data/') # set the path
tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens)
# remove punctuation, count raw words
nonPunct = re.compile('.*[A-Za-z].*')
raw_words = [w for w in text if nonPunct.match(w)]
raw_word_count = Counter(raw_words)
# stop words
no_stop_words = [w for w in raw_words if w.lower() not in stops]
no_stop_words_count = Counter(no_stop_words)
# save the results
results = sorted(
no_stop_words_count.items(),
key=operator.itemgetter(1),
reverse=True
)
try:
result = Result(
url=url,
result_all=raw_word_count,
result_no_stop_words=no_stop_words_count
)
db.session.add(result)
db.session.commit()
except:
errors.append("Unable to add item to database.")
return render_template('index.html', errors=errors, results=results)
if __name__ == '__main__':
app.run()
Create a new file called stop_words.py and add the following list:
stops = [
'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you',
'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his',
'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself',
'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which',
'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are',
'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having',
'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if',
'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for',
'with', 'about', 'against', 'between', 'into', 'through', 'during',
'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in',
'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then',
'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any',
'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no',
'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's',
't', 'can', 'will', 'just', 'don', 'should', 'now', 'id', 'var',
'function', 'js', 'd', 'script', '\'script', 'fjs', 'document', 'r',
'b', 'g', 'e', '\'s', 'c', 'f', 'h', 'l', 'k'
]
What’s happening?
Text Processing
-
In our index route we used beautifulsoup to clean the text, by removing the HTML tags, that we got back from the URL as well as nltk to-
- Tokenize the raw text (break up the text into individual words), and
- Turn the tokens into an nltk text object.
-
In order for nltk to work properly, you need to download the correct tokenizers. First, create a new directory -
mkdir nltk_data
- then run -python -m nltk.downloader
.When the installation window appears, update the ‘Download Directory’ to whatever_the_absolute_path_to_your_app_is/nltk_data/.
Then click the ‘Models’ tab and select ‘punkt’ under the ‘Identifier’ column. Click ‘Download’. Check the official documentation for more information.
Remove Punctuation, Count Raw Words
- Since we don’t want punctuation counted in the final results, we created a regular expression that matched anything not in the standard alphabet.
- Then, using a list comprehension, we created a list of words without punctuation or numbers.
- Finally, we tallied the number of times each word appeared in the list using Counter.
Stop Words
Our current output contains a lot of words that we likely don’t want to count - i.e., “I”, “me”, “the”, and so forth. These are called stop words.
- With the
stops
list, we again used a list comprehension to create a final list of words that do not include those stop words. - Next, we created a dictionary with the words (as keys) and their associated counts (as values).
- And finally we used the sorted method to get a sorted representation of our dictionary. Now we can use the sorted data to display the words with the highest count at the top of the list, which means that we won’t have to do that sorting in our Jinja template.
For a more robust stop word list, use the NLTK stopwords corpus.
Save the Results
Finally, we used a try/except to save the results of our search and the subsequent counts to the database.
Display Results
Let’s update index.html in order to display the results:
<!DOCTYPE html>
<html>
<head>
<title>Wordcount</title>
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<link href="//netdna.bootstrapcdn.com/bootstrap/3.1.1/css/bootstrap.min.css" rel="stylesheet" media="screen">
<style>
.container {
max-width: 1000px;
}
</style>
</head>
<body>
<div class="container">
<div class="row">
<div class="col-sm-5 col-sm-offset-1">
<h1>Wordcount 3000</h1>
<br>
<form role="form" method="POST" action="/">
<div class="form-group">
<input type="text" name="url" class="form-control" id="url-box" placeholder="Enter URL..." style="max-width: 300px;">
</div>
<button type="submit" class="btn btn-default">Submit</button>
</form>
<br>
{% for error in errors %}
<h4>{{ error }}</h4>
{% endfor %}
<br>
</div>
<div class="col-sm-5 col-sm-offset-1">
{% if results %}
<h2>Frequencies</h2>
<br>
<div id="results">
<table class="table table-striped" style="max-width: 300px;">
<thead>
<tr>
<th>Word</th>
<th>Count</th>
</tr>
</thead>
{% for result in results%}
<tr>
<td>{{ result[0] }}</td>
<td>{{ result[1] }}</td>
</tr>
{% endfor %}
</table>
</div>
{% endif %}
</div>
</div>
</div>
<br><br>
<script src="//code.jquery.com/jquery-1.11.0.min.js"></script>
<script src="//netdna.bootstrapcdn.com/bootstrap/3.1.1/js/bootstrap.min.js"></script>
</body>
</html>
Here, we added an if
statement to see if our results
dictionary has anything in it and then added a for
loop to iterate over the results
and display them in a table. Run your app and you should be able to enter a URL and get back the count of the words on the page.
$ python manage.py runserver
What if we wanted to display only the first ten keywords?
results = sorted(
no_stop_words_count.items(),
key=operator.itemgetter(1),
reverse=True
)[:10]
Test it out.
Summary
Okay great. Given a URL we can count the words that are on the page. If you use a site without a massive amount of words, like https://realpython.com, the processing should happen fairly quickly. What happens if the site has a lot of words, though? For example, try out https://gutenberg.ca. You’ll notice that this takes longer to process.
If you have a number of users all hitting your site at once to get word counts, and some of them are trying to count larger pages, this can become a problem. Or perhaps you decide to change the functionality so that when a user inputs a URL, we recursively scrape the entire web site and calculate word frequencies based on each individual page. With enough traffic, this will significantly slow down the site.
What’s the solution?
Instead of counting the words after each user makes a request, we need to use a queue to process this in the backend - which is exactly where will start next time in Part 4.
For now, commit your code, but before you push to Heroku, you should remove all language tokenizers except for English along with the zip file. This will significantly reduce the size of the commit. Keep in mind though that if you do process a non-English site, it will only process English words.
└── nltk_data
└── tokenizers
└── punkt
├── PY3
│ └── english.pickle
└── english.pickle
Push it up to the staging environment only since this new text processing feature is only half finished:
$ git push stage master
Test it out on staging. Comment if you have questions. See you next time!
Free Bonus: Click here to get access to a free Flask + Python video tutorial that shows you how to build Flask web app, step-by-step.
This is a collaboration piece between Cam Linke, co-founder of Startup Edmonton, and the folks at Real Python