Content Aggregator
Here are examples of content aggregators you can use for inspiration:
- HVPER News: One Page Internet
- All Top: Popular News Sites for Any Topic
Here are resources that you can use to build your content aggregator:
- requests: HTTP library for Python, built for human beings
- Beautiful Soup: Python library for quick turnaround projects like screen-scraping
- sqlite3: A self-contained, serverless, transactional SQL database engine
- celery: Distributed task queue
- apscheduler: In-process task scheduler with Cron-like capabilities
00:00 Web Project Ideas. In this section, you’re going to see some projects which lend themselves readily to being created for the web, but that doesn’t mean to say they have to be implemented solely for that platform, and you may find they’re suitable for GUI or even CLI implementations.
00:16 First up, a content aggregator. Then, you’re going to see a regex query tool, a URL shortener, post-it notes, and finally, a quiz application.
00:28 Let’s have a look at a content aggregator. Content is king—it exists everywhere on the web, from blogs to social media platforms. To keep up, you need to search for new information on the internet constantly.
00:40 One way to do this is to check all of the sites manually to see what new posts are present, but this is time-consuming, inefficient, and can be pretty tiring.
00:49 This is where a content aggregator comes in. A content aggregator fetches information from various places online and gathers all of that information in a single site.
00:58 Therefore, you don’t have to visit multiple sites to get the latest information—one site will be enough. With a content aggregator, all of the information can be gotten from one site that aggregates everything you’re interested in.
01:10 You can see all of the posts that interest you and decide whether to find out more about them without having to traipse all over the internet. Let’s look at a couple of implementations of content aggregators. Here, you can see Hvper, which aggregates a number of news sites, such as Reddit, Google News, and BuzzFeed. And here you can see AllTop, which aggregates a number of popular sites—TechCrunch, Wired, the New York Times Front Page, et cetera—but also allows you to create a customized page with any RSS feed.
01:42
Now, let’s look at some of the technical details that you’ll need to implement to allow you to create a content aggregator. Firstly, you’ll need to access content with libraries such as requests
and also a new one, BeautifulSoup
.
01:55
So, we’ve already seen requests
, which is an excellent way to access web data, but BeautifulSoup
is a Python library for pulling data out of that returned HTML, and it allows quick access to the semantic contents of web pages, allowing straightforward scraping of web data from websites.
02:12
Using a library like BeautifulSoup
to do this kind of work will save you hours—if not weeks—of programming and allows you to access the content quickly with a minimum of fuss. However, you need to ensure that you’re not breaking a site’s terms of services when scraping information from it in this way. Next up, you’re going to need a database. This could be something simple such as sqlite3
or using the ORM which comes as part of your framework that you’re using.
02:39
This is for storage and recall of the data that you’ve obtained. This may well be in the ORM that’s part of the framework that you’re using—which is one of the strengths of a framework such as django
—or, if you’re using a micro-framework like flask
, you may need to implement your own solution.
02:54
This can have pros and cons. Next up, scheduling. As seen previously, using a library such as celery
or apscheduler
will allow the data to be regularly updated.
03:06 This will mean you’ll be able to keep a track on what has been in your content aggregator, even if you haven’t been visiting, and possibly look at historical data.
03:15
Now, let’s look at some extra challenges when programming your content aggregator, the first of which will be adding new websites. Adding new websites to a aggregator will mean accessing content which is formatted in a different way, meaning you need to use a different structure to access it—using BeautifulSoup
, or possibly using an API. Secondly, user implementation.
03:37 Adding different users to a site could allow their viewing to be different. Each could mark a story as read or ask for their own updated stories whenever they visit the site. Also, a selection of sites for users: each user could select the sites they want the data to come from, such as from a list of implemented sites.
Become a Member to join the conversation.