Locked learning resources

Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Locked learning resources

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Working With HTML Files

00:00 Working with HTML files. HTML is a plaintext file that uses hypertext markup language to help browsers render web pages. The extensions for HTML files are .html and .htm.

00:16 You’ll need to install an HTML parser library like lxml or html5lib to be able to work with HTML files.

00:32 Once you have these libraries installed, you can save the contents of your DataFrame as an HTML file with .to_html().

00:48 This code generates a file data.html. You can see the contents of it onscreen. The file shows the DataFrame contents nicely. However, notice that you’ve not obtained an entire web page.

01:02 It’s just output the data that corresponds to the DataFrame in the HTML format. In a similar manner to what we’ve seen earlier, if .to_html() doesn’t receive the optional parameter buf, instead of creating a file, it will return a string.

01:18 Here are some other optional parameters: header determines whether to save the column names, index determines whether to save row labels, classes assigns CSS classes, render_links specifies whether to convert URLs to HTML links, table_id assigns a CSS id to the table tag, escape decides whether to convert the special characters seen onscreen to HTML-safe strings.

01:48 You can use parameters like these to specify different aspects of the resulting files or strings. You can create a DataFrame object from a suitable HTML file using read_html(), which will return a DataFrame instance or a list of them.

02:16 This is very similar to what you did when reading CSV files. You also have parameters that will help you work with dates, missing values, encoding, HTML parsers, and more.

02:31 Now that you’re comfortable working with HTML files, in the next section, you’ll take a deeper look at working with Excel files.

Avatar image for Dean

Dean on Sept. 9, 2021

I’m getting an EOL error in my code when I try to do anything with the ‘IND_DAY’ column. It’s acting like it’s not a column but I’m looking dead at it. File “<ipython-input-156-7b70654dd196>”, line 1 df_html = pd.read_html(‘data.html’, index_col=0, parse_dates=[‘IND_DAY]) ^ SyntaxError: EOL while scanning string literal

Avatar image for Geir Arne Hjelle

Geir Arne Hjelle RP Team on Sept. 10, 2021

Hi Dean, it looks like you’re missing the end-quote in parse_dates=['IND_DAY].

The error message says that the parser reached EOL (End-of-line) when it was reading a string literal (text between quotes).

Avatar image for toigopaul

toigopaul on Nov. 30, 2024

What’s with the ] at the end of the df_html (2:17)? It happens to me too, but have no idea why.

Avatar image for Bartosz Zaczyński

Bartosz Zaczyński RP Team on Dec. 2, 2024

@toigopaul According to the official documentation, pandas.read_html() returns a list of data frames. Because the default string representation of a Python list includes a pair of square brackets around its content, and there’s only one data frame in the result, you get to see these brackets.

Avatar image for toigopaul

toigopaul on Dec. 2, 2024

@Bartosz Zaczyński Thanks! My eye didn’t catch the opening [

Become a Member to join the conversation.