Working With HTML Files
00:00
Working with HTML files. HTML is a plaintext file that uses hypertext markup language to help browsers render web pages. The extensions for HTML files are .html
and .htm
.
00:16 You’ll need to install an HTML parser library like lxml or html5lib to be able to work with HTML files.
00:32
Once you have these libraries installed, you can save the contents of your DataFrame as an HTML file with .to_html()
.
00:48
This code generates a file data.html
. You can see the contents of it onscreen. The file shows the DataFrame contents nicely. However, notice that you’ve not obtained an entire web page.
01:02
It’s just output the data that corresponds to the DataFrame in the HTML format. In a similar manner to what we’ve seen earlier, if .to_html()
doesn’t receive the optional parameter buf
, instead of creating a file, it will return a string.
01:18
Here are some other optional parameters: header
determines whether to save the column names, index
determines whether to save row labels, classes
assigns CSS classes, render_links
specifies whether to convert URLs to HTML links, table_id
assigns a CSS id
to the table tag, escape
decides whether to convert the special characters seen onscreen to HTML-safe strings.
01:48
You can use parameters like these to specify different aspects of the resulting files or strings. You can create a DataFrame
object from a suitable HTML file using read_html()
, which will return a DataFrame
instance or a list of them.
02:16 This is very similar to what you did when reading CSV files. You also have parameters that will help you work with dates, missing values, encoding, HTML parsers, and more.
02:31 Now that you’re comfortable working with HTML files, in the next section, you’ll take a deeper look at working with Excel files.
Geir Arne Hjelle RP Team on Sept. 10, 2021
Hi Dean, it looks like you’re missing the end-quote in parse_dates=['IND_DAY]
.
The error message says that the parser reached EOL (End-of-line) when it was reading a string literal (text between quotes).
toigopaul on Nov. 30, 2024
What’s with the ] at the end of the df_html (2:17)? It happens to me too, but have no idea why.
Bartosz Zaczyński RP Team on Dec. 2, 2024
@toigopaul According to the official documentation, pandas.read_html()
returns a list of data frames. Because the default string representation of a Python list includes a pair of square brackets around its content, and there’s only one data frame in the result, you get to see these brackets.
toigopaul on Dec. 2, 2024
@Bartosz Zaczyński Thanks! My eye didn’t catch the opening [
Become a Member to join the conversation.
Dean on Sept. 9, 2021
I’m getting an EOL error in my code when I try to do anything with the ‘IND_DAY’ column. It’s acting like it’s not a column but I’m looking dead at it. File “<ipython-input-156-7b70654dd196>”, line 1 df_html = pd.read_html(‘data.html’, index_col=0, parse_dates=[‘IND_DAY]) ^ SyntaxError: EOL while scanning string literal