Oftentimes the information you want to track is embedded within a web page. The site may offer an RSS feed that you can monitor, but gives you no control over the contents of that feed. Other times you may have data in a web-based application that is enclosed in an HTML table.
HTML is simply a subset of XML, so all the CSS selectors can be applied to an HTML file as easily as an XML file. We will walk you through how to create a Klip that can parse an HTML page and extract specific content from an HTML table. The technique applies to any content in an HTML page.
One common problem in parsing HTML is that most HTML on the web is not well-formed XML. You see this by saving pretty much any HTML page to your hard disk with the extension .xml, then opening it with your web browser. Your web browser will attempt to validate and start displaying errors.
Klipfolio Dashboard's XML parser, however, is non-validating and very forgiving. Unlike an XML parser that uses a complex Document Object Model (DOM) API for extracting content, Klipfolio Dashboard offers a very powerful, yet very simple, Cascading Style Sheet-like syntax for parsing XML. Here is a sample HTML page with an embedded table.
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>Table Example</title>
</head>
<body>
<h1>Table Example</h1>
<table width="341" border="1">
<tr>
<td width="66"> </td>
<td width="35"><strong>East</strong></td>
<td width="46"><strong>West</strong></td>
<td width="77"><strong>North</strong></td>
<td width="83"><strong>South</strong></td>
</tr>
<tr>
<td><strong>Region 1 </strong></td>
<td><div align="right">12.3</div></td>
<td><div align="right">34.5</div></td>
<td><div align="right">33.2</div></td>
<td><div align="right">901</div></td>
</tr>
<tr>
<td><strong>Region 2 </strong></td>
<td><div align="right">83</div></td>
<td><div align="right">32</div></td>
<td><div align="right">34</div></td>
<td><div align="right">98</div></td>
</tr>
</table>
</body>
</html>
Here is what the table looks like:
You can also load this table in your web browser from http://www.klipfolio.com/static/klips/devguide/ch05_table.html.
Let's build a Klip to monitor changes to this table.
To parse XML — any XML — Klipfolio Dashboard needs to know two CSS instructions at a minimum. First, it needs to know the enclosing XML element around each item (or row) of data. This lets it find items of data to extract. Second, it needs to know what data you want to extract from the repeating block. This data gets displayed in a column in the Klip and/or a row in an item's tooltip.
Using our example HTML table, let's start with the data. We see that the data for each row of the table is enclosed in <td> ... </td>, so it's not difficult to pull out (we'll show you how in a moment). The HTML table itself contains two rows of data, each enclosed in <tr> ... </tr>. So far, so good.
However, when parsing HTML, a typical web page usually contains many tables. So we need to be specific about which table we want Klipfolio Dashboard to extract. Looking at our example HTML, we see the <tr>s are enclosed in this table: <table width="341" border="1"> ... </table>. The attribute width="341" is specific to this table, so we can use that attribute to uniquely identify this table.
Here's a first cut at the Klip:
<klip>
<identity>
<title>
Get Stats
</title>
</identity>
<locations>
<contentsource>
http://www.klipfolio.com/static/klips/devguide/ch05_table.html
</contentsource>
<icon>
http://www.klipfolio.com/static/klips/klipfolio/sample_icon.png
</icon>
<banner>
http://www.klipfolio.com/static/klips/klipfolio/sample_banner.png
</banner>
</locations>
<style>
table[width="341"] {
type: item;
}
tr {
itemcol: 1;
noterow: 1;
content: cdata;
}
</style>
</klip>
Here's what it looks like when you run it.
At this point, don't worry that we are not getting the proper data yet. The key is to get a working Klip that finds the proper table and extracts some data. You can see it found the right table. Our Klip has picked out the first row of data in the table, and we can see the raw XML because we specified the property content: cdata in the CSS.
tr {
itemcol: 1;
noterow: 1;
content: cdata;
}
The content: cdata property instructs Klipfolio Dashboard's XML parser not to process the data; rather, just display the raw XML contents. This mode is very helpful in parsing HTML as it lets you see exactly what Klipfolio Dashboard is looking at before it strips out XML elements or processes XML entities.
OK, we're getting some raw data, but the Klip is only extracting one row. Why? The reason is that we didn't get the specification for item quite right. Note the specification of item is as follows:
table[width="341"] {
type: item;
}
This reads as, "Extract the item that is enclosed within <table width="341"> ... </table>". That's not quite specific enough: the item is enclosed within a <td> in a <tr>, which in turn is enclosed within <table width="341">. Here's the correct specification for item.
table[width="341"] tr {
type: item;
}
td {
itemcol: 1;
noterow: 1;
content: cdata;
}
Reload the Klip. It now shows the following:
Looking better. We now we see three items. The CSS rule for td is extracting out the contents of the first td in each row.
But why are there three rows? It's because the table does have three rows: one row with headers (East, West...) and two rows of data (for Region 1 and Region 2). We want to skip the first row and just extract the two rows of data. Let's take a look again at the HTML for the table.
<table width="341" border="1">
<tr>
<td width="66"> </td>
<td width="35"><strong>East</strong></td>
<td width="46"><strong>West</strong></td>
<td width="77"><strong>North</strong></td>
<td width="83"><strong>South</strong></td>
</tr>
<tr>
<td><strong>Region 1 </strong></td>
<td><div align="right">12.3</div></td>
<td><div align="right">34.5</div></td>
<td><div align="right">33.2</div></td>
<td><div align="right">901</div></td>
</tr>
...
Notice the <td> element for the first row also specifies an attribute width, as in <td width=n>. We want to match the <td>s that do not have this attribute.
Modify the CSS match for <td> to td:not([width]).
table[width="341"] tr {
type: item;
}
td:not([width]) {
itemcol: 1;
noterow: 1;
content: cdata;
}
And let's reload the Klip.
We now have just the rows of data. Perfect! But we're only seeing the first column (the first <td>). Let's now extract the second <td>. To extract the second <td> we add the following rule:
td + td:not([width]) {
itemcol: 2;
noterow: 2;
content: cdata;
}
Which reads (from right to left), "Look for the first <td> that does not have a width and is preceded by <td>". Reload the Klip, and we now see the following:
Looking good Houston. Now, since we are getting the right data, let's drop the content: cdata and just view the text.
table[width="341"] tr {
type: item;
}
td:not([width]) {
itemcol: 1;
noterow: 1;
}
td + td:not([width]) {
itemcol: 2;
noterow: 2;
}
Reload the Klip.
Wait! Where did the data go? Why did taking content: cdata cause all the data to disappear? The reason is that the default setting for the content property is content: firstchild. This means Klipfolio Dashboard's XML parser takes only the first child of a matching element. To see why it's empty, look at the contents of the first <td>
<td><strong>Region 1 </strong></td>
The first child after <td> is an empty node — it's between the >< in <td><strong>. It's the same for the second <td> node. This empty node causes both the first and second cells to create empty items. When Klipfolio Dashboard tries to add the second empty item, it already finds one in the table, so it overwrites it. The result: a Klip with no data that has one unread item.
Why does Klipfolio Dashboard have a default as content: firstchild? There are two reasons. First, when parsing XML, the following child usually contains the data. For example, in <td>Region 1</td>, the first child is the text node Region 1. Second, picking out the first node saves processing time because Klipfolio Dashboard does not need to scan for the closing <td>. When parsing large sets of XML, these optimizations save CPU time. But, in this case, since HTML table is very small, and since the HTML contains formatting that we don't care about in the Klip, we need to tell Klipfolio Dashboard to do a bit of extra work.
To tell Klipfolio Dashboard to scan for the enclosing <td> and process everything in between, we specify the content as content: text. Here's the updated CSS.
table[width="341"] tr {
type: item;
}
td:not([width]) {
itemcol: 1;
noterow: 1;
content: text;
}
td + td:not([width]) {
itemcol: 2;
noterow: 2;
align: right;
content: text;
}
Reload the Klip.
We now see our data. It's looking good. Let's add the CSS entries to extract the contents of the third and forth <td> from the data.
table[width="341"] tr {
type: item;
}
td:not([width]) {
itemcol: 1;
noterow: 1;
content: text;
}
td + td:not([width]) {
itemcol: 2;
noterow: 2;
align: right;
content: text;
}
td + td + td:not([width]) {
itemcol: 3;
noterow: 3;
align: right;
content: text;
}
td + td + td + td:not([width]) {
itemcol: 4;
noterow: 4;
align: right;
content: text;
}
td + td + td + td + td:not([width]) {
itemcol: 5;
noterow: 5;
align: right;
content: text;
}
Reload. Here's the resulting Klip.
At this point the core of the Klip is working. We're finding the correct table and extracting only the rows of the data. Let's take a look at the tooltip.
By default, Klipfolio Dashboard uses the CSS matching criteria as the label for a note. Let's use notelabel to make our Klip a bit more expressive.
table[width="341"] tr {
type: item;
}
td:not([width]) {
itemcol: 1;
noterow: 1;
content: text;
notelabel: false;
emphasis: strong;
label: 'Region';
}
td + td:not([width]) {
itemcol: 2;
noterow: 2;
align: right;
content: text;
label: 'East';
}
td + td + td:not([width]) {
itemcol: 3;
noterow: 3;
align: right;
content: text;
label: 'West';
}
td + td + td + td:not([width]) {
itemcol: 4;
noterow: 4;
align: right;
content: text;
label: 'North';
}
td + td + td + td + td:not([width]) {
itemcol: 5;
noterow: 5;
align: right;
content: text;
label: 'South';
}
Reload and we now see the Klip has labels for each row.
We used notelabel: false to hide the label for the first row and set the emphasis: strong to make it a title.
![]() | Note |
|---|---|
If we are hiding the label for the first row, why do we have label: 'Region' for it? It's because of the brand new feature in Klipfolio Dashboard 5: column header display. The label attribute controls the Klip's column headers as well as the note label. Try it by selecting ->->. | |
Since the Klip is going to monitor a table for changes, we want to make it a dashboard Klip. Add the following code above the <style> block:
<setup>
<purge>
permanent
</purge>
<autoremove>
false
</autoremove>
</setup>
The Klip Will now only show you the current content of the XML.