HTML -> XML conversion with XSLT and visual toolsStep 1. Converting input HTML pages into XHTML with Tidy |
||||||
|
||||||
Our bunkhouse stretches itself on a large territory of 16 HTML files with total length about 500Kb. Page layout is done using tables and it makes corresponding HTML structure rather regular. However, the files were maintained by different people at different time and as a result, some peculiarities are observed at lower structure levels. For example, inside <td> tags the target information is normally placed in a sequence of <a><font> tags, but sometimes the sequence is reversed and we get <font><a>sequence. XSLT provides enough flexibility to write expressions and instructions matching both "regular" and "irregular" structures, but investigating such cases by reading nude HTML code would be tedious and boring. Free visual tools supplied by XML community turn this job into an exiting data hunting. |
||||||
Step 1. Converting input HTML pages into XHTML with Tidy | ||||||
To be able to use all power of XSLT, we have to provide an XSLT processor with well-formed XML as an input. Of
course, our input files were far from well-formedness standards. For this reason, first we needed to "clean"
input HTML files. I did it using Tidy - Dave
Raggett's free tool, currently hosted by W3C. It reads an HTML file and converts it into a well-formed XHTML document,
which can be interpreted as XML. Let's take a look at this real example: Input HTML: <TD rowspan="2"> <a target="amazon" href="http://www.amazon.com/exec/obidos/ASIN/0201616467/electricporkchop"> <IMG height=140 src="images/practical.jpg" width=112 ></A> </TD> Problems: 1. tag a first coded in lower case and closing tag </A> in upper case 2. tag IMG doesn't have matching closing tag 3. Values of height and width attribute are not enclosed in quotes Tidy output: <td rowspan="2"> <a target="amazon" href="http://www.amazon.com/exec/obidos/ASIN/0201616467/electricporkchop"> <img height="140" src="images/practical.jpg" width="112" /></a> </td> All problems are magically fixed. |
||||||
Step 2. HTML structure investigation with Merlot. | ||||||
|
||||||
|
||||||
Step 3.Checking XPath expressions with The Xpath Visualizer | ||||||
Another great visual tool is The XPath
Visualizer . It is designed specially for debugging and playing with XPath expressions during XSLT stylesheet
development. Load your XML file, enter the expression and iterate through a collection of elements that match it
(they are marked with yellow background) - great help in pattern investigation and tuning matching expressions. |
||||||
Step 4. Applying XSLT and testing the result | ||||||
Fighting chaos, case 2: <xsl:choose> element Book titles provide us with another problem: target information was ocasionlly located in the content of <font> tag, which was eòclosed in <a> tag (variant 1), on another occasion it was the content of <a> tag, enclosed in <font> (variant 2). <xsl:choose> element came to the rescue and incorporated such diversity, Variant 1: <td colspan="3" height="20"> <a target="amazon" href="http://www.amazon.com/exec/obidos/ASIN/0131103628/electricporkchop"> <font size="5">The C Programming Language</font><br /> </a> by Mark Williams Company</td> matching XPath expression is: ./td[1]/a/font/text() Variant 2: <td colspan="3" height="20"> <font size="5"> <a target="amazon" href="http://www.amazon.com/exec/obidos/ASIN/0201325829/electricporkchop"> Programming and Deploying Java Mobile Agents with Aglets</a></font><br /> by Danny B. Lange, Mitsuru Oshima</td> matching XPath expression is: ./td[1]/font/a/text() Solution: <xsl:choose> Here we simply test which variant we encounter and choose appropriate XSLT instruction. <xsl:choose> <xsl:when test="./td[1]/a/font"> <title> <xsl:value-of select="normalize-space(./td[1]/a/font/text())"/> </title> </xsl:when> <xsl:when test="./td[1]/font/a"> <title> <xsl:value-of select="normalize-space(./td[1]/font/a/text())"/> </title> </xsl:when> </xsl:choose> Fighting chaos, case 3: descendant-or-self (//) axis Variant 1 <td colspan="3" height="20"> <a target="amazon" href="http://www.amazon.com/exec/obidos/ASIN/0131103628/electricporkchop"> <font size="5">The C Programming Language</font><br /> </a> by Mark Williams Company</td> Matching XPath expression would be ./td[1]/a/@href Variant 2 <td colspan="3" height="20"> <font size="5"> <a target="amazon" href="http://www.amazon.com/exec/obidos/ASIN/0201633469/electricporkchop"> TCP/IP Illustrated, Volume 1</a> </font><br /> by W. Richard Stevens</td> Here we have <font> tag between <td> and <a>. We could write ./td[1]/font/a/@href but there is a better solution: ./td[1]//a/@href By putting // between <td> and <a> we are saying "select all <a> tags which are descendant of <td> tag, regardless of how many depth levels they are located". This XPath expression matches both variants above. |