Html2Xml Web Service

By Fons Sonnemans, posted on
2508 Views

As you probably know, HTML is a "markup language" that uses "tags" (such as and <br> and <p>) to mark up text for formatting. The Extensible Markup Language (XML) is the universal format for structured documents and data on the Web.

Both HTML and XML use <, >, and & to create element and attribute structures. While HTML browsers accept or ignore mangled markup language, XML parsers and applications built on those parsers are less forgiving

The Html2Xml webservice takes an Html text and converts it into an Xml text. Some corrections are done to the Html to make it well-formed Xml.

You can use this webservice to screen scrape a web page and convert it to Xml. The Xml can then be used for further processing.

The Html2Xml is also registered in the UDDI registry.

 

Operations

The following operations are supported. For a formal definition, please review the Service Description.

  • HtmlString2XmlNode
    Public: Convert the Html string to an XmlNode.
  • HtmlString2XmlString
    Public: Convert the Html string to an Xml string.
  • Url2XmlNode
    Public: Convert the Html page with the given Url to an XmlNode.
  • Url2XmlString
    Public: Convert the Html page with the given Url to an Xml string.
  • VersionInfo
    Public: Returns the web service name, current version, date and copyright information.
  • ReportStatistics
    Private: Return usage statistics, this operation requires a Key.

Buy the Hmtl2Xml Library

Given the success of the Html2Xml on the web, Reflection IT released the 'Html2Xml Library' product. More info...

Corrections

The Html2Xml webservice corrects the markup so that it matches the observed rendering in popular browsers from Netscape and Microsoft as much as possible. Here are just a few examples of how Html2Xml perfects your HTML for you:

Match the case of the start and end tags
The case of a start and end tags must match. Html2Xml writes all tags in lower case.

HTML Corrected XML
<P>here is a paragraph with a <b>bold</B> word</p>  <p>here is a paragraph with a <b>bold</b> word</p> 

 

End tags in the wrong order are corrected:
XML does not allow start and end tags to overlap, but enforces a strict hierarchy within the document. Html2Xml corrects this as much as possible.

HTML Corrected XML

<table>
   <tr>
      <td>
          text
       </tr>
   </td>
</table>

<table>
   <tr>
      <td>
          text
       </td>
   </tr>
</table>

Non-empty elements are closed
All elements must be closed, explicitly or implicitly. Many people used the <p> tag to separate paragraphs. The <p> tag is designed to mark the beginning and end of a paragraph. That makes it a "non-empty" tag since it contains the paragraph text. Html2Xml add the end tag to all non-empty tags.

Affected Elements: <basefont>, <body>, <colgroup>, <dd>, <dt>, <head>, <html>, <li>, <p>, <tbody>/<thead>/<tfoot>, <th>/<td>, <tr>.

HTML Corrected XML

<ul>
  <li>
    Bullet 1
  <li>
    Bullet 2
</ul>

<ul>
  <li>
    Bullet 1
  </li>
  <li>
    Bullet 2
  </li>
</ul>

 

 

Empty elements are terminated
While end tags may be optional with certain HTML elements, all elements in XML must have an end tag. a <br> tag is "empty" because it never contains anything. Other tags like this are <hr> and <img src="valid.gif"> Html2Xml terminates them by placing a forward slash (/) before the end bracket.

Affected Elements: <area>, <base>, <br>, <col>, <frame>, <hr>, <img>, <input>, <isindex>, <link>, <meta>, <option>, <param>.

HTML Corrected XML

text
<br>
text 

text
<br />
text

 

Missing quotes around attribute values are added
All attribute values must be quoted, whether or not they contain spaces. Html2Xml inserts quotation marks around all attribute values for you.

HTML Corrected XML
<img src=example.gif width=40 height=30> <img src='example.gif' width='40' height='30' />

 

Duplicate attributes are removed
An attribute may only be used once within a start tag. Html2Xml removes duplicate attributes for you.

HTML Corrected XML
<img src='example.gif' width='30' width='40' height='30' /> <img src='example.gif' width='30' height='30' />

 

Minimized attributes (used without a value) get a 'dummy' value
An attribute must have a value. Html2Xml gives attributes that have no values (e.g. nowrap, selected) a 'dummy' value.

HTML Corrected XML
<td align='center' nowrap width='30'> <td align='center' nowrap='value' width='30'>

 

Script Blocks containing unparseable characters are enclosed in a CDATA section.
Script blocks in HTML can contain unparseable characters, namely < and &. These must be escaped in well-formed HTML by enclosing the script block in a CDATA section.

HTML Corrected XML
<SCRIPT>
// checks a number against 5
function checkFive(n) {
    return n < 5;
}
</SCRIPT>
<SCRIPT><![CDATA[
// checks a number against 5
function checkFive(n) {
    return n < 5;
}
]]></SCRIPT>

 

Convert built-in characters
You cannot use the characters <, >, or & within the text of your documents. Html2Xml converts them to the Xml built-in character entities. The following characters are converted:

  • &lt; ??? (<)
  • &gt; ??? (>)
  • &amp; ??? (&)
  • &qout; ??? (")
  • &apos;- (')

Nonbreaking spaces (&nbsp;) are converted to &#160;

 

Development

The Html2Xml webservice is developed using:

  • Microsoft.NET framework 1.1
  • Microsoft Visual Studio.NET 2003
  • C#

 

Sample 

The Stock Sample Web Service  uses the Html2Xml service to retrieve the Finance page of Yahoo. It then searches the Xml, using an XPath expression, for the current value and returns it.

For more information see: Stock Sample Web Service

 

Conclusion

The Html2Xml webservice is an easy to use, powerful solution to convert Html to Xml.

Disclaimer

Reflection IT makes these WebServices available for free use without any sort of guarantees or promises of proper operation. Reflection IT cannot guarantee support of any kind to fix bugs or effect modifications based on feedback provided by the users. All that can be said is that these WebServices are being used in The Netherlands and viewers are free to use them.

Reflection IT will in no way be responsible for problems arising out of the use of the WebServices. Using them is at your own risk.

Tags

Web XML ASP.NET

All postings/content on this blog are provided "AS IS" with no warranties, and confer no rights. All entries in this blog are my opinion and don't necessarily reflect the opinion of my employer or sponsors. The content on this site is licensed under a Creative Commons Attribution By license.

Leave a comment

Blog comments

0 responses