With the advent of the Web, many organizations have put large amounts of information in the form of HTML pages. These pages are tied up to a single presentation. Extensible Markup Language (XML) allows us to separate the content and the presentation. If the developers think of a migration of these HTML pages to XML mechanically -- trying to create well-formed documents out of the existing HTML documents, or cutting and pasting contents from HTML to the newly created XML files, or whatever -- that would be a pretty daunting task. This article shows how the tool HTML Tidy and a COM Wrapper can make our job simpler. In my article Server side use of MSXMLDOM with HTML, I showed how we can exploit the functionality of the Document Object Model (DOM) parser to work with HTML documents, provided they are well-formed. This article is an extension of the same idea. Here we shall discuss a sample conversion of the bookmark file from HTML to XML and then into a browser-neutral tree view.
This article will be helpful for sites where a lot of information is maintained as flat HTML pages. All these site's pages share some kind of similar structure and the site developers think that the content should be separated from the publishing elements in order to support rendering media other than a browser or in order to have more control over the rendering itself.
The first step in the process is to clean up your HTML pages so that an XSLT (XSL Transformations) or a DOM or SAX parser can work with the documents . Dave Raggett's HTML Tidy is a good tool for converting your untidy (non well-formed) HTML to well-formed documents and also to XHTML and XML. For a review on HTML Tidy read HTML Tidy: Keeping it Clean. This is command-line tool and will not be of much help if we are considering a Web -based interface. André Blavier has developed A COM Wrapper for HTML Tidy and also a Windows-based GUI front end called TidyGUI. TidyCOM can be used from scripting languages.
After the Conversion
Once TidyCOM has created an XML document from the input HTML page, what remains is rendering the resultant XML. To make the rendering browser compatible we do a server-side transformation of this XML tree to HTML/XHTML using a DOM parser or an XSLT style sheet. In this article we discuss the conversion of a bookmark HTML file to XML. We use the DOM parser to do this transformation, but XSLT would do well for most of the cases. The choice of DOM in this article will be clear as we get deeper into the transformation and look into the complexity behind the generation of a tree view in the browser. XSLT has an inherent limitation because the variables are immutable and it requires programming in a different paradigm, that of recursion as compared to the sequential languages. In complex transformations, where maintaining state is of more essence, DOM is a better alternative and we have to forego the declarative nature of XSLT. And as performance and scalability become critical with large page sizes we have to choose the SAX (or Simple API for XML) alternative.
The Bookmark File
You can read about the Netscape's Bookmark file format at http://msdn.microsoft.com/workshop/browser/external/overview/Bookmark_File_Format.asp. This is an HTML document and is not well-formed. The programs that want to extract bookmarks and convert them into a tree user interface require the writing of a parser that obeys the rules of the Bookmark file format. An XML file would always be contended as a better alternative for storing bookmarks, but when Netscape defined the standard, XML was way off. The XML Bookmark Exchange Language (XBEL) is a standard that was developed as an Internet bookmarks interchange format. But let us concentrate on our current discussion of conversion from HTML to XML.
The Conversion Interface: Convert.asp
In the interface I have provided, the Webmaster can provide the path of the bookmark file and submit the form to get a converted file. Soon after the conversion the Tree View from the XML file can be viewed. You can provide an interface to upload the bookmark file from the local system to the server or provide a button to export the IE Favorites to the server. The IE Favorites are stored in the same format as that of Netscape.
To run the conversion code you will have to download and register the TidyCOM component on your system from http://perso.wanadoo.fr/ablavier/TidyCOM/#download . Make a note of the non-re-entrant nature of the TidyCOM component. Do not cache the instance, instead create and release on each page.I It also would not be advisable to expose it to a potentially large number of requests.
The conversion is pretty simple. Provide the absolute path of the HTML file and provide the destination file path, and you get a well-formed XML file. The TidyObject is set to output XML from the source HTML file, and this does half of our job.
bookmarksFilePath = Request.Form("txtFilePath")
if bookmarksFilePath <> "" then
set fso= Server.CreateObject("Scripting.FileSystemObject")
if fso.FileExists( bookmarksFilePath ) then
destBookmarksPath = Server.MapPath(".")&"\bookmark.xml"
set TidyObj = server.CreateObject("TidyCOM.TidyObject")
TidyObj.Options.Doctype = "strict"
TidyObj.Options.DropFontTags = true
TidyObj.Options.OutputXml = true ' set the output type to XML
TidyObj.Options.Indent = 2 'AutoIndent
TidyObj.Options.TabSize = 8
TidyObj.TidyToFile bookmarksFilePath, destBookmarksPath
set TidyObj = nothing
else
Response.Write "<script language='javascript'>alert('File Not Found');location.href='convert.asp'</script>"
set fso = nothing
Response.End
end if
set fso = nothing
Response.Write "<html><body><a href='allframes.htm' target='_blank' >Preview </a><br><a href='convert.asp' >Back</a></body></html>"
Response.End
end if
The Generated Bookmark XML File
You can view the generated XML file in Internet Explorer 5.0 for confirmation. In case of any errors, IE prompts with the appropriate error message and the line number.
An editor like XML Notepad can help you to visually analyze the hierarchy created, which is an alternative to perusing through the generated XML file.
With complex hierarchies, it is good practice to draw the hierarchies on paper for a better visualization. This will help before you jump into creating your XSLT documents or start programming using DOM. The following figure shows such a diagram.
Note that this article discusses a single HTML document structure of the bookmark file. But when we start converting a large number of pages, we are likely to find pages in the application that contain different structures. We have to develop a model that is the super set of all the pages, or group the pages with more similar structures and create a model for each of these groups. If most of the pages have a totally different structure or the application has very few pages, then all the toil is not worth it. You can simply go ahead with doing the conversion manually or not at all.
Transforming the XML into a Cross-Browser Tree
In order to create a browser-neutral tree, we first create a JavaScript Object tree with a DOM-like structure with a root node and child nodes in a tree hierarchy. We have this tree buildup in a hidden frame "tree" and write the HTML tree into the visible frame document called "temp." This will be clear from the following code in the HTML page "allframes.htm":
The tree.asp contains the JavaScript that performs all the magic behind the scenes.
<SCRIPT LANGUAGE=javascript >
<!--
var tempDoc = parent.frames["temp"].document;
function treeNode(nodeName, url,id, text ){
this.nodeName = nodeName ;
this.url=url ;
this.expanded = 'false';
this.setAttribute = new Function("attributeName" , "value", "if (attributeName=='expanded'){ this.expanded=value }" );
this.getAttribute = new Function("attributeName" , "if (attributeName=='expanded') return this.expanded " );
this.id=id ;
this.text=text ;
this.childNodes = new Array();
this.hasChildNodes = new Function("return (this.childNodes.length > 0 ? true :false)" );
}
The function treeNode(nodeName, url,id, text ) creates a node with the attributes such as nodeName, text, id, and url . We use the same node object to represent a folder and a leaf node, for example, the URL. We differentiate them by the nodeName attribute.
A folder node is created as
var x1 =new treeNode('folder','folder','1' ,'Personal Toolbar Folder');
We append it to its parent node using the append() method as
append ( x,x1 );
A leaf node is created as
childNode = new treeNode('leaf','http://home.netscape.com/bookmark/4_7/whatsnew.html' ,'10', 'What\'s New');
append ( x10,childNode );
Although JavaScript Objects are not object-oriented, and the setAttribue( ) or getAttribue( ) do not provide us with any encapsulation, I have provided these methods and the hasChildNodes( ) on every node just to get a feel of the DOM application program interface (API).
The method displayTreeNode( ) is the one that really draws the tree. Every folder in the hierarchy has a unique ID . We use this ID to identify recently expanded or collapsed folders in the Object tree.
function displayTreeNode( node , expandedNodeID , expand ){
if ( ( node.id == expandedNodeID && expand) || (node.getAttribute('expanded')=='true' && node.id != expandedNodeID ) ){
node.setAttribute('expanded','true');
}else{
node.setAttribute('expanded','false');
}
var i = 0 ;
var html = '<TR><TD>';
var nodeId = "\'" + node.id + "\'" ;
var treeFrame= "\'" + 'tree' + "\'" ;
html = html + "<img src='images/spacer.gif' width='"+ getSpaces( node.id ) + "' height='10' > "
if ( node.nodeName == 'folder' ){
if ( node.getAttribute('expanded')=='true' ){
html = html + ' <a href="javascript:parent.frames['+ treeFrame + '].collapse('+ nodeId +')" > &_
<img src=images/minus.gif border=0 ></a><img src=images/folder_open.gif >' ;
}else{
html = html + ' <a href="javascript:parent.frames['+ treeFrame + '].expand('+ nodeId +')" > &_
<img src=images/plus.gif border=0 ></a><img src=images/folder_closed.gif >' ;
}
html = html + " " + node.text + "</TR></TD>"
}else if ( node.nodeName == 'leaf' ){
html = html + '<img src=images/iefile.gif width=10 height=10 >' ;
html = html + " <a href='"+ node.url+"' target='_blank' >" + node.text + "</a></TR></TD>"
}
tempDoc.writeln( html );
if ( node.nodeName == 'folder' && node.getAttribute('expanded')=='true' && node.hasChildNodes() ){
for(i = 0; i < node.childNodes.length ; i++ ) {
var currNode = node.childNodes[i] ;
displayTreeNode( currNode , expandedNodeID , expand )
}
}
}
function beginHTML(){
tempDoc.open("text/html","replace");//open the document for writing
tempDoc.writeln( '<HTML><HEAD><TITLE>Bookmarks</TITLE></HEAD><BODY><TABLE >' );
}
function endHTML(){
tempDoc.writeln( '</TABLE></BODY></HTML>' );
tempDoc.close();
}
function expand( nodeID ){
beginHTML();
displayTreeNode( parent.frames['tree'].x , nodeID , true );
endHTML();
}
function collapse( nodeID ){
beginHTML();
displayTreeNode( parent.frames['tree'].x , nodeID ,false );
endHTML();
}
function append(parentNode,childNode){
parentNode.childNodes[parentNode.childNodes.length ]= childNode;
}
//-->
</SCRIPT>
By parsing the XML document, we will need to call the treeNode and append functions repeatedly to make the client-side Object tree structure.
The JavaScript code required to create the Object tree should look something like this:
var x= new treeNode('folder', 'folder' ,'0' , 'Bookmarks for ashwin' );
var x1 =new treeNode('folder','folder','1' ,
'Personal Toolbar Folder');
append ( x,x1 );
var x1_1 =new treeNode('folder','folder','1_1' ,
'ASP Bookmarks');
append ( x1,x1_1 );
. .
. .
. .
append ( x10,childNode );
childNode = new treeNode('leaf','http://home.netscape.com/bookmark/4_7/whatsnew.html' ,'10',
'What\'s New');
append ( x10,childNode );
var x11 =new treeNode('folder','folder','11' ,
'Personal Bookmarks');
append ( x,x11 );
childNode = new treeNode('leaf','http://www.real.com' ,'',
'RealPlayer Home Page');
append ( x,childNode );
childNode = new treeNode('leaf','http://home.netscape.com/escapes/search/netsearch_1.html' ,'',
'Net Search Page');
append ( x,childNode );
The above code is generated by the ASP page "JSGenerator.asp," which is explained in the next section.
Generating the JavaScript Code from the Bookmark XML: JSGenerator.asp
This program uses the XML parser installed with IE 5.0. You need not install the latest version. As seen during the analysis of the document structure above, we create the three methods GetDLChilds , getDDChilds, and displayLeaf .
GetDLChilds fetches all the child nodes of the DL node. This is called recursively in the program as was seen in the structure. GetDDChilds gets and displays all the folder-related code and makes a recursive call to the GetDLChilds method. The displayLeaf displays the URL-related code.
filePath = Server.MapPath(".")&"\bookmark.xml"
set xmlObj= Server.CreateObject("Microsoft.xmldom")
xmlObj.validateOnParse = false
xmlObj.async = false
xmlObj.preserveWhiteSpace = false
xmlObj.load(filePath )
set rootNode = xmlObj.documentElement
set nodeList = rootNode.childNodes(0)
'get the title for display with the root node
title = rootNode.getElementsByTagName("head")(0).getElementsByTagName("title")(0).text
set bodyNode = rootNode.getElementsByTagName("body")(0)
'the root node
Response.Write "var x= new treeNode('folder', 'folder' ,'0' , '"&title&"' ); "
'start with the child nodes 'DL' immediately under the BODY node
for each child in bodyNode.childNodes
if child.nodeName = "dl" then
getDLChilds child,null
end if
next
'get all the child
sub getDLChilds(node,folderNumberStr )
dim fileOrder,folderOrder
fileOrder = 0
folderOrder = 0
for each child in node.childNodes
if child.nodeName = "dd" then
getDDChilds child,folderOrder,folderNumberStr
elseif child.nodeName = "dt" then
fileOrder = fileOrder +1
'create the URL node
displayLeaf child, folderNumberStr
end if
next
end sub
sub getDDChilds(DDNode, ByRef folderOrder, folderNumberStr )
dim variableName
dim folderId
for each child in DDNode.childNodes
if child.nodeName = "h3" then
folderOrder = folderOrder + 1
Response.Write vbCrLf & "var "
variableName = "x"
'create a new unique variable to account for the newly encountered folder
if IsNull(folderNumberStr) or folderNumberStr="" then
variableName = variableName & folderOrder
folderId = folderOrder
else
variableName = variableName & folderNumberStr & "_" &folderOrder
folderId = folderNumberStr & "_" &folderOrder
end if
'create and assign the new folder node to the variable
Response.Write variableName &" =" &_
"new treeNode(" &_
"'folder','folder','"& folderId &"' , "& vbCrLf & " '"& removeNewLine( child.text) & "'" &_
");" & vbCrLf
'append the newly created folder node to its parent
Response.Write "append ( x" &_
folderNumberStr &","&variableName&" " &_
");" & vbCrLf
elseif child.nodeName = "dl" then
if IsNull(folderNumberStr) or folderNumberStr="" then
getDLChilds child,folderOrder
else
getDLChilds child,folderNumberStr &"_"&folderOrder
end if
end if
next
end sub
'create the URL node and append it to the parent
sub displayLeaf(DTNode,parentFolderNumberStr )
if DTNode.hasChildNodes() then
Response.Write "childNode = " &_
"new treeNode(" &_
"'leaf','"& DTNode.firstChild.getAttribute("href") & "' ,'"&parentFolderNumberStr&"', "& vbCrLf &" &_
'"& removeNewLine(DTNode.firstChild.text) & "'" &_
");" & vbCrLf
Response.Write "append ( x" &_
parentFolderNumberStr &",childNode " &_
");" & vbCrLf
end if
end sub
Once the JavaScript to create the Object tree is generated, the tree will be written into the document and we get a tree view like the one in the following figure. In addition to browser independence, another good thing about the JavaScript Object tree is that the data is cached on the client and there are no round trips to the server to read the XML file unless the user refreshes the page. The following figure shows our final product.
Summary
Now we see how we can clean up our HTML, produce XML, and then render it in a different style of our choice. We looked into the sample bookmarks conversion and also creating a browser-neutral tree from an XML document. If this makes you geared up for the conversion you were contemplating for some time, good luck!
Ashwin Kamanna is a software engineer working at AINS INDIA, Pvt. Ltd. He has worked on different technologies, such as ASP, DHTML, XML, and Java.
He can be reached at kamanna_ashwin@hotmail.com.
Stonebroom.ASP2XML(c) is an interface component designed to make building
applications that transport data in XML format much easier. It can be used
to automatically pass updates back to the original data source.
Right now the latest buzzword around town is AJAX. AJAX is an acronym for Asynchronous JavaScript and XML and is a method used to implement remote calling. The problem is that AJAX is only implemented in ASP.NET 2.0. This article will show you one way to implement remote calling without using AJAX or the XMLHttpRequest object. The technique outlined can even be used from classic ASP and is sufficient for most remote calling needs. [Read This Article][Top]
This article is the third and final installment of Alex Homer's series covering the new XML support in Microsoft SQL Server 2005. In it he covers updating the contents of xml columns, comparing traditional XML update techniques with XQuery, and using XQuery in a managed code stored procedure. [Read This Article][Top]
In the second part of his series on SQL Server 2005's new XML support, Alex Homer looks at extracting data from XML columns, comparing traditional XML data access approaches with XQuery, and combining XQuery and XSL-T.
[Read This Article][Top]
Microsoft SQL Server 2005 now offers great support for and close integration with XML as a data persistence format. In the first article of his series examining this new support, Alex Homer offers an overview of how SQL Server 2005 stores XML documents and schemas, examines how it supports querying and manipulating XML documents, and provides a simple test application that allows you to experiment with XQuery. [Read This Article][Top]
In the final article of his series on reading and writing XML in .NET 2.0, Alex Homer looks at how the updated XML document store objects XmlDocument, XmlDataDocument and PathDocument can be used to read, persist and write XML documents and fragments more easily and more efficiently than in .NET 1.x. [Read This Article][Top]
In the final article of his series on reading and writing XML in .NET 2.0, Alex Homer looks at how the updated XML document store objects XmlDocument, XmlDataDocument and PathDocument can be used to read, persist and write XML documents and fragments more easily and more efficiently than in .NET 1.x. [Read This Article][Top]
Alex Homer continues his series on reading and writing XML in .NET 2.0. In part one, we focused on the reading side of things, examining the XmlReader and XmlReaderSettings classes. In this article, we move on to look at the XmlWriter and XmlWriterSettings classes, and how they can be used to write XML documents and fragments more easily and more efficiently than in version 1.x of .NET.
[Read This Article][Top]
Alex Homer continues his series on reading and writing XML in .NET 2.0. In part one, we focused on the reading side of things, examining the XmlReader and XmlReaderSettings classes. In this article, we move on to look at the XmlWriter and XmlWriterSettings classes, and how they can be used to write XML documents and fragments more easily and more efficiently than in version 1.x of .NET. [Read This Article][Top]
In the first part of his series on reading and writing XML in .NET 2.0, Alex Homer discusses the XmlReader and XmlReaderSettings classes. The XmlReader exposes several useful new features and the all new XmlReaderSettings class makes it easy to generate single or multiple instances of an XmlReader with a range of useful properties. [Read This Article][Top]
In the first part of his series on reading and writing XML in .NET 2.0, Alex Homer discusses the XmlReader and XmlReaderSettings classes. The XmlReader exposes several useful new features and the all new XmlReaderSettings class makes it easy to generate single or multiple instances of an XmlReader with a range of useful properties. [Read This Article][Top]
Mailing List
Want to receive email when the next article is published? Just Click Here to sign up.