asp tutorials, asp.net tutorials, sample code, and Microsoft news from 15Seconds
Data Access  |   Troubleshooting  |   Security  |   Performance  |   ADSI  |   Upload  |   Email  |   Control Building  |   Component Building  |   Forms  |   XML  |   Web Services  |   ASP.NET  |   .NET Features  |   .NET 2.0  |   App Development  |   App Architecture  |   IIS  |   Wireless
 
Pioneering Active Server
 Power Search





Active News
15 Seconds Weekly Newsletter
• Complete Coverage
• Site Updates
• Upcoming Features

More Free Newsletters
Reference
News
Articles
Archive
Writers
Code Samples
Components
Tools
FAQ
Feedback
Books
Links
DL Archives
Community
Messageboard
List Servers
Mailing List
WebHosts
Consultants
Tech Jobs
15 Seconds
Home
Site Map
Press
Legal
Privacy Policy
internet.commerce














internet.com
IT
Developer
Internet News
Small Business
Personal Technology
International

Search internet.com
Advertise
Corporate Info
Newsletters
Tech Jobs
E-mail Offers

HardwareCentral
Compare products, prices, and stores at Hardware Central!

Converting Your existing HTML to XML
By Ashwin Kamanna
Rating: 3.6 out of 5
Rate this article


  • email this article to a colleague
  • suggest an article

    Introduction

    With the advent of the Web, many organizations have put large amounts of information in the form of HTML pages. These pages are tied up to a single presentation. Extensible Markup Language (XML) allows us to separate the content and the presentation. If the developers think of a migration of these HTML pages to XML mechanically -- trying to create well-formed documents out of the existing HTML documents, or cutting and pasting contents from HTML to the newly created XML files, or whatever -- that would be a pretty daunting task. This article shows how the tool HTML Tidy and a COM Wrapper can make our job simpler. In my article Server side use of MSXMLDOM with HTML, I showed how we can exploit the functionality of the Document Object Model (DOM) parser to work with HTML documents, provided they are well-formed. This article is an extension of the same idea. Here we shall discuss a sample conversion of the bookmark file from HTML to XML and then into a browser-neutral tree view.

    This article will be helpful for sites where a lot of information is maintained as flat HTML pages. All these site's pages share some kind of similar structure and the site developers think that the content should be separated from the publishing elements in order to support rendering media other than a browser or in order to have more control over the rendering itself.

    Download supporting source code.

    TidyCOM

    The first step in the process is to clean up your HTML pages so that an XSLT (XSL Transformations) or a DOM or SAX parser can work with the documents . Dave Raggett's HTML Tidy is a good tool for converting your untidy (non well-formed) HTML to well-formed documents and also to XHTML and XML. For a review on HTML Tidy read HTML Tidy: Keeping it Clean. This is command-line tool and will not be of much help if we are considering a Web -based interface. André Blavier has developed A COM Wrapper for HTML Tidy and also a Windows-based GUI front end called TidyGUI. TidyCOM can be used from scripting languages.

    After the Conversion

    Once TidyCOM has created an XML document from the input HTML page, what remains is rendering the resultant XML. To make the rendering browser compatible we do a server-side transformation of this XML tree to HTML/XHTML using a DOM parser or an XSLT style sheet. In this article we discuss the conversion of a bookmark HTML file to XML. We use the DOM parser to do this transformation, but XSLT would do well for most of the cases. The choice of DOM in this article will be clear as we get deeper into the transformation and look into the complexity behind the generation of a tree view in the browser. XSLT has an inherent limitation because the variables are immutable and it requires programming in a different paradigm, that of recursion as compared to the sequential languages. In complex transformations, where maintaining state is of more essence, DOM is a better alternative and we have to forego the declarative nature of XSLT. And as performance and scalability become critical with large page sizes we have to choose the SAX (or Simple API for XML) alternative.

    The Bookmark File

    You can read about the Netscape's Bookmark file format at http://msdn.microsoft.com/workshop/browser/external/overview/Bookmark_File_Format.asp. This is an HTML document and is not well-formed. The programs that want to extract bookmarks and convert them into a tree user interface require the writing of a parser that obeys the rules of the Bookmark file format. An XML file would always be contended as a better alternative for storing bookmarks, but when Netscape defined the standard, XML was way off. The XML Bookmark Exchange Language (XBEL) is a standard that was developed as an Internet bookmarks interchange format. But let us concentrate on our current discussion of conversion from HTML to XML.

    The Conversion Interface: Convert.asp

    In the interface I have provided, the Webmaster can provide the path of the bookmark file and submit the form to get a converted file. Soon after the conversion the Tree View from the XML file can be viewed. You can provide an interface to upload the bookmark file from the local system to the server or provide a button to export the IE Favorites to the server. The IE Favorites are stored in the same format as that of Netscape.

    To run the conversion code you will have to download and register the TidyCOM component on your system from http://perso.wanadoo.fr/ablavier/TidyCOM/#download . Make a note of the non-re-entrant nature of the TidyCOM component. Do not cache the instance, instead create and release on each page.I It also would not be advisable to expose it to a potentially large number of requests.

    The conversion is pretty simple. Provide the absolute path of the HTML file and provide the destination file path, and you get a well-formed XML file. The TidyObject is set to output XML from the source HTML file, and this does half of our job.

    
    bookmarksFilePath =   Request.Form("txtFilePath")
    
    if bookmarksFilePath <> ""  then
    	set fso= Server.CreateObject("Scripting.FileSystemObject")
    	if  fso.FileExists( bookmarksFilePath ) then
    	destBookmarksPath = Server.MapPath(".")&"\bookmark.xml"
    	set TidyObj = server.CreateObject("TidyCOM.TidyObject")
    	TidyObj.Options.Doctype = "strict"
    	TidyObj.Options.DropFontTags = true
    	TidyObj.Options.OutputXml = true ' set the output type to XML
    	TidyObj.Options.Indent = 2 'AutoIndent
    	TidyObj.Options.TabSize = 8
    	TidyObj.TidyToFile bookmarksFilePath, destBookmarksPath
    	set TidyObj = nothing
    	else
    Response.Write "<script language='javascript'>alert('File Not Found');location.href='convert.asp'</script>"
    	set fso = nothing	
    	Response.End				
    	end if
    	set fso = nothing	
    Response.Write "<html><body><a  href='allframes.htm' target='_blank'  >Preview  </a><br><a  href='convert.asp'  >Back</a></body></html>"
    Response.End
    end if
    
    

    The Generated Bookmark XML File

    You can view the generated XML file in Internet Explorer 5.0 for confirmation. In case of any errors, IE prompts with the appropriate error message and the line number.

    An editor like XML Notepad can help you to visually analyze the hierarchy created, which is an alternative to perusing through the generated XML file.

    With complex hierarchies, it is good practice to draw the hierarchies on paper for a better visualization. This will help before you jump into creating your XSLT documents or start programming using DOM. The following figure shows such a diagram.

    Note that this article discusses a single HTML document structure of the bookmark file. But when we start converting a large number of pages, we are likely to find pages in the application that contain different structures. We have to develop a model that is the super set of all the pages, or group the pages with more similar structures and create a model for each of these groups. If most of the pages have a totally different structure or the application has very few pages, then all the toil is not worth it. You can simply go ahead with doing the conversion manually or not at all.

    Transforming the XML into a Cross-Browser Tree

    In order to create a browser-neutral tree, we first create a JavaScript Object tree with a DOM-like structure with a root node and child nodes in a tree hierarchy. We have this tree buildup in a hidden frame "tree" and write the HTML tree into the visible frame document called "temp." This will be clear from the following code in the HTML page "allframes.htm":

    
        <frameset frameborder="0" framespacing="0" border="0" cols="*" rows="0,*">
          <frame marginwidth="0" marginheight="0" src="tree.asp" name="tree" noresize scrolling="no" frameborder="0">
          <frame marginwidth="5" marginheight="5" src="temp.htm" name="temp" noresize scrolling="auto" frameborder="0">
        </frameset>
    
    
    The tree.asp contains the JavaScript that performs all the magic behind the scenes.
    
    <SCRIPT LANGUAGE=javascript    >
    <!--
    var tempDoc  = parent.frames["temp"].document;
    
    function treeNode(nodeName, url,id, text  ){
     
    this.nodeName = nodeName ;
     this.url=url ;
     this.expanded = 'false';
     this.setAttribute = new Function("attributeName" , "value", "if (attributeName=='expanded'){ this.expanded=value }"  );	
     this.getAttribute = new Function("attributeName" , "if (attributeName=='expanded') return this.expanded "  );
      this.id=id ;
      this.text=text ;
     this.childNodes = new Array();
    this.hasChildNodes = new Function("return  (this.childNodes.length > 0 ? true :false)" );
    
     }
    
    
    The function treeNode(nodeName, url,id, text ) creates a node with the attributes such as nodeName, text, id, and url . We use the same node object to represent a folder and a leaf node, for example, the URL. We differentiate them by the nodeName attribute.

    A folder node is created as

    
    var x1 =new  treeNode('folder','folder','1' ,'Personal Toolbar Folder');
    
    
    We append it to its parent node using the append() method as append ( x,x1 );
    
    A leaf node is created as 
    childNode  = new  treeNode('leaf','http://home.netscape.com/bookmark/4_7/whatsnew.html' ,'10',  'What\'s New');
    append (  x10,childNode  );
    
    
    Although JavaScript Objects are not object-oriented, and the setAttribue( ) or getAttribue( ) do not provide us with any encapsulation, I have provided these methods and the hasChildNodes( ) on every node just to get a feel of the DOM application program interface (API).

    The method displayTreeNode( ) is the one that really draws the tree. Every folder in the hierarchy has a unique ID . We use this ID to identify recently expanded or collapsed folders in the Object tree.

    
    function displayTreeNode( node , expandedNodeID , expand  ){
    		    
    	if (  ( node.id == expandedNodeID && expand)  ||  (node.getAttribute('expanded')=='true' && node.id != expandedNodeID ) ){
    				node.setAttribute('expanded','true');
    	}else{ 
    		node.setAttribute('expanded','false');
    	}
    		    
    	var i = 0 ;
    	var html = '<TR><TD>';
    	var nodeId =  "\'" +  node.id + "\'" ;
    	var treeFrame= "\'" +  'tree' + "\'" ;
    		    
    html =  html +  "<img src='images/spacer.gif' width='"+ getSpaces( node.id ) + "'  height='10' > "
    		    
    	if ( node.nodeName == 'folder' ){
    		if (     node.getAttribute('expanded')=='true'    ){
    html = html +  ' <a href="javascript:parent.frames['+ treeFrame + '].collapse('+ nodeId +')"   > &_
    <img  src=images/minus.gif  border=0 ></a><img  src=images/folder_open.gif    >' ;				
    		}else{
    html = html +  ' <a href="javascript:parent.frames['+ treeFrame + '].expand('+ nodeId +')"   > &_
    <img  src=images/plus.gif  border=0 ></a><img  src=images/folder_closed.gif    >' ;
    		}
    		html = html + " " +  node.text +  "</TR></TD>"
    	}else if ( node.nodeName == 'leaf' ){
    html = html +  '<img  src=images/iefile.gif   width=10 height=10  >' ;		    
    html = html + " <a  href='"+ node.url+"'  target='_blank'   >" +  node.text +  "</a></TR></TD>"
    	}
    
    	tempDoc.writeln( html );
    
    	if ( node.nodeName == 'folder'  &&	 node.getAttribute('expanded')=='true'   &&  node.hasChildNodes() ){
    				
    		for(i = 0; i < node.childNodes.length ; i++	)	{
    			var  currNode = node.childNodes[i] ;
    			displayTreeNode( currNode , expandedNodeID  , expand )
    		}
    	}
    			
    								
    }
    
    function beginHTML(){
    tempDoc.open("text/html","replace");//open the document for writing
    tempDoc.writeln( '<HTML><HEAD><TITLE>Bookmarks</TITLE></HEAD><BODY><TABLE >' );
    }
    
    function endHTML(){
    	tempDoc.writeln( '</TABLE></BODY></HTML>' );
    	tempDoc.close();
    	
    }
    
    function expand( nodeID ){
    		beginHTML();				
    		displayTreeNode( parent.frames['tree'].x , nodeID , true  );
    		endHTML();
    }
    
    function collapse( nodeID ){
    		beginHTML();				
    		displayTreeNode( parent.frames['tree'].x , nodeID ,false  );
    		endHTML();
    }
    
    function append(parentNode,childNode){
    	parentNode.childNodes[parentNode.childNodes.length ]= childNode;
    }
    //-->
    </SCRIPT>
    
    
    By parsing the XML document, we will need to call the treeNode and append functions repeatedly to make the client-side Object tree structure. The JavaScript code required to create the Object tree should look something like this:
    
    var x= new treeNode('folder', 'folder' ,'0' ,  'Bookmarks for ashwin' );  
    var x1 =new  treeNode('folder','folder','1' , 
       'Personal Toolbar Folder');
    append (  x,x1  );
    
    var x1_1 =new  treeNode('folder','folder','1_1' , 
       'ASP Bookmarks');
    append (  x1,x1_1  );
    .      .
    .      .
    .      .
    append (  x10,childNode  );
    childNode  = new  treeNode('leaf','http://home.netscape.com/bookmark/4_7/whatsnew.html' ,'10', 
      'What\'s New');
    append (  x10,childNode  );
    
    var x11 =new  treeNode('folder','folder','11' , 
       'Personal Bookmarks');
    append (  x,x11  );
    childNode  = new  treeNode('leaf','http://www.real.com' ,'', 
      'RealPlayer      Home Page');
    append (  x,childNode  );
    childNode  = new  treeNode('leaf','http://home.netscape.com/escapes/search/netsearch_1.html' ,'', 
      'Net Search Page');
    append (  x,childNode  );
    
    
    The above code is generated by the ASP page "JSGenerator.asp," which is explained in the next section.
    
    <SCRIPT LANGUAGE=javascript>
    <!--
    	<!--#include file="JSGenerator.asp"  -->
    	expand('0');	
    //-->
    </SCRIPT>
    
    

    Generating the JavaScript Code from the Bookmark XML: JSGenerator.asp

    This program uses the XML parser installed with IE 5.0. You need not install the latest version. As seen during the analysis of the document structure above, we create the three methods GetDLChilds , getDDChilds, and displayLeaf .

    GetDLChilds fetches all the child nodes of the DL node. This is called recursively in the program as was seen in the structure. GetDDChilds gets and displays all the folder-related code and makes a recursive call to the GetDLChilds method. The displayLeaf displays the URL-related code.

    
    filePath = Server.MapPath(".")&"\bookmark.xml"
    set xmlObj= Server.CreateObject("Microsoft.xmldom")
    xmlObj.validateOnParse = false
    xmlObj.async = false
    xmlObj.preserveWhiteSpace = false
    xmlObj.load(filePath ) 
    
    set rootNode  = xmlObj.documentElement
    set nodeList = rootNode.childNodes(0)
    'get the title for display with the root node
    title = rootNode.getElementsByTagName("head")(0).getElementsByTagName("title")(0).text
    set bodyNode = rootNode.getElementsByTagName("body")(0)
    
    'the root node
    Response.Write "var x= new treeNode('folder', 'folder' ,'0' ,  '"&title&"' );  "
    
    'start with the child nodes 'DL' immediately  under the BODY node
    for each child in bodyNode.childNodes
    	if child.nodeName = "dl" then
    		getDLChilds child,null
    	end if
    next
    
    'get all the child 
    sub  getDLChilds(node,folderNumberStr )
    	dim fileOrder,folderOrder
    	fileOrder = 0
    	folderOrder = 0
    	for each child in node.childNodes
    			if child.nodeName = "dd" then
    				getDDChilds child,folderOrder,folderNumberStr
    			elseif 	child.nodeName = "dt" then
    				fileOrder = fileOrder +1 	
    				'create the URL node
    				displayLeaf child, folderNumberStr
    			end if
    	next
    		
    end sub
    
    sub getDDChilds(DDNode, ByRef folderOrder, folderNumberStr )	
    
    	dim variableName 
    	dim folderId
    	for each child in DDNode.childNodes
    		if child.nodeName = "h3" then
    				folderOrder = folderOrder + 1
    					
    				Response.Write vbCrLf & "var "
    				variableName = "x"
    				'create a new unique variable to account for the newly encountered folder
    				if IsNull(folderNumberStr) or folderNumberStr="" then
    					variableName = variableName &  folderOrder 	
    					folderId = folderOrder 	
    				else 	
    					variableName = variableName & folderNumberStr & "_" &folderOrder 
    					folderId = folderNumberStr & "_" &folderOrder 
    				end if 				
    					
    				'create and assign the new folder node to the variable
    		Response.Write variableName &" =" &_
    							"new  treeNode("    &_
    		"'folder','folder','"& folderId &"' , "& vbCrLf  & "   '"& removeNewLine( child.text) & "'"    &_
    							 ");" & vbCrLf 
    			'append the newly created  folder node to its parent
    			Response.Write  "append (  x" &_
    								  folderNumberStr &","&variableName&"  "   &_
    							");" & vbCrLf 
    		
    			elseif 	child.nodeName = "dl" then
    				if IsNull(folderNumberStr) or folderNumberStr="" then
    					getDLChilds child,folderOrder
    				else
    					getDLChilds child,folderNumberStr &"_"&folderOrder
    				end if	
    							
    			end if
    	next
    end sub
    
    
    'create the URL node and append it to the parent 
    sub displayLeaf(DTNode,parentFolderNumberStr 	)
    	if DTNode.hasChildNodes() then
    			
    Response.Write "childNode  = "  &_
    				   "new  treeNode("  &_
    "'leaf','"& DTNode.firstChild.getAttribute("href") & "' ,'"&parentFolderNumberStr&"', "& vbCrLf &"  &_
    '"& removeNewLine(DTNode.firstChild.text) & "'"    &_
    				   ");" & vbCrLf 
    	
    	Response.Write  "append (  x"   &_
    				parentFolderNumberStr &",childNode  "   &_
    			");" & vbCrLf 
    	end if	
    end sub
    
    
    Once the JavaScript to create the Object tree is generated, the tree will be written into the document and we get a tree view like the one in the following figure. In addition to browser independence, another good thing about the JavaScript Object tree is that the data is cached on the client and there are no round trips to the server to read the XML file unless the user refreshes the page. The following figure shows our final product.

    Summary

    Now we see how we can clean up our HTML, produce XML, and then render it in a different style of our choice. We looked into the sample bookmarks conversion and also creating a browser-neutral tree from an XML document. If this makes you geared up for the conversion you were contemplating for some time, good luck!

    References

    XHTML by Chelsea Valentine and Chris Minnix, see: http://www.newriders.com/books/title.cfm?isbn=0735710341

    HTML Tidy's Web site, see: http://www.w3.org/People/Raggett/tidy/

    TidyCOM: A COM Wrapper for HTML Tidy's Web site, see: http://perso.wanadoo.fr/ablavier/TidyCOM/

    About the Author

    Ashwin Kamanna is a software engineer working at AINS INDIA, Pvt. Ltd. He has worked on different technologies, such as ASP, DHTML, XML, and Java. He can be reached at kamanna_ashwin@hotmail.com.

  • Rate This Article
    Not HelpfulMost Helpful
    1 2 3 4 5
    Supporting Products/Tools
    Stonebroom.ASP2XML
    Stonebroom.ASP2XML(c) is an interface component designed to make building applications that transport data in XML format much easier. It can be used to automatically pass updates back to the original data source.
    [Top]
    Other Articles
    Sep 22, 2005 - Implementing Remote Calling Without Using AJAX
    Right now the latest buzzword around town is AJAX. AJAX is an acronym for Asynchronous JavaScript and XML and is a method used to implement remote calling. The problem is that AJAX is only implemented in ASP.NET 2.0. This article will show you one way to implement remote calling without using AJAX or the XMLHttpRequest object. The technique outlined can even be used from classic ASP and is sufficient for most remote calling needs.
    [Read This Article]  [Top]
    Aug 18, 2005 - SQL Server 2005 XQuery and XML-DML - Part 3
    This article is the third and final installment of Alex Homer's series covering the new XML support in Microsoft SQL Server 2005. In it he covers updating the contents of xml columns, comparing traditional XML update techniques with XQuery, and using XQuery in a managed code stored procedure.
    [Read This Article]  [Top]
    Aug 11, 2005 - SQL Server 2005 XQuery and XML-DML - Part 2
    In the second part of his series on SQL Server 2005's new XML support, Alex Homer looks at extracting data from XML columns, comparing traditional XML data access approaches with XQuery, and combining XQuery and XSL-T.
    [Read This Article]  [Top]
    Aug 3, 2005 - SQL Server 2005 XQuery and XML-DML - Part 1
    Microsoft SQL Server 2005 now offers great support for and close integration with XML as a data persistence format. In the first article of his series examining this new support, Alex Homer offers an overview of how SQL Server 2005 stores XML documents and schemas, examines how it supports querying and manipulating XML documents, and provides a simple test application that allows you to experiment with XQuery.
    [Read This Article]  [Top]
    Jun 30, 2005 - Reading and Writing XML in .NET Version 2.0 - Part 3, Cont'd
    In the final article of his series on reading and writing XML in .NET 2.0, Alex Homer looks at how the updated XML document store objects XmlDocument, XmlDataDocument and PathDocument can be used to read, persist and write XML documents and fragments more easily and more efficiently than in .NET 1.x.
    [Read This Article]  [Top]
    Jun 29, 2005 - Reading and Writing XML in .NET Version 2.0 - Part 3
    In the final article of his series on reading and writing XML in .NET 2.0, Alex Homer looks at how the updated XML document store objects XmlDocument, XmlDataDocument and PathDocument can be used to read, persist and write XML documents and fragments more easily and more efficiently than in .NET 1.x.
    [Read This Article]  [Top]
    Jun 16, 2005 - Reading and Writing XML in .NET Version 2.0 - Part 2, Cont'd
    Alex Homer continues his series on reading and writing XML in .NET 2.0. In part one, we focused on the reading side of things, examining the XmlReader and XmlReaderSettings classes. In this article, we move on to look at the XmlWriter and XmlWriterSettings classes, and how they can be used to write XML documents and fragments more easily and more efficiently than in version 1.x of .NET.
    [Read This Article]  [Top]
    Jun 15, 2005 - Reading and Writing XML in .NET Version 2.0 - Part 2
    Alex Homer continues his series on reading and writing XML in .NET 2.0. In part one, we focused on the reading side of things, examining the XmlReader and XmlReaderSettings classes. In this article, we move on to look at the XmlWriter and XmlWriterSettings classes, and how they can be used to write XML documents and fragments more easily and more efficiently than in version 1.x of .NET.
    [Read This Article]  [Top]
    Jun 2, 2005 - Reading and Writing XML in .NET Version 2.0 - Part 1, Cont'd
    In the first part of his series on reading and writing XML in .NET 2.0, Alex Homer discusses the XmlReader and XmlReaderSettings classes. The XmlReader exposes several useful new features and the all new XmlReaderSettings class makes it easy to generate single or multiple instances of an XmlReader with a range of useful properties.
    [Read This Article]  [Top]
    Jun 1, 2005 - Reading and Writing XML in .NET Version 2.0 - Part 1
    In the first part of his series on reading and writing XML in .NET 2.0, Alex Homer discusses the XmlReader and XmlReaderSettings classes. The XmlReader exposes several useful new features and the all new XmlReaderSettings class makes it easy to generate single or multiple instances of an XmlReader with a range of useful properties.
    [Read This Article]  [Top]
    Mailing List
    Want to receive email when the next article is published? Just Click Here to sign up.

    Support the Active Server Industry



    JupiterOnlineMedia

    internet.comearthweb.comDevx.commediabistro.comGraphics.com

    Search:

    Jupitermedia Corporation has two divisions: Jupiterimages and JupiterOnlineMedia

    Jupitermedia Corporate Info


    Legal Notices, Licensing, Reprints, & Permissions, Privacy Policy.

    Advertise | Newsletters | Tech Jobs | Shopping | E-mail Offers