Saturday, March 19, 2011

DOM and XPath

I've been working on a project that will help college students plan their upcoming semester by providing a easy-to-use system to filter through courses they need. One of the features of this project is that it will automatically grab the latest schedule data from a web page (or a series of web pages). The most complicated and cumbersome method of scraping HTML from a web page involves writing a custom routine to find exactly what you need.

I've already used php's SimpleXMLElement extension to pull posts from this blog (which is hosted by Blogger) and put them into a local database. But that was XML, and the web page that has the schedule data is in (ir)regular HTML. So, I decided to work with the DOM extension in php. In my trial-and-error experimentation with the DOM extension, I cooked up a little utility in the lab that shows the DOM tree according to the DOM extension, along with the attributes of each element.

Once you create the DOM document object, the next step to get the data you need is to locate it using XPath. XPath is a W3C standard that can determine exactly where in an XML (or in this case, HTML) document to point. I'm slightly embarrassed that I've never used - or even heard of - XPath up until now. However, now that I know all about it, I think I'll be using it a lot more.

Meanwhile, my schedule searching project is coming along nicely. I hope to have it up and running by the time Fall registration hits.