The Artima Developer Community
Sponsored Link

Python Buzz Forum
Using XPath to mine XHTML

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
Simon Willison

Posts: 282
Nickname: simonw
Registered: Jun, 2003

Simon Willison is a web technology enthusiast studying for a Computer Science degree at Bath Uni, UK
Using XPath to mine XHTML Posted: Oct 20, 2003 11:12 PM
Reply to this message Reply

This post originated from an RSS feed registered with Python Buzz by Simon Willison.
Original Post: Using XPath to mine XHTML
Feed Title: Simon Willison: Python
Feed URL: http://simon.incutio.com/syndicate/python/rss1.0
Feed Description: Simon Willison's Python cateory
Latest Python Buzz Posts
Latest Python Buzz Posts by Simon Willison
Latest Posts From Simon Willison: Python

Advertisement

This morning, I finally decided to install libxml2 and see what all the fuss was about, in particular with respect to XPath. What followed is best described as an enlightening experience.

XPath is a beautifully elegant way of adressing "nodes" within an XML document. XPath expressions look a little like file paths, for example:

/first/second
Match any <second> elements that occur inside a <first> element that is the root element of the document
//second
Match all <second> elements irrespective of their place in the document
//second[@hi]
Match all <second> elements with a 'hi' attribute
//second[@hi="there"]
Match all <second> elements with a 'hi' attribute that equals "there"

A full XPath tutorial is available.

The Python libxml2 bindings make running XPath expressions incredibly simple. Here's some code that extracts the titles of all of the entries on my Kansas blog from the site's RSS feed:

>>> import libxml2
>>> import urllib
>>> rss = libxml2.parseDoc(
      urllib.urlopen('http://www.a-year-in-kansas.com/syndicate/').read())
>>> rss.xpathEval('//item/title')
[<xmlNode (title) object at 0xb4b260>, <xmlNode (title) object at 0xa99968>, 
<xmlNode (title) object at 0x10dce68>]
>>> [node.content for node in rss.xpathEval('//item/title')]
['Music and Brunch', 'House hunting', 'Arrival']
>>> 

Why is this so exciting? I've been saying for over a year that XHTML is an ideal format for storing pieces of content in a database or content management system. Serving content to browsers as HTML 4 makes perfect sense, but storing your actual content as XML gives you the ability to process that content in the future using XML tools.

So far, the best example of a powerful tool for manipulating this stored XML has been XSLT. XSLT has its fans, but is also often criticised as being unintuitive and having a steep learning curve. XPath is a far better example of a powerful, easy to use tool that can be brought to bare on XHTML content.

Enough talk, here's an example of what I mean. The following code snippet creates a Python dictionary of all of the acronyms currently visible on the front page of my blog, mapping their shortened version to the expanded text (extracted from the title attribute):


>>> blog = libxml2.parseDoc(
    urllib.urlopen('http://simon.incutio.com/').read())
>>> ctxt = blog.xpathNewContext()
>>> ctxt.xpathRegisterNs('xhtml', 'http://www.w3.org/1999/xhtml')
0
>>> acronyms = dict([(a.content, a.prop('title')) 
    for a in ctxt.xpathEval('//xhtml:acronym')])
>>> for acronym, fulltext in acronyms.items():
	print acronym, ':', fulltext


DHTML : Dynamic HyperText Markup Language
URL : Universal Republic of Love
HTML : HyperText Markup Language
SIG : Special Interest Group
PHP : PHP: Hypertext Preprocessor
CSS : Cascading Style Sheets
>>> 

The above code is slightly more complicated than the first example, as using XPath with a document that uses XML namespaces requires some extra work to register the namespace with the XPath parser. Still, it's a pretty short piece of code considering what it does.

For an example of how powerful XPath can be on a much larger scale, take a look at Sam Ruby's XPath enabled blog search feature.

Read: Using XPath to mine XHTML

Topic: HTMLifying user input Previous Topic   Next Topic Topic: The Python Web SIG

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use