The Artima Developer Community
Sponsored Link

Python Buzz Forum
Capturing the power of re.split

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
Simon Willison

Posts: 282
Nickname: simonw
Registered: Jun, 2003

Simon Willison is a web technology enthusiast studying for a Computer Science degree at Bath Uni, UK
Capturing the power of re.split Posted: Oct 25, 2003 8:18 PM
Reply to this message Reply

This post originated from an RSS feed registered with Python Buzz by Simon Willison.
Original Post: Capturing the power of re.split
Feed Title: Simon Willison: Python
Feed URL: http://simon.incutio.com/syndicate/python/rss1.0
Feed Description: Simon Willison's Python cateory
Latest Python Buzz Posts
Latest Python Buzz Posts by Simon Willison
Latest Posts From Simon Willison: Python

Advertisement

A couple of Python tips. The first is really a tip for Mozilla/Firebird: You can set up a Custom Keyword for instantly accessing Python module documentation using the string www.python.org/doc/current/lib/module-%s.html - I have this set up as pydoc, so I can type pydoc re to jump straight to the re module documentation. I only set it up half an hour ago and I've already used it about a dozen times.

The second tip is so powerful I've been kicking myself for not finding out about it sooner. It relates to the regular expression module's re.split() function. Just like string.split(), this lets you split up a string based on a certain token. With string.split() you the token you split on isn't included in the resulting array:

>>> 'pipe|separated|values'.split('|')
['pipe', 'separated', 'values']

This is also true of re.split:

>>> splitter = re.compile('<.>')
>>> splitter.split('hi<a>there<b>from<c>python')
['hi', 'there', 'from', 'python']

Here's the magic part though. If you put part or all of the regular expression in parenthesis the separating tokens get included in the resulting list:

>>> splitter = re.compile('(<.>)')
>>> splitter.split('hi<a>there<b>from<c>python')
['hi', '<a>', 'there', '<b>', 'from', '<c>', 'python']

Why is this a big deal? Because it suddenly writing simple parsers and tokenisers a whole heck of a lot easier. Using the above example, say you wanted to do something with each of the <?> style tags. You can just iterate through the resulting list identifying each tag using the regular expression you've already compiled and then altering just those list items, before joining the whole list back together again at the end.

Simple parsing and replacement of easily identified tags can already be achieved using the re.sub() method, which allows you to provide a callback function to process each matching token. The difference with using re.split() is that you can easily take in to account the order of the tokens, allowing you to build systems that can use special tags to define areas of documents without getting confused by nesting tag sets. As a simple example, you could build a basic event based XML parser using just a couple of expressions. In fact, I discovered this technique while examining the source code for the tinpy tiny python template module, which gives a clue to why I'm so interested in it.

Having discovered this feature in Python, I just had to see if it existed in other languages as well. Unsurprisingly it does; PHP's preg_split offers an optional PREG_SPLIT_DELIM_CAPTURE flag (added in PHP 4.0.5) and Javascript has similar behaviour to Python, including the splitting token if it is wrapped in parentheses.

I'm probably the last person to find out about this, but it's such a useful technique I felt I just had to share it with the world.

Read: Capturing the power of re.split

Topic: Web-SIG Previous Topic   Next Topic Topic: So many accessors...

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use