The Artima Developer Community
Sponsored Link

Computing Thoughts
Beautiful Soup 4 Now in Beta
by Bruce Eckel
February 23, 2012
Summary
If you do any HTML/XML processing, you've probably heard of this powerful and useful package. Download the latest beta and run it on your code to help debug it and make sure it does what you need!

Advertisement

Many improvements have been made in this new version -- for one thing, it's compatible with both Python 2 and Python 3. One of the biggest changes is that Beautiful Soup no longer uses its own parser; instead it chooses from what's available on your system, preferring the blazingly-fast lxml but falling back to other parsers and using Python's batteries-included parser if nothing else is available.

A useful tip if you're on Windows: you can find a pre-compiled Windows version of lxml here. That site has lots of pre-compiled Python extensions which is extremely helpful, as some of these packages (like lxml) otherwise require some serious gyrations in order to install them on your Windows box. (I work regularly on Mac, Windows 7 and Ubuntu Linux, in order to ensure that whatever I'm working on is cross-platform.)

Beautiful Soup has been refactored in many places; sometimes these changes constitute a significant improvement to the programming model, other times the changes are just to conform to the Python naming syntax or to ensure Python 2/3 compatibility.

The author Leonard Richardson is open to suggestions for improvements, so if you've had a feature request sitting on your back burner, now's the time.

Here's the introductory link to the Beautiful Soup 4 Beta.

I've been using Beautiful Soup to process a book that I'm coauthoring via Google Docs. We can work on the book remotely at the same time, which is something I've tried to do with other technologies via screen sharing. It works best with Google Docs because there's no setup necessary if we want to have a phone conversation about the document while working on it. Then I download the book in HTML format and apply the Beautiful Soup tools I've written to process the HTML. Although I've spent a fair amount of time on these, the investment is worth it because HTML isn't going away anytime soon so my Beautiful Soup skills should come in handy again and again.

Talk Back!

Have an opinion? Be the first to post a comment about this weblog entry.

RSS Feed

If you'd like to be notified whenever Bruce Eckel adds a new entry to his weblog, subscribe to his RSS feed.

About the Blogger

Bruce Eckel (www.BruceEckel.com) provides development assistance in Python with user interfaces in Flex. He is the author of Thinking in Java (Prentice-Hall, 1998, 2nd Edition, 2000, 3rd Edition, 2003, 4th Edition, 2005), the Hands-On Java Seminar CD ROM (available on the Web site), Thinking in C++ (PH 1995; 2nd edition 2000, Volume 2 with Chuck Allison, 2003), C++ Inside & Out (Osborne/McGraw-Hill 1993), among others. He's given hundreds of presentations throughout the world, published over 150 articles in numerous magazines, was a founding member of the ANSI/ISO C++ committee and speaks regularly at conferences.

This weblog entry is Copyright © 2012 Bruce Eckel. All rights reserved.

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use