The Artima Developer Community
Artima Weblogs | Greg Jorgensen's Weblog | Discuss | Email | Print | Bloggers | Previous | Next
Sponsored Link

Omit Needless Words
Screen Scraping With Python
by Greg Jorgensen
August 24, 2004
Summary
Web-enabling an old terminal-oriented application turns into more fun than expected. A blow-by-blow account of writing a screen scraper with Python and pexpect.

Advertisement

I recently finished a project for a local freight broker. They run their business on an old SCO Unix-based "green screen" terminal application. They wanted to enable some functionality on their web site, so customers could track their shipments, and carriers could update their location and status.

Old Code On Life Support

By the time the project got to me the client had almost given up on finding a solution. The company responsible for the terminal-based application pushed an expensive upgrade instead of offering any help. Reliably reading the proprietary and undocumented database didn't look easy, and safely writing to it appeared impossible.

I've run into this situation quite a few times. Companies often depend on old software, but they can't get basic support, much less enhancements. The company gets stuck with a hardware and software infrastructure they don't dare change. Replacing the software they depend on usually means expensive hardware and software upgrades, upsetting business processes and routines, and re-training staff. And like any major software project, an upgrade or conversion can fail or stall for unexpected reasons. Relying on old software puts a company at risk, but replacing that software can seem even riskier.

Imitating A Terminal Session

My plan involved setting up an intermediate server that would talk to the SCO box over telnet (the only network protocol I could use), and to the web server over HTTP. Rather than spend money on new hardware with so many unknowns, I installed an old all-in-one iMac running Yellow Dog Linux and the latest version of Python. I confirmed I could telnet to the SCO box and run the freight application, and that I could talk to the iMac through the firewall.

When you can't read the data files, and the program has no usable API, you have to resort to the technique known as screen scraping. You write code that simulates a human user, sending keystrokes and capturing the output. Generally you need code that simulates an output device, in my case a VT100 terminal.

It's tempting to think you can skip the terminal emulation and simply use regular expression matching on the output stream, but if it was that easy everyone would do it. If you're old enough to have written or worked with terminal-based applications you know that characters are not necessarily sent to the screen in the order you expect. Those programs were written to work over low-bandwidth connections, and they often optimize the output stream, inject control characters in the middle of words, and even skip sending characters that didn't change on the terminal's display. If you watch the raw output coming from the program you'll see a jumble that only makes sense when projected on a terminal screen.

I originally planned to use expect, but it works best for command-line applications, not programs that paint a full screen with control codes. You can extend expect with Tk and you can find expect-Tk/Tcl screen scrapers, but I don't know Tk/Tcl, and learning a new language was not part of my plan for this project.

After some poking around, I found Noah Spurrier's pexpect module for Python and decided to give that a try. In the pexpect download I found an undocumented but seemingly-complete finite state machine VT100 emulator. After a little tinkering I could connect to the SCO box, log in, start up the freight broker program, navigate the menus, and capture output. I just had to figure out the specific keystrokes and screen positions and send those through pexpect and the VT100 emulator.

The Last Mile

Once I had working code that could reliably do simple queries, like find the last known status of a shipment, I worked on web-enabling my screen scraper. I wrote a very simple HTTP server (a few lines of code in Python) that serves only specific requests from the outside web server. On the web server I wrote code (in VBScript, alas) that sent a request to my Python-on-iMac server and waited for the response, then displayed the whole thing on a beautiful HTML page.

I had thought getting the screen scraper to work reliably would take the most time, but thanks to the well-designed and tested pexpect module I got that part working without any big problems. It turned out that connecting the asynchronous web application to the synchronous terminal application created timing problems and synchronization issues. Once the web server sent a request, my Python scraper went through the whole process, even if the web user clicked off the page or submitted another request. The freight tracking pages did not get a lot of traffic, but the scraper was slow enough that the whole thing would break down under even a light load.

Python makes experimenting and debugging a breeze, and because I could run my scraper from the command line, or from IDLE (the Python IDE), finding the slow parts didn't take long. Logging in to the SCO box over telnet could take a few seconds, and sometimes the SCO box would just drop the connection during the telnet login. So my first optimization was to keep my scraper logged in to the SCO box, sitting at a command prompt. Instead of logging in for every web request, the scraper could just make sure it still had a command prompt.

I changed my scraper to use the timeout features of pexpect, so it could fail gracefully if the freight broker program running on the SCO box locked up or quit or spit out something unexpected. With shorter timeouts and keeping the scraper logged in, the performance problems went away and the web application could reliably talk to the legacy program.

Why Prolong The Inevitable?

Eventually the company I did this work for will have to replace their terminal-based application and their creaky SCO server. But my solution cost only a fraction of an upgrade or replacement, and it gave their web site some important functionality. Keeping the old program alive for another year or two and enabling more pieces of it on the web site makes business and economic sense. My client got some breathing room to evaluate replacements to their core business system, without falling behind their competition. The vendor pushing them to upgrade lost some clout and credibility; their new version still doesn't work over the web, but it does offer an ODBC interface.

I've used Python quite a bit since I discovered it back in 1998. But this was the most complicated and important program I've written in Python (so far). At first I liked the simple and elegant syntax, but as I used Python for real work, I came to appreciate the core libraries and the selection of free high-quality modules such as pexpect. When arguing about programming languages, programmers should care a little less about things like continuations and metaclasses, and pay more attention to the scope and quality of the libraries and working code.

Resources

pexpect, expect-like module for Python: pexpect.sourceforge.net

expect, a tool for automating interactive applications such as telnet, ftp, passwd, fsck, rlogin, tip, etc.: expect.nist.gov

The Python language site: www.python.org

My web site: www.pdxperts.com

Talk Back!

Have an opinion? Readers have already posted 5 comments about this weblog entry. Why not add yours?

RSS Feed

If you'd like to be notified whenever Greg Jorgensen adds a new entry to his weblog, subscribe to his RSS feed.

About the Blogger

Greg Jorgensen started programming in 1974. Tainted by BASIC, Fortran, COBOL, and assembler at an impressionable age, he nevertheless managed to make a career out of programming and went on to work at Nike, Apple, and many other companies. Today Greg runs a small consulting business in Portland, Oregon. Besides making a living writing code, he teaches kids how to program and enjoys motorcycle riding.

This weblog entry is Copyright © 2004 Greg Jorgensen. All rights reserved.

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use