The Artima Developer Community
Sponsored Link

Weblogs Forum
Screen Scraping With Python

5 replies on 1 page. Most recent reply: May 14, 2010 6:01 AM by Bruce Wade

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 5 replies on 1 page
Greg Jorgensen

Posts: 65
Nickname: gregjor
Registered: Feb, 2004

Screen Scraping With Python (View in Weblogs)
Posted: Aug 24, 2004 1:49 AM
Reply to this message Reply
Summary
Web-enabling an old terminal-oriented application turns into more fun than expected. A blow-by-blow account of writing a screen scraper with Python and pexpect.
Advertisement

I recently finished a project for a local freight broker. They run their business on an old SCO Unix-based "green screen" terminal application. They wanted to enable some functionality on their web site, so customers could track their shipments, and carriers could update their location and status.

Old Code On Life Support

By the time the project got to me the client had almost given up on finding a solution. The company responsible for the terminal-based application pushed an expensive upgrade instead of offering any help. Reliably reading the proprietary and undocumented database didn't look easy, and safely writing to it appeared impossible.

I've run into this situation quite a few times. Companies often depend on old software, but they can't get basic support, much less enhancements. The company gets stuck with a hardware and software infrastructure they don't dare change. Replacing the software they depend on usually means expensive hardware and software upgrades, upsetting business processes and routines, and re-training staff. And like any major software project, an upgrade or conversion can fail or stall for unexpected reasons. Relying on old software puts a company at risk, but replacing that software can seem even riskier.

Imitating A Terminal Session

My plan involved setting up an intermediate server that would talk to the SCO box over telnet (the only network protocol I could use), and to the web server over HTTP. Rather than spend money on new hardware with so many unknowns, I installed an old all-in-one iMac running Yellow Dog Linux and the latest version of Python. I confirmed I could telnet to the SCO box and run the freight application, and that I could talk to the iMac through the firewall.

When you can't read the data files, and the program has no usable API, you have to resort to the technique known as screen scraping. You write code that simulates a human user, sending keystrokes and capturing the output. Generally you need code that simulates an output device, in my case a VT100 terminal.

It's tempting to think you can skip the terminal emulation and simply use regular expression matching on the output stream, but if it was that easy everyone would do it. If you're old enough to have written or worked with terminal-based applications you know that characters are not necessarily sent to the screen in the order you expect. Those programs were written to work over low-bandwidth connections, and they often optimize the output stream, inject control characters in the middle of words, and even skip sending characters that didn't change on the terminal's display. If you watch the raw output coming from the program you'll see a jumble that only makes sense when projected on a terminal screen.

I originally planned to use expect, but it works best for command-line applications, not programs that paint a full screen with control codes. You can extend expect with Tk and you can find expect-Tk/Tcl screen scrapers, but I don't know Tk/Tcl, and learning a new language was not part of my plan for this project.

After some poking around, I found Noah Spurrier's pexpect module for Python and decided to give that a try. In the pexpect download I found an undocumented but seemingly-complete finite state machine VT100 emulator. After a little tinkering I could connect to the SCO box, log in, start up the freight broker program, navigate the menus, and capture output. I just had to figure out the specific keystrokes and screen positions and send those through pexpect and the VT100 emulator.

The Last Mile

Once I had working code that could reliably do simple queries, like find the last known status of a shipment, I worked on web-enabling my screen scraper. I wrote a very simple HTTP server (a few lines of code in Python) that serves only specific requests from the outside web server. On the web server I wrote code (in VBScript, alas) that sent a request to my Python-on-iMac server and waited for the response, then displayed the whole thing on a beautiful HTML page.

I had thought getting the screen scraper to work reliably would take the most time, but thanks to the well-designed and tested pexpect module I got that part working without any big problems. It turned out that connecting the asynchronous web application to the synchronous terminal application created timing problems and synchronization issues. Once the web server sent a request, my Python scraper went through the whole process, even if the web user clicked off the page or submitted another request. The freight tracking pages did not get a lot of traffic, but the scraper was slow enough that the whole thing would break down under even a light load.

Python makes experimenting and debugging a breeze, and because I could run my scraper from the command line, or from IDLE (the Python IDE), finding the slow parts didn't take long. Logging in to the SCO box over telnet could take a few seconds, and sometimes the SCO box would just drop the connection during the telnet login. So my first optimization was to keep my scraper logged in to the SCO box, sitting at a command prompt. Instead of logging in for every web request, the scraper could just make sure it still had a command prompt.

I changed my scraper to use the timeout features of pexpect, so it could fail gracefully if the freight broker program running on the SCO box locked up or quit or spit out something unexpected. With shorter timeouts and keeping the scraper logged in, the performance problems went away and the web application could reliably talk to the legacy program.

Why Prolong The Inevitable?

Eventually the company I did this work for will have to replace their terminal-based application and their creaky SCO server. But my solution cost only a fraction of an upgrade or replacement, and it gave their web site some important functionality. Keeping the old program alive for another year or two and enabling more pieces of it on the web site makes business and economic sense. My client got some breathing room to evaluate replacements to their core business system, without falling behind their competition. The vendor pushing them to upgrade lost some clout and credibility; their new version still doesn't work over the web, but it does offer an ODBC interface.

I've used Python quite a bit since I discovered it back in 1998. But this was the most complicated and important program I've written in Python (so far). At first I liked the simple and elegant syntax, but as I used Python for real work, I came to appreciate the core libraries and the selection of free high-quality modules such as pexpect. When arguing about programming languages, programmers should care a little less about things like continuations and metaclasses, and pay more attention to the scope and quality of the libraries and working code.

Resources

pexpect, expect-like module for Python: pexpect.sourceforge.net

expect, a tool for automating interactive applications such as telnet, ftp, passwd, fsck, rlogin, tip, etc.: expect.nist.gov

The Python language site: www.python.org

My web site: www.pdxperts.com


Jason Yip

Posts: 31
Nickname: jchyip
Registered: Mar, 2003

Re: Screen Scraping With Python Posted: Aug 24, 2004 6:13 AM
Reply to this message Reply
Out of curiosity, how long did you spend on this?

Cool stuff in any case.

Matt Gerrans

Posts: 1153
Nickname: matt
Registered: Feb, 2002

Re: Screen Scraping With Python Posted: Aug 24, 2004 10:48 AM
Reply to this message Reply
Cangratulations -- this is the kind of problem solving that can be much more valuable (and satisfying) than simply programming.

Greg Jorgensen

Posts: 65
Nickname: gregjor
Registered: Feb, 2004

Re: Screen Scraping With Python Posted: Aug 24, 2004 8:00 PM
Reply to this message Reply
> Out of curiosity, how long did you spend on this?

The project stretched out over several months, mostly due to the client deciding what he wanted, and working around lots of vacations. I spent several hours going through the terminal-based program with one of the people who knew it, taking screen shots and working out the exact key sequences to send and the positions of things on the screen. The Python code came together in less than two days, even accounting for me tossing the first version and rewriting it. The ASP code on the public web site took a couple of days, mainly because VBScript is such a poor language. So overall I spent about a week on the core stuff, and a couple of weeks going back and forth with the client and the network admin.

Greg Jorgensen

Rudi Seiberlich

Posts: 1
Nickname: rudis
Registered: Jul, 2005

Re: Screen Scraping With Python Posted: Jul 23, 2005 1:54 AM
Reply to this message Reply
Greg,
I presently look for my company on a solution. Are you willing, or do you know anybody who is willing to do this
for us ?
here our requirements:
We would like to have sime kind of "Meta Search" on our website for the american public: www.sky-tours.com

1. Results from www.cheap-flight-tickets.com (Travelstoremaker engine),
2. Results from www.skytours.net (Patheo engine)
3. Results from www.sky-tours.com/index_fo.html (Fareoffice engine)
4. Results from www.no-frill-flights.com (low cost carrier web scraping site)

(the above may at its final stage get different domain names like: Airbookers.com, Flightdiscount.com, Budgetfares.com, Bayoo.com

The client should see on the welcome page :

from:
to
date out:
date back:
# of adults (over 12)
# of children (2-11)
# of infants (0-1)

the scraping of the above starts and the results of the five cheapest offers (identical offers are surpressed) are shown on a new page with the possibility
to click on the choosen offer whcih brings the client to the booking page.

I site which is working like this is www.kelkoo.co.uk (select flights) and you will see the site scraping different companies.


In the future we as well would like to sign up other sites (=competitors) and after an agreement reached show them as well in our comparison engine.

Is the above ground enough to give uns an offer?

brgds

Rudi Seiberlich

Bruce Wade

Posts: 1
Nickname: stealth
Registered: May, 2010

Re: Screen Scraping With Python Posted: May 14, 2010 6:01 AM
Reply to this message Reply
Nice post I love the fact that more and more people are writing screen scrapers in Python. When I first started I was using PHP and learned from Webbots spiders and screen scrapers book:

http://www.wadecybertech.com/2010/05/13/review-webbots-spiders-and-screen-scrapers/

My new blog is completely dedicated to scraping, and web automation.

Flat View: This topic has 5 replies on 1 page
Topic: Announcing the Reinventing Business Blog Previous Topic   Next Topic Topic: SOA, 5 Years In

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use