The Artima Developer Community
Sponsored Link

Python Buzz Forum
Smalltalk Scripting

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
Ng Pheng Siong

Posts: 410
Nickname: ngps
Registered: Apr, 2004

Ng Pheng Siong is just another guy with a website.
Smalltalk Scripting Posted: Sep 2, 2004 10:54 AM
Reply to this message Reply

This post originated from an RSS feed registered with Python Buzz by Ng Pheng Siong.
Original Post: Smalltalk Scripting
Feed Title: (render-blog Ng Pheng Siong)
Feed URL: http://sandbox.rulemaker.net/ngps/rdf10_xml
Feed Description: Just another this here thing blog.
Latest Python Buzz Posts
Latest Python Buzz Posts by Ng Pheng Siong
Latest Posts From (render-blog Ng Pheng Siong)

Advertisement

With fine timing, Chris Petrilli posts a pointer to freely available online Smalltalk books. A perfect opening for some Smalltalk web scraping.

Mission: Download the PDF files for the book Smalltalk By Example.

Sketch of approach: 1. Fetch the page. 2. Parse the HTML. 3. For each link that is a downloadable file, download it.

Tools: Smalltalk/X system browser, workspace, transcript, inspector and a Unix shell.

1. Using the shell and system browser, look for classes about HTTP. Ah, found HTTPInterface.

2. Click about to figure out how to use the class. In workspace, try:

httpResp := HTTPInterface 
            get: '/~ducasse/FreeBooks/ByExample/'
            fromHost: 'www.iam.unibe.ch'

3. Inspect httpResp, see that it is a HTTPResponse instance; use system browser to navigate to its implementation, look-look see-see.

4. Find classes about HTML parsing: Found HTMLParser. Figure out how to use the class.

5. In workspace,

parsed := HTMLParser new parseText: httpResp data

6. Inspect parsed, see that it is a HTMLDocument, use system browser to navigate to its implementation, look-look see-see.

7. Ok, the message (or is it called a selector?) 'anchorElements' returns an OrderedCollection of, duh, anchor elements in said HTML document.

8. I want those links with one of these suffix: pdf, zip or gif. In workspace,

parsed anchorElements do:
  [:each | each hrefString ifNotNil:
    [ |f| f := each hrefString asFilename.
          (((f hasSuffix: 'pdf') or:
            [f hasSuffix: 'zip']) or:
            [f hasSuffix: 'gif']) ifTrue:
                                  [Transcript showCR: f asString]]]

The statement (expression? command?) "Transcript ..." prints each 'f' in the transcript window. Its output looks like this:

CodeExamples.zip
SmalltalkByExampleNewRelease.pdf
SmalltalkbyExMissingChapter27.pdf
byExample.gif

From here it is just a little more work to apply class HTTPInterface to 'each' (which contains the URL) to save the downloaded content into a file named by the Filename instance 'f'.

All this just by using 'Find Class' and 'Find Method' in the system browser. Nice! I'm looking forward to discovering what the rest of the IDE does.

Conclusion: Don't know how to indent. Don't know the terminology. My fingers want to type a closing parenthesis at the end of each line. ;-) Notwithstanding all that, Smalltalk rocks!

BTW, if you want to try out the above, remember to practise on your own server and not on Ducasse's site.

Read: Smalltalk Scripting

Topic: Fund Michael Salib Previous Topic   Next Topic Topic: Emulation

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use