|
This post originated from an RSS feed registered with Python Buzz
by Ng Pheng Siong.
|
Original Post: Smalltalk Scripting
Feed Title: (render-blog Ng Pheng Siong)
Feed URL: http://sandbox.rulemaker.net/ngps/rdf10_xml
Feed Description: Just another this here thing blog.
|
Latest Python Buzz Posts
Latest Python Buzz Posts by Ng Pheng Siong
Latest Posts From (render-blog Ng Pheng Siong)
|
|
With fine timing, Chris
Petrilli posts a pointer to freely
available online Smalltalk books. A perfect opening for some Smalltalk web scraping.
Mission: Download the PDF files for the book Smalltalk
By Example.
Sketch of approach: 1. Fetch the page. 2. Parse the HTML. 3. For each
link that is a downloadable file, download it.
Tools: Smalltalk/X system browser, workspace, transcript, inspector and a Unix shell.
1. Using the shell and system browser, look for classes about HTTP. Ah, found HTTPInterface.
2. Click about to figure out how to use the class. In workspace, try:
httpResp := HTTPInterface
get: '/~ducasse/FreeBooks/ByExample/'
fromHost: 'www.iam.unibe.ch'
3. Inspect httpResp, see that it is a HTTPResponse instance; use
system browser to navigate to its implementation, look-look see-see.
4. Find classes about HTML parsing: Found HTMLParser. Figure out how to use the class.
5. In workspace,
parsed := HTMLParser new parseText: httpResp data
6. Inspect parsed, see that it is a HTMLDocument, use system browser
to navigate to its implementation, look-look see-see.
7. Ok, the message (or is it called a selector?) 'anchorElements'
returns an OrderedCollection of, duh, anchor elements in said HTML
document.
8. I want those links with one of these suffix: pdf, zip or
gif. In workspace,
parsed anchorElements do:
[:each | each hrefString ifNotNil:
[ |f| f := each hrefString asFilename.
(((f hasSuffix: 'pdf') or:
[f hasSuffix: 'zip']) or:
[f hasSuffix: 'gif']) ifTrue:
[Transcript showCR: f asString]]]
The statement (expression? command?) "Transcript ..." prints each 'f'
in the transcript window. Its output looks like this:
CodeExamples.zip
SmalltalkByExampleNewRelease.pdf
SmalltalkbyExMissingChapter27.pdf
byExample.gif
From here it is just a little more work to apply class HTTPInterface to 'each' (which contains the URL) to save the downloaded content into a file named by the Filename instance 'f'.
All this just by using 'Find Class' and 'Find Method' in the system
browser. Nice! I'm looking forward to discovering what the rest of the IDE does.
Conclusion: Don't know how to indent. Don't know the
terminology. My fingers want to type a closing parenthesis at the end
of each line. ;-) Notwithstanding all that, Smalltalk rocks!
BTW, if you want to try out the above, remember to practise on your
own server and not on Ducasse's site.
Read: Smalltalk Scripting