Alan Knowles
Posts: 390
Nickname: alank
Registered: Sep, 2004
|
Alan Knowles is Freelance Developer, works on PHP extensions and PEAR.
|
|
|
|
52 Making simple things easy, and difficult things possible. yet another html parser. 15
|
Posted: Nov 1, 2004 7:41 AM
|
|
13a5
When John released his bindings to html tidy, I joked with him, that it would have been far more interesting (as a project), to write a proper HTML lexer, rather than bind to an existing library. (mainly cause having written one in PHP, I didnt think it would be that difficult), and I have a strange idea of fun...
Well, over the weekend, I was re-pondering this. Partly due to the fact I had used the Flexy Parser to try and parse HTML from a web site, and found the tokenizer in Flexy was getting slower with age (5seconds on average to parse a page). While this is not a huge issue normally, as this parsing is cached during the compiling phase of template engine. It is a huge issue if you are pulling pages down, parsing out the forms, and reposting the forms in a web test script.
So over the weekend after a little google search and discover trip, I ran across a little w3c project, "A Lexical Analyzer for HTML and SGML", It looked interesting, but it wasnt until I pulled the code down, untared and built it, that I realized it could be used to write a really fast, and simple HTML tokenizer. (not only that, it could easily form the basis of a C based backend for Flexy.)
To create an extension that used the code (not a library, but just pulled in the C code into a PHP extension), and parse a string of HTML took about 30 minutes.. - It took an extra 3 hours, on and off over a few days, to make it return a array of tokens (with attributes sorted into a sensible structure.)
So now I have a cute extension that has 1 function, and 1 result, KISS at it's best..
<?php print_r( flexyparser_tokenize( file_get_contents("..some file...") )); Outputs:
[0] => Array ( [0] => 14 // token type (look up the source) [1] => // data (tag name or string) [2] => 1 // line number [3] => 0 // character position )
[1] => Array ( [0] => 1 [1] =>
[2] => 2 [3] => 50 )
[2] => Array ( [0] => 2 [1] => HTML [2] => 2 [3] => 51 )
[3] => Array ( [0] => 2 [1] => HEAD [2] => 2 [3] => 57 ) ..... ...... [15] => Array ( [0] => 2 [1] => A [2] => 6 [3] => 212 [4] => Array // array of attributes ( [HREF] => "/pub/WWW/Consortium/" )
)
[16] => Array ( [0] => 2 [1] => IMG [2] => 7 [3] => 243 [4] => Array ( [align] => bottom
[src] => "/pub/WWW/Icons/WWW/w3c_48x48" )
)
the code is in my svn server, under akpear/flexyparser, works perfectly with PHP5 and PHP4 at the moment.
I really want to do a tree version of this, that loads data into a user defined object: eg. <?php $tree = flexyparser_toTree($data, new MyClass);
so it can be used 'how you want it...'
20
Read: 52 Making simple things easy, and difficult things possible. yet another html parser. 15
|
|