This post originated from an RSS feed registered with Python Buzz
by Thomas Guest.
Original Post: A Python syntax highlighter
Feed Title: Word Aligned: Category Python
Feed URL: http://feeds.feedburner.com/WordAlignedCategoryPython
Feed Description: Dynamic languages in general. Python in particular. The adventures of a space sensitive programmer.
In a recent post
I described my first ever Ruby program – which
was actually a syntax highlighter for Python and written
in Ruby, ready to be used in a Typo web log. Since the post
was rather a long one, I decided to post the code itself separately.
Here it is, then.
The Test Code
As you can see, currently only comments, single- and triple- quoted
strings, keywords and identifiers are recognised. That’s really all I
wanted, for now. For completeness, I may well add support for numeric
literals. Watch this space!
typo/vendor/syntax/test/syntax/tc_python.rb
requireFile.dirname(__FILE__)+"/tokenizer_testcase"class TC_Syntax_Python<TokenizerTestCasesyntax"python"def test_emptytokenize""assert_no_next_tokenenddef test_comment_eoltokenize"# a comment\nfoo"assert_next_token:comment,"# a comment"assert_next_token:normal,"\n"assert_next_token:ident,"foo"enddef test_two_commentstokenize"# first comment\n# second comment"assert_next_token:comment,"# first comment"assert_next_token:normal,"\n"assert_next_token:comment,"# second comment"enddef test_stringtokenize"'' 'aa' r'raw' u'unicode' UR''"assert_next_token:string,"''"skip_tokenassert_next_token:string,"'aa'"skip_tokenassert_next_token:string,"r'raw'"skip_tokenassert_next_token:string,"u'unicode'"skip_tokenassert_next_token:string,"UR''"tokenize'"aa\"bb"'assert_next_token:string,'"aa\"bb"'enddef test_triple_quoted_stringtokenize"'''\nfoo\n'''"assert_next_token:triple_quoted_string,"'''\nfoo\n'''"tokenize'"""\nfoo\n"""'assert_next_token:triple_quoted_string,'"""\nfoo\n"""'tokenize"uR'''\nfoo\n'''"assert_next_token:triple_quoted_string,"uR'''\nfoo\n'''"tokenize'"""\'a\'"b"c"""'assert_next_token:triple_quoted_string,'"""\'a\'"b"c"""'enddef test_keywordSyntax::Python::KEYWORDS.eachdo|word|tokenizewordassert_next_token:keyword,wordendSyntax::Python::KEYWORDS.eachdo|word|tokenize"x#{word}"assert_next_token:ident,"x#{word}"tokenize"#{word}x"assert_next_token:ident,"#{word}x"endendend
The Python Tokenizer
typo/vendor/syntax/python.rb
require'syntax'module Syntax# A basic tokenizer for the Python language. It recognises# comments, keywords and strings.class Python<Tokenizer# The list of all identifiers recognized as keywords.# http://docs.python.org/ref/keywords.html# Strictly speaking, "as" isn't yet a keyword -- but for syntax# highlighting, we'll treat it as such.KEYWORDS=%w{as and del for is raise assert elif from lambda return break
else global not try class except if or while continue exec
import pass yield def finally in print}# Step through a single iteration of the tokenization process.def stepifscan(/#.*$/)start_group:comment,matchedelsifscan(/u?r?'''.*?'''|""".*?"""/im)start_group:triple_quoted_string,matchedelsifscan(/u?r?'([^\\']|\\.)*'/i)start_group:string,matchedelsifscan(/u?r?"([^\\"]|\\.)*"/i)start_group:string,matchedelsifcheck(/[_a-zA-Z]/)word=scan(/\w+/)ifKEYWORDS.include?(word)start_group:keyword,wordelsestart_group:ident,wordendelsestart_group:normal,scan(/./m)endendendSYNTAX["python"]=Pythonend