Python Buzz Forum - HTMLifying user input

Articles |
News |
Weblogs |
Books |
Forums

Artima Forums | Articles | Weblogs | Java Answers | News

Sponsored Link •

Python Buzz Forum
HTMLifying user input

0 replies on 1 page.

Welcome Guest
Sign In

Back to Topic List

Reply to this Topic

Search Forum

Threaded View


Previous Topic		Next Topic

Flat View: This topic has 0 replies on 1 page

Simon Willison

Posts: 282
Nickname: simonw
Registered: Jun, 2003

Simon Willison is a web technology enthusiast studying for a Computer Science degree at Bath Uni, UK

HTMLifying user input

Posted: Oct 18, 2003 8:01 PM

This post originated from an RSS feed registered with Python Buzz by Simon Willison.
Original Post: HTMLifying user input Feed Title: Simon Willison: Python Feed URL: http://simon.incutio.com/syndicate/python/rss1.0 Feed Description: Simon Willison's Python cateory	Latest Python Buzz Posts Latest Python Buzz Posts by Simon Willison Latest Posts From Simon Willison: Python

I've added a comment system to my new Kansas blog. Since the target audience for that site is friends and family rather than fellow web developers, I've taken a very different approach to processing the input from comments. While this blog insists upon valid XHTML and gives very little help to comment posters aside from highlighting validation problems, my new site's comment system takes the more traditional root of disallowing HTML while automatically converting line breaks and links.

The standard way of doing this with PHP is to use the nl2br function. I've never been a big fan of this method as I prefer blocks of text to be surrounded by paragraph tags. Luckily, adding paragraph tags to blocks of text is a relatively easy task. Here's the pseudo-code, mocked up in Python because it's quicker to experiment with than PHP:

>>> text = '''... lengthy text block here ...'''
>>> paras = text.split('\n\n')
>>> paras = ['<p>%s</p>' % para.strip() for para in paras]
>>> print '\n\n'.join(paras)

The above code splits the text block on any occurrence of a double newline, then wraps each of the resulting blocks in a paragraph tag (after stripping off any remaining whitespace) before joining the blocks back together with a pair of newlines between each one - because I like to keep my HTML nicely formatted. What it doesn't do is handle any necessary <br> tags. The trick now is to replace any single line breaks with <br> without interfering with the paragraph tags. The easiest way to do this is to put the replacement inside the loop, so that only line breaks that occur within a paragraph are replaced. Here's the updated list comprehension:


>>> paras = ['<p>%s</p>' % p.strip().replace('\n', '<br>\n') for p in paras]

The final job is to convert the above in to PHP:

$paras = explode("\n\n", $text);
for ($i = 0, $j = count($paras); $i < $j; $i++) {
    $paras[$i] = '<p>'.
        str_replace("\n", "<br>\n", trim($paras[$i])).
        '</p>';
}
$text = implode("\n\n", $paras);

That's the line conversions handled, but there are a few other important steps. Any HTML tags entered by the user need to be either stripped out or disabled by converting them to entities. Converting them to entities carries the risk of ugly failed attempts at HTML appearing on the comments page, but stripping tags carries an equal risk of innocent parts of a legitimate comment (such as a <wink>) being discarded. I chose to go the entity conversion route but force commenters to preview their comments before posting them, a trick I picked up from Adrian's blog. The final step is to automatically convert links in to <a href=""> tags. I achieve this using a pair of naive regular expressions in the hope that the preview screen would avoid them mangling comments in a way not intended by the author.

Here's the finished PHP function:

function untrustedTextToHTML($text) {
    $text = htmlentities($text);
    $paras = explode("\n\n", $text);
    for ($i = 0, $j = count($paras); $i < $j; $i++) {
        $paras[$i] = '<p>'.
            str_replace("\n", "<br>\n", trim($paras[$i])).
            '</p>';
    }
    $text = implode("\n\n", $paras);
    // Convert http:// links
    $text = preg_replace('|\\b(http://[^\s)<]+)|', 
        '<a href="$1">$1</a>', $text);
    // Convert www. links
    $text = preg_replace('|\\b(www.[^\s)<]+)|', 
        '<a href="http://$1">$1</a>', $text);
    return $text;
}

I have no doubt it could be improved, but my tests so far have shown it to be good enough for the job at hand.

Read: HTMLifying user input

Previous Topic

Next Topic


	Web Artima.com