The Artima Developer Community
Sponsored Link

Python Buzz Forum
HTMLifying user input

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
Simon Willison

Posts: 282
Nickname: simonw
Registered: Jun, 2003

Simon Willison is a web technology enthusiast studying for a Computer Science degree at Bath Uni, UK
HTMLifying user input Posted: Oct 18, 2003 8:01 PM
Reply to this message Reply

This post originated from an RSS feed registered with Python Buzz by Simon Willison.
Original Post: HTMLifying user input
Feed Title: Simon Willison: Python
Feed URL: http://simon.incutio.com/syndicate/python/rss1.0
Feed Description: Simon Willison's Python cateory
Latest Python Buzz Posts
Latest Python Buzz Posts by Simon Willison
Latest Posts From Simon Willison: Python

Advertisement

I've added a comment system to my new Kansas blog. Since the target audience for that site is friends and family rather than fellow web developers, I've taken a very different approach to processing the input from comments. While this blog insists upon valid XHTML and gives very little help to comment posters aside from highlighting validation problems, my new site's comment system takes the more traditional root of disallowing HTML while automatically converting line breaks and links.

The standard way of doing this with PHP is to use the nl2br function. I've never been a big fan of this method as I prefer blocks of text to be surrounded by paragraph tags. Luckily, adding paragraph tags to blocks of text is a relatively easy task. Here's the pseudo-code, mocked up in Python because it's quicker to experiment with than PHP:

>>> text = '''... lengthy text block here ...'''
>>> paras = text.split('\n\n')
>>> paras = ['<p>%s</p>' % para.strip() for para in paras]
>>> print '\n\n'.join(paras)

The above code splits the text block on any occurrence of a double newline, then wraps each of the resulting blocks in a paragraph tag (after stripping off any remaining whitespace) before joining the blocks back together with a pair of newlines between each one - because I like to keep my HTML nicely formatted. What it doesn't do is handle any necessary <br> tags. The trick now is to replace any single line breaks with <br> without interfering with the paragraph tags. The easiest way to do this is to put the replacement inside the loop, so that only line breaks that occur within a paragraph are replaced. Here's the updated list comprehension:


>>> paras = ['<p>%s</p>' % p.strip().replace('\n', '<br>\n') for p in paras]

The final job is to convert the above in to PHP:

$paras = explode("\n\n", $text);
for ($i = 0, $j = count($paras); $i < $j; $i++) {
    $paras[$i] = '<p>'.
        str_replace("\n", "<br>\n", trim($paras[$i])).
        '</p>';
}
$text = implode("\n\n", $paras);

That's the line conversions handled, but there are a few other important steps. Any HTML tags entered by the user need to be either stripped out or disabled by converting them to entities. Converting them to entities carries the risk of ugly failed attempts at HTML appearing on the comments page, but stripping tags carries an equal risk of innocent parts of a legitimate comment (such as a <wink>) being discarded. I chose to go the entity conversion route but force commenters to preview their comments before posting them, a trick I picked up from Adrian's blog. The final step is to automatically convert links in to <a href=""> tags. I achieve this using a pair of naive regular expressions in the hope that the preview screen would avoid them mangling comments in a way not intended by the author.

Here's the finished PHP function:

function untrustedTextToHTML($text) {
    $text = htmlentities($text);
    $paras = explode("\n\n", $text);
    for ($i = 0, $j = count($paras); $i < $j; $i++) {
        $paras[$i] = '<p>'.
            str_replace("\n", "<br>\n", trim($paras[$i])).
            '</p>';
    }
    $text = implode("\n\n", $paras);
    // Convert http:// links
    $text = preg_replace('|\\b(http://[^\s)<]+)|', 
        '<a href="$1">$1</a>', $text);
    // Convert www. links
    $text = preg_replace('|\\b(www.[^\s)<]+)|', 
        '<a href="http://$1">$1</a>', $text);
    return $text;
}

I have no doubt it could be improved, but my tests so far have shown it to be good enough for the job at hand.

Read: HTMLifying user input

Topic: Computers -- History of Chemical Nomenclature Previous Topic   Next Topic Topic: Space cadet keyboard

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use