This post originated from an RSS feed registered with .NET Buzz
by Roy Osherove.
Original Post: Using Regex to return the first N words in a string
Feed Title: ISerializable
Feed URL: http://www.asp.net/err404.htm?aspxerrorpath=/rosherove/Rss.aspx
Feed Description: Roy Osherove's persistent thoughts
Jeff
Perrin needed a function to return the first N words in a string (to create
a small summary or a snippet thingy). He did it using the manual and awkward
method of parsing the string manually. That method is more error prone and
usually makes for less readable code. Fortunately, you can use regular
expressions here quite nicely. Here's a test that makes sure that we get the
first 4 words in a string and the function "FindFirstWords" that does this very
easily using a simple regular expression.
What I'm doing here is that I'm using the expression to find the first 4
occurrences of text that is composed of alphanumeric text with one or more
spaces after it. Then I simply iterate over the match I found. The match should
contain 4 captures inside it - one for each "word" that was found.
It's not fully tested as you can see. I only wrote one test to see it works
on this sort of sentence. More tests could and should be added to test other
cases. In fact, if this were reall TDD, I would have started with a test of an
empty string, and continued on to test getting only one word, and then two and
so on.
[Test]
publicvoidTestRegexFindFirstNWords()
{
conststringINPUT =
"this is word
four five six seven eight nine ten eleven twelve thirteen!";