Tuesday, November 9, 2010

count the total number of words in a string

There are many scenarios in which you may wish to be able to count the number of words in a string. For example, image that you run a Web site with a classified section and you restrict users to posting a classified ad with only, say, 200 words (or perhaps you charge for the ad based on the number of words in the ad).

As with "removing extraneous spaces in a string there are a number of ways to count the words in a string. One method involves using split to turn the string into an array. Basically you are just using the VBScript split function to delimit on the space character. (To learn more about split be sure to read Parsing with join and split.) So, if you have the string:
Dim str
str = "Today is a great day indeed, Bob."

And you use split to break it down into an array like so:
Dim aWords
aWords = split(str, " ")

The array aWords would have the following elements:

aWords(0) == "Today"
aWords(1) == "is"
aWords(2) == "a"
aWords(3) == "great"
aWords(4) == "day"
aWords(5) == "indeed,"
aWords(6) == "Bob."

So, to get the total number of words all you would have to do is use UBound(aWords) + 1 (you need to add one since UBound(aWords) would return 6 since the array is indexed at zero). Things get a little more complex with this technique if your sentence has multiple spaces in the string, like:
Dim str
str = "Hi.  How are you?"

Note that there are two spaces between "Hi." and "How are you?" When using split this will return the array as:

aWords(0) == "Hi."
aWords(1) == ""
aWords(2) == "How"
aWords(3) == "are"
aWords(4) == "you?"

Ah! It's counting the two spaces as a single word (see aWords(1)). To compensate for this we would need to strip out all of the extraneous spaces in the string before applying the split solution. Fortunately there is a previous FAQ demonstrating how to remove extraneous spaces in a string: How can I remove multiple spaces between words in a string?Using the code presented in that FAQ, we have:
Dim str
str = "Hi. How are you?"

'Start by trimming leading/trailing spaces
str = Trim(str)

'Now, while we have 2 consecutive spaces, replace them
'with a single space...
Do While InStr(1, str, "  ")
  str = Replace(str, "  ", " ")
Loop

Dim aWords
aWords = split(str, " ")
Response.Write "There are " & UBound(aWords) + 1 & " words in " & str

Neat, eh? There is, however, a much cleaner way for counting the number of words in a string and it involves regular expressions. (For more information on regular expressions be sure to visit the Regular Expressions Article Index!) The regular expression to count the number of words in a string uses the non-greedy repitition pattern matching symbol. This special symbol is only available in the regular expression engine that ships with the Microsoft Scripting Engines version 5.5 or greater. To learn more about this special non-greedy matching symbol be sure to read: Picking Out Delimited Text with Regular Expressions.

To count the number of words in a sentence our regular expression should search for one or more word characters surrounded by word boundaries. Word boundaries represent the beginning or end of a word. They can be spaces or punctuation. For example, the string "Hello, how are you?" has two word boundaries around each word. The first occurs right before the first letter of the string, the second right before the comma after "Hello", the next is right before the "h" in "how," and so on. Regular expressions have a special character when searching for a word boundary: \b. Since we are looking for one or more word characters between word boundaries, our regular expression is:

\b(\w+?)\b

The \w character translates to any word character (any alphanumeric character); the + means match one or more such characters; the ? means to apply the non-greedy search, which basically means match the fewest number of characters that appear between two word boundaries. So, in plain English, the regular expression states: "Match one or more word characters between word boundaries."

Unfortunately apostrophes count as word boundaries, meaning the string:

I'm funny.

Will be counted as three words: Im, and funny. So... how can we fix this? It's a bit of a hack, but in the Execute function we can replace all aposotrphe's with blank strings. Examine the example below to see how this is done.

Once we Execute this regular expression, we simply need to count the number of Matches returned and that will let us know how many words are in our string. An example can be seen below:
dim regex
set regex = new RegExp
regex.IgnoreCase = True
regex.Global = True
regex.Pattern = "\b(\w+?)\b"

'Remember to remove all apostrophes in str!
'Note the Replace statement in the Execute function
Response.Write "<p>There are " & _
                FormatNumber(regex.Execute(Replace(str,"'","")).Count, 0) & _
                " words in your sentence(s): """ & _
                str & "
"".<p>"