Tuesday, November 9, 2010
count the total number of words in a string
There are many scenarios in which you may wish to be able to count the number of words in a string. For example, image that you run a Web site with a classified section and you restrict users to posting a classified ad with only, say, 200 words (or perhaps you charge for the ad based on the number of words in the ad).
As with "removing extraneous spaces in a string there are a number of ways to count the words in a string. One method involves using
As with "removing extraneous spaces in a string there are a number of ways to count the words in a string. One method involves using
split
to turn the string into an array. Basically you are just using the VBScript split
function to delimit on the space character. (To learn more about split
be sure to read Parsing with join
and split
.) So, if you have the string:Dim str |
And you use
split
to break it down into an array like so:Dim aWords |
The array
So, to get the total number of words all you would have to do is use
aWords
would have the following elements:
aWords(0) == "Today"
aWords(1) == "is"
aWords(2) == "a"
aWords(3) == "great"
aWords(4) == "day"
aWords(5) == "indeed,"
aWords(6) == "Bob."
So, to get the total number of words all you would have to do is use
UBound(aWords) + 1
(you need to add one since UBound(aWords)
would return 6
since the array is indexed at zero). Things get a little more complex with this technique if your sentence has multiple spaces in the string, like:Dim str |
Note that there are two spaces between "Hi." and "How are you?" When using
Ah! It's counting the two spaces as a single word (see
split
this will return the array as:
aWords(0) == "Hi."
aWords(1) == ""
aWords(2) == "How"
aWords(3) == "are"
aWords(4) == "you?"
Ah! It's counting the two spaces as a single word (see
aWords(1)
). To compensate for this we would need to strip out all of the extraneous spaces in the string before applying the split solution. Fortunately there is a previous FAQ demonstrating how to remove extraneous spaces in a string: How can I remove multiple spaces between words in a string?Using the code presented in that FAQ, we have:Dim str |
Neat, eh? There is, however, a much cleaner way for counting the number of words in a string and it involves regular expressions. (For more information on regular expressions be sure to visit the Regular Expressions Article Index!) The regular expression to count the number of words in a string uses the non-greedy repitition pattern matching symbol. This special symbol is only available in the regular expression engine that ships with the Microsoft Scripting Engines version 5.5 or greater. To learn more about this special non-greedy matching symbol be sure to read: Picking Out Delimited Text with Regular Expressions.
To count the number of words in a sentence our regular expression should search for one or more word characters surrounded by word boundaries. Word boundaries represent the beginning or end of a word. They can be spaces or punctuation. For example, the string "Hello, how are you?" has two word boundaries around each word. The first occurs right before the first letter of the string, the second right before the comma after "Hello", the next is right before the "h" in "how," and so on. Regular expressions have a special character when searching for a word boundary:
The
Unfortunately apostrophes count as word boundaries, meaning the string:
Will be counted as three words:
Once we
To count the number of words in a sentence our regular expression should search for one or more word characters surrounded by word boundaries. Word boundaries represent the beginning or end of a word. They can be spaces or punctuation. For example, the string "Hello, how are you?" has two word boundaries around each word. The first occurs right before the first letter of the string, the second right before the comma after "Hello", the next is right before the "h" in "how," and so on. Regular expressions have a special character when searching for a word boundary:
\b
. Since we are looking for one or more word characters between word boundaries, our regular expression is:
\b(\w+?)\b
The
\w
character translates to any word character (any alphanumeric character); the +
means match one or more such characters; the ?
means to apply the non-greedy search, which basically means match the fewest number of characters that appear between two word boundaries. So, in plain English, the regular expression states: "Match one or more word characters between word boundaries."Unfortunately apostrophes count as word boundaries, meaning the string:
I'm funny.
Will be counted as three words:
I
, m
, and funny
. So... how can we fix this? It's a bit of a hack, but in the Execute
function we can replace all aposotrphe's with blank strings. Examine the example below to see how this is done.Once we
Execute
this regular expression, we simply need to count the number of Matches
returned and that will let us know how many words are in our string. An example can be seen below:dim regex |