13. Counting: Repetition and Regexps

Repetition and regular expression searches are powerful tools that you often use when you write code in Emacs Lisp. This chapter illustrates the use of regular expression searches through the construction of word count commands using while loops and recursion.

Counting words
13.1 The `count-words-region` Function		Use a regexp, but find a problem.
13.2 Count Words Recursively		Start with case of no words in region.
13.3 Exercise: Counting Punctuation

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

Counting words

The standard Emacs distribution contains a function for counting the number of lines within a region. However, there is no corresponding function for counting words.

Certain types of writing ask you to count words. Thus, if you write an essay, you may be limited to 800 words; if you write a novel, you may discipline yourself to write 1000 words a day. It seems odd to me that Emacs lacks a word count command. Perhaps people use Emacs mostly for code or types of documentation that do not require word counts; or perhaps they restrict themselves to the operating system word count command, wc. Alternatively, people may follow the publishers' convention and compute a word count by dividing the number of characters in a document by five. In any event, here are commands to count words.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

13.1 The `count-words-region` Function

A word count command could count words in a line, paragraph, region, or buffer. What should the command cover? You could design the command to count the number of words in a complete buffer. However, the Emacs tradition encourages flexibility--you may want to count words in just a section, rather than all of a buffer. So it makes more sense to design the command to count the number of words in a region. Once you have a count-words-region command, you can, if you wish, count words in a whole buffer by marking it with C-x h (mark-whole-buffer).

Clearly, counting words is a repetitive act: starting from the beginning of the region, you count the first word, then the second word, then the third word, and so on, until you reach the end of the region. This means that word counting is ideally suited to recursion or to a while loop.

Designing `count-words-region`		The definition using a `while` loop.
13.1.1 The Whitespace Bug in `count-words-region`

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

Designing `count-words-region`

First, we will implement the word count command with a while loop, then with recursion. The command will, of course, be interactive.

The template for an interactive function definition is, as always:

(defun name-of-function (argument-list)
  "documentation..."
  (interactive-expression...)
  body...)

What we need to do is fill in the slots.

The name of the function should be self-explanatory and similar to the existing count-lines-region name. This makes the name easier to remember. count-words-region is a good choice.

The function counts words within a region. This means that the argument list must contain symbols that are bound to the two positions, the beginning and end of the region. These two positions can be called `beginning' and `end' respectively. The first line of the documentation should be a single sentence, since that is all that is printed as documentation by a command such as apropos. The interactive expression will be of the form `(interactive "r")', since that will cause Emacs to pass the beginning and end of the region to the function's argument list. All this is routine.

The body of the function needs to be written to do three tasks: first, to set up conditions under which the while loop can count words, second, to run the while loop, and third, to send a message to the user.

When a user calls count-words-region, point may be at the beginning or the end of the region. However, the counting process must start at the beginning of the region. This means we will want to put point there if it is not already there. Executing (goto-char beginning) ensures this. Of course, we will want to return point to its expected position when the function finishes its work. For this reason, the body must be enclosed in a save-excursion expression.

The central part of the body of the function consists of a while loop in which one expression jumps point forward word by word, and another expression counts those jumps. The true-or-false-test of the while loop should test true so long as point should jump forward, and false when point is at the end of the region.

We could use (forward-word 1) as the expression for moving point forward word by word, but it is easier to see what Emacs identifies as a `word' if we use a regular expression search.

A regular expression search that finds the pattern for which it is searching leaves point after the last character matched. This means that a succession of successful word searches will move point forward word by word.

As a practical matter, we want the regular expression search to jump over whitespace and punctuation between words as well as over the words themselves. A regexp that refuses to jump over interword whitespace would never jump more than one word! This means that the regexp should include the whitespace and punctuation that follows a word, if any, as well as the word itself. (A word may end a buffer and not have any following whitespace or punctuation, so that part of the regexp must be optional.)

Thus, what we want for the regexp is a pattern defining one or more word constituent characters followed, optionally, by one or more characters that are not word constituents. The regular expression for this is:

\w+\W*

The buffer's syntax table determines which characters are and are not word constituents. (See section What Constitutes a Word or Symbol?, for more about syntax. Also, see section `The Syntax Table' in The GNU Emacs Manual, and section `Syntax Tables' in The GNU Emacs Lisp Reference Manual.)

The search expression looks like this:

(re-search-forward "\\w+\\W*")

(Note that paired backslashes precede the `w' and `W'. A single backslash has special meaning to the Emacs Lisp interpreter. It indicates that the following character is interpreted differently than usual. For example, the two characters, `\n', stand for `newline', rather than for a backslash followed by `n'. Two backslashes in a row stand for an ordinary, `unspecial' backslash.)

We need a counter to count how many words there are; this variable must first be set to 0 and then incremented each time Emacs goes around the while loop. The incrementing expression is simply:

(setq count (1+ count))

Finally, we want to tell the user how many words there are in the region. The message function is intended for presenting this kind of information to the user. The message has to be phrased so that it reads properly regardless of how many words there are in the region: we don't want to say that "there are 1 words in the region". The conflict between singular and plural is ungrammatical. We can solve this problem by using a conditional expression that evaluates different messages depending on the number of words in the region. There are three possibilities: no words in the region, one word in the region, and more than one word. This means that the cond special form is appropriate.

All this leads to the following function definition:

;;; First version; has bugs!
(defun count-words-region (beginning end)
  "Print number of words in the region.
Words are defined as at least one word-constituent
character followed by at least one character that
is not a word-constituent.  The buffer's syntax
table determines which characters these are."
  (interactive "r")
  (message "Counting words in region ... ")

;;; 1. Set up appropriate conditions.
  (save-excursion
    (goto-char beginning)
    (let ((count 0))

;;; 2. Run the while loop.
      (while (< (point) end)
        (re-search-forward "\\w+\\W*")
        (setq count (1+ count)))

;;; 3. Send a message to the user.
      (cond ((zerop count)
             (message
              "The region does NOT have any words."))
            ((= 1 count)
             (message
              "The region has 1 word."))
            (t
             (message
              "The region has %d words." count))))))

As written, the function works, but not in all circumstances.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

13.1.1 The Whitespace Bug in `count-words-region`

The count-words-region command described in the preceding section has two bugs, or rather, one bug with two manifestations. First, if you mark a region containing only whitespace in the middle of some text, the count-words-region command tells you that the region contains one word! Second, if you mark a region containing only whitespace at the end of the buffer or the accessible portion of a narrowed buffer, the command displays an error message that looks like this:

Search failed: "\\w+\\W*"

If you are reading this in Info in GNU Emacs, you can test for these bugs yourself.

First, evaluate the function in the usual manner to install it. Here is a copy of the definition. Place your cursor after the closing parenthesis and type C-x C-e to install it.

;; First version; has bugs!
(defun count-words-region (beginning end)
  "Print number of words in the region.
Words are defined as at least one word-constituent character followed
by at least one character that is not a word-constituent.  The buffer's
syntax table determines which characters these are."
  (interactive "r")
  (message "Counting words in region ... ")

;;; 1. Set up appropriate conditions.
  (save-excursion
    (goto-char beginning)
    (let ((count 0))

;;; 2. Run the while loop.
      (while (< (point) end)
        (re-search-forward "\\w+\\W*")
        (setq count (1+ count)))

;;; 3. Send a message to the user.
      (cond ((zerop count)
             (message "The region does NOT have any words."))
            ((= 1 count) (message "The region has 1 word."))
            (t (message "The region has %d words." count))))))

If you wish, you can also install this keybinding by evaluating it:

(global-set-key "\C-c=" 'count-words-region)

To conduct the first test, set mark and point to the beginning and end of the following line and then type C-c = (or M-x count-words-region if you have not bound C-c =):

    one   two  three

Emacs will tell you, correctly, that the region has three words.

Repeat the test, but place mark at the beginning of the line and place point just before the word `one'. Again type the command C-c = (or M-x count-words-region). Emacs should tell you that the region has no words, since it is composed only of the whitespace at the beginning of the line. But instead Emacs tells you that the region has one word!

For the third test, copy the sample line to the end of the `*scratch*' buffer and then type several spaces at the end of the line. Place mark right after the word `three' and point at the end of line. (The end of the line will be the end of the buffer.) Type C-c = (or M-x count-words-region) as you did before. Again, Emacs should tell you that the region has no words, since it is composed only of the whitespace at the end of the line. Instead, Emacs displays an error message saying `Search failed'.

The two bugs stem from the same problem.

Consider the first manifestation of the bug, in which the command tells you that the whitespace at the beginning of the line contains one word. What happens is this: The M-x count-words-region command moves point to the beginning of the region. The while tests whether the value of point is smaller than the value of end, which it is. Consequently, the regular expression search looks for and finds the first word. It leaves point after the word. count is set to one. The while loop repeats; but this time the value of point is larger than the value of end, the loop is exited; and the function displays a message saying the number of words in the region is one. In brief, the regular expression search looks for and finds the word even though it is outside the marked region.

In the second manifestation of the bug, the region is whitespace at the end of the buffer. Emacs says `Search failed'. What happens is that the true-or-false-test in the while loop tests true, so the search expression is executed. But since there are no more words in the buffer, the search fails.

In both manifestations of the bug, the search extends or attempts to extend outside of the region.

The solution is to limit the search to the region--this is a fairly simple action, but as you may have come to expect, it is not quite as simple as you might think.

As we have seen, the re-search-forward function takes a search pattern as its first argument. But in addition to this first, mandatory argument, it accepts three optional arguments. The optional second argument bounds the search. The optional third argument, if t, causes the function to return nil rather than signal an error if the search fails. The optional fourth argument is a repeat count. (In Emacs, you can see a function's documentation by typing C-h f, the name of the function, and then RET.)

In the count-words-region definition, the value of the end of the region is held by the variable end which is passed as an argument to the function. Thus, we can add end as an argument to the regular expression search expression:

(re-search-forward "\\w+\\W*" end)

However, if you make only this change to the count-words-region definition and then test the new version of the definition on a stretch of whitespace, you will receive an error message saying `Search failed'.

What happens is this: the search is limited to the region, and fails as you expect because there are no word-constituent characters in the region. Since it fails, we receive an error message. But we do not want to receive an error message in this case; we want to receive the message that "The region does NOT have any words."

The solution to this problem is to provide re-search-forward with a third argument of t, which causes the function to return nil rather than signal an error if the search fails.

However, if you make this change and try it, you will see the message "Counting words in region ... " and ... you will keep on seeing that message ..., until you type C-g (keyboard-quit).

Here is what happens: the search is limited to the region, as before, and it fails because there are no word-constituent characters in the region, as expected. Consequently, the re-search-forward expression returns nil. It does nothing else. In particular, it does not move point, which it does as a side effect if it finds the search target. After the re-search-forward expression returns nil, the next expression in the while loop is evaluated. This expression increments the count. Then the loop repeats. The true-or-false-test tests true because the value of point is still less than the value of end, since the re-search-forward expression did not move point. ... and the cycle repeats ...

The count-words-region definition requires yet another modification, to cause the true-or-false-test of the while loop to test false if the search fails. Put another way, there are two conditions that must be satisfied in the true-or-false-test before the word count variable is incremented: point must still be within the region and the search expression must have found a word to count.

Since both the first condition and the second condition must be true together, the two expressions, the region test and the search expression, can be joined with an and special form and embedded in the while loop as the true-or-false-test, like this:

(and (< (point) end) (re-search-forward "\\w+\\W*" end t))

(See section 12.4 forward-paragraph: a Goldmine of Functions, for information about and.)

The re-search-forward expression returns t if the search succeeds and as a side effect moves point. Consequently, as words are found, point is moved through the region. When the search expression fails to find another word, or when point reaches the end of the region, the true-or-false-test tests false, the while loop exists, and the count-words-region function displays one or other of its messages.

After incorporating these final changes, the count-words-region works without bugs (or at least, without bugs that I have found!). Here is what it looks like:

;;; Final version: while
(defun count-words-region (beginning end)
  "Print number of words in the region."
  (interactive "r")
  (message "Counting words in region ... ")

;;; 1. Set up appropriate conditions.
  (save-excursion
    (let ((count 0))
      (goto-char beginning)

;;; 2. Run the while loop.
      (while (and (< (point) end)
                  (re-search-forward "\\w+\\W*" end t))
        (setq count (1+ count)))

;;; 3. Send a message to the user.
      (cond ((zerop count)
             (message
              "The region does NOT have any words."))
            ((= 1 count)
             (message
              "The region has 1 word."))
            (t
             (message
              "The region has %d words." count))))))

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

13.2 Count Words Recursively

You can write the function for counting words recursively as well as with a while loop. Let's see how this is done.

First, we need to recognize that the count-words-region function has three jobs: it sets up the appropriate conditions for counting to occur; it counts the words in the region; and it sends a message to the user telling how many words there are.

If we write a single recursive function to do everything, we will receive a message for every recursive call. If the region contains 13 words, we will receive thirteen messages, one right after the other. We don't want this! Instead, we must write two functions to do the job, one of which (the recursive function) will be used inside of the other. One function will set up the conditions and display the message; the other will return the word count.

Let us start with the function that causes the message to be displayed. We can continue to call this count-words-region.

This is the function that the user will call. It will be interactive. Indeed, it will be similar to our previous versions of this function, except that it will call recursive-count-words to determine how many words are in the region.

We can readily construct a template for this function, based on our previous versions:

;; Recursive version; uses regular expression search
(defun count-words-region (beginning end)
  "documentation..."
  (interactive-expression...)

;;; 1. Set up appropriate conditions.
  (explanatory message)
  (set-up functions...

;;; 2. Count the words.
    recursive call

;;; 3. Send a message to the user.
    message providing word count))

The definition looks straightforward, except that somehow the count returned by the recursive call must be passed to the message displaying the word count. A little thought suggests that this can be done by making use of a let expression: we can bind a variable in the varlist of a let expression to the number of words in the region, as returned by the recursive call; and then the cond expression, using binding, can display the value to the user.

Often, one thinks of the binding within a let expression as somehow secondary to the `primary' work of a function. But in this case, what you might consider the `primary' job of the function, counting words, is done within the let expression.

Using let, the function definition looks like this:

(defun count-words-region (beginning end)
  "Print number of words in the region."
  (interactive "r")

;;; 1. Set up appropriate conditions.
  (message "Counting words in region ... ")
  (save-excursion
    (goto-char beginning)

;;; 2. Count the words.
    (let ((count (recursive-count-words end)))

;;; 3. Send a message to the user.
      (cond ((zerop count)
             (message
              "The region does NOT have any words."))
            ((= 1 count)
             (message
              "The region has 1 word."))
            (t
             (message
              "The region has %d words." count))))))

Next, we need to write the recursive counting function.

A recursive function has at least three parts: the `do-again-test', the `next-step-expression', and the recursive call.

The do-again-test determines whether the function will or will not be called again. Since we are counting words in a region and can use a function that moves point forward for every word, the do-again-test can check whether point is still within the region. The do-again-test should find the value of point and determine whether point is before, at, or after the value of the end of the region. We can use the point function to locate point. Clearly, we must pass the value of the end of the region to the recursive counting function as an argument.

In addition, the do-again-test should also test whether the search finds a word. If it does not, the function should not call itself again.

The next-step-expression changes a value so that when the recursive function is supposed to stop calling itself, it stops. More precisely, the next-step-expression changes a value so that at the right time, the do-again-test stops the recursive function from calling itself again. In this case, the next-step-expression can be the expression that moves point forward, word by word.

The third part of a recursive function is the recursive call.

Somewhere, also, we also need a part that does the `work' of the function, a part that does the counting. A vital part!

But already, we have an outline of the recursive counting function:

(defun recursive-count-words (region-end)
  "documentation..."
   do-again-test
   next-step-expression
   recursive call)

Now we need to fill in the slots. Let's start with the simplest cases first: if point is at or beyond the end of the region, there cannot be any words in the region, so the function should return zero. Likewise, if the search fails, there are no words to count, so the function should return zero.

On the other hand, if point is within the region and the search succeeds, the function should call itself again.

Thus, the do-again-test should look like this:

(and (< (point) region-end)
     (re-search-forward "\\w+\\W*" region-end t))

Note that the search expression is part of the do-again-test--the function returns t if its search succeeds and nil if it fails. (See section The Whitespace Bug in count-words-region, for an explanation of how re-search-forward works.)

The do-again-test is the true-or-false test of an if clause. Clearly, if the do-again-test succeeds, the then-part of the if clause should call the function again; but if it fails, the else-part should return zero since either point is outside the region or the search failed because there were no words to find.

But before considering the recursive call, we need to consider the next-step-expression. What is it? Interestingly, it is the search part of the do-again-test.

In addition to returning t or nil for the do-again-test, re-search-forward moves point forward as a side effect of a successful search. This is the action that changes the value of point so that the recursive function stops calling itself when point completes its movement through the region. Consequently, the re-search-forward expression is the next-step-expression.

In outline, then, the body of the recursive-count-words function looks like this:

(if do-again-test-and-next-step-combined
    ;; then
    recursive-call-returning-count
  ;; else
  return-zero)

How to incorporate the mechanism that counts?

If you are not used to writing recursive functions, a question like this can be troublesome. But it can and should be approached systematically.

We know that the counting mechanism should be associated in some way with the recursive call. Indeed, since the next-step-expression moves point forward by one word, and since a recursive call is made for each word, the counting mechanism must be an expression that adds one to the value returned by a call to recursive-count-words.

Consider several cases:

If there are two words in the region, the function should return a value resulting from adding one to the value returned when it counts the first word, plus the number returned when it counts the remaining words in the region, which in this case is one.
If there is one word in the region, the function should return a value resulting from adding one to the value returned when it counts that word, plus the number returned when it counts the remaining words in the region, which in this case is zero.
If there are no words in the region, the function should return zero.

From the sketch we can see that the else-part of the if returns zero for the case of no words. This means that the then-part of the if must return a value resulting from adding one to the value returned from a count of the remaining words.

The expression will look like this, where 1+ is a function that adds one to its argument.

(1+ (recursive-count-words region-end))

The whole recursive-count-words function will then look like this:

(defun recursive-count-words (region-end)
  "documentation..."

;;; 1. do-again-test
  (if (and (< (point) region-end)
           (re-search-forward "\\w+\\W*" region-end t))

;;; 2. then-part: the recursive call
      (1+ (recursive-count-words region-end))

;;; 3. else-part
    0))

Let's examine how this works:

If there are no words in the region, the else part of the if expression is evaluated and consequently the function returns zero.

If there is one word in the region, the value of point is less than the value of region-end and the search succeeds. In this case, the true-or-false-test of the if expression tests true, and the then-part of the if expression is evaluated. The counting expression is evaluated. This expression returns a value (which will be the value returned by the whole function) that is the sum of one added to the value returned by a recursive call.

Meanwhile, the next-step-expression has caused point to jump over the first (and in this case only) word in the region. This means that when (recursive-count-words region-end) is evaluated a second time, as a result of the recursive call, the value of point will be equal to or greater than the value of region end. So this time, recursive-count-words will return zero. The zero will be added to one, and the original evaluation of recursive-count-words will return one plus zero, which is one, which is the correct amount.

Clearly, if there are two words in the region, the first call to recursive-count-words returns one added to the value returned by calling recursive-count-words on a region containing the remaining word--that is, it adds one to one, producing two, which is the correct amount.

Similarly, if there are three words in the region, the first call to recursive-count-words returns one added to the value returned by calling recursive-count-words on a region containing the remaining two words--and so on and so on.

With full documentation the two functions look like this:

The recursive function:

(defun recursive-count-words (region-end)
  "Number of words between point and REGION-END."

;;; 1. do-again-test
  (if (and (< (point) region-end)
           (re-search-forward "\\w+\\W*" region-end t))

;;; 2. then-part: the recursive call
      (1+ (recursive-count-words region-end))

;;; 3. else-part
    0))

The wrapper:

;;; Recursive version
(defun count-words-region (beginning end)
  "Print number of words in the region.

Words are defined as at least one word-constituent
character followed by at least one character that is
not a word-constituent.  The buffer's syntax table
determines which characters these are."
  (interactive "r")
  (message "Counting words in region ... ")
  (save-excursion
    (goto-char beginning)
    (let ((count (recursive-count-words end)))
      (cond ((zerop count)
             (message
              "The region does NOT have any words."))
            ((= 1 count)
             (message "The region has 1 word."))
            (t
             (message
              "The region has %d words." count))))))

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

13.3 Exercise: Counting Punctuation

Using a while loop, write a function to count the number of punctuation marks in a region--period, comma, semicolon, colon, exclamation mark, and question mark. Do the same using recursion.

[ << ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

This document was generated by Dohn Arms on March, 6 2005 using texi2html

13. Counting: Repetition and Regexps

Counting words

13.1 The count-words-region Function

Designing count-words-region

13.1.1 The Whitespace Bug in count-words-region

13.2 Count Words Recursively

13.3 Exercise: Counting Punctuation

13.1 The `count-words-region` Function

Designing `count-words-region`

13.1.1 The Whitespace Bug in `count-words-region`