Token Count Metric
Token Count and Halstead’s Metrics
Token count measures program size and complexity by treating source code as a sequence of tokens, each classified as either an operator or an operand. This idea underlies Halstead’s software metrics, which are widely used in tools that analyse code and estimate complexity.
In this context:
-
Operators include:
- Arithmetic symbols – + - * /
- Keywords – while, for, if, return, printf
- Special symbols – { } ( ) = ; , [ ]
- Function names used as actions – e.g. eof, scanf, sort
- Operands include variables, constants, and labels used in the program.
Halstead’s central idea is that a program (an implementation of an algorithm) can be viewed as a collection of operator and operand tokens. From their counts, several base and derived measures are computed.
Base Measures
By scanning the source code and classifying tokens, we collect four base measures:
- n1 – number of distinct operators.
- n2 – number of distinct operands.
- N1 – total number of operator occurrences.
- N2 – total number of operand occurrences.
Derived Halstead Metrics
From these four base values, Halstead defined several derived metrics:
Program vocabulary – total number of distinct tokens:
Program length – total number of token occurrences:
Estimated program length – theoretical length based on the vocabulary:
Program volume – information content of the program in bits:
The unit of program volume V is bits.
Program difficulty – how hard the program is to write or understand:
Program level – inverse of difficulty; higher level means easier (better) code:
Programming effort – estimated mental effort to implement or understand the program:
As a rough guideline:
- Larger vocabulary and volume → more complex program.
- Higher difficulty and effort → more error-prone and harder to maintain.
Example: Token Count
Table 5 shows a sample token count for a small program. Operators and operands are listed with their number of occurrences.
| Operators | Occurrences | Operands | Occurrences |
|---|---|---|---|
| int | 4 | SORT | 1 |
| () | 5 | x | 7 |
| , | 4 | n | 3 |
| [] | 7 | i | 8 |
| if | 2 | j | 7 |
| < | 2 | save | 3 |
| ; | 11 | im1 | 3 |
| for | 2 | 2 | 2 |
| = | 6 | 1 | 3 |
| - | 1 | 0 | 1 |
| <= | 2 | – | – |
| ++ | 2 | – | – |
| return | 2 | – | – |
| { } | 3 | – | – |
| n1 = 14, N1 = 53 | n2 = 10, N2 = 38 | ||
Using these values, you can now compute vocabulary n = n₁ + n₂, length N = N₁ + N₂, volume V = N \log₂ n, difficulty D = (n₁/2) · (N₂/n₂), and effort E = D × V using the formulas above.