Token Count Metric

Public section

Theme Text

Preferences are saved on this device.

Token Count and Halstead’s Metrics

Token count measures program size and complexity by treating source code as a sequence of tokens, each classified as either an operator or an operand. This idea underlies Halstead’s software metrics, which are widely used in tools that analyse code and estimate complexity.

In this context:

Operators include:
- Arithmetic symbols – + - * /
- Keywords – while, for, if, return, printf
- Special symbols – { } ( ) = ; , [ ]
- Function names used as actions – e.g. eof, scanf, sort
Operands include variables, constants, and labels used in the program.

Halstead’s central idea is that a program (an implementation of an algorithm) can be viewed as a collection of operator and operand tokens. From their counts, several base and derived measures are computed.

Base Measures

By scanning the source code and classifying tokens, we collect four base measures:

n₁ – number of distinct operators.
n₂ – number of distinct operands.
N₁ – total number of operator occurrences.
N₂ – total number of operand occurrences.

Derived Halstead Metrics

From these four base values, Halstead defined several derived metrics:

Program vocabulary – total number of distinct tokens:

$$\mathrm{n} = n_1 + n_2$$

Program length – total number of token occurrences:

$$\mathrm{N} = N_1 + N_2$$

Estimated program length – theoretical length based on the vocabulary:

$$\hat{\mathrm{N}} = n_1 \log_2 n_1 + n_2 \log_2 n_2$$

Program volume – information content of the program in bits:

$$\mathrm{V} = \mathrm{N} \log_2 \mathrm{n}$$

The unit of program volume V is bits.

Program difficulty – how hard the program is to write or understand:

$$\mathrm{D} = \frac{n_1}{2}\cdot\frac{N_2}{n_2}$$

Program level – inverse of difficulty; higher level means easier (better) code:

$$\mathrm{L} = \frac{1}{\mathrm{D}}$$

Programming effort – estimated mental effort to implement or understand the program:

$$\mathrm{E} = \mathrm{D} \times \mathrm{V}$$

As a rough guideline:

Larger vocabulary and volume → more complex program.
Higher difficulty and effort → more error-prone and harder to maintain.

Example: Token Count

Table 5 shows a sample token count for a small program. Operators and operands are listed with their number of occurrences.

Table 5: A token count example
Operators	Occurrences	Operands	Occurrences
int	4	SORT	1
()	5	x	7
,	4	n	3
[]	7	i	8
if	2	j	7
<	2	save	3
;	11	im1	3
for	2	2	2
=	6	1	3
-	1	0	1
<=	2	–	–
++	2	–	–
return	2	–	–
{ }	3	–	–
n₁ = 14, N₁ = 53		n₂ = 10, N₂ = 38

Using these values, you can now compute vocabulary n = n₁ + n₂, length N = N₁ + N₂, volume V = N \log₂ n, difficulty D = (n₁/2) · (N₂/n₂), and effort E = D × V using the formulas above.

Token Count and Halstead’s Metrics

Base Measures

Derived Halstead Metrics

Example: Token Count

Contents