Computer Number Representations

2. Computer Number Representations#

2.1. Introduction to Computer Number Representations#

Computers are powerful, but they are finite.

A problem in computer design is how to represent an arbitrary number using a finite amount of memory space, and then how to deal with the limitations arising from this representation.

Computer memories are based on magnetic or electronic realizations of a spin pointing up or down. Therefore, the most elementary units of computer memory are two binary integers (bits) 0 and 1.

All numbers are stored in computer memory in the binary form, i.e. as long strings of 0s and 1s.

\(N\) bits can store integers in the range \([0, 2^{N-1}]\). \(N-1\) since the first bit represents the sign: e.g. 0 for positive numbers.

The description of a particular computer’s system or language states the “word length”, i.e. the number of bits used to store a number.

The length is often expressed in bytes (i.e. a mouthful of bits!), where:

1 byte = 1 B = 8 bits

Memory sizes are measured in bytes, kilobytes, megabytes, gigabytes, terabytes, petabytes.

1 K = 1 kB = \(2^{10}\) bytes = 1024 bytes.

1 byte is also the amount of memory needed to store a single letter like “a”. This adds up to a typical printed page requiring ~3 kB.

Memory in older personal computers (PCs): uses 8-bit words, and therefore for these systems, the maximum integer was \(2^7 = 128\).

With 64-bit (modern PCs): \(1 - 2^{63} \simeq 1-10^{19}\).

Trying to store a number larger than the hardware/software was designed for (overflow) was common in older machines!

2.2. Floating-Point Numbers#

Real numbers are represented on computers in either fixed-point or floating-point notations.

In fixed-point notation, numbers are represented with a fixed number of places beyond the decimal point (e.g. integers).

Most scientific computations use double-precision floating-point numbers with 64 bits = 8 B. (Note that in particle physics, and other fields of physics, quadruple precision is necessary for some calculations! There are specialized libraries for this purpose).

64 bit double precision floating-point numbers are called float in Python.

In 1987 the Institute of Electrical & Electronics Engineers (IEEE) and the American National Standards Institute (ANSI) adopted the IEEE754 standard for floating point arithmetics.

Bit allocation (IEEE754 standard):

1 bit is allocated to the sign indicator, \(s\).
11 bits are allocated to the exponent, \(e\).
52 bits are allocated to the fraction, \(f\).

A 64-bit (double precision) float can then be represented in decimal (base 10) by the formula:

\(n= (-1)^s\times 2^{e-1023} \times (1+f)\)

The value subtracted from the exponent is known as the bias, 1023 in this case.

Let’s see how this formula can be used to represent a number:

What is the number 1 10000000010 1000000000000000000000000000000000000000000000000000 (IEEE754) in base 10?

Answer:

The sign of the number is \((-1)^1=-1\).

The exponent in decimal is \(e = 1\cdot 2^{10} + 1 \cdot 2^1 = 1026\) and therefore the total exponent is \(1026 - 1023 = 3\).

The fraction is \(f = 1 \cdot \frac{1}{2^1} + 0 \cdot \frac{1}{2^2} + ... = 0.5\).

So:

\(n = (-1)^1 \times 2^3 \times (1+0.5) = -12.0\) in base 10.

Let’s try the opposite operation: given a number in base 10, let’s try to find a binary representation:

What is 15.0 (in base 10) in IEEE754 binary? What is the largest number smaller than 15.0? What is the smallest number larger than 15.0?

The number is positive, so \(s=0\).

We then ask for the exponent \(2^{e-1023}\) to give us the last number which is a power of two that is smaller than 15.0. i.e. \(2^3 = 8\), and therefore \(e=1026\). In binary form 1026 is: 2^1 + 2^10 or 10000000010.

Then the fraction remains. To obtain this we solve:

\((1+f) = n/2^3\) so: \(f = 15/8 -1 = 0.875\). To get the binary representation try: \(0.875 = 1 \cdot \frac{1}{2^1} + 1 \cdot \frac{1}{2^2} + 1 \cdot \frac{1}{2^3}\), so 0.875(base 10) = 1110000000000000000000000000000000000000000000000000 (base 2).

Putting these numbers together gives us the representation of 15.0 in IEEE754:

0 10000000010 1110000000000000000000000000000000000000000000000000

The next smallest number is: 0 10000000010 1101111111111111111111111111111111111111111111111111, which we can calculate to be:

\(n = (-1)^0 \times 2^3 \times (1 \cdot \frac{1}{2^1} + 1 \cdot \frac{1}{2^2} + 1 \cdot \frac{1}{2^4} + 1 \cdot \frac{1}{2^5} + ... + 1 \cdot \frac{1}{2^{52}}) = 14.9999999999999982236431605997\)

and the next largest number is: 0 10000000010 1110000000000000000000000000000000000000000000000001 = 15.0000000000000017763568394003

From the above, we can conclude that the IEEE754 number representing 15.0 also represents real numbers halfway between its immediate neighbors. Any computation that results in a number within this interval will be assigned the number 15.0

We call the distance from one number to the next the gap. Since the fraction is multiplied by \(2^{e-1023}\), the gap grows as the number represented grows. The gap at any given number can be computed using numpy:

import numpy as np

np.spacing(1e9)

np.float64(1.1920928955078125e-07)

There are special cases:

When \(e=0\): the leading 1 in the fraction takes the value 0 instead. The result is called a subnormal number and is computed by \(n=(-1)^s 2^{-1023} (0+f)\).
When \(e=2047\) and \(f\) is non zero, then the result is “not a number”, NAN, i.e. undefined.
When \(e=2047\), \(f=0\), \(s=0\) the result is positive infinity, +INF.
When \(e=2047\), \(f=0\), \(s=1\) the result is negative infinity, -INF.

Overflow occurs when numbers exceed the maximum value that can be represented. Python will assign this number to INF.

Underflow occurs when small nonzero values become too small, smaller than the smallest subnormal number. Python will assign a zero.

You can check these values using the sys module:

import sys
sys.float_info

print(sys.float_info.max)
print(sys.float_info.min)

1.7976931348623157e+308
2.2250738585072014e-308

2.3. Errors and Uncertainties in Computations#

2.3.1. Types of Errors#

Four general types of errors exist to plague your computations:

Blunders or bad theory: typos in program or data, fault in reasoning (i.e. in the theoretical description), using the wrong data files, etc..
Random errors: imprecision caused by events such as fluctuations in electronics, cosmic rays, etc..
Approximation errors: Imprecision arising from simplifying the math so that a problem can be solved on a computer. e.g. replacement of infinite series by a finite sums, of infinitesimal intervals by finite ones, by variable functions by constants, e.g.:

\(\sin(x) = \sum_{n=1}^{\infty} \frac{ (-1)^{n-1} x^{2n-1} } { (2n-1)! }\) (exact)

vs.

\(\sin(x) \simeq \sum_{n=1}^{N} \frac{ (-1)^{n-1} x^{2n-1} } { (2n-1)! } + \epsilon(x,N)\)

\(\epsilon(x,N)\) is the approximation (or algorithmic) error = the series from \(N+1\) to \(\infty\).

For any reasonable approximation, this error should decrease as \(N\) increases and should vanish in the limit \(N\rightarrow \infty\).

Round-off errors: Imprecision arising arising from the finite number of digits used to store floating-point numbers. This is analogous to the uncertainty in the measurement of a physical quantity in an elementary physics lab. Round-off errors accumulate as the computer handles more numbers, i.e. as the number of steps in a computation increases. This may cause some algorithms to become unstable with a rapid increase in error. In some cases the round-off error may become a major component in your answer, leading to “garbage”.

e.g. if a computer stores 1/3 as 0.3333 and 2/3 as 0.6667, if we ask the computer to do 2*(1/3) - 2/3 = 0.6666 - 0.6667 = -0.0001. Although small, this error is non-zero. If this is repeated millions of times, the final answer might not even be small!

2.3.2. Subtractive Cancelation#

Calculations employing numbers that are stored only approximately on the computer can only be expected to yield approximate answers.

We model the computer representation, \(x_c\), of the exact number \(x\), as:

\(x_c \simeq x ( 1 + \epsilon_x )\),

where \(\epsilon_x\) is the relative error in \(x_c\). We expect this to be of the same order as the “machine precision”. In particular, we expect double-precision numbers to have an error in the 15th decimal place.

We can apply this notation to simple subtraction: \(a = b - c\).

On a computer, this would be: \(a_c \simeq b_c - c_c \simeq b(1+\epsilon_b) - c( 1 + \epsilon_c)\).

So:

\(\frac{a_c}{a} \simeq 1 + \epsilon_b \frac{b}{a} - \frac{c}{a} \epsilon_c\)

Observe that: when we subtract two nearly equal numbers, i.e.,

\( b \simeq c \gg a\)

Then:

\(\frac{a_c}{a} \equiv 1 + \epsilon_a \simeq 1 + \frac{b}{a} (\epsilon_b - \epsilon_c) \simeq 1 + \frac{b}{a}(|\epsilon_b|+|\epsilon_c|)\)

Even if the relative errors somewhat cancel, they are still multiplied by a large number, \(b/a\). This can significantly magnify the error. But since we cannot assume a sign for the errors, we can assume the worst, i.e. that they add up.

As a more explicit mathematical example, consider the constructing the expansion of \(e^{-x}\) for large values of \(x\):

\(e^{-x} = 1 - x + \frac{x^2}{2!} - \frac{x^3}{3!} + ... \)

The first few terms are large, but of alternating sign, leading to an almost total cancelation in order to yield the final small result.

This subtractive cancelation can be eliminated by considering \(e^{-x} = 1/e^x\) instead, but round-off errors still remain, which we examine next.

2.3.3. Round-off Errors#

Error arises from a single division of the computer representations of two numbers:

\(a = \frac{b}{c} \rightarrow a_c = \frac{b_c}{c_c} = \frac{b}{c} \left( \frac{1+\epsilon_b}{1+\epsilon_c} \right)\),

so:

\(\frac{a_c}{a} = \frac{1+\epsilon_b}{1+\epsilon_c} \simeq (1+\epsilon_b) (1- \epsilon_c) \simeq 1 + \epsilon_b - \epsilon_c\),

where we have ignored terms of order \(\epsilon^2\).

We need to assume the worst for the error subtraction, and get:

\(\frac{a_c}{a} \simeq 1 + |\epsilon_b| + |\epsilon_c|\).

Can generalize this to estimate the error in the evaluation of a general function \(f(x)\):

\(\epsilon = \frac{ f(x) - f(x_c) } { f(x) } \simeq \frac{\mathrm{d} f(x) / \mathrm{d}x}{f(x)} (x-x_c)\).

E.g. for \(f(x) = \sqrt{1+x}\): \(\frac{ \mathrm{d} f}{\mathrm{d}x} = \frac{1}{2} \frac{1}{\sqrt{1+x}}\;.\)

\(\epsilon = \frac{x - x_c}{2 (1+x) }\;.\)

e.g. Evaluate at \(x=\pi/4\) and assume an error in the 4th decimal place of \(x\), then we obtain a similar error of \(1.5\times 10^{-4}\) on \(f(x) = \sqrt{1+x}\).

2.4. Round-off Error Accumulation#

If a calculation has a large number of steps, we can view the error as a step in a random walk.

For a random walk (see later), the total distance \(R\), covered in \(N\) steps of length \(r\) is:

\(R = \sqrt{N} r\)

By analogy, the total relative error after \(N\) calculational steps, each with machine precision, \(\epsilon_m\), is on average:

\(\epsilon_\mathrm{ro} = \sqrt{N} \epsilon_m\)

If the round-off error does not accumulate in a random manner: a more detailed analysis is needed.

In some cases there may be no cancelation, and the error may increase as \(N \epsilon_m\) instead. Even worse, in some recursive algorithms, error generation can be coherent, leading to a \(N!\) increase in error.