# Quantization Noise: What Is It, And Why Should You Care?

Zsolt Kollar, Monther Alrwashdeh

In this blog, we discuss two approaches to handle the finite word-length arithmetic necessary for processing these numbers, fixed- and floating-point arithmetics.

• 173

Computers use binary digits (AKA “bits”) to represent numbers of all kinds, including real or complex values, integers, or fractions. There are other types of number representation techniques for digital number representation, for example, octal number-, decimal number-, and hexadecimal number systems. Out of all types of systems, the binary number system has the most relevance and is the most popular for representing numbers in digital computer systems.

When discrete-time systems are implemented in hardware or software, all parameters and arithmetic operations are implemented using finite-precision numbers; hence, their effect is apparent. Two approaches to handle the finite word-length arithmetic necessary for processing these numbers, fixed- and floating-point arithmetics. The choice between these approaches depends on factors such as ease of implementation, required accuracy, and dynamic range.

Quantization is the process of approximating a continuous signal through a set of discrete symbols or integer values. While physical quantities are precise, when digitized and inserted into a computer or a digital system, they are rounded to the nearest representative value. This process introduces quantization error, i.e. noise. The nearest representative value is defined by the number representation type.

Fixed-point arithmetic is easier to implement but has a limited dynamic range and precision, meaning it can only handle very large or very small numbers. In contrast, floating-point arithmetic has a broader dynamic range and variable accuracy, depending on the magnitude of the number being processed. However, it is more complex to implement and analyze.

Field-programmable gate arrays (“FPGAs”) and microcontrollers can implement both fixed-point and floating-point arithmetic. However, floating-point arithmetic requires more hardware resources and has a longer computation time compared to fixed-point arithmetic. Therefore, the choice between fixed-point and floating-point implementation depends on the application’s requirements and the available hardware resources.

Numeric classes in MATLAB® – a product developed by MathWorks – include signed and unsigned integers and single-precision and double-precision floating-point numbers. By default, MATLAB stores all numeric values as double-precision floating points. You can store any number or array of numbers as integers or single precision figures. Integer and single-precision arrays offer more memory-efficient storage than double-precision variants.

The following summarizes these two number representations.

### Fixed-point number representation

There are three components of a fixed-point number representation: the sign-, integer-, and fractional elements, as explained in Figure 1.

The input-output characteristic of the fixed-point quantizer shown in Figure 2 is a staircase function with uniform steps. Figure 2: Input-output characteristic of the fixed-point quantizer (Source: Quantization Noise: Roundoff Error in Digital Computation, Signal Processing, Control, and Communications by B. Windrow and I. Kollar)

### Floating-point number representation

A floating-point representation defines a value with a bit vector containing three elements: sign, exponent, and fraction as explained in Figure 3. It is mainly defined by precise composition, which is the number of fraction bits plus the sign bit.

The input-output staircase function for the floating-point quantizer with a 3-bit fraction is shown in Figure 4. The input to this quantizer is x, a variable that is generally continuous in amplitude. The output of this quantizer is x’, a variable that is discrete in amplitude and that can only take on values in accordance with a floating-point number scale. The input–output relation for this quantizer is a staircase function that does not have uniform steps. Figure 4: Input-output characteristic of the floating-point quantizer (Source: Quantization Noise: Roundoff Error in Digital Computation, Signal Processing, Control, and Communications by B. Windrow and I. Kollar)

The the following floating-point representations as the most commonly used:

• Double-precision floating point: The total number of bits is 64, divided as follows: 1 bit for sign, 11 bits for exponent, and 52 bits for fraction.
• Single-precision floating-point: The total number of bits is 32, divided as follows: 1 bit for sign, 8 bits for exponent, and 23 bits for fraction.
• Half-precision floating-point: The total number of bits is 16, divided as follows: 1 bit for sign, 5 bits for exponent, and 10 bits for fraction.

### Comparison

Fixed-point arithmetic represents numbers with a fixed number of bits for the integer and fractional parts. In a fixed-point implementation, the range and precision of the numbers are determined by the bit width of the operands. Fixed-point arithmetic is more straightforward and faster than floating-point arithmetic, which makes it suitable for many applications that require real-time processing, such as signal processing, control systems, and image processing. On the other hand, floating-point arithmetic is a method of representing numbers with a variable number of bits for the integer and fractional parts. In a floating-point implementation, the range and precision of the numbers are determined by the exponent and the significand. Floating-point arithmetic is more complex than fixed-point arithmetic, but provides a wider range of numbers and higher precision. Floating-point arithmetic is commonly used in scientific computing, simulation, and graphics processing.

## Example: Quantization noise in FBMC systems

In this section, we examine the quantization errors in a signal processing system used for communications to illustrate how one’s choice affects number representation.

Filter bank multi-carrier (“FBMC”) is a modulation scheme used in digital communication systems. It is similar to orthogonal frequency division multiplexing (“OFDM”) in that it divides a wideband signal into multiple narrowband subcarriers to transmit data simultaneously. However, FBMC uses two different approaches to generate the FBMC signal. These approaches are poly-phase network (“PPN”) and frequency spreading (“FS”). Implementation of FBMC transmitters using finite word-length arithmetic will generate quantization noise. This noise depends on several factors, including the type of number representation, number of bits used in the quantization process and the type of signal processing blocks. A block diagram of various implementation possibilities of FBMC transmitters using PPN and FS can be viewed in this paper.

Quantization noise can affect the power spectral density (“PSD”) of the FBMC signal. Figure 5 and Figure 6 show how the number of fraction bits used in the quantization process affects the PSD of the FBMC signal for different FBMC transmitter implementations using fixed and floating-point quantization. The figures show that the quantization noise affects the PSD of the FBMC signal and increases spectral sidelobes, and reduces the FBMC signal’s spectrum efficiency when a low number of quantization bits is used. Furthermore, in case of fixed-point implementation the chosen FBMC transmitter architecture is also critical. In case of floating point implementation, double and single precision have a relatively similar performance; degradation can only be seen for half precision.

## Conclusion

We have highlighted two types of number representations for implementing a 5G communication system. The double-precision floating-point implementation can be used to achieve a system virtually free of quantization error. Still, it requires higher memory usage and more complex arithmetic units, and it is not easy to implement compared to fixed-point implementations. However, the number of quantization bits and type of number representation should be carefully selected since they produce quantization noise that may affect the system’s performance. Consequently, a tradeoff between system performance requirements and the hardware resources available should be considered, especially in the case of FPGA or microcontroller environments. As a final takeaway, quantization error can be both simulated (using tool sets provided by MATLAB) and analyzed theoretically prior to implementation. In this way, the hardware resources needed can be kept to a minimum while fulfilling the accuracy requirements. 