Q (number format)

Q is a fixed point number format where the number of fractional bits (and optionally the number of integer bits) is specified. For example, a Q15 number has 15 fractional bits; a Q1.14 number has 1 integer bit and 14 fractional bits. Q format is often used in hardware that does not have a floating-point unit and in applications that require constant resolution.

Characteristics

Q format numbers are (notionally) fixed point numbers (but not actually a number itself); that is, they are stored and operated upon as regular binary numbers (i.e. signed integers), thus allowing standard integer hardware/ALU to perform rational number calculations. The number of integer bits, fractional bits and the underlying word size are to be chosen by the programmer on an application-specific basis — the programmer's choices of the foregoing will depend on the range and resolution needed for the numbers.

Some DSP architectures offer native support for common formats, such as Q1.15. In this case, the processor can support arithmetic in one step, offering saturation (for addition and subtraction) and renormalization (for multiplication) in a single instruction. Most standard CPUs do not. If the architecture does not directly support the particular fixed point format chosen, the programmer will need to handle saturation and renormalization explicitly with bounds checking and bit shifting.

There are 2 conflicting notations for fixed point. Both notations are written as Qm.n, where:

One convention includes the sign bit in the value of m,[1] and the other convention does not. The choice of convention can be determined by summing m+n. If the value is equal to the register size, then the sign bit is included in the value of m. If it is one less than the register size, the sign bit is not included in the value of m.

In addition, the letter U can be prefixed to the Q to indicate an unsigned value, such as UQ1.15, indicating values from 0.0 to +1.99997.

Signed Q values are stored in 2's complement format, just like signed integer values on most processors. In 2's complement, the sign bit is extended to the register size.

For a given Qm.n format, using an m+n+1 bit signed integer container with n fractional bits:

For a given UQm.n format, using an m+n bit unsigned integer container with n fractional bits:

For example, a Q14.1 format number:

Unlike floating point numbers, the resolution of Q numbers will remain constant over the entire range.

Conversion

Float to Q

To convert a number from floating point to Qm.n format:

  1. Multiply the floating point number by 2n
  2. Round to the nearest integer

Q to float

To convert a number from Qm.n format to floating point:

  1. Convert the number to floating point as if it were an integer
  2. Multiply by 2n

Math operations

Q numbers are a ratio of two integers: the numerator is kept in storage, the denominator is equal to 2n.

Consider the following example:

If the Q number's base is to be maintained (n remains constant) the Q number math operations must keep the denominator constant. The following formulas shows math operations on the general Q numbers N_1 and N_2.

\begin{align}
\frac{N_1}{d} + \frac{N_2}{d} &= \frac{N_1+N_2}{d}\\
\frac{N_1}{d} - \frac{N_2}{d} &= \frac{N_1-N_2}{d}\\
\left(\frac{N_1}{d} \times \frac{N_2}{d}\right) \times d &= \frac{N_1\times N_2}{d}\\
\left(\frac{N_1}{d} / \frac{N_2}{d}\right)/d &= {N_1/N_2}
\end{align}

Because the denominator is a power of two the multiplication can be implemented as an arithmetic shift to the left and the division as an arithmetic shift to the right; on many processors shifts are faster than multiplication and division.

To maintain accuracy the intermediate multiplication and division results must be double precision and care must be taken in rounding the intermediate result before converting back to the desired Q number.

Using C the operations are (note that here, Q refers to the fractional part's number of bits) :

Addition

 signed int a, b, result;
 result = a+b;

With saturation

 signed int a, b, result;
 signed long int tmp;
 tmp = a + b;
 if (tmp > 0x7FFFFFFF) tmp = 0x7FFFFFFF;
 if (tmp < -1 * 0x7FFFFFFF) tmp = -1 * 0x7FFFFFFF;
 result = (signed int) tmp;

Subtraction

 signed int a,b,result;
 result = a-b;

Multiplication

 // precomputed value:
 #define K   (1 << (Q-1))
 
 signed int       a, b, result;
 signed long int  temp;
 temp = (long int)a * (long int)b; // result type is operand's type
 // Rounding; mid values are rounded up
 temp += K;
 // Correct by dividing by base
 result = temp >> Q;

Division

 signed int  a, b, result;
 signed long int temp;
 // pre-multiply by the base (Upscale to Q16 so that the result will be in Q8 format)
 temp = (long int)a << Q;
 // So the result will be rounded ; mid values are rounded up.
 temp += b/2;
 result = temp/b;

See also

References

External links