Up until this point we discussed dither as a saving grace to the detriments of truncation and its delegating characteristics to quantization error. We do have another problem though. That is when we are digitally mixing and processing our signal. Anytime we do processing to the signal we increase the chance of quantization error. For any change to the binary code that represented its signal, there must be a reacting counter arrangement of those one’s and zero’s. There is a rearrangement and this gives rise to more chance of quantization error. Obviously the more bits one has to work with the least chance of misrepresentation. So better than 16-bit we got 24 bit. What is better than 24 bit then? We’ve got 32-bit and not only 32-bit fixed like 16 and 24 but floating in this case. Floating as opposed to fixed is another way of representing the data of the signal.
Floating-point numbers differ from Integer numbers in that they can scale themselves internally to be able to represent very big and very small numbers without losing significant detail. The fixed bit depth obviously have a fixed finite range: a 16-bit fixed can represent 65535 different discrete values. A 24-bit number can represent 16777216 different discrete values. And a 32-bit fixed number can represent 4294967296 different discrete values. Sounds great but all these fixed numbers have the same disadvantage. The disadvantage is that just as we said earlier, analog represents infinite amount of points in exemplifying its signal. So no matter how high these bit depths are, they certainly are not as high as infinity! As the number represented by the analog gets smaller and smaller the fixed bit depths try and represent this scaling but nevertheless due to their nature of finite, they fail in doing so accurately. Therefore the error in representing a small analogue sample increases as the number gets smaller as the number can only be a fixed amount represented by binary code rather than a specific “in-between” value.
Floating-point offers us this nuance. It is an envoy with a decimal point where the decimal can move. This then gives us the flexibility of representing these fractional or “in-between” parts. Internally, inside a 32 bit floating point number 24 bits are to represent the number required between 0.0000000 and 0.999999, and the remaining 8 bits are used to scale the number to the right range. So a floating-point number has the capability to represent prodigious numbers and maintain the accuracy with the more specific or exact aspects of the number.
Single and double precision floating point formats get translated to 80 bit extended precision floating point format (64 bit mantissa, 15 bit exponent) for calculations and the result gets rounded to 32 or 64 bit. (They are not always rounded. It depends on compiler flags, code and some other things).What this means in Sonar is that the full signal path is at 64 bit float [1] (53 bit Mantissa, 11 bit exponent) and calculations are done at 80 bit extended precision. Depending on the situation, results may remain at 80 bit between calculations.
32 bit float DAWs also benefit from the 80 bit extended format. (But the results will be rounded to 32 bit float instead of 64 bit float).


