Home » C++ » Adding smallest possible float to a float

Adding smallest possible float to a float

Posted by: admin November 29, 2017 Leave a comment

Questions:

I want to add the smallest possible value of a float to a float. So, for example, I tried doing this to get 1.0 + the smallest possible float:

float result = 1.0f + std::numeric_limits<float>::min();

But after doing that, I get the following results:

(result > 1.0f) == false
(result == 1.0f) == true

I’m using Visual Studio 2015. Why does this happen? What can I do to get around it?

Answers:

If you want the next representable value after 1, there is a function for that called std::nextafter, from the <cmath> header.

float result = std::nextafter(1.0f, 2.0f);

It returns the next representable value starting from the first argument in the direction of the second argument. So if you wanted to find the next value below 1, you could do this:

float result = std::nextafter(1.0f, 0.0f);

Adding the smallest positive representable value to 1 doesn’t work because the difference between 1 and the next representable value is greater than the difference between 0 and the next representable value.

Questions:
Answers:

The “problem” you’re observing is because of the very nature of floating point arithmetic.

In FP the precision depends on the scale; around the value 1.0 the precision is not enough to be able to differentiate between 1.0 and 1.0+min_representable where min_representable is the smallest possible value greater than zero (even if we only consider the smallest normalized number, std::numeric_limits<float>::min(). The smallest denormal is another few orders of magnitude smaller).

For example with double-precision 64-bit IEEE754 floating point numbers, around the scale of x=10000000000000000 (1016) it’s impossible to distinguish between x and x+1.


The fact that the resolution changes with scale is the very reason for the name “floating point”, because the decimal point “floats”. A fixed point representation instead will have a fixed resolution (for example with 16 binary digits below units you have a precision of 1/65536 ~ 0.00001).

For example in the IEEE754 32-bit floating point format there is one bit for the sign, 8 bits for the exponent and 31 bits for the mantissa:

floating point


The smallest value eps such that 1.0f + eps != 1.0f is available as a pre-defined constant as FLT_EPSILON, or std::numeric_limits<float>::epsilon. See also machine epsilon on Wikipedia, which discusses how epsilon relates to rounding errors.

I.e. epsilon is the smallest value that does what you were expecting here, making a difference when added to 1.0.

The more general version of this (for numbers other than 1.0) is called 1 unit in the last place (of the mantissa). See Wikipedia’s ULP article.

Questions:
Answers:

min is the smallest non-zero value that a (normalized-form) float can assume, i.e. something around 2-126 (-126 is the minimum allowed exponent for a float); now, if you sum it to 1 you’ll still get 1, since a float has just 23 bits of mantissa, so such a small change cannot be represented in such a “big” number (you would need a 126 bit mantissa to see a change summing 2-126 to 1).

The minimum possible change to 1, instead, is epsilon (the so-called machine epsilon), which is in fact 2-23 – as it affects the last bit of the mantissa.

Questions:
Answers:

To increase/decrement a floating point value by the smallest possible amount, use nextafter towards +/- infinity().

If you just use next_after(x,std::numeric_limits::max()), the result with be wrong in case x is infinity.

#include <iostream>
#include <limits>
#include <cmath>

template<typename T>
T next_above(const T& v){
    return std::nextafter(1.0,std::numeric_limits<T>::infinity()) ;
}
template<typename T>
T next_below(const T& v){
    return std::nextafter(1.0,-std::numeric_limits<T>::infinity()) ;
}

int main(){
  std::cout << next_below(1.0) - 1.0<< std::endl; // gives eps
  std::cout << next_above(1.0) - 1.0<< std::endl; // gives ~ -eps/2

  // Note:
  std::cout << std::nextafter(std::numeric_limits<double>::infinity(),
     std::numeric_limits<double>::infinity()) << std::endl; // gives inf
  std::cout << std::nextafter(std::numeric_limits<double>::infinity(),
     std::numeric_limits<double>::max()) << std::endl; // gives 1.79769e+308

}