This article is just a simplification of the IEEE 754 standard. Here, we will see how floatingpoint no stored in memory, floatingpoint exceptions/rounding, etc. But if you will want to find more authoritative sources then go for
 What Every Computer Scientist Should Know About FloatingPoint Arithmetic
 https://en.wikipedia.org/wiki/IEEE_7541985
 https://en.wikipedia.org/wiki/Floating_point.
Floatingpoint numbers stored by encoding significand & the exponent (along with a sign bit)
 Above line contains 23 abstract terms & I think you will unable to understand the above line until you read further.
Floating Point Number Memory Layout
A typical singleprecision 32bit floatingpoint memory layout has the following fields :
 sign
 exponent
 significand(AKA mantissa)
Sign
 The highorder bit indicates a sign.
0
indicates a positive value,1
indicates negative.
Exponent
 The next 8 bits are used for the exponent which can be positive or negative, but instead of reserving another sign bit, they’re encoded such that
1000 0000
represents0
, so0000 0000
represents128
and1111 1111
represents127
.  How does this encoding work? go to exponent bias or see it in next point practically.
Significand
 The remaining 23bits used for the significand(AKA mantissa). Each bit represents a negative power of 2 countings from the left, so:
OK! We are done with basics.
Let’s Understand Practically
 So, we consider very famous float value
3.14
(PI) example.  Sign: Zero here, as PI is positive!
Exponent calculation
3
is easy:0011
in binary The rest,
0.14
 So,
0.14 = 001000111...
If you don’t know how to convert decimal no in binary then refer this float to binary.  Add
3
, `11.001000111… with exp 0 (3.14 * 2^0)  Now shift it (normalize it) and adjust the exponent accordingly `1.1001000111… with exp +1 (1.57 * 2^1)
 Now you only have to add the bias of
127
to the exponent1
and store it(i.e.128
=1000 0000
)0 1000 0000 1100 1000 111...
 Forget the top
1
of the mantissa (which is always supposed to be1
, except for some special values, so it is not stored), and you get:0 1000 0000 1001 0001 111...
 So our value of
3.14
would be represented as something like:
 The number of bits in the exponent determines the range (the minimum and maximum values you can represent).
Summing up Significand
 If you add up all the bits in the significand, they don’t total
0.7853975
(which should be, according to 7 digit precision). They come out to0.78539747
.  There aren’t quite enough bits to store the value exactly. we can only store an approximation.
 The number of bits in the significand determines the precision.
 23bits gives us roughly 6 decimal digits of precision. 64bit floatingpoint types give roughly 12 to 15 digits of precision.
Strange! But Fact
 Some values cannot represent exactly no matter how many bits you use. Just as values like 1/3 cannot represent in a finite number of decimal digits, values like 1/10 cannot represent in a finite number of bits.
 Since values are approximate, calculations with them are also approximate, and rounding errors accumulate.
Let’s See Things Working


 This C code will print binary representation of float on the console.
Where the Decimal Point Is Stored?
 The decimal point not explicitly stored anywhere.
 As I wrote a line `Floatingpoint numbers stored by encoding significand & the exponent (along with a sign bit), but you don’t get it the first time. Don’t worry 99% people don’t get it first, including me.
A Bit More About Representing Numbers
 According to
IEEE 7541985
worldwide standard, you can also store zero, negative/positive infinity and even `NaN`(Not a Number). Don’t worry if you don’t know what isNaN
, I will explain shortly(But be worried, if you don’t know infinity).
Zero Representation
 sign = 0 for positive zero, 1 for negative zero.
 exponent = 0.
 fraction = 0.
Positive & Negative Infinity Representation
 sign = 0, for positive infinity, 1 for negative infinity.
 exponent = all 1 bits.
 fraction = all 0 bits.
NaN Representation
 sign = either 0 or 1.
 exponent = all 1 bits.
 fraction = anything except all 0 bits (since all 0 bits represents infinity)
Why Do We Need NaN
?
 Some operations of floatingpoint arithmetic are invalid, such as dividing by zero or taking the square root of a negative number.
 The act of reaching an invalid result called a floatingpoint exception(next point). An exceptional result is represented by a special code called a
NaN
, for “Not a Number”.
FloatingPoint Exceptions
 The
IEEE 7541985
standard defines five exceptions that can occur during a floatingpoint calculation named as
 Invalid Operation: occurs due to many causes like multiplication of infinite with zero or infinite, division of infinite by zero or infinite & viceversa, square root of operand less than zero, etc.
 Division by Zero: occurs when “as its name sounds”
 Overflow: This exception raised whenever the result cannot represent a finite value in the precision format of the destination.
 Underflow: The underflow exception raised when an intermediate result is too small to calculate accurately, or if the operation’s result rounded to the destination precision too small to normalized
 Inexact: raised when a rounded result not exact.
Rounding in FloatingPoint
 As we saw floatingpoint numbers have a limited number of digits, they cannot represent all real numbers accurately: when there are more digits than the format allows, the leftover ones are omitted  the number is rounded.
 There are 4 rounding modes :
1. Round to Nearest: rounded to the nearest value with an even (zero) least significant bit, which occurs 50% of the time.
2. Round toward 0 – simply truncate the extra digits.
3. Round toward +∞ – rounding towards positive infinity.
4. Round toward −∞ – rounding towards negative infinity.
Misc points
 In older time, embedded system processors do not use floatingpoint numbers as they don’t have such hardware capabilities.
 So there is some alternative to a floatingpoint number, called Fixed Point Numbers.
 A fixedpoint number is usually used in specialpurpose applications on embedded processors that can only do integer arithmetic, but decimal fixed point(’.’) is manipulated by software library.
 But nowadays, the microcontroller has separate FPU’s too, like STM32F series.