Tuesday, June 8, 2021

Stuck in Fixed & Floating point?

Hey, 👩

  In the digital world, we represent numbers as binary bits (0's and 1's). Two such popular ways to             represent real numbers in binary number system are:

  • Fixed-point representation 
  • Floating-point representation
  i. Fixed-point representation 

      > represents the integer and fractional parts separately.

  Variants:

     > Signed fixed point representation: has an extra sign bit at the beginning 

         Example:  (3.5) (base = 10) = (11.101) (base = 2)
                          integer part(=3) + fractional part(=0.5) = 11 + 0.101 = 11.101

     > Unsigned fixed point representation: no extra sign bit
   
        Example:  (3.5) (base = 10) = (011.101) (base = 2)
                         sign(=+ve) + integer part(=3) + fractional part(=0.5) = 0 + 11 + 0.101 = 011.101

        [ NOTE:  numbers starting with '0' are +ve and those with '1' are -ve ]

        So, for -3.5, we will have 111.101 as the signed fixed point representation

-----------------------------------------------------------------------------------------------------------------------------

  ii. Floating-point representation 

      It's much like the order of magnitude representation that we've learnt in Physics. 
      Let's learn by an example:
      
      Just like the previous case, we have here (3.5) (base = 10) = (011.101) (base = 2). Now, 
   
    step 1: separate the sign bit (here = 0)
   
    step 2: shift the decimal towards left keeping only one '1' on left (as we do for order of magnitude
       representation) :
                                 1.11010000.....
       (you can keep as many zeros you like because trailing zeros after the decimal and leading zeros               before the decimal point doesn't matter.)
   
    step 3: express as powers of 2:

                                 11.101 = 1.1101 x 2^1   ( 2^1 since decimal was shifted by only 1 place)
        [ steps 2 and 3 together is called normalization]
    
    step 4: Figure out the exponent and mantissa:
                   
                This is the most vital step. Basically, the number  is : (-1)s(1+m) x 2(e-Bias) 
      where:
             s = 0 (for +ve numbers) and 1 (for -ve numbers) (it's determines the sign)

       to determine m, e and e-Bias, we have two standardized options to follow:

  • IEEE single-precision floating-point standard representation: it provides a 32-bit                              representation of floating-point numbers as follows:-
                    
                       



    




        
        
          m = mantissa( or significand) = the part after the decimal point in step 3 representation

          here, > m = 1101000000....up to 23 bits (following IEEE single precision format)
                   > e-Bias = 127 (127 = max. no. in 7 bits)
          So, you can choose e = 128 because then 2(e-Bias) will be 2(128-127)  = 2^1 and that is what 
          we want. [ see step 3]  
          In any case, e >= e-Bias, so e>=127 i.e. you'll require >=7 bits to represent the exponent which 
          is why we've reserved 8 bits for exponent (e) and remaining 32 - (1+8) = 23 bits for mantissa.
          Hence, the representation is as:

                0 (sign bit) e = 10000000 (binary of 128) and m = 1101000000....up to 23 bits

                i.e. 0 10000000 11010000000000000000000

-----------------------------------------------------------------------------------------------------------------------------
   
  • IEEE double-precision floating-point standard representation: it provides a 64-bit                  representation of floating-point numbers as follows:-
                   
                               
          here, > m = 1101000000....up to 52 bits (following IEEE double precision format)
                   > e-Bias = 1023 (1023 = max. no. in 10 bits)
          So, you can choose e = 1024 because then 2(e-Bias) will be 2(1024 - 1023)  = 2^1 and that is 
          what we want here. [ refer step 3]  
          In any case, e >= e-Bias, so e >= 1023 i.e. you'll require >=10 bits to represent the exponent
          which is why we've reserved 11 bits for exponent (e) and remaining 64 - (1+11) = 52 bits for
          mantissa.
          Hence, the representation is as:

              0 (sign bit) e = 10000000000 (binary of 1024) and m = 1101000000....up to 52 bits

              i.e. 0 10000000000 1101000000000000000000000000000000000000000000000000

          phewww... that's really huuuge!

-----------------------------------------------------------------------------------------------------------------------------

     [ NOTE: for non-terminating real numbers, single-precision format only affords us 23 bits to
      represent the fractional part. Thus, we've to settle for an approximation, rounding things to
      the 23rd digit (and correspondingly, 52nd digit for double precision format). ]

     (Images collected from: Google images)
-----------------------------------------------------------------------------------------------------------------------------

       That's it! 
       For similar interesting posts, keep in touch with topics@today...💗

No comments:

Post a Comment

You may like these:

HANDS ON JAVA(OOP - based uses) (Part-I)

     Practice JAVA like Never Before!                                                                                                     ...