Type punning

In computer science, type punning is a common term for any programming technique that subverts or circumvents the type system of a programming language in order to achieve an effect that would be difficult or impossible to achieve within the bounds of the formal language.

In C and C++, constructs such as type conversion and union — C++ adds reinterpret_cast to this list — are provided in order to permit many kinds of type punning, although some kinds are not actually supported by the standard language.

In the Pascal programming language, the use of records with variants may be used to treat a particular data type in more than one manner, or in a manner not normally permitted.

Sockets example

One classic example of type punning is found in the Berkeley sockets interface. The function to bind an opened but uninitialized socket to an IP address is declared as follows:

int bind(int sockfd, struct sockaddr *my_addr, socklen_t addrlen);

The bind function is usually called as follows:

struct sockaddr_in sa = {0};
int sockfd = ...;
sa.sin_family = AF_INET;
sa.sin_port = htons(port);
bind(sockfd, (struct sockaddr *)&sa, sizeof sa);

The Berkeley sockets library fundamentally relies on the fact that in C, a pointer to struct sockaddr_in is freely convertible to a pointer to struct sockaddr; and, in addition, that the two structure types share the same memory layout. Therefore, a reference to the structure field my_addr->sin_family (where my_addr is of type struct sockaddr*) will actually refer to the field sa.sin_family (where sa is of type struct sockaddr_in). In other words, the sockets library uses type punning to implement a rudimentary form of inheritance.

Often seen in the programming world is the use of "padded" data structures to allow for the storage of different kinds of values in what is effectively the same storage space. This is often seen when two structures are used in mutual exclusivity for optimization.

Floating-point example

Not all examples of type punning involve structures, as the previous example did. Suppose we want to determine whether a floating-point number is negative. We could write:

bool is_negative(float x) {
    return x < 0.0;
}

However, supposing that floating-point comparisons are expensive, and also supposing that float is represented according to the IEEE floating-point standard, and integers are 32 bits wide, we could engage in type punning to extract the sign bit of the floating-point number using only integer operations:

bool is_negative(float x) {
    unsigned int *ui = (unsigned int *)&x;
    return *ui & 0x80000000;
}

Note that the behaviour will not be exactly the same: in the special case of x being negative zero, the first implementation yields false while the second yields true.

This kind of type punning is more dangerous than most. Whereas the former example relied only on guarantees made by the C programming language about structure layout and pointer convertibility, the latter example relies on assumptions about a particular system's hardware. Some situations, such as time-critical code that the compiler otherwise fails to optimize, may require dangerous code. In these cases, documenting all such assumptions in comments, and introducing static assertions to verify portability expectations, helps to keep the code maintainable.

For a practical example popularized by Quake III, see fast inverse square root.

In addition to the assumption about bit-representation of floating-point numbers, the previous floating-point type-punning example also violates the C language's constraints on how objects are accessed:[1] the declared type of x is float but it is read through an expression of type unsigned int. On many common platforms, this use of pointer punning can create problems if different pointers are aligned in machine-specific ways. Furthermore, pointers of different sizes can alias accesses to the same memory, causing problems that are unchecked by the compiler.

Use of union

This aliasing problem can be fixed by the use of a union (though, this example still makes the assumption about IEEE-754 bit-representation of floating-point types):

bool is_negative(float x) {
    union {
        unsigned int ui;
        float d;
    } my_union = { .d = x };
    return my_union.ui & 0x80000000;
}

This is legal C99 and above because the stored value of an object may be accessed through an lvalue of aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union).[1] Here the relevant declared type is float for the store and unsigned for the read in the union, thus complying with the s6.5 rules.

Some compilers like GCC support such constructs as a language extension.[2]

For another example of type punning, see Stride of an array.

Pascal

A variant record permits treating a data type as multiple kinds of data depending on which variant is being referenced. In the following example, integer is presumed to be 16 bit, while longint and real are presumed to be 32, while character is presumed to be 8 bit:

  type variant_record = record
     case rec_type : longint of
         1: ( I : array [1..2] of integer );
         2: ( L : longint );
         3: ( R : real );
         4: ( C : array [1..4] of character);
     end;
   Var V: Variant_record;
      K: Integer;
      LA: Longint;
      RA: Real;
      Ch: character;
  ...
   V.I := 1;
   Ch := V.C[1];   (* This would extract the first binary byte of V.I *)
   V.R := 8.3;   
   LA := V.L;     (* This would store a real into an integer *)

In Pascal, copying a real to an integer converts it to the truncated value. This method would translate the binary value of the floating-point number into whatever it is as a long integer (32 bit), which will not be the same and may be incompatible with the long integer value on some systems.

These examples could be used to create strange conversions, although, in some cases, there may be legitimate uses for these types of constructs, such as for determining locations of particular pieces of data. In the following example a pointer and a longint are both presumed to be 32 bit:

 Type PA = ^Arec;
 
    Arec = record
      case rt : longint of
         1: (P: PA);
         2: (L: Longint);
    end;
 
  Var PP: PA;
   K: Longint;
  ...
   New(PP);
   PP^.P := PP;
   Writeln('Variable PP is located at address ', hex(PP^.L));

Where "new" is the standard routine in Pascal for allocating memory for a pointer, and "hex" is presumably a routine to print the hexadecimal string describing the value of an integer. This would allow the display of the address of a pointer, something which is not normally permitted. (Pointers cannot be read or written, only assigned .) Assigning a value to an integer variant of a pointer would allow examining or writing to any location in system memory:

 PP^.L := 0;
 PP := PP^.P;  (*PP now points to address 0 *)
 K := PP^.L;   (*K contains the value of word 0 *)
 Writeln('Word 0 of this machine contains ',K);

This construct may cause a program check or protection violation if address 0 is protected against reading on the machine the program is running upon or the operating system it is running under.

C#

In C# (and other .NET languages), this is a bit harder to achieve because of the type system, but can be done nonetheless, using pointers or struct unions.

Pointers

C# only allows pointers to so-called native types, i.e. any primitive type (except string), enum, array or struct that is composed only of other native types. Note that pointers are only allowed in code blocks marked 'unsafe'.

 float pi = 3.14159;
 uint piAsRawData = *(uint*)&pi;

Struct unions

Struct unions are allowed without any notion of 'unsafe' code, but they do require the definition of a new type.

 [StructLayout(LayoutKind.Explicit)]
 struct FloatAndUIntUnion
 {
     [FieldOffset(0)]
     public float DataAsFloat;
     [FieldOffset(0)]
     public uint DataAsUInt;
 }

 // ...

 FloatAndUIntUnion union;
 union.DataAsFloat = 3.14159;
 uint piAsRawData = union.DataAsUInt;

Raw CIL code

Raw CIL can be used instead of C#, because it doesn't have most of the type limitations. This allows one to, for example, combine two enum values of a generic type:

 TEnum a = ...;
 TEnum b = ...;
 TEnum combined = a | b; // illegal

This can be circumvented by the following CIL code:

 .method public static hidebysig
     !!TEnum CombineEnums<valuetype .ctor ([mscorlib]System.ValueType) TEnum>(
         !!TEnum a,
         !!TEnum b
     ) cil managed
 {
     .maxstack 2

     ldarg.0 
     ldarg.1
     or  // this will not cause an overflow, because a and b have the same type, and therefore the same size.
     ret
 }

The cpblk CIL opcode allows for some other tricks, such as converting a struct to a byte array:

 .method public static hidebysig
     uint8[] ToByteArray<valuetype .ctor ([mscorlib]System.ValueType) T>(
         !!T& v // 'ref T' in C#
     ) cil managed
 {
     .locals init (
         [0] uint8[]
     )

     .maxstack 3

     // create a new byte array with length sizeof(T) and store it in local 0
     sizeof !!T
     newarr uint8
     dup           // keep a copy on the stack for later (1)
     stloc.0

     ldc.i4.0
     ldelema uint8

     // memcpy(local 0, &v, sizeof(T));
     // <the array is still on the stack, see (1)>
     ldarg.0 // this is the *address* of 'v', because its type is '!!T&'
     sizeof !!T
     cpblk

     ldloc.0
     ret
 }

References

  1. 1 2 ISO/IEC 9899:1999 s6.5/7
  2. GCC: Non-Bugs
This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.