Type safety
Type systems |
---|
In computer science, type safety is the extent to which a programming language discourages or prevents type errors. A type error is erroneous or undesirable program behaviour caused by a discrepancy between differing data types for the program's constants, variables, and methods (functions), e.g., treating an integer (int) as a floating-point number (float). Type safety is sometimes alternatively considered to be a property of a computer program rather than the language in which that program is written; that is, some languages have type-safe facilities that can be circumvented by programmers who adopt practices that exhibit poor type safety. The formal type-theoretic definition of type safety is considerably stronger than what is understood by most programmers.
Type enforcement can be static, catching potential errors at compile time, or dynamic, associating type information with values at run-time and consulting them as needed to detect imminent errors, or a combination of both.
The behaviors classified as type errors by a given programming language are usually those that result from attempts to perform operations on values that are not of the appropriate data type. This classification is partly based on opinion; it may imply that any operation not leading to program crashes, security flaws or other obvious failures is legitimate and need not be considered an error, or it may imply that any contravention of the programmer's explicit intent (as communicated via typing annotations) to be erroneous and not "type-safe".
In the context of static (compile-time) type systems, type safety usually involves (among other things) a guarantee that the eventual value of any expression will be a legitimate member of that expression's static type. The precise requirement is more subtle than this — see, for example, subtype and polymorphism for complications.
Type safety is closely linked to memory safety, a restriction on the ability to copy arbitrary bit patterns from one memory location to another. For instance, in an implementation of a language that has some type , such that some sequence of bits (of the appropriate length) does not represent a legitimate member of , if that language allows data to be copied into a variable of type , then it is not type-safe because such an operation might assign a non- value to that variable. Conversely, if the language is type-unsafe to the extent of allowing an arbitrary integer to be used as a pointer, then it is not memory-safe.
Most statically typed languages provide a degree of type safety that is strictly stronger than memory safety, because their type systems enforce the proper use of abstract data types defined by programmers even when this is not strictly necessary for memory safety or for the prevention of any kind of catastrophic failure.
Definitions
Type-safe code accesses only the memory locations it is authorized to access. (For this discussion, type safety specifically refers to memory type safety and should not be confused with type safety in a broader respect.) For example, type-safe code cannot read values from another object's private fields.
Robin Milner provided the following slogan to describe type safety:
- Well-typed programs cannot "go wrong".[1]
The appropriate formalization of this slogan depends on the style of formal semantics used for a particular language. In the context of denotational semantics, type safety means that the value of an expression that is well-typed, say with type τ, is a bona fide member of the set corresponding to τ.
In 1994, Andrew Wright and Matthias Felleisen formulated what is now the standard definition and proof technique for type safety in languages defined by operational semantics. Under this approach, type safety is determined by two properties of the semantics of the programming language:
- (Type-) preservation or subject reduction
- "Well typedness" of programs remains invariant under the transition rules (i.e. evaluation rules or reduction rules) of the language.
- Progress
- A well typed program never gets "stuck", i.e., never gets into an undefined state where no further transitions are possible.
These properties do not exist in a vacuum; they are linked to the semantics of the programming language they describe, and there is a large space of varied languages that can fit these criteria, since the notion of "well typed" program is part of the static semantics of the programming language and the notion of "getting stuck" (or "going wrong") is a property of its dynamic semantics.
Vijay Saraswat provides the following definition:
- "A language is type-safe if the only operations that can be performed on data in the language are those sanctioned by the type of the data." [2]
Relation to other forms of safety
Type safety is ultimately aimed at excluding other problems, e.g.:-
- Prevention of illegal operations. For example, we can identify an expression
3 / "Hello, World"
as invalid, because the rules of arithmetic do not specify how to divide an integer by a string. - Memory safety
- Wild pointers can arise when a pointer to one type object is treated as a pointer to another type. For instance, the size of an object depends on the type, so if a pointer is incremented under the wrong credentials, it will end up pointing at some random area of memory.
- Buffer overflow - Out-of bound writes can corrupt the contents of objects already present on the heap. This can occur when a larger object of one type is crudely copied into smaller object of another type.
- Logic errors originating in the semantics of different types. For instance, inches and millimeters may both be stored as integers, but should not be substituted for each other or added. A type system can enforce two different types of integer for them.
Type-safe and type-unsafe languages
Type safety is usually a requirement for any toy language proposed in academic programming language research. Many languages, on the other hand, are too big for human-generated type safety proofs, as they often require checking thousands of cases. Nevertheless, some languages such as Standard ML, which has rigorously defined semantics, have been proved to meet one definition of type safety.[3] Some other languages such as Haskell are believed to meet some definition of type safety, provided certain "escape" features are not used (for example Haskell's unsafePerformIO, used to "escape" from the usual restricted environment in which I/O is possible, circumvents the type system and so can be used to break type safety.[4]) Type punning is another example of such an "escape" feature. Regardless of the properties of the language definition, certain errors may occur at run-time due to bugs in the implementation, or in linked libraries written in other languages; such errors could render a given implementation type unsafe in certain circumstances. An early version of Sun's Java Virtual Machine was vulnerable to this sort of problem.[2]
Type safety and "strong typing"
Some people use the term "strong typing" to refer to certain aspects of type safety. For example, a language with a statically checked type system may be described as "strongly typed", because it statically disallows conversions between values of incompatible type. Similarly, a language with a dynamically checked type system may also be described as "strongly typed", because a program which tries to convert a value to an incompatible type will fail at runtime.
Type safety in object oriented languages
In object oriented languages type safety is usually intrinsic in the fact a type system is in place. This is expressed in terms of class definitions.
A class essentially defines the structure of the objects derived from it and an API as a contract for handling these objects. Each time a new object is created it will comply with that contract.
Each function that exchanges objects derived from a specific class, or implementing a specific interface, will adhere to that contract: hence in that function the operations permitted on that object will be only those defined by the methods of the class the object implements. This will guarantee that the object integrity will be preserved.[5]
Exception to this are object oriented languages that allow dynamic modification of the object structure, or the use of reflection to modify the content of an object to overcome the constraints imposed by the class methods definitions.
Type safety issues in specific languages
Ada
The Wikibook Ada Programming has a page on the topic of: Type System |
Ada was designed to be suitable for embedded systems, device drivers and other forms of system programming, but also to encourage type safe programming. To resolve these conflicting goals, Ada confines type-unsafety to a certain set of special constructs whose names usually begin with the string Unchecked_. Unchecked_Deallocation can be effectively banned from a unit of Ada text by applying pragma Pure to this unit. It is expected that programmers will use Unchecked_ constructs very carefully and only when necessary; programs that do not use them are type safe.
The SPARK programming language is a subset of Ada eliminating all its potential ambiguities and insecurities while at the same time adding statically checked contracts to the language features available. SPARK avoids the issues with dangling pointers by disallowing allocation at run time entirely.
Ada2012 adds statically checked contracts to the language itself (in form of pre-, and post-conditions, as well as type invariants).
C
Wikibooks has a book on the topic of: C Programming |
The C programming language is typesafe in limited contexts; for example, a compile-time error is generated when an attempt is made to convert a pointer to one type of structure to a pointer to another type of structure, unless an explicit cast is used. However, a number of very common operations are non-typesafe; for example, the usual way to print an integer is something like printf("%d", 12)
, where the %d
tells printf
at run-time to expect an integer argument. (Something like printf("%s", 12)
, which erroneously tells the function to expect a pointer to a character-string, may be accepted by compilers, but will produce undefined results.) This is partially mitigated by some compilers (such as gcc) checking type correspondences between printf arguments and format strings.
In addition, C, like Ada, provides unspecified or undefined explicit conversions; and unlike in Ada, idioms that use these conversions are very common, and have helped to give C a type-unsafe reputation. For example, the standard way to allocate memory on the heap is to invoke a memory allocation function, such as malloc
, with an argument indicating how many bytes are required. The function returns an untyped pointer (type void *
), which the calling code must cast to the appropriate pointer type. Older C specifications required an explicit cast to do so, therefore the code (struct foo *) malloc(sizeof(struct foo))
became the accepted practice.[6] However, this practice is discouraged in ANSI C as it can mask a failure to include the header file in which malloc
is defined, resulting in downstream errors on machines where the int and pointer types are of different sizes, such as most common implementations of C for the now-ubiquitous x86 64 architecture.[7] A conflict arises in code that is required to compile as C++, since the cast is necessary in that language.
C++
Some features of C++ that promote more type-safe code:
- The new operator returns a pointer of type based on operand, whereas malloc returns a void pointer.
- C++ code can use virtual functions and templates to achieve polymorphism without void pointers.
- Preprocessor constants (without type) can be rewritten as const variables (typed).
- Preprocessor macro functions (without type) can be rewritten as inline functions (typed). The flexibility of accepting and returning different types can still be obtained by function overloading.
- Safer casting operators, such as dynamic cast that performs run-time type checking.
C#
C# is type-safe (but not statically type-safe). It has support for untyped pointers, but this must be accessed using the "unsafe" keyword which can be prohibited at the compiler level. It has inherent support for run-time cast validation. Casts can be validated by using the "as" keyword that will return a null reference if the cast is invalid, or by using a C-style cast that will throw an exception if the cast is invalid. See C Sharp conversion operators.
Undue reliance on the object type (from which all other types are derived) runs the risk of defeating the purpose of the C# type system. It is usually better practice to abandon object references in favour of generics, similar to templates in C++ and generics in Java.
Java
Wikibooks has a book on the topic of: Java Programming |
The Java language is designed to enforce type safety. Anything in Java happens inside an object and each object is an instance of a class.
To implement the type safety enforcement, each object, before usage, needs to be allocated. Java allows usage of primitive types but only inside properly allocated objects.
Sometimes a part of the type safety is implemented indirectly: e.g. the class BigDecimal represents a floating point number of arbitrary precision, but handles only numbers that can be expressed with a finite representation. The operation BigDecimal.divide() calculates a new object as the division of two numbers expressed as BigDecimal.
In this case if the division has no finite representation, as when one computes e.g. 1/3=0.33333..., the divide() method can rise an exception if no rounding mode is defined for the operation. Hence the library, rather than the language, guarantees that the object respects the contract implicit in the class definition.
Standard ML
Wikibooks has a book on the topic of: Standard ML Programming |
SML has rigorously defined semantics and is known to be type-safe. However, some implementations of SML, including Standard ML of New Jersey (SML/NJ), its syntactic variant Mythryl and Mlton, provide libraries that offer certain unsafe operations. These facilities are often used in conjunction with those implementations' foreign function interfaces to interact with non-ML code (such as C libraries) that may require data laid out in specific ways. Another example is the SML/NJ interactive toplevel itself, which must use unsafe operations to execute ML code entered by the user.
Modula-2
Modula-2 is a strongly typed language with a design philosophy to require any unsafe facilities to be explicitly marked as unsafe. This is achieved by "moving" such facilities into a built-in pseudo-library called SYSTEM from where they must be imported before they can be used. The import thus makes it visible when such facilities are used. Unfortunately, this was not consequently implemented in the original language report and its implementation.[8] There still remained unsafe facilities such as the type cast syntax and variant records (inherited from Pascal) that could be used without prior import.[9] The difficulty in moving these facilities into the SYSTEM pseudo-module was the lack of any identifier for the facility that could then be imported since only identifiers can be imported, but not syntax.
IMPORT SYSTEM; (* allows the use of certain unsafe facilities: *)
VAR word : SYSTEM.WORD; addr : SYSTEM.ADDRESS;
addr := SYSTEM.ADR(word);
(* but type cast syntax can be used without such import *)
VAR i : INTEGER; n : CARDINAL;
n := CARDINAL(i); (* or *) i := INTEGER(n);
The ISO Modula-2 standard corrected this for the type cast facility by changing the type cast syntax into a function called CAST which has to be imported from pseudo-module SYSTEM. However, other unsafe facilities such as variant records remained available without any import from pseudo-module SYSTEM.[10]
IMPORT SYSTEM;
VAR i : INTEGER; n : CARDINAL;
i := SYSTEM.CAST(INTEGER, n); (* Type cast in ISO Modula-2 *)
A recent revision of the language applied the original design philosophy rigorously. First, pseudo-module SYSTEM was renamed to UNSAFE to make the unsafe nature of facilities imported from there more explicit. Then all remaining unsafe facilities where either removed altogether (for example variant records) or moved to pseudo-module UNSAFE. For facilities where there is no identifier that could be imported, enabling identifiers were introduced. In order to enable such a facility, its corresponding enabling identifier must be imported from pseudo-module UNSAFE. No unsafe facilities remain in the language that do not require import from UNSAFE.[9]
IMPORT UNSAFE;
VAR i : INTEGER; n : CARDINAL;
i := UNSAFE.CAST(INTEGER, n); (* Type cast in Modula-2 Revision 2010 *)
FROM UNSAFE IMPORT FFI; (* enabling identifier for foreign function interface facility *)
<*FFI="C"*> (* pragma for foreign function interface to C *)
Pascal
Wikibooks has a book on the topic of: Pascal Programming |
Pascal has had a number of type safety requirements, some of which are kept in some compilers. Where a Pascal compiler dictates "strict typing", two variables cannot be assigned to each other unless they are either compatible (such as conversion of integer to real) or assigned to the identical subtype. For example, if you have the following code fragment:
type
TwoTypes = record
I: Integer;
Q: Real;
end;
DualTypes = record
I: Integer;
Q: Real;
end;
var
T1, T2: TwoTypes;
D1, D2: DualTypes;
Under strict typing, a variable defined as TwoTypes is not compatible with DualTypes (because they are not identical, even though the components of that user defined type are identical) and an assignment of T1 := D2; is illegal. An assignment of T1 := T2; would be legal because the subtypes they are defined to are identical. However, an assignment such as T1.Q := D1.Q; would be legal.
Common Lisp
In general, Common Lisp is a type-safe language. A Common Lisp compiler is responsible for inserting dynamic checks for operations whose type safety cannot be proven statically. However, a programmer may indicate that a program should be compiled with a lower level of dynamic type-checking.[11] A program compiled in such a mode cannot be considered type-safe.
C++ Examples
The following examples illustrates how C++ cast operators can break type safety when used incorrectly. The first example shows how basic data types can be incorrectly casted:
#include <iostream>
using namespace std;
int main () {
int ival = 5; // integer value
float fval = reinterpret_cast<float&>(ival); // reinterpret bit pattern
cout << fval << endl; // output integer as float
return 0;
}
In this example, reinterpret_cast
explicitly prevents the compiler from performing a safe conversion from integer to floating-point value.[12] When the program runs it will output a garbage floating-point value. The problem could have been avoided by instead writing float fval = ival;
The next example shows how object references can be incorrectly downcasted:
#include <iostream>
using namespace std;
class Parent {
public:
virtual ~Parent() {} // virtual destructor for RTTI
};
class Child1 : public Parent {
public:
int a;
};
class Child2 : public Parent {
public:
double b;
};
int main () {
Child1 c1;
c1.a = 5;
Parent & p = c1; // upcast always safe
Child2 & c2 = static_cast<Child2&>(p); // invalid downcast
cout << c2.b << endl; // will output garbage data
return 0;
}
The two child classes have members of different types. When downcasting a parent class pointer to a child class pointer, then the resulting pointer may not point to a valid object of correct type. In the example, this leads to garbage value being printed. The problem could have been avoided by replacing static_cast
with dynamic_cast
that throws an exception on invalid casts.[13]
See also
Notes
- ↑ Milner, Robin (1978), "A Theory of Type Polymorphism in Programming", Jcss 17: 348–375
- 1 2 Saraswat, Vijay (1997-08-15). "Java is not type-safe". Retrieved 2008-10-08.
- ↑ Standard ML. Smlnj.org. Retrieved on 2013-11-02.
- ↑ "System.IO.Unsafe". GHC libraries manual: base-3.0.1.0. Retrieved 2008-07-17.
- ↑ Type safety is hence also a matter of good class definition: public methods that modify the internal state of an object shall preserve the object itegrity
- ↑ Kernighan; Dennis M. Ritchie (March 1988). The C Programming Language (2nd ed.). Englewood Cliffs, NJ: Prentice Hall. p. 116. ISBN 0-13-110362-8.
In C, the proper method is to declare that malloc returns a pointer to void, then explicitly coerce the pointer into the desired type with a cast.
- ↑ "Do I cast the result of malloc?". Stack Overflow. 7 November 2012. Retrieved 21 February 2013.
- ↑ Niklaus Wirth (1985). Programming in Modula-2. Springer Verlag.
- 1 2 "The Separation of Safe and Unsafe Facilities". Retrieved 24 March 2015.
- ↑ "ISO Modula-2 Language Reference". Retrieved 24 March 2015.
- ↑ "Common Lisp HyperSpec". Retrieved 26 May 2013.
- ↑ http://en.cppreference.com/w/cpp/language/reinterpret_cast
- ↑ http://en.cppreference.com/w/cpp/language/dynamic_cast
References
- Pierce, Benjamin C. (2002). Types and Programming Languages. MIT Press. ISBN 0-262-16209-1.
- "Type Safe". Portland Pattern Repository Wiki.
- Wright, Andrew K.; Matthias Felleisen (1994). "A Syntactic Approach to Type Soundness". Information and Computation 115 (1): 38–94. doi:10.1006/inco.1994.1093.
- Macrakis, Stavros (April 1982). "Safety and power" (requires subscription). ACM SIGSOFT Software Engineering Notes 7 (2): 25–26. doi:10.1145/1005937.1005941.