In computer science, a type system defines how a programming language classifies values and expressions into types, how it can manipulate those types and how they interact. A type identifies a value or set of values as having a particular meaning or purpose (although some types, such as abstract types and function types, might not be represented as values in the running computer program). Type systems vary significantly between languages with, perhaps, the most important variations being their compile-time syntactic and run-time operational implementations.
A compiler may use the static type of a value to optimize the storage it needs and the choice of algorithms for operations on the value. In many C compilers the "float" data type, for example, is represented in 32 bits, in accordance with the IEEE specification for single-precision floating point numbers. C thus uses floating-point-specific operations on those values (floating-point addition, multiplication, etc.).
The depth of type constraints and the manner of their evaluation affect the typing of the language. A programming language may further associate an operation with varying concrete algorithms on each type in the case of type polymorphism. Type theory is the study of type systems, although the concrete type systems of programming languages originate from practical issues of computer architecture, compiler implementation, and language design.
Contents |
Assigning data types (typing) gives meaning to collections of bits. Types usually have associations either with values in memory or with objects such as variables. Because any value simply consists of a set of bits in a computer, hardware makes no distinction even between memory addresses, instruction code, characters, integers and floating-point numbers. Assignment to a type informs programs and programmers how those bit collections should be treated.
Major functions type systems provide include:
3 / "Hello, World"
as invalid because one cannot divide (in any usual sense) an integer by a string. As discussed below, strong typing offers more safety, but does not necessarily guarantee complete safety (see type-safety for more information).Notice, however, that type safety is not equivalent to program correctness. A program may give the wrong result and be safely typed, producing no compiler errors. Other methods are needed to ensure a program is correct.
A program typically associates each value with one particular type (although a type may have more than one subtype). Other entities, such as objects, modules, communication channels, dependencies, or even types themselves, can become associated with a type. For example:
A type system, specified in for each programming language, controls the ways typed programs may behave, and makes behavior outside these rules illegal. An effect system typically provides more fine-grained control than does a type system.
More formally, type theory studies type systems.
The process of verifying and enforcing the constraints of types – type checking – may occur either at compile-time (a static check) or run-time (a dynamic check). If a language specification requires its typing rules strongly (ie, more or less allowing only those automatic type conversions which do not lose information), one can refer to the process as strongly typed, if not, as weakly typed. The terms are not used in a strict sense.
A programming language is said to use static typing when type checking is performed during compile-time as opposed to run-time. Examples of languages that use static typing include Ada, C, C++, C#, Java, Fortran, ML, Pascal, and Haskell. Static typing is a limited form of program verification (see type safety): accordingly, it allows many errors to be caught early in the development cycle. Program execution may also be made more efficient (i.e. faster or taking reduced memory).
A programming language is said to be dynamically typed, or just 'dynamic', when the majority of its type checking is performed at run-time as opposed to at compile-time. Examples of languages that are dynamically typed include Groovy, JavaScript, Lisp, Objective-C, Perl, PHP, Prolog, Python, Ruby, and Smalltalk. Compared to static typing (or 'early-binding'), dynamic typing is more flexible because of theoretical limitations on the decidability of certain static program analysis problems; these prevent the same level of flexibility from being achieved with static typing.
Dynamically-typed language systems, compared to their statically-type cousins, make fewer "compile-time" checks on the source code (but will check, for example, that the program is syntactically correct). Run-time checks can potentially be more sophisticated, since they can use dynamic information as well as any information that was present at "compile" time. On the other hand, runtime checks only assert that conditions hold in a particular execution of the program, and are repeated for every execution of the program. Static languages restrict type information to that which can be determined at compile time which means that static checks are not repeated on execution which in practice leads to faster execution.
Development in dynamically-typed languages is often supported by programming practices such as unit testing. Testing is a key practice in professional software development, and is particularly important in dynamically-typed languages because the greater flexibility afforded by dynamic typing needs a more dynamic, that is runtime testing. Unit testing can be incorporated into the build cycle and so can be thought of as a "compile-time" check, in that the program user will not have to manually run such tests.
Some statically typed languages have a "back door" in the language specification that enables programmers to write code that does not statically type check. For example, Java and most C-style languages have "casting"; such operations may be unsafe at runtime, in that they can cause unwanted behavior due to incorrect typing of values when the program runs.
The presence of static typing in a programming language does not necessarily imply the absence of all dynamic typing mechanisms. For example, Java, and various other object-oriented languages, while using static typing, require for certain operations (downcasting) the support of runtime type tests, a form of dynamic typing. See programming language for more discussion of the interactions between static and dynamic typing.
C++, besides being mostly statically typed, also supports a type of late-binding, that is mostly implemented through its support of polymorphism.
The choice between static and dynamic typing requires trade-offs.
Static typing can find type errors reliably at compile time. This should increase the reliability of the delivered program. However, programmers disagree over how commonly type errors occur, and thus what proportion of those bugs which are written would be caught by static typing. Static typing advocates believe programs are more reliable when they have been well type-checked, while dynamic typing advocates point to distributed code that has proven reliable and to small bug databases. The value of static typing, then, presumably increases as the strength of the type system is increased. Advocates of dependently-typed languages such as Dependent ML and Epigram have suggested that almost all bugs can be considered type errors, if the types used in a program are properly declared by the programmer or correctly inferred by the compiler.[1]
Static typing usually results in compiled code that executes more quickly. When the compiler knows the exact data types that are in use, it can produce optimized machine code. Further, compilers for statically typed languages can find assembler shortcuts more easily. Some dynamically-typed languages such as Common Lisp allow optional type declarations for optimization for this very reason. Static typing makes this pervasive. See optimization.
By contrast, dynamic typing may allow compilers and interpreters to run more quickly, since changes to source code in dynamically-typed languages may result in less checking to perform and less code to revisit. This too may reduce the edit-compile-test-debug cycle.
Statically-typed languages which lack type inference (such as Java and C) require that programmers declare the types they intend a method or function to use. This can serve as additional documentation for the program, which the compiler will not permit the programmer to ignore or permit to drift out of synchronization. However, a language can be statically typed without requiring type declarations (examples include Scala and C#3.0), so this is not a necessary consequence of static typing.
Dynamic typing allows constructs that some static type checking would reject as illegal. For example, eval functions, which execute arbitrary data as code, become possible (however, the typing within that evaluated code might remain static). Furthermore, dynamic typing accommodates transitional code and prototyping, such as allowing a string to be used in place of a data structure. Recent enhancements to statically typed languages (e.g. Haskell Generalized algebraic data types) have allowed eval functions to be written in a type-safe way.[2]
Dynamic typing typically makes metaprogramming more effective and easier to use. For example, C++ templates are typically more cumbersome to write than the equivalent Ruby or Python code. More advanced run-time constructs such as metaclasses and introspection are often more difficult to use in statically-typed languages.
One definition of strongly typed involves preventing success for an operation on arguments which have the wrong type. A C cast gone wrong exemplifies the problem of absent strong typing; if a programmer casts a value from one type to another in C, not only must the compiler allow the code at compile time, but the runtime must allow it as well. This may permit more compact and faster C code, but it can make debugging more difficult.
Some observers use the term memory-safe language (or just safe language) to describe languages that do not allow undefined operations to occur. For example, a memory-safe language will check array bounds, or else statically guarantee (i.e., at compile time before execution) that array accesses out of the array boundaries will cause compile-time and perhaps runtime errors.
Weak typing means that a language implicitly converts (or casts) types when used. Revisiting the previous example, we have:
var x := 5; // (1) (x is an integer) var y := "37"; // (2) (y is a string) x + y; // (3) (?)
It is not clear what result one would get in a weakly typed language. Some languages, such as Visual Basic, would produce runnable code producing the result 42: the system would convert the string "37" into the number 37 to forcibly make sense of the operation. Other languages like JavaScript would produce the result "537": the system would convert the number 5 to the string "5" and then concatenate the two. In both Visual Basic and JavaScript, the resulting type is determined by rules that take both operands into consideration. In some languages, such as AppleScript, the type of the resulting value is determined by the type of the left-most operand only.
Reduction of operator overloading, such as not using "+" for string concatenation in addition to arithmetic addition, can reduce some of the confusion caused by weak typing. For example, PHP uses periods (.) for string concatenation, and similarly Ada uses the ampersand (&).
A third way of categorizing the type system of a programming language uses the safety of typed operations and conversions. Computer scientists consider a language "type-safe" if it does not allow operations or conversions which lead to erroneous conditions.
Let us again have a look at the pseudocode example:
var x := 5; // (1) var y := "37"; // (2) var z := x + y; // (3)
In languages like Visual Basic variable z in the example acquires the value 42. While the programmer may or may not have intended this, the language defines the result specifically, and the program does not crash or assign an ill-defined value to z. In this respect, such languages are type-safe.
Now let us look at the same example in C:
int x = 5; char y[] = "37"; char* z = x + y;
In this example z will point to a memory address five characters beyond y, equivalent to two characters after the terminating zero character of the string pointed to by y. The content of that location is undefined, and might lie outside addressable memory. The mere computation of such a pointer may result in undefined behavior (including the program crashing) according to C standards, and in typical systems dereferencing z at this point could cause the program to crash. We have a well-typed, but not memory-safe program — a condition that cannot occur in a type-safe language.
The term "polymorphism" refers to the ability of code (in particular, methods or classes) to act on values of multiple types, or to the ability of different instances of the same data-structure to contain elements of different types. Type systems that allow polymorphism generally do so in order to improve the potential for code re-use: in a language with polymorphism, programmers need only implement a data structure such as a list or an associative array once, rather than once for each type of element with which they plan to use it. For this reason computer scientists sometimes call the use of certain forms of polymorphism generic programming. The type-theoretic foundations of polymorphism are closely related to those of abstraction, modularity and (in some cases) subtyping.
In "duck typing," a statement calling a method on an object does not rely on the type of the object; only that the object, of whatever type, must implement the method called.
Duck typing differs from structural typing in that, if the part (of the whole module structure) needed for a given local computation is present at runtime, the duck type system is satisfied in its type identity analysis. On the other hand, a structural type system would require the analysis of the whole module structure at compile-time to determine type identity or type dependence.
Duck typing differs from a nominative type system in a number of aspects. The most prominent ones are that, for duck typing, type information is determined at runtime (as contrasted to compile-time) and the name of the type is irrelevant to determine type identity or type dependence; only partial structure information is required for that, for a given point in the program execution.
Initially coined by Alex Martelli in the Python community, duck typing uses the premise that (referring to a value) "if it walks like a duck, and quacks like a duck, then it is a duck".
Many type systems have been created that are specialized for use in certain environments, with certain types of data, or for out-of-band static program analysis. Frequently these are based on ideas from formal type theory and are only available as part of prototype research systems.
Dependent types are based on the idea of using scalars or values to more precisely describe the type of some other value. For example, "matrix(3,3)" might be the type of a 3×3 matrix. We can then define typing rules such as the following rule for matrix multiplication:
where k, m, n are arbitrary positive integer values. A variant of ML called Dependent ML has been created based on this type system, but because type-checking conventional dependent types is undecidable, not all programs using them can be type-checked without some kind of limitations. Dependent ML limits the sort of equality it can decide to Presburger arithmetic;other languages such as Epigram make the value of all expressions in the language decidable so that type checking can be decidable, it is also possible to make the language Turing complete at the price of undecidable type checking like in Cayenne .
Linear types, based on the theory of linear logic, and closely related to uniqueness types, are types assigned to values having the property that they have one and only one reference to them at all times. These are valuable for describing large immutable values such as strings, files, and so on, because any operation that simultaneously destroys a linear object and creates a similar object (such as 'str = str + "a"
') can be optimized "under the hood" into an in-place mutation. Normally this is not possible because such mutations could cause side effects on parts of the program holding other references to the object, violating referential transparency. They are also used in the prototype operating system Singularity for interprocess communication, statically ensuring that processes cannot share objects in shared memory in order to prevent race conditions. The Clean language (a Haskell-like language) uses this type system in order to gain a lot of speed while remaining safe.
Intersection types are types describing values that belong to both of two other given types with overlapping value sets. For example, in C the signed char has range -128 to 127 and the unsigned char has range 0 to 255, so the intersection type of these two types would have range 0 to 127. Such an intersection type could be safely passed into functions expecting either signed or unsigned chars, because it is compatible with both types.
Intersection types are useful for describing overloaded function types: For example, if "int → int" is the type of functions taking an integer argument and returning an integer, and "float → float" is the type of functions taking a float argument and returning a float, then the intersection of these two types can be used to describe functions that do one or the other, based on what type of input they are given. Such a function could be passed into another function expecting an "int → int" function safely; it simply would not use the "float → float" functionality.
In a subclassing hierarchy, the intersection of a type and an ancestor type (such as its parent) is the most derived type. The intersection of sibling types is empty.
The Forsythe language includes a general implementation of intersection types. A restricted form is refinement types.
Union types are types describing values that belong to either of two types. For example, in C, the signed char has range -128 to 127, and the unsigned char has range 0 to 255, so the union of these two types would have range -128 to 255. Any function handling this union type would have to deal with integers in this complete range. More generally, the only valid operations on a union type are operations that are valid on both types being unioned. C's "union" concept is similar to union types, but is not typesafe because it permits operations that are valid on either type, rather than both. Union types are important in program analysis, where they are used to represent symbolic values whose exact nature (eg, value or type) is not known.
In a subclassing hierarchy, the union of a type and an ancestor type (such as its parent) is the ancestor type. The union of sibling types is a subtype of their common ancestor (that is, all operations permitted on their common ancestor are permitted on the union type, but they may also have other valid operations in common).
Existential types are frequently used to represent modules and abstract data types because of their ability to separate implementation from interface. For example, in C pseudocode, the type "T = ∃X { X a; int f(X); }" describes a module interface that has a data member of type X and a function that takes a parameter of the same type X and returns an integer. This could be implemented in different ways; for example:
These types are both subtypes of the more general existential type T and correspond to concrete implementation types, so any value of one of these types is a value of type T. Given a value "t" of type "T", we know that "t.f(t.a)" is well-typed, regardless of what the abstract type X is. This gives flexibility for choosing types suited to a particular implementation while clients that use only values of the interface type — the existential type — are isolated from these choices.
Many static type systems, such as C's and Java's, require type declarations: The programmer must explicitly associate each variable with a particular type. Others, such as Haskell's, perform type inference: The compiler draws conclusions about the types of variables based on how programmers use those variables. For example, given a function f(x,y) which adds x and y together, the compiler can infer that x and y must be numbers – since addition is only defined for numbers. Therefore, any call to f elsewhere in the program that specifies a non-numeric type (such as a string or list) as an argument would signal an error.
Numerical and string constants and expressions in code can and often do imply type in a particular context. For example, an expression 3.14
might imply a type of floating-point, while [1, 2, 3]
might imply a list of integers – typically an array.
Type inference is only possible if it is decidable in the type theory in question. Haskell's type system, a version of Hindley-Milner, is a restriction of System Fω to so-called rank-1 polymorphic types, in which type inference is decidable. Most Haskell compilers allow arbitrary-rank polymorphism as an extension, but this makes type inference undecidable. (Type checking is decidable, however, and rank-1 programs still have type inference; higher rank polymorphic programs are rejected unless given explicit type annotations.)
A type of types is a kind. Kinds appear explicitly in typeful programming, such as a type constructor in the Haskell programming language.
Types fall into several broad categories:
A type-checker for a statically typed language must verify that the type of any expression is consistent with the type expected by the context in which that expression appears. For instance, in an assignment statement of the form x := e
, the inferred type of the expression e must be consistent with the declared or inferred type of the variable x
. This notion of consistency, called compatibility, is specific to each programming language.
If the type of e and the type of x
are the same and assignment is allowed for that type, then this is a valid expression. In the simplest type systems, therefore, the question of whether two types are compatible reduces to that of whether they are equal (or equivalent). Different languages, however, have different criteria for when two type expressions are understood to denote the same type. These different equational theories of types vary widely, two extreme cases being structural type systems, in which any two types are equivalent that describe values with the same structure, and nominative type systems, in which no two syntactically distinct type expressions denote the same type (i.e., types must have the same "name" in order to be equal).
In languages with subtyping, the compatibility relation is more complex. In particular, if A is a subtype of B, then a value of type A can be used in a context where one of type B is expected, even if the reverse is not true. Like equivalence, the subtype relation is defined differently for each programming language, with many variations possible. The presence of parametric or ad hoc polymorphism in a language may also have implications for type compatibility.
There are often conflicts between those who prefer statically-typed languages and those who prefer dynamically typed languages. The first group advocates for the early detection of type errors during compilation and increased runtime performance, while the latter group advocates for rapid prototyping that is possible with a more dynamic typing system and that type errors are only a small subset of errors in a program.[3][4]. Related to this is the consideration that often there is no need to manually declare all types in statically typed programming languages with type inference; thus, the need for the programmer to explicitly specify types of variables is automatically lowered for such languages; and some dynamic languages have run-time optimisers[5][6] that can generate fast code approaching the speed of static language compilers.