String interning
In computer science, string interning is a method of storing only one copy of each distinct string value, which must be immutable. Interning strings makes some string processing tasks more time- or space-efficient at the cost of requiring more time when the string is created or interned. The distinct values are stored in a string intern pool.
The single copy of each string is called its 'intern' and is typically looked up by a method of the string class, for example String.intern()
in Java. All compile-time constant strings in Java are automatically interned using this method.[1]
String interning is supported by some modern object-oriented programming languages, including Python, Ruby (with its symbols), Java and .NET languages. Lisp, Scheme, and Smalltalk are among the languages with a symbol type that are basically interned strings. The library of the Standard ML of New Jersey contains an atom type that does the same thing. Objective-C's selectors, which are mainly used as method names, are interned strings. .NET languages, Lua and JavaScript string values are immutable and interned.[2][3]
Objects other than strings can be interned. For example, in Java, when primitive values are boxed into a wrapper object, certain values (any boolean
, any byte
, any char
from 0 to 127, and any short
or int
between -128 and 127) are interned, and any two boxing conversions of one of these values are guaranteed to result in the same object.[4]
History
Lisp introduced the notion of interned strings for its symbols. Historically, the data structure used as a string intern pool was called an 'oblist' (when it was implemented as a linked list) or an 'obarray' (when it was implemented as an array).
Modern Lisp dialects typically distinguish symbols from strings; interning a given string returns an existing symbol or creates a new one, whose name is that string. Symbols often have additional properties that strings do not (such as storage for associated values, or namespacing): the distinction is also useful to prevent accidentally comparing an interned string with a not-necessarily-interned string, which could lead to intermittent failures depending on usage patterns.
Motivation
String interning speeds up string comparisons, which are sometimes a performance bottleneck in applications (such as compilers and dynamic programming language runtimes) that rely heavily on hash tables with string keys. Without interning, checking that two different strings are equal involves examining every character of both strings. This is slow for several reasons: it is inherently O(n) in the length of the strings; it typically requires reads from several regions of memory, which take time; and the reads fill up the processor cache, meaning there is less cache available for other needs. With interned strings, a simple object identity test suffices after the original intern operation; this is typically implemented as a pointer equality test, normally just a single machine instruction with no memory reference at all.
String interning also reduces memory usage if there are many instances of the same string value; for instance, it is read from a network or from storage. Such strings may include magic numbers or network protocol information. For example, XML parsers may intern names of tags and attributes to save memory. Network transfer of objects over Java RMI serialization object streams can transfer strings that are interned more efficiently, as the String object's handle is used in place of duplicate objects upon serialization. [5]
See also
References
External links
- Visual J# String class
- .NET String Class
- Guava Java Library - Interner - Non-permgen String.intern and supports other immutable types with weak and strong referenced implementations