C++ string handling

The C++ programming language has support for string handling, mostly implemented in its standard library. The language standard specifies several string types, some inherited from C, some newly designed to make use of the language's features, such as templates and the RAII resource management idiom.

Since the initial versions of C++ had only the "low-level" C string handling functionality and conventions, multiple incompatible designs for string handling classes have been designed over the years, and C++ programmers may need to handle multiple conventions in a single application.

History

The std::string type is the main string datatype in standard C++ since 1998, but it was not always part of C++, and still is not the only standard string type: from C, C++ inherited the convention of using null-terminated strings that are handled by a pointer to their first element, and a library of functions that manipulate such strings. In modern standard C++, a string literal such as "hello" still denotes a NUL-terminated array of characters[1] and std::string has support for converting itself to such an array.[2]

In a 1991 retrospective on the history of C++, its inventor Bjarne Stroustrup called the lack of a standard string type (and some other standard types) in C++ 1.0 the worst mistake he made in its development; "the absence of those led to everybody re-inventing the wheel and to an unnecessary diversity in the most fundamental classes".[3] Over the years, C++ application, library and framework developers produced their own, incompatible string representations, such as the one in AT&T's Standard Components library (the first such implementation, 1983)[3] or the CString type in Microsoft's MFC.[4] While std::string standardized strings, legacy applications still commonly contain such custom string types and libraries may expect C-style strings, making it "virtually impossible" to avoid using multiple string types in C++ programs[1] and requiring programmers to decide on the desired string representation ahead of starting a project.[4]

Implementation issues

The various vendors' string types have different implementation strategies and performance characteristics. In particular, some string types use a copy-on-write strategy, where an operation such as

string a = "hello!";
string b = a; // Copy constructor

does not actually copy the content of a to b; instead, both strings share their contents and a reference count on the content is incremented. The actual copying is postponed until a mutating operation, such as appending a character to either string, makes the strings' contents differ.

Though std::string no longer uses it, third-party string libraries may still implement copy-on-write strings; Qt's QString is an example.[5]

Also, third-party string implementations may store 16-bit or 32-bit code points instead of bytes, in order to facilitate processing of Unicode text.[5] However, it means that conversion to these types from std::string or from arrays of bytes is a slow and often a lossy operation, dependent on the "locale", and can throw exceptions.

Standard string types

The std::string class is the standard representation for a text string since C++98. Compared to C-style strings (NUL-terminated arrays) and the associated standard functions, this class offers several the benefits of automated memory management and a reduced risk of out-of-bounds accesses.[6] The class provides some typical string operations like comparison, concatenation, find and replace, and a function for obtaining substrings. An std::string can be constructed from a C-style string, and a C-style string can also be obtained from one.

The individual units making up the string are of type char, at least (and almost always) 8 bits each. In modern usage these are often not "characters", but parts of a multibyte character encoding such as UTF-8.

The copy-on-write strategy was deliberately allowed by the initial C++ Standard for std::string because it was deemed a useful optimization, and used by nearly all implementations.[2] However, there were mistakes, for instance the operator[] returned a non-const reference, and must be treated as potentially-mutating, even after the operator has finished (the caller can legally store the reference and modify the byte after copying the string).[7][8] This caused some implementations to abandon copy-on-write. Performance problems in multi-threaded applications, due to the locking needed to examine or change the reference count, were soon pointed out.[9] The optimization was finally disallowed in C++11,[7] with the result that even passing a std::string as an argument to a function, viz.

void print(std::string s)
{
    std::cout << s;
}

must be expected to perform a full copy of the string into newly allocated memory. The common idiom to avoid such copying is to pass by const reference:

void print(std::string const &s)
{
    std::cout << s;
}

Example usage

#include <iostream>
#include <string>

int main()
{
    // Literals (double-quoted text) denote C strings, but std::string
    // instances can be initialized from such literals.
    std::string foo = "fighters";
    std::string bar = "stool";

    // The operator != compares string contents for inequality. This is different
    // from != on char pointers to C strings, where != would compare the memory
    // addresses of the strings rather than their contents.
    if (foo != bar) {
        std::cout << "The strings are different." << std::endl;
    }

    // Prints "stool fighters". The + operator denotes string concatenation.
    std::cout << (bar + " " + foo) << std::endl;

    return 0;
}

std::string is a typedef for a particular instantiation of the std::basic_string template class.[10] Its definition is found in the <string> header:

typedef basic_string<char> string;

Thus string provides basic_string functionality for strings having elements of type char. There is a similar class std::wstring, which consists of wchar_t, and is most often used to store UTF-16 text on Windows and UTF-32 on most Unix-like platforms. The C++ standard, however, does not impose any interpretation as Unicode code points or code units on these types and does not even guarantee that a wchar_t holds more bits than a char.[11] To resolve some of the incompatibilities resulting from wchar_t's properties, C++11 added two new classes: std::u16string and std::u32string (made up of the new types char16_t and char32_t), which are the given number of bits per code unit on all platforms.[12] C++11 also added new string literals of 16-bit and 32-bit "characters" and syntax for putting Unicode code points into null-terminated (C-style) strings.[13]

A basic_string is guaranteed to be specializable for any type with a char_traits struct to accompany it. As of C++11, only char, wchar_t, char16_t and char32_t specializations are required to be implemented in the standard library; any other types are implementation-defined.[14] Each specialization is also a Standard Library container, and thus the Standard Library algorithms can be applied to the code units in strings.

Critiques

The design of std::string has held up as an example of monolithic design by Herb Sutter, who reckons that of the 103 member functions on the class in C++98, 71 could have been decoupled without loss of implementation efficiency.[15]

References

  1. 1 2 Seacord, Robert C. (2013). Secure Coding in C and C++. Addison-Wesley.
  2. 1 2 Meyers, Scott (2012), Effective STL, Addison-Wesley, pp. 64–65
  3. 1 2 Stroustrup, Bjarne (1993). A History of C++: 1979–1991 (PDF). Proc. ACM History of Programming Languages Conf.
  4. 1 2 Solter, Nicholas A.; Kleper, Scott J. (2005). Professional C++. John Wiley & Sons. p. 23.
  5. 1 2 Blanchette, Jasmin; Summerfield, Mark (2008). C++ GUI Programming with Qt4. Pearson Education.
  6. Oualline, Steve (2003). Practical C++ Programming. O'Reilly.
  7. 1 2 Meredith, Alisdair; Boehm, Hans; Crowl, Lawrence; Dimov, Peter (2008). "Concurrency Modifications to Basic String". ISO/IEC JTC 1/SC 22/WG 21. Retrieved 19 November 2015.
  8. {{url=https://gcc.gnu.org/bugzilla/show_bug.cgi?id=21334#c45}}
  9. Sutter, Herb (1999). "Optimizations That Aren't (In a Multithreaded World)". C/C++ Users Journal. 17 (6).
  10. "C++ reference for basic_string". Cppreference.com. Retrieved 11 January 2011.
  11. Gillam, Richard (2003). Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard. Addison-Wesley Professional. p. 714.
  12. "C++11 Paper N3336". Open Standards. Programming Language C++, Library Working Group. 13 Jan 2012. Retrieved 2 Nov 2013.
  13. Stroustrup, Bjarne (2013). The C++ Programming Language. Addison Wesley. p. 179.
  14. "char_traits - C++ Reference". Retrieved 2015-08-01.
  15. Sutter, Herb. "Monoliths "Unstrung"". gotw.ca. Retrieved 23 November 2015.

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.