strlen

From Wikipedia, the free encyclopedia

In the C standard library, strlen is a string function that determines the length of a character string.

Contents

[edit] Example usage

#include <stdio.h>
#include <string.h>
 
int main(void)
{
    char *string = "Hello World";
    printf("%d\n", strlen(string));
    return 0;
}

This program will print the value 11, which is the length of the string "Hello World". Character strings are stored in an array of a data type called char. The end of a string is found by searching for the first null character in the array.

[edit] Implementation

There are many possible implementations of strlen. The native functions are generally fast; but other methods do exist. The function is usually supplied from a library. Two things are worth keeping in mind though:

  • Good compilers will often optimize calls to strlen() with constant strings arguments, doing the calculation at compile time.
  • No optimization can make strlen() fast for significantly large inputs, so storing the length of the "C-strings" is generally the recommended fix.

[edit] Naive

A possible implementation of strlen might be:

size_t strlen(char *str)
{
    char *s;
    if (str == NULL) {
        return SOME_ERROR_VALUE; /* invalid string */
    }
 
    for (s=str; *s; s++);
    return s-str;
}

The above implementation reads the string from beginning to end to search for a NUL character. It stops when it sees a NUL character, and the difference between the pointers of the location of the NUL character and the beginning of the string is returned.

[edit] Multi-Byte approach

This approach uses the trick of checking more bytes at once. The former approach uses just one-byte checking. Note that this approach is explained on unsigned values. Note that this approach checks groups of bytes by reading them into a variable. It is up to programmer to handle proper memory alignment and availability of size of that variable. Yet you may possibly read from invalid memory address on the last read. So this is not safe without solving those things (aligning allocated strings on searching window, which is done already by malloc, and length being multiply of it).

If there is a zero byte might be checked by understanding how numbers are represented in binary. Each character is on, lets presume 8 bits, byte. Certain number of bytes, lets presume 4, form a word. And that word will be the group we're checking for a 0'th byte.

What properties does this byte have? All bits set to 0. Others have some of bits set to 1. This is just that any number greater than 0 is not nul byte. What do we want to achieve? See if there is any zero byte, and if there is none we must have same answer for all possible variations of other non-nul bytes. This is just that we want some way to map those to same pattern or number, which if any byte is 0 will be broken.

If we were able to map all non-0 bytes to one value and 0 byte to other value it would be solved.

value1 = (value & 0x7f) + 0x7f; // value = (value & 127) + 127

This code produces value >= 128 (0x80) for values from interval <1, 127> by mapping it to <1+127, 127+127>.

And (&) operation is used to exclude bit of value 128, for it causes numerical overflow. And this value is therefore not handled so far. But since we excluded this value we need to do something about it. We simply add it, for it is just another value which has to distinguished from 0 value. And therefore it has to be treated the same way.

value2 = value1 | (value & 128); // add 128 value bit
value2 = value1 |  value; // or just add all bits, including 128 value bit

Now we have made all values which are > 0 set bit value 128.

is_zero = (value2 & 128) == 128; // true or false

This way we check whether the value was >= 0. Multibyte solution should be obvious.

#include <stdlib.h>
size_t strlen (char *ptr)
{
    uint32_t* ip;
    char*     s;
 
    // NOTE: this check is rarely done in C libraries.
    if (NULL == ptr)
       return ERROR_VALUE; //or 0
 
    // Handle alignment here somehow if not handled within string.
    // String has to be aligned on integer boundary
    // or some architectures might behave unexpectedly!
    // NOTE: This is also required so that you don't overflow the
    // bounds of the memory by looking at bytes beyond the end of
    // the C-string.
    ip = (uint32_t *) ptr;
 
    // multi-skip (less times around the loop)
    while (1) { /* check for all zero */
        uint32_t x = *ip; /* check for not-all zero */
        if (((((x & 0x7f7f7f7f) + 0x7f7f7f7f7f) | x) & 0x80808080) != 0x80808080)
            break;
        ++ip; /* moves forward 4 bytes at once */
    }
 
    // count any remaining bytes
    for (s = (char*)ip; 0 != *s; ++s)
        /* do nothing */ ;
 
    return s - ptr;
}

This can be found in Hackmem.

[edit] Assembly

Sometimes the header files for a particular C library emit fast inline versions of strlen written in assembly. The compiler may also do this; or the header files may simply call compiler built-in versions.

Writing strlen in assembly is primarily done for speed. The complexity of code emitted from a compiler is often higher than hand-optimized assembly, even for very short functions. Further, a function call requires setting up a proper call frame on most implementations; these operations can outweigh the size of simple functions like strlen.

[edit] External links

Languages