Suite101

String Tokenizing in C Programming

A guide to using the strtok function

© Guy Lecky-Thompson

The strtok function in the string.h library can be very useful, but there are a few caveats that need to be borne in mind. Some avoid it, some find it indespensible.

Introduction

Breaking a string down into it's component parts is one of those useful data processing tasks that is almost always required when reading and writing information in a program. It casn also be tough to implement, so most string libraries provide a tokenizing function, and C is no different.

The C strtok function has received varying press over the years. It does have some caveats, but by and large it does what it says on the tin. Anyone who has browsed the open source string.h files provided with most implementations will, however, find comments instructing the programmer to avoid strtok whenever possible.

An Example - Processing CSV

Before we look at why this could be, we should first see what strtok actually does. In essence, it exists to tokenize a string - turn it into a set of sub-strings, based on the processing of 'separators'.

A typical separator is the comma. Indeed, if a file is exported from a spreadsheet application as CSV, each field is delimited by a comma, and each line delimited by a carriage return.

To turn a line of text into a set of fields, two operations are therefore required:

  1. Swap the final carriage return for a null character
  2. Translate the comma delimited fields into a set of sub-strings

The first is required because we need to lose the carriage return because it does not form part of the data of the last field, and we need to pass strtok a null-terminated string. The code to do this might look like:

int nLength = strlen(szString);
if (szString[nLength]-1 == '\n') {
szString[nLength]-1 = '\0';
}

Of course, one might be tempted to also test for '\r' on the basis that each line could be terminated with a LF/CR combination, but that is system dependent and slightly out of scope. The tokenizing process itself, might look like this:

char * tok = strtok(szString, ",");
while (tok != NULL) {
// Do something with the tok
tok = strtok(NULL,",");
}

The above will move through szString, dividing it into tokens, each delimyed by a comma. We use NULL as the first argument to strtok in the loop because otherwise, the function will replace the string with szString each time, since it keeps a global copy in memory for the duration.

A side effect of this is that the string kept in memory might become corrupt, or remain allocated long after the program has finished with it, since there is no guarantee that all the fields will be split. This makes strtok less than perfect, earning it a slightly dubious repulation.

Changing the Separators

Another curio is that we can actually change the second argument between calls - in effect changing the separator or separators that we wish to use. In the above example, we could use this technique to absorb the rest of the line up to the carriage return by changing the separator once the loop has completed; rather than setting the final character to a null.

To do this, of course, we would need to know the exact number fo fields to be converted, so as to be able to stop at the appropriate point. It is, after all only an example, and there are probably far better reasons to want to change separators mid-processing.

Safe Use of strtok

To safeguard our use of strtok, there are two things we can do:

  1. Check for null strings before the first call
  2. Check the string is empty after the last call

The second is advisable, while the first is required, as strtok does not generally deal with null pointers very elegantly.

Conclusion & Further Reading

So, strtok should come with a health warning, but it is not quite the beast that it can often be made out to be. Quite the reverse - with careful handling it is a very useful piece of functionality. The programmer just needs to be sure that they really need it...

Articles:

Using the C String Library

Books:

Just Enough C


The copyright of the article String Tokenizing in C Programming in Computer Programming is owned by Guy Lecky-Thompson. Permission to republish String Tokenizing in C Programming in print or online must be granted by the author in writing.





Post this Article to facebook Add this Article to del.icio.us! Digg this Article furl this Article Add this Article to Reddit Add this Article to Technorati Add this Article to Newsvine Add this Article to Windows Live Add this Article to Yahoo Add this Article to StumbleUpon Add this Article to BlinkLists Add this Article to Spurl Add this Article to Google Add this Article to Ask Add this Article to Squidoo