|
|
|
|
|
String Tokenizing in C ProgrammingA guide to using the strtok functionThe strtok function in the string.h library can be very useful, but there are a few caveats that need to be borne in mind. Some avoid it, some find it indespensible.
IntroductionBreaking a string down into it's component parts is one of those useful data processing tasks that is almost always required when reading and writing information in a program. It casn also be tough to implement, so most string libraries provide a tokenizing function, and C is no different. The C strtok function has received varying press over the years. It does have some caveats, but by and large it does what it says on the tin. Anyone who has browsed the open source string.h files provided with most implementations will, however, find comments instructing the programmer to avoid strtok whenever possible. An Example - Processing CSVBefore we look at why this could be, we should first see what strtok actually does. In essence, it exists to tokenize a string - turn it into a set of sub-strings, based on the processing of 'separators'. A typical separator is the comma. Indeed, if a file is exported from a spreadsheet application as CSV, each field is delimited by a comma, and each line delimited by a carriage return. To turn a line of text into a set of fields, two operations are therefore required:
The first is required because we need to lose the carriage return because it does not form part of the data of the last field, and we need to pass strtok a null-terminated string. The code to do this might look like: int nLength = strlen(szString); if (szString[nLength]-1 == '\n') { szString[nLength]-1 = '\0'; } Of course, one might be tempted to also test for '\r' on the basis that each line could be terminated with a LF/CR combination, but that is system dependent and slightly out of scope. The tokenizing process itself, might look like this: char * tok = strtok(szString, ","); while (tok != NULL) { // Do something with the tok tok = strtok(NULL,","); } The above will move through szString, dividing it into tokens, each delimyed by a comma. We use NULL as the first argument to strtok in the loop because otherwise, the function will replace the string with szString each time, since it keeps a global copy in memory for the duration. A side effect of this is that the string kept in memory might become corrupt, or remain allocated long after the program has finished with it, since there is no guarantee that all the fields will be split. This makes strtok less than perfect, earning it a slightly dubious repulation. Changing the SeparatorsAnother curio is that we can actually change the second argument between calls - in effect changing the separator or separators that we wish to use. In the above example, we could use this technique to absorb the rest of the line up to the carriage return by changing the separator once the loop has completed; rather than setting the final character to a null. To do this, of course, we would need to know the exact number fo fields to be converted, so as to be able to stop at the appropriate point. It is, after all only an example, and there are probably far better reasons to want to change separators mid-processing. Safe Use of strtokTo safeguard our use of strtok, there are two things we can do:
The second is advisable, while the first is required, as strtok does not generally deal with null pointers very elegantly. Conclusion & Further ReadingSo, strtok should come with a health warning, but it is not quite the beast that it can often be made out to be. Quite the reverse - with careful handling it is a very useful piece of functionality. The programmer just needs to be sure that they really need it... Articles: Books:
The copyright of the article String Tokenizing in C Programming in Computer Programming is owned by Guy Lecky-Thompson. Permission to republish String Tokenizing in C Programming in print or online must be granted by the author in writing.
|
|
|
|