Thursday, June 30, 2011

Parsing:  Quick guide to strtok()

This post is sort of a continuation of my Serial Comm guide.

There are a variety of ways to parse C strings, all with their pros and cons.  One method uses the C library function strtok().  It's advantages are less coding on your part to parse a string.  A bit more robust than some other methods (without additional coding on your part as well).  It's disadvantages are:  it mangles the C string it parses.  Binary size may be a bit larger than other methods (important if you're running low on flash space, but the difference will be on the order of about 1500 bytes).

First, we assume you have some C string with data in it you need to parse out.  I'll be using an example from my Serial Comm guide, the Sparkfun Razor IMU output string, which will look something like:

!ANG:7,320,90

Here we have three angle values we are trying to parse out into a useable format, like an int.  To do that, we'll be using another C library function as well: atoi().  This function converts a C string into an int.  You pass it a char pointer to the portion of the string you want converted to an int, and it returns the int value of what you passed it.  It's usage will look something like:

void setup(){
  char instring[] = "!ANG:7,320,90";
  char* valPosition = instring + 5;
  int value = atoi(valPosition);
  Serial.begin(115200);
  Serial.println(value);

   
}

This is a contrived example that is of little real use, but it does show the usage of atoi().  We will now use strtok() to break up our string into it's discrete tokens, and use atoi() to convert those token values to ints. strtok takes two parameters and returns a char* to the next token in the string.

The first parameter is the string you want to tokenize.  It's important to note that you only pass strtok a pointer to this string once.  It retains this pointer internally on subsequent calls, and returns a pointer to the next token in the string, or NULL when it reaches the end of the string.  The second parameter is a list of delimiting characters.
  For our Razor IMU, these delimiter would be the exclamation point, semicolon, and comma.  Some example usage:
 
void setup(){
  char instring[] = "!ANG:7,320,90";
  char delimiters[] = "!:,";
  char* valPosition;
 
  //This initializes strtok with our string to tokenize
  valPosition = strtok(instring, delimiters);
 
  Serial.begin(115200);
 
  while(valPosition != NULL){
    Serial.println(valPosition);
    //Here we pass in a NULL value, which tells strtok to continue working with the previous string
    valPosition = strtok(NULL, delimiters);
  }
 
}
Using strtok is typically a two step process.  The first call to strtok is our initialization call.  We pass it in the string we want to tokenize, and it passes back a pointer to the first token.  It also does some stuff internally for the second step of the process.

This second step is typically a loop of some sort that repeatedly calls strtok.  In this case, we check to see if our return value is NULL or not.  If it's NULL, strtok has finished tokenizing our string.  If it isn't NULL, we make another call to strtok.

This code will provide the following output:

ANG
7
320
90

So the next step to do is utilize atoi to convert our tokens into actual int values.  Let's just start with a couple additional lines of code:

void setup(){
  char instring[] = "!ANG:7,320,90";
  char delimiters[] = "!:,";
  char* valPosition;
 
  valPosition = strtok(instring, delimiters);
  int angle;
 
  Serial.begin(115200);
 
  while(valPosition != NULL){
    angle = atoi(valPosition);
    Serial.println(angle);
    valPosition = strtok(NULL, delimiters);
  }
 
}

Here we declare an int variable called angle to hold our return value from atoi, and we make a call to atoi() in our while loop.  We are also printing out our new angle value.  The output of this code is similar to our previous code:


0
7
320
90

Our angle values are all correct, but the first line is the result of calling atoi() on a non-numeric string.  We don't really want to convert our ANG token.  It isn't a numeric value so atoi just returns zero.  We have two options here.  The first is to change our code to ignore the first token in our string.  Another is to change our input string to get rid of ANG altogether.  If you recall from the Serial Comm guide, I talked about being able to use either ! or : for the start character with the Razor IMU.  It made little difference to the serial comm code, but here it can simplify things a bit.  So let's look at some code that goes that route:

void setup(){
  char instring[] = ":7,320,90";
  char delimiters[] = "!:,";
  char* valPosition;
 
  valPosition = strtok(instring, delimiters);
  int angle;
 
  Serial.begin(115200);
 
  while(valPosition != NULL){
    Serial.println(angle);
    angle = atoi(valPosition);
    valPosition = strtok(NULL, delimiters);
  }
 
}

All we've done is change our input string to get rid of the ANG token, which never changes anyways, so is of no real value.  This code is currently only printing out the value of angle, and each previous value is lost.  What we really want is to record all three values, presumably to perform some calculations on afterwards.  The best way to handle this is with an array of ints, something like this:

void setup(){
  char instring[] = ":7,320,90";
  char delimiters[] = "!:,";
  char* valPosition;
 
  valPosition = strtok(instring, delimiters);
  int angle[] = {0, 0, 0};
 
  Serial.begin(115200);
 
  for(int i = 0; i < 3; i++){
    angle[i] = atoi(valPosition);
    Serial.println(angle[i]);
    valPosition = strtok(NULL, delimiters);
  }

}

First change is our declaration of angle.  We now declare it as an array by suffixing it with [], and initialize it to three elements all equal to zero.

We then replace our while loop with a for loop.  Our input string is of a known, specific format, with three elements, so we loop 3 times, storing each successive value into the next element of our array.  When it's all done, we have all three of our values stored in our array.  A couple of notes here though.  Because our array was declared inside serup(), once setup() exits, our array is gone.  Also, it wouldn't be a bad idea to add some code to check for a NULL value returned from strtok().  The code will still work fine without it.  atoi() will return a zero when passed a NULL value.  If you're going to utilize these values for additional calculations though, you'll probably want to completely disgard anything that didn't fully parse properly to avoid processing garbage data.


A Note on converting strings to other types:
atoi() is a useful function for converting a numeric string into an int, and there are other library functions available for some of the other types as well.
atof() can be used to convert float values (it technically returns a double, but doubles and floats are the same on 8bit AVRs anyways)
atol() returns a long value.

These are all part of the AVR Libc package, which provides most of the standard C library functions for the AVR 8bit micros, and more details can be found at their homepage here:

AVR Libc Home Page