Tokenize strings help

klose86 · Aug 1, 2006

Hi.
I'm writing a program that evaluates expressions, kind of what scanners and parsers do on compilers. Now I got this trouble:
I need to check for assignation and comparisson expressions, that is, I need to look in the text for symbols like: =, <, >, >=, <=, ==, etc.
The problem comes when I find those symbols right beside a token, for example:
var= a;
or var =a;
instead of var = a; (see the space between the characters)
And same with the other operators:
a>b
instead of a > b

When that happens, the token I get is "var=" or "a>", which is not what I need. I need to get "var" "=" "a", wherever the operator is written(right, center or left). Any idea of how can i do that?

BTW: I'm saving each token I get from the file in a singly linked list.
THANKS.

ArkM · Aug 1, 2006

Well, start your own scanner (on char by char basis, of course) then we can help you to proceed...

dEVooXiAm · Aug 2, 2006

Simply DO NOT count on spaces, but character types instead: Alpha characters (A-Z, a-z, _), numbers (0-9), special characters (=, >, <, etc.) and finally spaces - thats space or tab. And construct a TOKEN everytime the type changes.

------------------
When you do it, do it right.

cpjust · Aug 2, 2006

You could also write a pre-parser, that searches for those assignment/comparison operators and adds spaces before and after them if necessary.

cdlvj · Aug 3, 2006

look at strtok function.

cpjust · Aug 3, 2006

strtok() won't help him if there is no whitespace between the var and '='...

Salem · Aug 3, 2006

Create several character set tests
matchIdent - A-Z, a-z, 0-9, _
matchDigit - 0-9
matchSpace - space, tab, newline, formfeed, vertical tab
matchOp - <, >, =, !, &, |,

The first character which matches determines what the sequence which follows should all match.

So reading "var >=a" would be
- read v, which matches with matchIdent, so keep progressing until matchIdent is false (this gets you "var")
- read a space, so keep matching with isSpace (" ")
- read >, matches with matchOp, so keep matching with that function (gets you ">=")
- read a, gets you "a"

Thoughts?

If this is a fairly complex grammar, then lex/yacc (or flex/bison) are worth looking into, since all you have to do is define the rules of the grammar, and it takes care of all the awkward tokenising bits.

--

cdlvj · Aug 3, 2006

Sure it will.

Code:

// strtok.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"
#include "string.h"

void srtest( char *t )
{
	int i;
	char data[64];
	strcpy(data,t);
        char delims[] = "=<> ";
        char *result = NULL;
        result = strtok( data, delims );
        while( result != NULL ) {
           printf( "result is \"%s\"\n", result );
           result = strtok( NULL, delims );
         }            

}

int _tmain(int argc, _TCHAR* argv[])
{

	srtest("var=10");
	srtest("var =10");
	srtest("var= 10");
	srtest("var<10");
	srtest("var>10");
	return 0;
}

Code:

result is "var"
result is "10"
result is "var"
result is "10"
result is "var"
result is "10"
result is "var"
result is "10"
result is "var"
result is "10"

cdlvj · Aug 3, 2006

And this returns same results:

Code:

        srtest("var               =10");
	srtest("var =10             ");
	srtest("var=                   10");
	srtest("var  <    10");
	srtest("  var       >10");
	srtest("  var>10");

cpjust · Aug 3, 2006

klose86 said:
I need to get "var" "=" "a", wherever the operator is written(right, center or left). Any idea of how can i do that?

Based on the original question, you're code only fulfills 2/3 of the requirements.

klose86 · Aug 5, 2006

Well, thank you very much for your help.
Before I started this thread I was working with the strotk() function, but as "cpjust" said, it didn't fullfill my requirements.
So what I did was to write a function that read the file char by char, using fgetc() function and inside a while loop and I save every char in an array that will contain a complete token when the loop ends. I skip the white spaces and other simbols just asking if what I got from fgetc() matches those symbols. But if its equal to an operator (>,<,=, etc) I save that char in the array and then break the loop, so that the function doesn't save the other chars.
I worked, but sometimes, it doesn't. Specially when it comes to parenthesis (I need to save them too). The closing parenthesis is always at the end, and it seems that the loop doesn't get there, so it saves "a)" as it is, instead of "a" ")". Well, at least I'm happy it works in most of the cases.

Hope I explained myself clear, I would post the code of the function, but its kinda long.

By the way, does anybody know how do I ask if a value is an integer or not?. I Need to atoi() some chars, and I need to check if what I get is actually a number or not.

THANKS A LOT.

cpjust · Aug 5, 2006

You could use atoi() as long as the value 0 is "0" and not something like "000". If that is true, you could do this:

Code:

char* szNum;
int iNum;
...
iNum = atoi( szNum );

/* 0 could mean szNum is "0", or not an integer. */
if ( iNum == 0 )
{
   if ( (strlen( szNum ) == 1) && (szNum[0] == '0') )
   {
      /* szNum is really "0", so do nothing. */
   }
   else
   {
      /* szNum is not an integer! */
      /* Do something to handle this condition. */
   }
}

klose86 · Aug 8, 2006

Oh well, thank you very much to cpjust. Your idea about atoi() and the zero result helped me a lot.
Now, I'm taking back the original issue of the thread. I'm posting the function I wrote to scan the file and then tokenize it. As you can see I read char by char and save them in an array to build a token. When it comes to operators, I just ask if the char is equal to one of the operators I need to check, and if its true, then I save that char and break the while loop. The problem is that it works correctly ONLY if the array is empty.
Why is that? Because if its not empty and I got for example: "var1= a", the array would contain [v,a,r,1,=] and then loop breaks, saving "var1=" in the array. However, if it is empty,then it means that its reading for example "var1 =a", so the first element would be "=", so the loop breaks and the array will only contain that char, which is what I'm looking for.

I'm not sure what should I do to fix the problem of the non-empty array. If you guys understood me and got any idea, I'll appreciate the help, as always.

Thank u very much.

Code:

void tokenizar_file(char *file_name) {	
	FILE *f;
	char ch;
	char array[MAXSTR];
	int numchar;				/* total of characters in the word */
	char* temp;				/* for allocating the tokens

	if ((f = fopen(file_name, "r")) == NULL) {
		printf("Error %s\n");
		exit(0);	}
	while ((ch = (char)getc(f)) != EOF) {	/*char by char/*
		numchar = 0;
	
		while (ch != ' ' && ch!= '\n' && ch != '\t' && ch != EOF) {
			if ( ( (ch == "+") || (ch == "=") || (ch == ')') (ch == '(')) && numchar == 0 ) {
				/*in this case the array is empty*/
				array[numchar++] = ch;
				break;
			}

			else if( ( ( (ch == "+") || (ch == "=") || (ch == ')') (ch == '(') ) && numchar != 0 ){
			/*in this case the array is not empty*/
			/*don't now what to do here about the operator ch*/
					break;
			}

			array[numchar++] = ch; /*I save the char in the array*/
			ch = (char)getc(f);
		}
		array[numchar] = '\0';

		if (numchar != 0) {	/*I copy the chars of the array into temp*/
			
			temp = (char *)malloc(strlen(array)+1);
			strcpy(temp,array);
			InsertOnList(&tokens, temp, numLinea); /*here I insert the token on linked list*/		
		}
	}
}

cpjust · Aug 9, 2006

Two things I found that probably aren't related to the problem you're having:

1. In 2 places you use

Code:

ch = (char)getc(f)

but getc() returns an int, not a char. It's probably not very likely that you have 0xFF chars in your file, but if you do, they'd be mistaken for EOF.

2. You aren't checking if malloc() returns NULL.

I'll try to look at the code more closely when I get some time.

SamBones · Aug 9, 2006

Have you looked at [tt]lex[/tt] and [tt]yacc[/tt] (and [tt]flex[/tt] and [tt]bison[/tt])? What you're doing is called lexical analysis, and that's what these utilities do. You might be able to look at the source for [tt]flex[/tt] to see how it does it.

You can Google this for more info...

http://www.google.com/search?hl=en&lr=&q=lex+yacc+flex+bison&btnG=Search

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Tokenize strings help

klose86

Programmer

ArkM

IS-IT--Management

dEVooXiAm

Programmer

cpjust

Programmer

cdlvj

MIS

cpjust

Programmer

Salem

Programmer

cdlvj

MIS

cdlvj

MIS

cpjust

Programmer

klose86

Programmer

cpjust

Programmer

klose86

Programmer

cpjust

Programmer

SamBones

Programmer

Similar threads

Part and Inventory Search

Sponsor