Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chris Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Dynamic Not-Quite Template System - Regex

Status
Not open for further replies.

Borvik

Programmer
Jan 2, 2002
1,392
US
I'm sure that subject probably confused some of you, though I can't think of another quick way to describe it.

It's not a template, as in creating a page template - like most template systems out there, but it has some of those parsing similarities.

I am basically creating a dynamic question and answer system, where the question, answer, and possibly an object definition are stored in a database.

I've got most of it working, but have come across a road-block as I'm trying to implement a way for the definition to include a function call.

Here are the rules I've come up with so far:
Variables are all random numbers
Variables are identified by $.
Variables in the answer definition must be defined in the question or they are ignored.
Variables are defined with a $ in front of the name and curly brackets surrounding a number {#} (0 to #) or two numbers {#,#} (# to #).
Function definitions are identified by a # in front.
Functions may have multiple parameters.
Function parameters are within parenthesis.

The answer needs to go through eval for mathematical evaluation questions to work. This is the LAST thing that gets run:
Code:
if( substr($a, -1) != ';' )
	$a .= ';';
$a = '$pDataResult = '.$a;
eval($a);
$a = $pDataResult;

Here is an example of what I'm doing (stuff that works):
Code:
$q = 'What is $a{1,9} + $b{1,9}?';
$a = '$a + $b';
That ends up after being parsed as:
parsed a: mt_rand(1, 9)
parsed b: mt_rand(1, 9)
$q = 'What is 4 + 3?'
$a = '7'

Code:
$q = 'What is $a{10,18} - $b{1,9}?';
$a = '$a - $b';
$q = 'What is 13 - 6?'
$a = '7'

Those examples work fine. Here is where it gets complicated.
Code:
$q = 'What is the #DynamicQ::num2RandOrd($a{1,6}) word of the following sentence?';
$a = '#DynamicQ::RegExMatch(parameters here)';
$o = 'This is a very common sentence.';
Parsing the question works just fine the question ends up being something like:
What is the 4th word of the following sentence?
or
What is the fourth word of the following sentence?

I haven't even started on the regex to find to nth word yet, I'll worry about the parsing first.

I would like to accurately find the function name (including the ::, though the function doesn't have to have it) and the parameters.

Here's what I have currently for parsing the functions:
Code:
	function Parse4Funcs($txt){
		$funcs = array();
		preg_match_all('/#([a-zA-Z0-9\:]*?)\((.*?)\)/', $txt, $matches);
		for($i = 0; $i < count($matches[1]); $i++){
			$args = split(',', $matches[2][$i]);
			$funcs[$matches[1][$i]] = $args;
		}
		return $funcs;
	}
That ends up with an array with the function name as the key and the parameters as a subarray, for example:
Code:
Array(
  'DynamicQ::num2RandOrd' => Array(
      4
  )
)

Thanks.
 
is the pattern you are searching for not just
Code:
$pattern = '/\\s*((#[^\(]*)(\([^\)]*\)))';
 
Not quite.

Basically what I end up with needs to accurately locate and separate the arguments. This does not need to be done in 1 regular expression.

BTW - I'm using preg_match_all (not ereg), in case that changes anything (read reports preg could be faster than ereg) - not locked into it though.

Here are the results with the original string, I added a second parameter in quotes where the end of the function could be miscalculated by the regular expression:

$a = '#DynamicQ::RegExMatch($a, ")")';
results:
Code:
Array
(
    [0] => Array
        (
            [0] => #DynamicQ::RegExMatch($a, ")
        )

    [1] => Array
        (
            [0] => #DynamicQ::RegExMatch($a, ")
        )

    [2] => Array
        (
            [0] => #DynamicQ::RegExMatch
        )

    [3] => Array
        (
            [0] => ($a, ")
        )

)
Notice how the arguments are really missing that full string containing the parenthesis. Commas could also be in a string that could throw off the separation of the arguements.
 
not sure i quite grasp the issue yet. but how about this? feed the input subject to parseAnswer.

Code:
function parseAnswer($answer){
	$c = 0;
	$pattern = '/\\s*(#[^\(]*)(\(\))';
	preg_match_all($pattern, $answer, $matches);
	foreach($matches as $match){
		$answers[$c] = $match[1];
		preg_replace_callback('/(\'|")(.*?)\\1', 'strip_commas', $match[2]);
		$params = explode (',', $match[2]);
		foreach ($params as $param){
			$answers[$c]['params'][] = str_replace('~~~', ',', $param);
		}
	}
	return $answers;
}

function strip_commas($matches){
	$m = str_replace(',', '~~~', $match[2]);
	return $match[1].$m.$match[1];
}
 
That didn't quite work either.

Here's what I need out of the given string.
Function Name
Function Parameters

By the time the functions are parsed, and run the variables are all ready replaced.

Example case:
Oringal: What is the #DynamicQ::num2RandOrd($a{1,6}) word of the following sentence?
String: What is the #DynamicQ::num2RandOrd(1) word of the following sentence?

The function parser needs to return the function and all it's parameters. Easiest way I can think of right now is to return this as an associative array with the function name as the key, and the parameters as an array of it. Here is a line of code that would produce the signature obtained from the above string
Code:
$ret['DynamicQ::num2RandOrd'] = array(1);
I would then be able to use a for each to run each of these using call_user_func_array.

Actually, now that I'm thinking about it. It would probably be better to have the whole matching function (parameters and all) as the key. The function name as the first item in the sub array and the parameters as a second subarray:
Code:
$ret['#DynamicQ::num2RandOrd(1)'] = array('DynamicQ::num2RandOrd', array(1));

Now the problem is in getting it to work on a string like the following: 'My answer is: #DynamicQ::RegExMatch(",", ")") - and that is final'
That should have a structure similar to:
Code:
$ret['#DynamicQ::RegExMatch(",", ")")'] = array('DynamicQ::RegExMatch', array('","', '")"'));

I would imagine that some of the regex patterns would be similar to patterns used to separate comma delimited files, though I wouldn't have a clue as to how to make the pattern work properly.
 
here you go. you can play with the function to change the array shape to be as you wish.

if there is a danger of the function being found twice or more, then make the function recursive with a static variable for the offset.

Code:
<?php
$b = parseAnswer('My answer is: #DynamicQ::RegExMatch(",", ")") - and that is final');
echo "<hr/><pre>";
print_r($b);
function parseAnswer($answer){
	$c = 0;
	$pattern = '/\\s*(#[^\(]*)(\((.*)\))/';
	preg_match($pattern, $answer, $matches);
	$answers[$c]['function'] = $matches[0];
	$matches[3] = preg_replace_callback('/(\'|")(.*?)\\1/', 'strip_commas', $matches[3]);
	$params = explode (',', $matches[3]);
	foreach ($params as $param){
		$answers[$c]['params'][] = str_replace('~~~', ',', $param);
	}
	return $answers;
}

function strip_commas($matches){
	$m = str_replace(',', '~~~', $matches[2]);
	return $matches[1].$m.$matches[1];
}

gives

Code:
Array
(
    [0] => Array
        (
            [function] =>  #DynamicQ::RegExMatch(",", ")")
            [params] => Array
                (
                    [0] => ","
                    [1] =>  ")"
                )

        )

)
 
Thanks jpadie,

That does indeed appear to work, though I managed to break it.

If there are two functions in the same string, the greediness of ".*" shows up captures everything inside it.

I've got a thought - based on the preg_replace_callback you used.

Can a regular expression search for a match that excludes two exact characters in succession? Example: [^\(] excludes a single open parenthesis, would [^d\(] exclude a d followed by a parenthesis. So "phpd(test)" would retrieve "phpd" with the first one, but the second one would retrieve "php".

If that does work out, the function could then use the same method of CSV separation you implemented to automatically escape the parenthesis, and then run a regex on the escaped string.

Let me know if my theory is sound.
 
Well I managed to break that example.

If there are two functions in the same string, the greediness of ".*" shows up captures everything inside it.

Based on my thought, I was trying to figure out how your functions/regexp worked in combination with studying the php.net page when I think I stumbled onto a solution using a "lookbehind assertion."

After playing with trying to get it working for quite some while, I think I have a viable solution.

Here is the output along with the orignal string.
Code:
My answer is: #DynamicQ::RegExMatch($a, "\)", "Testing, all occurences") - and #that("\)", "More\"testing") is final
Array
(
    [ #DynamicCaptcha::RegExMatch(#argvar{#object}, "\)", "Testing, all occurences")] => Array
        (
            [function] => DynamicCaptcha::RegExMatch
            [params] => Array
                (
                    [0] => #argvar{#object}
                    [1] =>  \)
                    [2] =>  Testing, all occurences
                )

        )

    [ #that("\)", "More\"testing")] => Array
        (
            [function] => that
            [params] => Array
                (
                    [0] => \)
                    [1] =>  More"testing
                )

        )

)

I need the full definition so that after running the function it can easily replace what needs to be replace, in case the same function was used more than once in the same string.

I tweaked it further to remove the starting/end quotes as it's already a string. It also takes care to remove the escape character from the proper quotes in those strings.

Here is the code I ended up with (the printing in parseAnswer was for debugging purposes):
Code:
function parseAnswer($answer){
	echo "$answer\n";
    	$tmpS = str_replace(chr(0), chr(0).chr(0), $answer);
    	$tmpS = preg_replace_callback('/(\'|")(.*?)(?<!\\\)\\1/', 'escapeEndParen', $tmpS);
   	$pattern = '/\\s*#([^\(]*)\((.*?(?<!\x00))\)/';
   	preg_match_all($pattern, $tmpS, $matches);
   	$matches = repairArray($matches, ')');
   	$ret = array();
	for( $i = 0; $i < count($matches[0]); $i++ ){
		$ret[$matches[0][$i]]['function'] = $matches[1][$i];
		$ret[$matches[0][$i]]['params'] = parseParameters($matches[2][$i]);
	}
	print_r($ret);
	return $ret;
}

function parseParameters($params){
	$tmpS = str_replace(chr(0), chr(0).chr(0), $params);
	$tmpS = preg_replace_callback('/(\'|")(.*?)(?<!\\\)\\1/', 'escapeCommas', $tmpS);
	$p = preg_split('/(?<!\x00)\,/', $tmpS);
	$p = repairArray($p, ',');
	foreach($p as $key => $value)
		$p[$key] = preg_replace_callback('/(\'|")(.*?)(?<!\\\)\\1/', 'removeExteriorParens', $p[$key]);
	return $p;
}
function escapeEndParen($matches){
	$m = str_replace(')', chr(0).')', $matches[2]);
	return $matches[1].$m.$matches[1];
}
function escapeCommas($matches){
	$m = str_replace(',', chr(0).',', $matches[2]);
	return $matches[1].$m.$matches[1];
}
function removeExteriorParens($matches){
	$m = str_replace('\\'.$matches[1], $matches[1], $matches[2]);
	return $m;
}
function repairArray($arr, $chr){
	foreach($arr as $key => $value){
		if(is_array($value))
			$arr[$key] = repairArray($value, $chr);
		else{
			$arr[$key] = str_replace(chr(0).$chr, $chr, $arr[$key]);
			$arr[$key] = str_replace(chr(0).chr(0), chr(0), $arr[$key]);
		}
	}
	return $arr;
}

Now the only thing that breaks it is having a function as a parameter of another function like the following line:
Code:
#arrayVal(#trimArray(#preg_split('/[^\w]/', #argvar{#object})), 3)

Any idea on that variation?
 
you can indeed use lookbehind/lookaheads. but you're entering a level of complexity that does not often return a benefit.

from what i have seen you'd be better off using a lexer/tokenzer solution.
 
Yes, it is getting rather complex - though I want to do it right.

How could I do this with a lexer/tokenizer solution? I know what tokenizing is, but how you parse with it escapes me (I was in the zone and lost it now - I'm sure you know how that feels). I never really heard of a lexer before - how might that work? Would that be better than a tokenizer solution?
 
i treat lexer and tokenizer as pretty much the same thing or two facets of the same exercise. although i am sure i'm wrong. ..

tokenizers divide up strings by tokens (.e.g a space). lexer's enable the parsing of tokenized content by reference to pre-written grammar rules.

 
Gotcha.

I've been searching in the meantime, and think I may have found something to help me make one. I'll post back with how it goes after I find out.
 
Well, I think I've got a working solution. It took me a little while to figure out how the whole stack concept would work with function calling (the example I found didn't show function calls, just simple math), but I think I got it worked out.

Here's the code if anyone is interested:
Code:
class funcParser{
	var $funcs;
	var $openBracketCount;
	var $script;
	var $position;
	var $lastToken;
	
	var $object;
	
	function funcParser($object){
		$this->funcs = array();
		$this->object = $object;
	}
	
	function parse($script){
		$this->script = $script;
		$this->getFuncs();
		return $this->runFuncs();
	}
	
	function getc(){
		return substr($this->script, $this->position++, 1);
	}
	
	function runFuncs(){
		$t = array();
		$str = '';
		while( count($this->funcs) ){
			$obj = array_pop($this->funcs);
			if( get_class($obj) == 'funcStart' ){
				// get rest of function from $t;
				$params = array();
				while( get_class($o2 = array_pop($t)) != 'funcEnd' ){
					$params[] = $o2->txt;
				}
				if( strpos($obj->txt, '::') !== false )
					$aFunc = split('::', $obj->txt);
				else
					$aFunc = $obj->txt;
				foreach($params as $key => $param){
					if( trim($param) == '@object' )
						$params[$key] = $this->object;
					else{
						if( @eval($param) !== false )
							$params[$key] = eval($param);
					}
				}
				$result = call_user_func_array($aFunc, $params);
				$temp = new funcText($result);
				array_push($this->funcs, $temp);
				while( count($t) ){
					$r = array_pop($t);
					array_push($this->funcs, $r);
				}
			}else{
				array_push($t, $obj);
			}
			if( count($this->funcs) == 0 ){
				//nothing left - reassemble
				while( count($t) ){
					$txt = array_pop($t);
					$str .= $txt->txt;
				}
			}
		}
		if( @eval($str) !== false )
			$str = eval($str);
		return $str;
	}
	
	function getFuncs(){
		$this->position = 0;
		$this->openBracketCount = 0;
		$f = $this->getToken();
		while( $f !== null ){
			if( $f !== -1 ){
				if( get_class($f) == 'funcStart' )
					$this->openBracketCount++;
				elseif( get_class($f) == 'funcEnd' )
					$this->openBracketCount--;
				
				$this->lastToken = $f;
				array_push($this->funcs, $f);
			}
			$f = $this->getToken();
		}
		if( $this->openBracketCount > 0 )
			die("Parse Error: Expecting ')'.");
		elseif( $this->openBracketCount < 0 )
			die("Parse Error: Too many ')'.");
	}
	
	function getToken(){
		$c = 0;
		// ignore whitespace
		while (($c = $this->getc()) == "\n" || $c == "\r");
		if ( $this->position == strlen($this->script) + 1 ) return null;
		//echo "\$c = $c, Position: ".$this->position.'/'.strlen($this->script)."\n";
		if( $c == '#' ){
			// start of function - get rest of function
			$nextchar = $this->getc();
			if( $nextchar == '#' ) return -1;
			while( preg_match('/[a-zA-Z0-9\:\_]/', $nextchar) ){
				$c .= $nextchar;
				$nextchar = $this->getc();
			}
			if( $nextchar != '(' ){
				die("Parse Error: Malformed function definition \"$c$nextchar\"");
			}
			$funcName = trim($c, '#');
			return new funcStart($funcName);
		}elseif($c == ','){
			// parameter separator - do nothing.
			return -1;
		}elseif($c == ')'){
			return new funcEnd();
		}elseif($c == "'" || $c == '"'){
			$endQ = $c;
			$nextchar = $this->getc();
			if( $this->openBracketCount > 0 ){
				while(1){
					while( $nextchar != $endQ ){
						$c .= $nextchar;
						$nextchar = $this->getc();
						if( $this->position == strlen($this->script) + 1 )
							break;
					}
					if( $this->position == strlen($this->script) + 1 )
						break;
					if( substr($c, -1) != '\\' ){
						break;
					}
				}
				$quotedParam = trim($c, $endQ);
				while( $nextchar != ',' && $nextchar != ')' && $this->position == strlen($this->script) + 1 )
					$nextchar = $this->getc();
				if( $nextchar == ')' )
					$this->position--;
				return new paramDef($quotedParam);
			}else{
				while( $nextchar != "'" && $nextchar != "'" && $nextchar != '#' ){
					$c .= $nextchar;
					$nextchar = $this->getc();
					if ( $this->position == strlen($this->script) + 1 ) break;
				}
				$this->position--;
				return new funcText($c);
			}
		}else{
			$nextchar = $this->getc();
			if( $this->openBracketCount > 0 ){
				while( $nextchar != ',' && $nextchar != ')' ){
					$c .= $nextchar;
					$nextchar = $this->getc();
				}
				if( $nextchar == ')' ){
					$this->position--;
				}
				return new paramDef($c);
			}else{
				while( $nextchar != "'" && $nextchar != "'" && $nextchar != '#' ){
					$c .= $nextchar;
					$nextchar = $this->getc();
					if ( $this->position == strlen($this->script) + 1 ) break;
				}
				$this->position--;
				return new funcText($c);
			}
		}
		return -1;
	}
}
class funcText{
	var $txt;
	function funcText($txt){
		$this->txt = $txt;
	}
}
class paramDef{
	var $txt;
	function paramDef($txt){
		$this->txt = $txt;
	}
}
class funcStart{
	var $txt;
	
	function funcStart($name){
		$this->txt = $name;
	}
}
class funcEnd{}
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top