Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

How to make the Type Tree parser non-greedy?

Status
Not open for further replies.

SeriesConsumer

Programmer
Feb 1, 2006
44
LU
Hello,

I have a very simple input file to parse but currently don't manage to build a Type Tree that is capable of parsing it. The input looks like
[tt]
ED03_06
ABC09_07
FEX301_12
A303_03
_05_06
[/tt]
Each line starts with a symbolic name of arbitrary length (minimum 1 character) followed by two digits indicating the month, a fixed underscore sign and finally two more digits indicating the year.

The problem is the symbol field: I don't know a way of making it non-greedy, i.e. prevent it from consuming the month too. I already set month and year to fixed size. To represent the underscore I introduced a separate field with a value restriction and an inclusion rule for just the character '_'. Using component rules also didn't lead to success.

For demonstration purposes I will include a small Perl script that does exactly what I want (using non-greedy regular expressions):
[tt]
while (<>) {
chomp;
if (/^(.+?)(\d\d)_(\d\d)$/) {
my ($symbol, $month, $year) = ($1, $2, $3);
print "symbol=$symbol, month=$month, year=$year\n";
} else {
print "can't parse $_\n";
}
}
[/tt]
Does anyone have a clue how this behaviour can be achieved using Mercator 6.7?
 
You need to be able to define a group with no way to tell what the end of the first element is. "FEX3" and "_" seems to means the first element could contain anything of any length.

You might want to use the script to pre-process the data and add a terminator to the first element.



BocaBurger
<===========================||////////////////|0
The pen is mightier than the sword, but the sword hurts more!
 
You are right, die symbol field can contain virtually anything, esp. digits that will prevent me from defining value and/or character restrictions. Currently the field is at least two characters long (mostly up to four) and contains capitalized characters and (seldom) a digit. But I don't want to rely on that characteristics as it might change daily and, above all, silently: there might new input data coming in that contains new symbols I can't think of now without prior notification.

If I get you right you are saying that my problem can't be solved using Mercator Type Trees? That's bitter.
 
Type trees need to have pre-defined types. Best bet is to have a pre-process procedure that you control, to add a terminator that will allow the types to be distinguished. Then the first element would be a text (s) terminated by something that can't appear in the data.

FYI, in your example data, one data string has the first element one charater long.



BocaBurger
<===========================||////////////////|0
The pen is mightier than the sword, but the sword hurts more!
 
You know the size of the month and year part - it's 5 characters. The month is the 2 characters preceding '-'
If you define the record as a single field, you can extract the symbolic name as follows

=left(field,size(word(field,"-",1)-2))

You can get the other bits in a similar way.
 
Ah I missed this record _05_06

=left(field,size(field)-5))

Even easier.
 
BocaBurger, I agree with you that maybe it would be best (i.e. simplest solution) to have some pre-processing, but
a) this would be very hard to integrate into our existing EAI platform (not to mention a violation of our strategy) and
b) it would also be a bit silly to "tune" a $100K software by some Perl script.
What I just don't understand: obviously the parser is capable of performing backtracking to obtain the best match. But under which circumstances is backtracking activated? It seems as if it only works for partitioned or choice groups. So I could introduce a partitioned group for symbol of length 2, length 3 and so on (the group would also contain date and month field). Having such a group at top level will make the parser backtrack. But the disadvantages are so obvious that I didn't dare to implement it. Till now I did not find a way to activate backracking for a field.

The second idea I had was to express the constraint
SIZE(symbol) = SIZE(line) - 5
somehow in a component rule. But that didn't work either.
 
Janhes, it is of course no problem to determine the symbol part of a line using map rules. But that implies having one more output card for "post parsing" instead of directly processing the input by the Type Tree. However, due to the lack of alternative sensible solutions I actually have done this ;-)
 
Prob missed something here but..... defined a simple type tree with your data. seems to validate. may need some tweeking but....

file
-record (0:S) {terminator = "<NL>"}
--optional_group1 (0:1)
---text (0:S) {size=1 include = A-Z}
---num1 (0:S) {size=1}
--optional_group2 (1:S) {initiator = "_"}
---num2 (1:1) {size=2}
 
I think the num1 element might cause a problem, or if the symbolic name were ever to be numbers or include numbers before the last position.

BocaBurger
<===========================||////////////////|0
The pen is mightier than the sword, but the sword hurts more!
 
Eyetry, I fear that your Type Tree does not reflect properly the characteristics of the data to be read: as the symbol field can contain an arbitrary number of digits everywhere within it, it is hard to distinguish it from the month field. It would me much easier parsing if you could do it from right to left: first come two year digits, an underscore, two month digits and the rest is simply the symbol. I thought of REVERSEBYTE() already but still see no way to use this with success.
 
So, if the first field is present will you always have a min of 3 bytes of which the last two will always be numeric?

 
No, all three fields (symbol, month and year) are always mandatory. Originally I assumed that the symbol field must contain at least one character, but for practical purposes one can assume it's always two or more characters.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top