Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

How do I parse such a complicated file? 1

Status
Not open for further replies.

AfroJoe

IS-IT--Management
Aug 8, 2010
7
CA
Hi Guys,

I posted this before, but somehow I guess my post got lost in cyberspace. :( Im extremely new to AWK. I have a scenario where I need to extract variables and invoice information into an array. Without further adieu, can someone please help point me in the right direction? Here's the input file:

X, Y, Z Company 50681 08/04/10

07/01/10 3065 2,782.50 0.00 2,782.50
07/01/10 3067 984.38 0.00 984.38
07/01/10 3069 1,007.80 0.00 1,007.80

4,774.68 0.00 4,774.68


08/04/10 50681 ******4774.68

FOUR THOUSAND SEVEN HUNDRED SEVENTY-FOUR AND 68/100----

X, Y, Z Company
123 Main St
Toronto ON M8Y 3H8




As you can tell, there's quite a bit of textual information and it's quite difficult to grab a dynamic list of invoices. Any ideas?
 

And which information do you need to "grab"?
[censored]

----------------------------------------------------------------------------
The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb
 
I need to grab everything from what I posted, the invoice list, represented here (with three entries), is dynamic:

07/01/10 3065 2,782.50 0.00 2,782.50
07/01/10 3067 984.38 0.00 984.38
07/01/10 3069 1,007.80 0.00 1,007.80

So everything that is listed in my previous msg, has tobe placed in either a variable or array.
 
So, basically I want to grab and put into variables the following from the above text:

Variables:
X, Y, Z Company
50681
08/04/10

Dynamic Array:
07/01/10 3065 2,782.50 0.00 2,782.50
07/01/10 3067 984.38 0.00 984.38
07/01/10 3069 1,007.80 0.00 1,007.80

More Variables:
4,774.68
0.00
4,774.68
08/04/10
50681
******4774.68
FOUR THOUSAND SEVEN HUNDRED SEVENTY-FOUR AND 68/100----
X, Y, Z Company
123 Main St
Toronto
ON M8Y 3H8
 

I assume there are several invoices in the file. Is there a way to know what separates each invoice?
[ponder]


----------------------------------------------------------------------------
The person who says it can't be done should not interrupt the person doing it. -- Chinese proverb
 
Excellent assumption on the fact that there's several invoices. However, there is no EOF or EOR character. The next invoice basically starts after the Postal Code (M8Y 3H8) is given. Thats what I was going to use as a termination point. (Sequence: Letter,Number,Letter,Space,Number,Letter,Number) In other words, check for the ASCII position of that sequence is met, move on to the next invoice... It's crazy, I know, Maybe you have a better way? :)
 
Assuming the formatting of each invoice is consistent and that they are separated by some blank lines, something like this may do the trick:


Code:
awk '
        [green]BEGIN[/green] {
                [olive]do[/olive] {
                        [olive]do[/olive] { r=[b]getline[/b] } [olive]while[/olive] ([blue]$0[/blue] ~ [green]/^$/[/green])
                        date=[blue]$NF[/blue]; [blue]$NF[/blue]=[red]"[/red][purple][/purple][red]"[/red]
                        number=[blue]$([/blue][blue]NF[/blue]-1); [blue]$([/blue][blue]NF[/blue]-1)=[red]"[/red][purple][/purple][red]"[/red]
                        company=[blue]$0[/blue]
                        [b]print[/b] [red]"[/red][purple]company:[/purple][red]"[/red],company, date, number
                        [olive]do[/olive] { r=[b]getline[/b] } [olive]while[/olive] ([blue]$0[/blue] ~ [green]/^$/[/green])
                        [b]print[/b] [red]"[/red][purple]invoices: [/purple][red]"[/red]
                        [olive]do[/olive] { [b]print[/b]; r=[b]getline[/b] } [olive]while[/olive] ([blue]$0[/blue] !~ [green]/^$/[/green])
                        [olive]do[/olive] { r=[b]getline[/b] } [olive]while[/olive] ([blue]$0[/blue] ~ [green]/^$/[/green])
                        [b]print[/b] [red]"[/red][purple]subtotal: [/purple][red]"[/red] [blue]$0[/blue]
                        [olive]do[/olive] { r=[b]getline[/b] } [olive]while[/olive] ([blue]$0[/blue] ~ [green]/^$/[/green])
                        [b]print[/b] [red]"[/red][purple]total: [/purple][red]"[/red] [blue]$0[/blue]
                        [olive]do[/olive] { r=[b]getline[/b] } [olive]while[/olive] ([blue]$0[/blue] ~ [green]/^$/[/green])
                        [b]print[/b] [red]"[/red][purple]words: [/purple][red]"[/red] [blue]$0[/blue]
                        [olive]do[/olive] { r=[b]getline[/b] } [olive]while[/olive] ([blue]$0[/blue] ~ [green]/^$/[/green])
                        [b]print[/b] [red]"[/red][purple]address:[/purple][red]"[/red]
                        [olive]do[/olive] { [b]print[/b]; r=[b]getline[/b] } [olive]while[/olive] ([blue]$0[/blue] !~ [green]/^$/[/green])
                } [olive]while[/olive] (r>0)
        }
' inputfile

I haven't assigned everything to variables yet, but it should be a good starting point for you.

Annihilannic.
 
great!! Im going to give it a shot right now.
 
awesome code there Annihilannic!

it worked flawlessly.. Im having trouble with the invoices portion of the script. I'm trying to assign the invoices to an array as they are pulled line by line.. Any idea on that?

 
Something like this:

Code:
[gray]                        print "company:",company, date, number
                        do { r=getline } while ($0 ~ /^$/)[/gray]
                        i=0
                        [olive]do[/olive] {
                                i++
                                item_date[i]=[blue]$1[/blue]
                                item_cost[i]=[blue]$2[/blue]
                                item_tax[i]=[blue]$3[/blue]
                                item_total[i]=[blue]$4[/blue]
                                r=[b]getline[/b]
                        } [olive]while[/olive] ([blue]$0[/blue] !~ [green]/^$/[/green])
                        [b]print[/b] [red]"[/red][purple]invoices: [/purple][red]"[/red]
                        [olive]for[/olive] (j=1;j<=i;j++) {
                                [b]print[/b] j,item_date[j],item_cost[j],item_tax[j],item_total[j]
                        }
[gray]                        do { r=getline } while ($0 ~ /^$/)[/gray]

Annihilannic.
 
Again, the array is more than what I expected Annihilannic!! I was going to have a seperate array for ea invoice entry and you already beat me to it. (ie: item_date=$1, item_cost=$2,item_tax=$3,item_total=$4)

Now everything is in variables and I can generate an output file!! btw; what editor do you use to get 'syntax highlighting' ?? It's so much easier to read, right now im using DOS Editor, it just isn't pretty. Something Windows based you cna recommend??
 
I wrote my own Perl script to add TGML markup to awk scripts for posting on this site, so it's not exactly what you need.

I know vim can do it; but vim is an acquired taste... especially if you're not a Unix/Linux user.

I use Programmers Notepad under Windows. It supports scheme files which allow you to define syntax highlighting for any language; there doesn't seem to be an existing one for AWK that I can see, but it would be easy to make one.

Many other programmers editors have similar facilities though and you should be able to find one that does it. The Komodo IDE is another likely option (Lite version available for free).

Annihilannic.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top