Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Help with file processing code 1

Status
Not open for further replies.

JamesOwen

Programmer
Nov 8, 2011
12
GB
HI Guy's

Can I please get some help with this code.

I have xml feed file which rapidly changing temporary file and I need to capture the content of this file as soon as data arrives.

Example of the data

[date+time], message=[DATA= “<?xml version=”1.0?”><data changeMsg><NAME=”John Smith”><Age=”23”><D.O.B=”11-10-1988”> <Gender=”Male”>”
[date+time], message=[DATA= “<?xml version=”1.0?”><data changeMsg><NAME=”Emy Williams”><Age=”23”><D.O.B=”01-05-1988”> <Gender=”Female”>”
[date+time], message=[DATA= “<?xml version=”1.0?”><data changeMsg><NAME=”Jack Adam”><Age=”66”><D.O.B=”24-07-1945”> <Gender=”Male”>”
[date+time], message=[DATA= “<?xml version=”1.0?”><data changeMsg><NAME=”Charlie Daniel”><Age=”38”><D.O.B=”15-08-1973”> <Gender=”Male”>”
[date+time], message=[DATA= “<?xml version=”1.0?”><data changeMsg><NAME=”Ruby James”><Age=”38”><D.O.B=”11-03-1973”> <Gender=”Female”>”
[date+time], message=[DATA= “<?xml version=”1.0?”><data changeMsg><NAME=”Sophie Thomas”><Age=”20”><D.O.B=”12-09-1991”><Gender=”Female”>”

Required data output

8:30,Male,23,1
8:31,Female,23,1
8:32,Female,30,4
8:33,Male,50,10

Time is current time.

This is a ksh and awk code that I have so far but this doesn't do what I need it to do. Can I please get help with it.

All I want the code to do is to run for 2 minutes process the counts , write it to output then do the same process again and again.

code

awk 'BEGIN { INTERVAL=120; "date +%s"|getline sec;
NEXT=sec+120;}

{
if(sec >= NEXT)
{
printf( "\nSummary\n" );
for( x in agcount )
printf( "%s,%d\n", x, agcount[x] ) | "sort";

NEXT=sec+120;
}

gsub( ">", "" ); # strip uneeded junk and make "foo bar" easy to capture
gsub( " ", "~" );
gsub( "<", " " );

for( i = 1; i <= NF; i++ ) # snarf up each name=value pair
{
if( split( $(i), a, "=" ) == 2 )
{
gsub( "\"", "", a[2] );
gsub( "~", " ", a[2] );
values[a[1]] = a[2];
}
}

#gcount[values["Gender"]]++; # collect counts
#acount[values["Age"]]++;
agcount[values["Gender"]","values["Age"]]++;

printf( "%s %s %s %s\n", values["NAME"], values["Age"], values["D.O.B"], values["Gender"] );
}' input-file

Will anyone be able to help me with this?

any help would be greatly appreciated.

James
 
Sorry just to add i Can't use cron schedular or gawk.

Thanks

James
 
In what way does it not "do what I need to do"? Error messages? Partial data? Incorrect output format...?

Annihilannic
[small]tgmlify - code syntax highlighting for your tek-tips posts[/small]
 
Yes it doesn't loop for two minutes and also it does print any output out.
 
Each time you process the file, i.e. every 2 minutes, will it contain only new data, or will it still contain what was processed the previous time?

Does that even matter or do you just want to process the entire contents of the file each time, regardless of whether it is new or old data or a mixture of both?

Also your sample data contains "date+time". What does it look like in reality? Do you want to pull the timestamp out of that field or just generate it at the time the data is processed?

The simplest way to loop every 2 minutes (approximately, but very close) would be to put a ksh while loop around your awk script, with a sleep 120, but in this scenario you may have to deal with the new vs old data as mentioned above.

Annihilannic
[small]tgmlify - code syntax highlighting for your tek-tips posts[/small]
 
Every two minute should contain new data and it shouldn't process the old data again.

It shouldn't also process the entire file at once but only content of 2 minutes.

Date and time. Should be created within the process.

I have tried to put sleep 120 but it doesn't work.is this something you will be able hell with please.

Thank you again
 
Just a couple of questions from myself as I'm slightly confused about you requirements:

1) if your on Unix then why can't you use the cron to schedule you script to run?

2) do you mean that the application that is producing the xml output is creating a [date+time] but you have just hidden it in your example above? or do you mean that your script that processes the xml file should replace the string [date+time] with a real time ? and if so what format and what time? ... the date & time you start to process the record, the date & time you finish processing it or some point between?

I would hope the application would provide the date & time in the xml as this would help you track that last record processed allowing you to start at the next record. Otherwise I see nothing unique in your file to identify what you have or have not processed!

3) You say the file is continuously changing, so what happens to the file does it rollover at a specific file size, time of day, number of records ... does it just get deleted after x number of records ???? Without knowing that info its difficult to know how best you could code your processing method/
i.e. if it rolls then how will you notify your script that its changed and start looking at the new file stream?

Thats just a few questions ......

The way I would expect (prefer) to process it assuming it has a date & time stamp and could be run under cron is :

1) set a "start time" (update a file if necessary with this time)
2) grep the xml file for all records with a date+time from that "start time" file up to "start time" + 2 minutes
3) change the start time with the new time being = to the time of the last record read from the file.
4) process the records found and summarise to your desired format
5) at some point (need not be 2 minutes later but could be) start the process again ..... read in the "start time", process records, reset "start time" ..........

Cron would be best triggering this as you check to confirm it's not still processing and if so wait and try again.

You could just run it under a while loop but again you have to have some other process to be sure that the process itself remains running (would need a keepalive process) and or risk the process stalling and you loosing data.

Laurie.
 
My preferred approach would be to continously "tail -f" the file and process records as they arrive rather than doing so every 2 minutes, however as tarn mentioned this may present challenges when the file (presumably, hopefully?) is periodically rotated or truncated. When that happens your process would either have to be restarted or be smart enough to rewind to the beginning of the file.

tarn, I think he has answered your question 2 already in his latest post, i.e. the script itself should supply them at the time of processing.

Annihilannic
[small]tgmlify - code syntax highlighting for your tek-tips posts[/small]
 
Hi guys,

Thank you for the help.

@tarn

Yes messages will have a date and time stamp but time in the output example is current time meaning when ever the messages were processed.

I am new to programming, would please be able to help me with examples for steps 1 to 5.

Thanks

James
 
tarn, I think he has answered your question 2 already in his latest post, i.e. the script itself should supply them at the time of processing.

Well yes probably but it was not very clear if the string "[date+time]" exists in the file or if James is reading the example with some tool that cannot translate/format the date format as presented in the file? I'd be surprised if the data file was not providing a timestamp but who knows (I'd want to fix that problem first but thats just me ;) )

Assuming that there is no date I agree that tail of the file into another file for processing and place the date/time in with each record. Then simply process as I suggested in my last post based on date/time values to track the progress of the auditing.

Then as I think you agree there has to be a parallel process to monitor for "if and when" the file rolls over and hup the process.

Not sure I can provide much more (without writing it myself).

Other than I guess that maybe James should break down his original script into smaller processes and identify why either he gets no output OR why the loop timer does not work but not try to debug both at once.

Laurie.
 
@Annihilannic

I have tried to use “tail -f” but the only problem with this is that as I get new data it will mean the old data is gone. This is rapid changing file that I need to process the data as soon as it arrives.
@tarn
I have tried to break the code into parts but still I am not getting the 2 minute looping to work then write to output file.

Guy’s can I please get help with the code.

Thank you all.
 
JamesOwen said:
I have tried to use "tail -f" but the only problem with this is that as I get new data it will mean the old data is gone. This is rapid changing file that I need to process the data as soon as it arrives.

Sorry, I don't understand what you mean here... how will the old data be "gone"? Is it because the file is overwritten? And if that's the case, surely waiting 2 minutes between each processing cycle will mean even more data will be missed?

Annihilannic
[small]tgmlify - code syntax highlighting for your tek-tips posts[/small]
 
Sorry, I don't understand what you mean here... how will the old data be "gone"? Is it because the file is overwritten? And if that's the case, surely waiting 2 minutes between each processing cycle will mean even more data will be missed?

Yes the file is overwritten and this is why I was thinking of using a pipe.

 
Sounds like a good idea... have you tried that to see if the application is happy writing to a pipe file? You could just run a simple test using cat to pull the data from the pipe and make sure it continues to work as normal.

For example:

Code:
# stop application
mv logfile.xml logfile.xml.processed
mknod logfile.xml p
nohup cat <logfile.xml >>logfile.xml.processed &
# restart application

If that's all fine then you can replace your cat with a more intelligent script.

Annihilannic
[small]tgmlify - code syntax highlighting for your tek-tips posts[/small]
 
OK having read your first posting I see that your xml is a "feed" file which I assume is going "into" the application and as such being consumed rather than an output file which should build over time. I think what we both mean with the tail was that you tail -f your <feed>.xml file into a new file as in:

Code:
 # tail -f <feed>.xml | tee file_to_process.xml

This way you will not loose anything as everything that is written to <feed>.xml will be written into file_to_process.xml

You can then think about how "in parallel" you can process that file and not worry about the fed file.

This should also remove our concerns about the file rollover as this is unlikely to be the case. With now understanding the file you want to process is the "input file" we can assume it will be a constant feed source input file rather than an application output logging file.

The obvious issue here is that you will not get an exact "real-time" date stamp with each record line and depending on the method you use to add for instance a time-stamp the best would be a few milliseconds after the record arrives. You have not indicated the TPS (transactions per second) you expect to flow through your feed file so we should not try to guess.

If you care not about the time each record arrives but only a timestamp as you re-process the parallel "file_to_process.xml" contents then you could use a "row number" to keep track of what you have already processed..

Just number your records with:

Code:
-bash-3.00$ cat -n file_to_process.xml
     1  John Smith       23      11-10-1988      Male     17-Jan-2012-09:38:11
     2  Emy Williams     23      01-05-1988      Female   17-Jan-2012-09:38:11
     3  Jack Adam        66      24-07-1945      Male     17-Jan-2012-09:38:11
     4  Charlie Daniel   38      15-08-1973      Male     17-Jan-2012-09:38:11
     5  Ruby James       38      11-03-1973      Female   17-Jan-2012-09:38:11
     6  Sophie Thomas    20      12-09-1991      Female   17-Jan-2012-09:38:11

I have a long one liner that would process your xml data with the output above (not including numbering instead adding a date/time to the records),but that requires the use of nawk which you say you do not have access to.

But I will post it here anyway just for a reference:

Code:
nawk -F"[<|>:@]" 'BEGIN{ "date +%d-%b-%Y-%T" | getline dateVal;{ pdate=dateVal}};  $0~/changeMsg/ {print $6,$8,$10,$12,pdate}' file_to_process.xml | nawk -F\" '{print $2"\t",$4"\t",$6"\t",$8"\t",$9}'

I really don't have time to build a test harness / feed generator to simulate your feed.xml (and the above was just me passing some dead time while waiting to do an overnight system restart [Yawn] ).

I would hope that you could now however see:

1) That you can collect your feed.xml in parallel by piping it through "tee" so you will not loose anything... if it flows through the feed.xml it will tee to the file_to_process.xml

2) You can parallel process your tee'd file and if necessary adding simple line numbers by piping it through cat-n

That would allow you to post-process your file via another script using the last processed line/record number +1 as the point from where to start on the next iteration.

3) With a pre-processed file like the gawk processing I demo ABOVE then it should be easy to take that output and post-process that output via another reporting script that could be run in a simple "while true" and "sleep" script if you really wanted to chomp your report into specific time windows ...

Code:
#This simple while will just provide clues on how you can loop with a sleep delay only ceasing when you send a break .. (no smart reporting here just juggling the columns)..... 
   
$ while true; do `cat file_to_process.xml |awk -F\t '{print $5",",$4","$2}'>> out.log`; sleep 3; done

# Returns:
  17-Jan-2012-09:38:11,  Male, 23
  17-Jan-2012-09:38:11,  Female, 23
  17-Jan-2012-09:38:11,  Male, 66
  17-Jan-2012-09:38:11,  Male, 38
  17-Jan-2012-09:38:11,  Female, 38
  17-Jan-2012-09:38:11,  Female, 20
  17-Jan-2012-09:38:11,  Male, 23
  17-Jan-2012-09:38:11,  Male, 23
~
~
~
# Into the out.log

# Until you send a break

These are just some of the ways to process files and only random examples that can provide some clues. There are probably 101 more different ways to do the same thing (and much easier if you have the rolling feed file to test with).

I also recommend buying a copy of "UNIX Shells by Example" its a fantastic bible and so much more tactile than the Internet.

Good luck

Laurie.

 
Guy’s

Thanks for helping.

I have tried your suggestions, the only problem is that still I can’t loop only for 2 minutes them pause the process to write to another file then start from the same place again without processing the data that has been process already.

@tarn

Thank you for recommending this book i will defiantly buy it.

James
 
Can we see your code to see how you are doing the 2 minute pauses?

Annihilannic
[small]tgmlify - code syntax highlighting for your tek-tips posts[/small]
 

This part of my code should be the part which is pausing the code for 2 minutes and also does the counting.

Code:
awk 'BEGIN { INTERVAL=120;    "date +%s"|getline sec;
    NEXT=sec+120;}

    {
        if(sec >= NEXT)
        {
           printf( "\nSummary\n" );
           for( x in agcount )
              printf( "%s,%d\n", x, agcount[x] ) | "sort";

           NEXT=sec+120;
        }
 
OK here you go James maybe this will get you started .....

Remember this is downright DIRTY and probably not the way to code it but ...............

Code:
-bash-3.00$ cat tilly.sh
#!/bin/bash

SOURCE=new_file_to_process.xml
LINES_TO_PROCESS=3
TIMER=10
start1=0
stop1=0
#
echo " this will provide a dirty start of a script that will read "
echo "lines from your file to process and keep looping until you"
echo "break, it prints lines to process then sleeps 10 seconds"
echo " then starts where it left off at the next record"
#
while true; do
awk 'NR=="'$start1'",NR=="'$stop1'"' $SOURCE | awk -f\t '{print $1,$5",",$4","$2}'
echo ""
echo "----------- Need to do some calculation ------------"
echo "           ~ `date` ~"
echo "            And make your report output             "
echo ""
# Recalculate start and end line numbers for the loop
start1=`echo $stop1+1|bc`
stop1=`echo $stop1+$LINES_TO_PROCESS|bc`
# pause for a period before looping through new records
sleep $TIMER;
done

###########################################################
# Source file for testing is like this .......
# No real need for the line numbers in teh file but just shows script is processing
# the lines we expect it to ........
#
#cat new_file_to_process.xml
#     1  John Smith       23      11-10-1988      Male     17-Jan-2012-09:38:11
#     2  Emy Williams     23      01-05-1988      Female   17-Jan-2012-09:38:11
#     3  Jack Adam        66      24-07-1945      Male     17-Jan-2012-09:38:11
#     4  Charlie Daniel   38      15-08-1973      Male     17-Jan-2012-09:38:11
#     5  Ruby James       38      11-03-1973      Female   17-Jan-2012-09:38:11
#     6  Sophie Thomas    20      12-09-1991      Female   17-Jan-2012-09:38:11
#     7  John Smith       23      11-10-1988      Male     17-Jan-2012-09:38:11
#     8  John Smith       23      11-10-1988      Male     17-Jan-2012-09:38:11
#     9  John Smith       23      11-10-1988      Male     17-Jan-2012-09:38:11
#    10  John Smith       23      11-10-1988      Male     17-Jan-2012-09:38:11
#    11  John Smith       23      11-10-1988      Male     17-Jan-2012-09:38:11
#    12  John Smith       23      11-10-1988      Male     17-Jan-2012-09:38:11
#    13  John Smith       23      11-10-1988      Male     17-Jan-2012-09:38:11
#    14  John Smith       23      11-10-1988      Male     17-Jan-2012-09:38:11
#    15  John Smith       23      11-10-1988      Male     17-Jan-2012-09:38:11
#    16  John Smith       23      11-10-1988      Male     17-Jan-2012-09:38:11
############################################################

And now when its running .... and send a break to end the loop ..

In reality you need a keep-alive to ensure your while keeps running .....

Code:
-bash-3.00$ ksh ./tilly.sh
 this will provide a dirty start of a script that will read
lines from your file to process and keep looping until you
break, it prints lines to process then sleeps 10 seconds
 then starts where it left off at the next record

----------- Need to do some calculation ------------
           ~ Fri Jan 20 21:00:24 GMT 2012 ~
            And make your report output

1 11-10-1988, 23,John
2 01-05-1988, 23,Emy
3 24-07-1945, 66,Jack

----------- Need to do some calculation ------------
           ~ Fri Jan 20 21:00:34 GMT 2012 ~
            And make your report output

4 15-08-1973, 38,Charlie
5 11-03-1973, 38,Ruby
6 12-09-1991, 20,Sophie

----------- Need to do some calculation ------------
           ~ Fri Jan 20 21:00:44 GMT 2012 ~
            And make your report output

7 11-10-1988, 23,John
8 11-10-1988, 23,John
9 11-10-1988, 23,John

----------- Need to do some calculation ------------
           ~ Fri Jan 20 21:00:55 GMT 2012 ~
            And make your report output

10 11-10-1988, 23,John
11 11-10-1988, 23,John
12 11-10-1988, 23,John

----------- Need to do some calculation ------------
           ~ Fri Jan 20 21:01:05 GMT 2012 ~
            And make your report output

13 11-10-1988, 23,John
14 11-10-1988, 23,John
15 11-10-1988, 23,John

----------- Need to do some calculation ------------
           ~ Fri Jan 20 21:01:15 GMT 2012 ~
            And make your report output

^C-bash-3.00$

Enjoy .......

Laurie.
 
@tarn, thank you this is a good place to start, but the problem is that i need to do some counts like the ones in my sample output.

Code:
8:30,Male,23,1
8:31,Female,23,1
8:32,Female,30,4
8:33,Male,50,10

Also I am not sure where you parsing the xml messages in this code, can you please help with this.

Thank you again
James

 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top