Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

tcpflow data parsing with awk to insert into mysql 1

Status
Not open for further replies.

tehiznogud

Programmer
Mar 1, 2008
3
BA
Hi all!

First of all I would like to greet you all :) I am a newbee to Awk and to problem I am solving as well, but I am trying hard to complete the task at hand :) I need kind of conceptual help rather than "coding" help, I could probably do the whole job with manual but I am not sure if I am aiming in the right direction with solution to my problem. I've been reading Awk User's Guide for about a week, and I am becoming comfortable with it. Unfortunately, I have had no previous experience with bash or anything similar to Awk - all I did was some c++ in the school and some pretty basic PHP later on. Even so, I have no problem with reading more, I just need some pointers for the logics I am trying to employ. All help is greatly appreciated!

Now, let me explain my trouble:

I am using slackware 12.4 machine to sniff CDR information from my Voice switch. Tcpflow collects data from particular port, and pipes it to an awk script. All data is in the form of opening line, 2-5 lines of information, and closing line. Each record is separated by an empty line (in an ideal world, I am not yet 100% sure that EACH record separator line is blank). So, records look like following when tcpflow output is reviewed:
Code:
...

session_opening string
some_data_1
some_data_2
some_data_3
some_data_4
session_closing string

session_opening string
some_data_1
some_data_2
some_data_3
some_data_4
session_closing string

...
After I collect data with tcpflow, I pipe that data to an Awk script that has a loop where lines are examined and prepared for sql injection. Basically, it is series of conditions like this:
Code:
{
 if (/A/) {							
  sql = parse_a($0)			
  query = (query sql)
 }								
 if (/B/) {							
  sql = parse_b($0)			
  query = (query sql)
 }								
 if (/C/) {							
  sql = parse_c($0)			
  query = (query sql)
  print ("mysql --user=user --password=pass -e " query) | "/bin/sh"
  query = ""
  system("")
 }								
}
[small]Functions for parsing are separated and irrelevant - they just parse string and extract numbers, returning just a string in format "INSERT INTO database.table ..."[/small]

Now when I have accomplished to insert few strings into database (about 50.000 records in about 4 hours that script was running) I found out that script would break in peak time, when there is a lots of packets arriving via tcpflow (like, 3,4,5,10 records per second). From my observations I concluded that two records get concatenated in the query string and that simply breaks MySQL. Whether this unpredicted concatenation happens because of slow MySQL response when it is called from shell with -e switch, or because of the slow script that catches up with itself - I am not sure.

Here is where you guys come into picture :) I have thought about different ways to solve my problem. One that I am really considering using looks like this:

1. capture raw data with tcpdump -C -W switches, so tcpdump would rotate -W number of files of -C size;

2. have a small separate awk script that will list tcpdump directory with ls -altr and compare one before latest tcpdump file with information that has already been put into the database (should I read timestamp of the last record from the database, or just store this information into small separate temporary file?). If one-before-last file is older than last record written to the database, go to sleep for one minute or whatever. Otherwise, if file is newer, goto

3. call tcpflow on tcpdump's file, and pipe output to original parser - therefore creating one larger file to be later inserted into database, thus ensuring proper insert and enabling better error handling - from logging "broken" insert data to mailing/smsing myself with relevant information...


So, after a day of thinking, this approach seems logical to me atm - this way I would handle files and not stream. However, after a few readings of the AWK manual I still suspect that there are other and much better ways to handle streams that need to be inserted into databases, and that I was missing something big. Could you please discuss this with me little further and give me your 2 cents on this?

Much obliged!!!

Sincerely yours,

iznogud
 
Thanks for that suggestion - I certainly will use LOAD DATA INFILE - site says it is up to 20 times faster than regular INSERT INTO statement...

seems like noone had anything to say about my problems... i just hope that it is because i am on the right track, and not because my intro post is so long :)

Thanks once again hoinz for this suggestion and for taking time to think about my question.

cheers all!
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top