Joining and sorting multiple text files 2

cdj02 · Jun 2, 2012

Hello, I run systems that collect measurement data and dump it into delimited text files. I'm trying to join these files and sort them by Date, Target_Name, and site so that i can plot the data. Currently, i have the following code that does work to join the files into a single file called All_CST.cst but I'm new to Perl and I'm having trouble with sorting. I know how to sort simple lists in Perl but re-ordering more complex data like I have in these logfiles is a bit beyond me at the moment (and I'd really like to avoid having to use MS Excel to do my sorting). I'd appreciate any help in pointing me in the right direction on sorting this type of data. Thanks in advance for your time.

My Script
#!/usr/bin/perl
open (LOGFILEOUT, "> ALL_CST.cst");

@files = <*.cst,*>;

$totallines=0;

foreach(@files){
if ($_ =~ /ALL_CST.cst/){next;};
$filename=$_;
open (LOGFILEIN, "< $filename");
while (<LOGFILEIN>){
print LOGFILEOUT "$_ \n";
}

}

The Data files are here:

http://www.mediafire.com/?bu2xsk1652cajxv

http://www.mediafire.com/?hqlie3gjcpwzhi4

http://www.mediafire.com/?1k4p4wc2z479ovk

feherke · Jun 2, 2012

Hi

Sorting huge amount of data is not a trivial task. Based on the line separators used in the log files I suppose you are on a Unix-like operating system. In that case I would start with [tt]sort[/tt] :

Code:

sort -k20 -k16 -k1n *.cst,* > ALL_CST.cst

Or to skip the header lines :

Code:

tail -qn+2 *.cst,* | sort -k20 -k16 -k1n > ALL_CST.cst

But there is a big problem : the date format. If it would be ISO date, it could be sorted as string, but this one need abit of preprocessing :

Code:

[gray]# if 04/12/2012-14:10:37 means 2012 April 12[/gray]
tail -qn+2 *.cst,* | awk '[teal]![/teal][blue]NF[/blue][teal]{[/teal][COLOR=chocolate]next[/color][teal]}[/teal][navy]$20[/navy][teal]=[/teal][COLOR=chocolate]substr[/color][teal]([/teal][navy]$20[/navy][teal],[/teal][purple]7[/purple][teal],[/teal][purple]4[/purple][teal])[/teal][green][i]"/"[/i][/green][COLOR=chocolate]substr[/color][teal]([/teal][navy]$20[/navy][teal],[/teal][purple]1[/purple][teal],[/teal][purple]5[/purple][teal])[/teal][COLOR=chocolate]substr[/color][teal]([/teal][navy]$20[/navy][teal],[/teal][purple]11[/purple][teal])[/teal]' | sort -k20 -k16 -k1n > ALL_CST.cst

[gray]# if 04/12/2012-14:10:37 means 2012 December 04[/gray]
tail -qn+2 *.cst,* | awk '[teal]![/teal][blue]NF[/blue][teal]{[/teal][COLOR=chocolate]next[/color][teal]}[/teal][navy]$20[/navy][teal]=[/teal][COLOR=chocolate]substr[/color][teal]([/teal][navy]$20[/navy][teal],[/teal][purple]7[/purple][teal],[/teal][purple]4[/purple][teal])[/teal][COLOR=chocolate]substr[/color][teal]([/teal][navy]$20[/navy][teal],[/teal][purple]3[/purple][teal],[/teal][purple]4[/purple][teal])[/teal][COLOR=chocolate]substr[/color][teal]([/teal][navy]$20[/navy][teal],[/teal][purple]1[/purple][teal],[/teal][purple]2[/purple][teal])[/teal][COLOR=chocolate]substr[/color][teal]([/teal][navy]$20[/navy][teal],[/teal][purple]11[/purple][teal])[/teal]' | sort -k20 -k16 -k1n > ALL_CST.cst

This changes the data format. If you need it unchanged, then better add another column at the end of each line, for sorting purpose only.

As you asked in the Perl forum, here is a Perl solution too :

Code:

perl -nlae '[navy]$h[/navy][teal]=[/teal][navy]$_[/navy] [b]if[/b]$[teal].==[/teal][purple]1[/purple][teal];([/teal][navy]$d[/navy][teal]=[/teal][navy]$F[/navy][teal][[/teal][purple]19[/purple][teal]])=~[/teal][b]s[/b][fuchsia],(../..)/(....),$2/$1,[/fuchsia][teal];[/teal][b]push[/b][navy]@d[/navy][teal],[[/teal][navy]@F[/navy][teal],[/teal][navy]$d[/navy][teal]][/teal][b]if[/b][navy]@F[/navy][teal]&&[/teal][navy]$_[/navy] ne[navy]$h[/navy][teal];[/teal]END[teal]{[/teal][b]print[/b] [b]join[/b][green][i]"\t"[/i][/green][teal],[/teal][b]splice[/b]@[teal]{[/teal][navy]$_[/navy][teal]}[/teal][teal],[/teal][purple]0[/purple][teal],-[/teal][purple]1[/purple][b]for[/b] [b]sort[/b][teal]{[/teal][navy]$a[/navy][teal]->[-[/teal][purple]1[/purple][teal]][/teal][b]cmp[/b][navy]$b[/navy][teal]->[-[/teal][purple]1[/purple][teal]]||[/teal][navy]$a[/navy][teal]->[[/teal][purple]15[/purple][teal]][/teal][b]cmp[/b][navy]$b[/navy][teal]->[[/teal][purple]15[/purple][teal]]||[/teal][navy]$a[/navy][teal]->[[/teal][purple]0[/purple][teal]][/teal][green][i]<=>[/i][/green][navy]$b[/navy][teal]->[[/teal][purple]0[/purple][teal]][/teal][teal]}[/teal][navy]@d[/navy][teal]}[/teal]' *.cst,* > ALL_CST.cst

This one interprets the date as mm/dd/yyyy. All header lines are skipped. Note that field separators are replaced with tab characters.

This one preserves the original format, but may fail sooner :

Code:

perl -nlae '[navy]$h[/navy][teal]=[/teal][navy]$_[/navy] [b]if[/b]$[teal].==[/teal][purple]1[/purple][teal];([/teal][navy]$d[/navy][teal]=[/teal][navy]$F[/navy][teal][[/teal][purple]19[/purple][teal]])=~[/teal][b]s[/b][fuchsia],(../..)/(....),$2/$1,[/fuchsia][teal];[/teal][b]push[/b][navy]@d[/navy][teal],[[/teal][navy]@F[/navy][teal][[/teal][purple]0[/purple][teal],[/teal][purple]15[/purple][teal]],[/teal][navy]$d[/navy][teal],[/teal][navy]$_[/navy][teal]][/teal][b]if[/b][navy]@F[/navy][teal]&&[/teal][navy]$_[/navy] ne[navy]$h[/navy][teal];[/teal]END[teal]{[/teal][b]print[/b] [navy]$_[/navy][teal]->[[/teal][purple]3[/purple][teal]][/teal][b]for[/b] [b]sort[/b][teal]{[/teal][navy]$a[/navy][teal]->[[/teal][purple]2[/purple][teal]][/teal][b]cmp[/b][navy]$b[/navy][teal]->[[/teal][purple]2[/purple][teal]]||[/teal][navy]$a[/navy][teal]->[[/teal][purple]1[/purple][teal]][/teal][b]cmp[/b][navy]$b[/navy][teal]->[[/teal][purple]1[/purple][teal]]||[/teal][navy]$a[/navy][teal]->[[/teal][purple]0[/purple][teal]][/teal][green][i]<=>[/i][/green][navy]$b[/navy][teal]->[[/teal][purple]0[/purple][teal]][/teal][teal]}[/teal][navy]@d[/navy][teal]}[/teal]' *.cst,* > ALL_CST.cst

As both Perl codes load all data into the memory, they will hit memory allocation error at a certain file size. To avoid it a lot of extra work is needed.

Feherke.
[link feherke.github.com/]

http://feherke.github.com/

[/url]

cdj02 · Jun 2, 2012

Thanks for the speedy response!

The data is generated on a unix system but I pull the files remotely to my Windows based laptop and process it with strawberry perl. I'd be ok with processing the files line by line if that is a better way to do it. As i said, i'm new to perl so slurping a whole file at a time seemed easiest to me. If there is a better way to do this please let me know. I'm typically grabbing only a sample of 10 files at a time so I would think I'd be ok memory wise (an assumption on my part). At the most extreme I can think I might ever need would be a grab of 50 files. Based on the file sizes I've shown you (and without knowing my laptop specs), do you think I'd be likely to run into trouble with memory allocation?

btw- the date format is "04/12/2012-14:10:37 means 2012 April 12" and i have no problem transposing the date format to whatever is easiest to work with. I'm thinking I'm going to need to use the time from the date stamp for sorting as well to get things in the right order.

I'm going to give the code snippets you've shown me a try. If you can think of a better way to do the combined join/sort for these files please let me know. Thanks again for your time.

feherke · Jun 2, 2012

Hi

cdj02 said:
I'd be ok with processing the files line by line if that is a better way to do it.

Well, the sorting part could be hardly done that way, so there will be a moment anyway when all the data will be in memory.

Sorry, I can not estimate the memory requirement for that amount of files. I just mentioned the allocation error as a weak point of the concept. If you will have problems, my first suggestion would be to remove the columns containing nothing else than "NA".

cdj02 said:
I'm thinking I'm going to need to use the time from the date stamp for sorting as well to get things in the right order.

Parsing a string into time I would expect to be slower than just rearranging substrings. ( I mean, using [tt]substr()[/tt]. I used the regular expression based mm/dd <-> yyyy swapping just because got bored by typing [tt]substr()[/tt] while used it in the Awk code earlier in my post... If you notice speed problems, we can change that part. )

Feherke.
[link feherke.github.com/]

http://feherke.github.com/

[/url]

cdj02 · Jun 2, 2012

Feherke, I'm testing out this line

Code:

 perl -nlae '$h=$_ if$.==1;($d=$F[19])=~s,(../..)/(....),$2/$1,;push@d,[@F,$d]if@F&&$_ ne$h;END{print join"\t",splice@{$_},0,-1for sort{$a->[-1]cmp$b->[-1]||$a->[15]cmp$b->[15]||$a->[0]<=>$b->[0]}@d}' *.cst,* > ALL_CST.cst ]

but I'm getting Bareword found near "* > ALL_CST" <Missing operator before ALL_CST?> and a syntax error near "*."

I think it doesn't like the ",*" at the end of the "*.cst,*". In my original code I had to put <> around the "*.cst,* to get perl to accept it but that doesn't seem to work here.

any thoughts on what I'm missing?

cdj02 · Jun 2, 2012

I don't know where that ] came from at the end of the code snippet in my post above but it doesn't belong there....sorry if it caused any confusion.

feherke · Jun 3, 2012

Hi

There are years since I used Perl on Windows. Mostly CygWin's Perl, rarely ActivePerl, never Strawberry Perl. So I have no proper idea how and in what circumstances that works.

The code I wrote certainly works on Linux and I am sure it would work perfectly on CygWin. I usually recommend installing it to everybody. It provides the usual GNU tools ( and much more ) compiled into native Windows executables, so they are fast and functional. As CygWin has Bash too, that expands the *.cst,* wildcard so the script receives the list of file names as parameter.

In case you still prefer Strawberry Perl, save this in a file :

Perl:

[gray]#!/usr/bin/perl[/gray]

[b]foreach[/b] [navy]$name[/navy] [teal]([/teal][green][i]<*.cst,*>[/i][/green][teal])[/teal] [teal]{[/teal]
  [b]open[/b] FIL[teal],[/teal][green][i]"<$name"[/i][/green][teal];[/teal]
  [green][i]<FIL>[/i][/green][teal];[/teal]
  [b]while[/b] [teal]([/teal][b]chomp[/b][teal]([/teal][navy]$line[/navy][teal]=[/teal][green][i]<FIL>[/i][/green][teal]))[/teal] [teal]{[/teal]
    [b]next[/b] [b]unless[/b] [navy]$line[/navy][teal];[/teal]
    [navy]@field[/navy][teal]=[/teal][b]split[/b] [green][i]/\s+/[/i][/green][teal],[/teal][navy]$line[/navy][teal];[/teal]
    [teal]([/teal][navy]$date[/navy][teal]=[/teal][navy]$field[/navy][teal][[/teal][purple]19[/purple][teal]])=~[/teal][b]s[/b][fuchsia],(../..)/(....),$2/$1,[/fuchsia][teal];[/teal]
    [b]push[/b] [navy]@data[/navy][teal],[[/teal][navy]@field[/navy][teal],[/teal][navy]$date[/navy][teal]];[/teal]
  [teal]}[/teal]
  [b]close[/b] FIL[teal];[/teal]
[teal]}[/teal]

[navy]@data[/navy][teal]=[/teal][b]sort[/b] [teal]{[/teal] [navy]$a[/navy][teal]->[-[/teal][purple]1[/purple][teal]][/teal] [b]cmp[/b] [navy]$b[/navy][teal]->[-[/teal][purple]1[/purple][teal]][/teal] [teal]||[/teal] [navy]$a[/navy][teal]->[[/teal][purple]15[/purple][teal]][/teal] [b]cmp[/b] [navy]$b[/navy][teal]->[[/teal][purple]15[/purple][teal]][/teal] [teal]||[/teal] [navy]$a[/navy][teal]->[[/teal][purple]0[/purple][teal]][/teal] [green][i]<=>[/i][/green] [navy]$b[/navy][teal]->[[/teal][purple]0[/purple][teal]][/teal] [teal]}[/teal] [navy]@data[/navy][teal];[/teal]

[b]open[/b] FIL[teal],[/teal][green][i]">ALL_CST.cst"[/i][/green][teal];[/teal]
[b]foreach[/b] [navy]$one[/navy] [teal]([/teal][navy]@data[/navy][teal])[/teal] [teal]{[/teal]
  [b]print[/b] FIL [b]join[/b][teal]([/teal][green][i]"\t"[/i][/green][teal],[/teal][b]splice[/b] @[teal]{[/teal][navy]$one[/navy][teal]}[/teal][teal],[/teal][purple]0[/purple][teal],-[/teal][purple]1[/purple][teal]),[/teal][green][i]"\n"[/i][/green][teal];[/teal]
[teal]}[/teal]
[b]close[/b] FIL[teal];[/teal]

Then run it from the directory where the .cst files are. Needs no parameter.

Feherke.
[link feherke.github.com/]

http://feherke.github.com/

[/url]

prex1 · Jun 4, 2012

cdj02, you didn't specify the typical size of your actual log files; if they were as small as the examples, then even thousands of them wouldn't be difficult to process.
Also you didn't describe the exact sorting procedure: the field site appears to be a dot separated list of two integers, so how to sort it? And in sorting by 'Date, Target_Name, and site', you mean in that order or first by site, then by Target_Name and finally by Date? Or what else? And in which order (asc or desc)?
This is a sketch of the (untested) code I would use for this task, disregarding the remark above concerning site.

Perl:

my($header,@lines,$site,$target,$date,@sites,@targets,@dates,@days);
for $name(<*.cst,*>){
  open FIL,"<$name";
  <FIL>;
  $header=$_ unless $header;
  while(<FIL>){
    push @lines,$_;
    ($site,undef,undef,undef,undef,undef,undef,undef,undef,undef,undef,undef,undef,undef,undef,$target,undef,undef,undef,$date)=split;
    push @sites,$site;
    push @targets,$target;
    @days=split/\D/,$date;
    push @dates,join('',$days[2],$days[1],$days[0],$days[3],$days[4],$days[5]);
  }
  close FIL;
}

open FIL,">ALL_CST.cst";
for(sort{$sites[$a] cmp $sites[$b] || $targets[$a] cmp $targets[$b] || $dates[$a] cmp $dates[$b]}0..$#lines){
  print FIL $lines[$_];
}
close FIL;

Franco

http://www.xcalcs.com

: Online engineering calculations

http://www.megamag.it

: Magnetic brakes for fun rides

http://www.levitans.com

: Air bearing pads

cdj02 · Jun 5, 2012

Prex1 - The example sizes are the typical file sizes I'm working with and as for sorting by site, the site represents XY coordinates of the measurements. I was thinking I'd remove the comma to make it easier to sort. Ultimately, I only need to match up the measurements from file to file so I can chart them.

Feherke - I haven't tried CygWin but I'll give it a try today.

Thanks again for the help!

cdj02 · Jun 5, 2012

prex1 - your code works perfectly and sorts exactly the way I wanted. The only odd thing it does is that it adds a blank line at the top of the file for each .cst file input. I ran my test with 10 files and it put 10 blank lines at the top.

prex1 · Jun 5, 2012

There's a blank line at the end of your files [mad]

.
Simply add a line after the while (as feherke did [blush]

):

Perl:

  while(<FIL>){
    [COLOR=red]next unless $_;[/color]
    ...

If you want the header on the sorted file, just add the line (that I forgot) after the open:

Perl:

open FIL,">ALL_CST.cst";
[COLOR=red]print FIL $header;[/color]
...

Franco

http://www.xcalcs.com

: Online engineering calculations

http://www.megamag.it

: Magnetic brakes for fun rides

http://www.levitans.com

: Air bearing pads

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Joining and sorting multiple text files 2

cdj02

Technical User

feherke

Programmer

cdj02

Technical User

feherke

Programmer

cdj02

Technical User

cdj02

Technical User

feherke

Programmer

prex1

Programmer

cdj02

Technical User

cdj02

Technical User

prex1

Programmer

Similar threads

Part and Inventory Search

Sponsor