Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Count pipes in delimited file (embedded CRLF's)

Status
Not open for further replies.

bigbalbossa

Programmer
Mar 21, 2002
87
US
Alright folks...i didn't think this would be too difficult, but it's turning into quite a chore.

I have 7GB pipe delimited file that seems to have line feeds in the wrong place. There should be 18 pipes per line, however, when i run a script that does a count i get:

There are 1 line(s) with 12 pipes
There are 24 line(s) with 2 pipes
There are 6 line(s) with 20 pipes
There are 3 line(s) with 14 pipes
There are 4 line(s) with 4 pipes
There are 45873912 line(s) with 18 pipes
There are 2 line(s) with 0 pipes
There are 23 line(s) with 16 pipes
There are 126 line(s) with 19 pipes
There are 1 line(s) with 13 pipes
There are 1 line(s) with 5 pipes

Is there a way to remove all CRLF's, count 18 pipes, then add a CRLF?

Any ideas are much appreciated.
 
You need to look at some of the lines that have less or more pipes than you want to see exactly what is happening. Just swapping linefeeds for pipes and vice versa in an attempt to clean the data is probably a very bad idea.
However, you asked for a script to do the job so here goes:
Code:
#!/usr/bin/perl -w
use strict;
$|=1;
my $record    = "";
my $pipecount = 18;
while(<>) {
  chomp;
  if($record) {
    $record .= "|$_";
  } else {
    $record = $_;
  }
  my $count = $record =~ tr/|/|/;
  if($count == $pipecount) {
    print "$record\n";
    $record = "";
    next;
  }
  next if($count < $pipecount);
  $record =~ s/^((?:[^|]+\|){$pipecount}[^|]+)\|//o;
  print "$1\n";
}
print "$record\n" if($record);
Be careful not to lose your original data, keep a copy at all times in case this produces output that you do not like.



Trojan.
 
How about something like this?
Code:
my $delims_per_line = 5;

{local $/ = "\|";
 while (!eof DATA) {
    my @temp;
    foreach (1..$delims_per_line) {
        last if eof DATA;
        push(@temp, scalar <DATA>);
    }
    map {s/\n//g} @temp;
    print @temp, "\n";
 }
}

__DATA__
one-one|one-two|one-three|
one-four
|one-five|two-one|
two-two|two-three|two-four|
two-five|
 
How about this? Obviously i am aware i am posting an awk solution! ... just seems very apt in this situation:-

Code:
awk -F\| 'NF % 4 != 0 { print }' 7_GIG.txt

This will allow you, from the command line, to display any rogue lines - as Trojan very wisely suggests


Kind Regards
Duncan
 
also, i should add, awk will (should) just eat its way through your file - without altering it in any way - and have no problem with the large file size.


Kind Regards
Duncan
 
Duncan, can you do a walk-through of your awk command? I like the idea.

Thanks for the tips everybody.

 
Hi bigbalbossa

The awk example i gave above will work - all you need to do is change the name of 7_GIG.txt and off you go (assuming you are in the correct directory)

Or do you want me to explain how it works?


Kind Regards
Duncan
 
Sorry - i am being a bit of a moron

Code:
 awk -F\| 'NF % [b][red]18[/red][/b] != 0 { print }' 7_GIG.txt

This should work - i.e. i have changed it to look for the non-existence of 18 columns - created by the pipes


Kind Regards
Duncan
 
Code:
awk
-F\| [blue]set the input field separator to a pipe (requires escaping)[/blue]
'    [blue]opening single quote - required[/blue]
NF   [blue]number of fields[/blue]
% 18   [blue]modulus 18 - i.e. divide by 18[/blue]
!= 0   [blue]does [b]the calculation above[/b] NOT equal zero?[/blue]
{ print }   [blue]only print if the calculation returned true - i.e. there are NOT 18 fields[/blue]
'    [blue]closing single quote - required[/blue]

modulus:-
if any calculation of the number of fields divided by 18 returns a remainder, then the condition returns true - as it is a negative condition

i.e. if there were 19 pipes - this calculation would return true

20 % 18 != 0


Kind Regards
Duncan
 
Duncs,
Why not just test for "== 18"?
By using "% 18" it would allow a line with 36 fields to be considered acceptable or worse a line with zero records.
Just a thought!
;-)


Trojan.
 
This is good guys, but if a line has no pipes, how can i get rid of them?
 
Trojan/Duncan,

Thanks for the help...your awk solution did the trick. This was the final code:

awk -F\| 'NF % 18 == 1 { print }' head_original.txt > temp1

This gave me all records that didn't have 18 pipes, however, it still kept lines that had no pipes...running this took care of the issue:

awk -F\| 'NF != 1 { print }' head_original.txt > temp1

All is well...thanks again guys.
 
[tt]awk -F\| 'NF % 18 != 0 { print }' 7_GIG.txt[/tt]

Code:
awk -F\| 'NF != 18' 7_GIG.txt

However, the o.p. said "18 pipes per line". That would make 19 fields.
 
Ducan -> 1 star
Trojan -> 1 star

Thanks guys! I've got to work w/ AWK more.
 
Hi Scanner18

Thank you for your point - you are quite right to highlight this

However, with this script, that is in total about 25 characters in length, i don't think it would take a rocket scientist to tweak it to work. It is pretty evident that the 18 bit is the bit to change...

I wasn't really trying to provide the answer in this instance - just the method. I would imagine that it is well within the capability of everyone in this forum to be able to take a cursory glance at a very basic awk script and alter it accordingly


Kind Regards
Duncan
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top