Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Irregular Seg Fault?! 1

Status
Not open for further replies.

nokona13

Programmer
Dec 7, 2003
35
0
0
US
I'm getting a segmentation fault at varibale times in my program. I've had this problem before, on other basic formatting scripts on similar files. Basically this program is just reading in a giant text file, ~300 lines with 3000-5000 number data on each line separated by white space. This little script is just supposed to make sure that none of the lines have identical, non-zero data after the first six "id" fields. It usually crashes near the end of the first loop (comparing person 0 to persons 1-299), but often makes it to the early part of the second loop and sometimes further. I was getting a warning about using undefined variables when splitting the @file array lines since the last line was blank. I thought maybe I was overflowing some sort of warning buffer or something since I fixed that and got one file to run all the way through, but other files still aren't working!

-------------------------
use diagnostics;

open (IN, "$ARGV[0]");


@file = <IN>;

$waste = pop(@file); # was getting warning about undef last line
$waste = 0;

$len = @file;

$numIdent = 0;
for ($i = 0; $i < $len - 1; $i++) {
print "Comparing person $i of $len\n";
for ($j = $i + 1; $j < $len; $j++) {
$t1 = $file[$i];
$t2 = $file[$j];

@line1 = split(/\s+/, $t1);
@line2 = split(/\s+/, $t2);
$len1 = @line1;
$len2 = @line2;
if ($len1 != $len2) {
$message = "*** ABORT ***\nPersons " . $line1[0] . " " .
$line1[1] . " and " . $line2[0] . " " . $line2[1] .
" have different " . "number of markers";
die "$message\n";
}
$ident = identical($t1, $t2);
if ($ident == 1) {
$message = "Persons " . $line1[0] . " " . $line1[1] . " and " .
$line2[0] . " " . $line2[1] . " are identical and non-zero";
print "$message\n";
$numIdent++;
}

# @line1 = ();
# @line2 = ();
}
}

print "\n$numIdent total pairs of non-zero persons were identical\n";

close (IN);
-------------------------

GREATLY appreciate any advice! Thanks!

Matt
 
This little script is just supposed to make sure that none of the lines have identical, non-zero data after the first six "id" fields.
I'm a little confused by this, since it sounds like you only want to check for duplicates within each line, but looking at your code, it looks like you're checking for duplicates within the whole file.

Here's my advice:
Code:
0. Use [b]strict[/b] and [b]warnings[/b] and declare your variables with [b]my[/b].
1. Declare a [b]hash[/b] to use as a counter.
2. Read the file [b]line by line using while[/b] instead of slurping it into an array.
3. [b]Chomp[/b] each line to get rid of trailing newline.
4. Use a [b]regex[/b] to get rid of the first 6 values on each line that you want to ignore.
5. Split the remaining line on whitespace.
6. Use [b]grep to collect non-zero values[/b] from the split into an array.
7. Loop over the array and use each element as a [b]key in the counter hash[/b].  [b]Increment the value[/b] for that hash key by 1.
8. When you're finished reading the file, loop over the counter hash.  [b]For any value in the hash that is greater than 1, the key is a duplicate number in the file[/b].
HTH


 
Yeah, it's looking for duplicate lines in the whole file. I tried introducint "use strict;" but I got all kinds of package warnings. Do I have to put main:: in front of every single line if I use strict?

A couple questions about your advice:

1) Reading the file line by line (and not saving each as an element in an array), means I have to keep closing and opening the file to compare line 1 to all the following lines, then compare line 2 to all the following lines, no? If I save them all in an array, does that not do the same thing as slurping and then looping through and chomping each line (something I clearly should have thought of doing)?

2) How does grep work on an array like that?

3) Your hash suggestion is really neat, but I don't think appropriate for the particular data. The datapoints are all small integers, with different ranges in different files but generally from 1-2 through 1-10 or so. Knowing there's a crapload of 1's isn't informative. I need to know if the exact sequence or 3000 integers is identical on two different lines.

Thanks for your help!!
 
Do I have to put main:: in front of every single line if I use strict?
No, declare your variables with my.
1) Reading the file line by line (and not saving each as an element in an array), means I have to keep closing and opening the file to compare line 1 to all the following lines, then compare line 2 to all the following lines, no?
Clearly at least the values you care about (or some information about them) must be saved. But operating on one line at a time will be much less memory-intensive than slurping a huge file into an array.
2) How does grep work on an array like that?
perldoc -f grep
Knowing there's a crapload of 1's isn't informative. I need to know if the exact sequence of 3000 integers is identical on two different lines.
Hmm, the plot thickens. That wasn't clear to me from your original post. You say you want to exclude any zero values. Would you consider the following 2 sequences identical, then? With the zeros removed, are they the same?
1 2 3 0 0 4 5 6 0 7 8 9
1 2 3 0 4 5 6 7 0 0 8 9


I'll have to think about this. [ponder]

 
Sorry this wasn't clear in earlier posts. Two lines are considered identical iff there are ANY non-zero values in the whole string of 3000 integers AND the values at each point in the array are identical.

So:

0 0 0 0 0 0
0 0 0 0 0 0

are not identical, and:

0 0 0 0 0 0
0 0 1 0 0 0

are also not identical, but:

1 0 2 4 0 5
1 0 2 4 0 5

and

0 0 1 0 0 0
0 0 1 0 0 0

are both considered identical...

The thing is, I have a little function (identical(...)) that I'm already calling in the code above that just loops from 6 to the end of the arrays (after making sure they're the same size), comparing each value, and returning 1 if they're non-zero identical and 0 otherwise. I've done enough print statement debugging to be pretty sure it's working correctly. My problem is that this is all causing seg faults still. Here's my latest code:


#!/usr/bin/perl

use diagnostics;
use warnings;
use strict;

my $fn = $ARGV[0];
open (IN, "$fn");
my @file = <IN>;
close (IN);

my $waste = pop(@file);
$waste = 0;

my $len = @file;

for (my $k = 0; $k < $len; $k++) {
chomp ($file[$k]);
}


my $numIdent = 0;
for (my $i = 0; $i < $len - 1; $i++) {
print "Comparing person $i of $len\n";
for (my $j = $i + 1; $j < $len; $j++) {
print "In loop $i : $j\n";
my $t1 = $file[$i];
my $t2 = $file[$j];

my $ident = identical($t1, $t2);
if ($ident == 1) {
my $message = "Persons on lines " . $i+1 . " and " . $j+1 .
" are identical and non-zero";
print "$message\n";
$numIdent++;
}
}
}

print "\n$numIdent total pairs of non-zero persons were identical\n";



sub identical {

my $t1 = $_[0];
my $t2 = $_[1];


if ($t1 eq "") {
die "First argument to identical(...) is blank\n";
}
if ($t2 eq "") {
die "First argument to identical(...) is blank\n";
}

my @a1 = split (/\s+/, $t1);
my @a2 = split (/\s+/, $t2);

my $len1 = @a1;
my $len2 = @a2;

if ($len1 != $len2) {
my $message = "*** ABORT ***\nPersons " . $a1[0] . " " .
$a1[1] . " and " . $a2[0] . " " . $a2[1] .
" have different " . "number of markers";
die "$message\n";
}
my $nonzero = 0;
for (my $j = 6; $j < $len1; $j++) {
if ($a1[$j] != $a2[$j]) {
return 0;
}
elsif (($a1[$j] != 0) || ($a2[$j] != 0)) {
$nonzero = 1;
}
}
if ($nonzero) {
return 1;
}
else {
return 0;
}
}
 
To conserve memory, you might want to try using the Tie::File module. It will let you treat the file as an array without actually reading the whole file into memory at once.
You should be able to use the following in your existing program without making any other changes. (after installing Tie::File of course)

Code:
use Fcntl 'O_RDWR';
use Tie::File;

and replace the line my @file = <IN>; with:
Code:
tie my @file, "Tie::File", $fn, autochomp => 0, mode => O_RDWR;
 
Two lines are considered identical iff there are ANY non-zero values in the whole string of 3000 integers AND the values at each point in the array are identical.
At each point in the array before or after the removal of non-zero values? If it's after, then
1 2 3 0 0 4 5 6 0 7 8 9
1 2 3 0 4 5 6 7 0 0 8 9


are identical, no? But if it's before, they're not, since
the numbers 4-7 don't occur in the same places in the original sequences.

 
Sorry again mike.

1 2 3 0 0 4 5 6 0 7 8 9
1 2 3 0 4 5 6 7 0 0 8 9

Are not identical. Each point is a reference to a distinct place in a large vector. 0's don't shift out or anything.

Chaz, thanks for the new trick. I've wondered if there was an easy way to get array-style access to a file without reading a giant file into memory. Unfortunately I'm still getting a seg fault after about 250 to 500 comparisons (regardless of the number of lines in the individual file).
 
While I was waiting for your answer, I came up with this, which lets you specify whether or not to keep the zeros. On the command line, enter your input filename and a 1 or 0 to keep or not keep the zeros. (Based on your last post, you want to keep them, so enter a 1.) Of course I haven't tested it with a file as huge as what you're using, but it seems to give correct results with smaller files.
Code:
#!perl
use strict;
use warnings;

my ($filename, $keepzeros) = (shift, shift);
unless (defined($filename) && defined($keepzeros)) {
    die qq(You must specify filename and whether to keep zeros (1/0).\n);
}
chomp($filename, $keepzeros);
open(FH, $filename) || die qq(Can't open "$filename" for read.\n);

my @arr;
my $ignoren = 6;  #non-blank fields to ignore at start of lines

while (<FH>) {
    chomp;

    # get rid of first $ignoren non-blank fields
    s/^(\S+\s+){$ignoren}//;

    # remove zeros if not wanted
    if (!$keepzeros) {
        1 while s/\b0+\s+//g;
    }

    # skip line if blank
    /^\s*$/ && next;

    # replace multiple whitespace with single space
    s/\s\s+/ /g;

    push @arr, $_;
}
close(FH) || die qq(Can't close "$filename" after read.\n);

my @dups;
for (my $i=0; $i<@arr-1; $i++) {
    for (my $j=$i+1; $j<@arr; $j++) {
        if (identical($arr[$i], $arr[$j])) {
            push @dups, [ ($i, $j) ];
        }
    }
}

unless (@dups) {
    print "No duplicate lines were found.\n";
} else {
    for (@dups) {
        print "line $_->[0] duplicates line $_->[1]\n";
    }
}

sub identical {
    my ($x, $y) = @_;
    $x eq $y;
}
HTH


 
Mike, you are the MAN! Worked perfectly!

I tend to think too much like a C programmer in perl, and I forget how incredibly easy (and fast) it is to just manipulate whole lines like that to set up really easy, non-looping comparisons. Not only does this work, but it's almost instantaneous, whereas on the one file my program worked on, it took like 8 minutes.

Thanks again, I REALLY appreciate it!!
 
Oops...

If I still have your attention, I need it to ignore identical lines that are all zeros. I tried adding this to sub identical:

if ($x =~ /^(0\s)+$/) {
return 0;
}
else {
return ($x eq $y);
}

And that causes a seg fault. The lines can't be too big for regex stuff since it's done in the top of the program. Any ideas why this would cause it to start segfaulting?
 
Okay, so I solved this problem, but if you have any insights into why the above caused a set fault, I'd still be interested to learn.

I solved the problem by putting this into the while (<FH>) loop.

# skip lines with only zeros
/^(0\s)+$/ && next;

Does the quadratic number of regex statements executed when you put it inside a double loop cause some sort of overflow with perl?
 
I solved the problem by putting this into the while (<FH>) loop.

# skip lines with only zeros
/^(0\s)+$/ && next;
I think I'd have gone with
Code:
!/[1-9]/ && next;
What if your numbers are zero-filled?

No insight into your seg fault. I'm not sure I've ever seen a seg fault with Perl, and I've been using it for about 7 years. (If I have, it was so long ago and so infrequent that I don't remember it.) Does this happen when you run other Perl programs, too?

I didn't try to reproduce the seg fault, and I confess I didn't read your code carefully, as I knew the problem could be solved with a simpler approach.

P. S. You can compare 2 arrays in Perl using eq. I was originally working on an array-driven approach rather than a string-driven one, and identical wouldn't have been much different at all, just that the args would have been 2 array refs instead of 2 strings, e.g.
Code:
sub identical {
    # Array version
    my ($x, $y) = @_;
    [b]@[/b]$x eq [b]@[/b]$y;
}
The string-driven approach was simpler, though, so I went with that. Glad it worked out. Thanks for the star.

 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top