Irregular Seg Fault?! 1

nokona13 · Mar 28, 2005

I'm getting a segmentation fault at varibale times in my program. I've had this problem before, on other basic formatting scripts on similar files. Basically this program is just reading in a giant text file, ~300 lines with 3000-5000 number data on each line separated by white space. This little script is just supposed to make sure that none of the lines have identical, non-zero data after the first six "id" fields. It usually crashes near the end of the first loop (comparing person 0 to persons 1-299), but often makes it to the early part of the second loop and sometimes further. I was getting a warning about using undefined variables when splitting the @file array lines since the last line was blank. I thought maybe I was overflowing some sort of warning buffer or something since I fixed that and got one file to run all the way through, but other files still aren't working!

-------------------------
use diagnostics;

open (IN, "$ARGV[0]");

@file = <IN>;

$waste = pop(@file); # was getting warning about undef last line
$waste = 0;

$len = @file;

$numIdent = 0;
for ($i = 0; $i < $len - 1; $i++) {
print "Comparing person $i of $len\n";
for ($j = $i + 1; $j < $len; $j++) {
$t1 = $file[$i];
$t2 = $file[$j];

@line1 = split(/\s+/, $t1);
@line2 = split(/\s+/, $t2);
$len1 = @line1;
$len2 = @line2;
if ($len1 != $len2) {
$message = "*** ABORT ***\nPersons " . $line1[0] . " " .
$line1[1] . " and " . $line2[0] . " " . $line2[1] .
" have different " . "number of markers";
die "$message\n";
}
$ident = identical($t1, $t2);
if ($ident == 1) {
$message = "Persons " . $line1[0] . " " . $line1[1] . " and " .
$line2[0] . " " . $line2[1] . " are identical and non-zero";
print "$message\n";
$numIdent++;
}

# @line1 = ();
# @line2 = ();
}
}

print "\n$numIdent total pairs of non-zero persons were identical\n";

close (IN);
-------------------------

GREATLY appreciate any advice! Thanks!

Matt

mikevh · Mar 28, 2005

This little script is just supposed to make sure that none of the lines have identical, non-zero data after the first six "id" fields.

I'm a little confused by this, since it sounds like you only want to check for duplicates within each line, but looking at your code, it looks like you're checking for duplicates within the whole file.

Here's my advice:

Code:

0. Use [b]strict[/b] and [b]warnings[/b] and declare your variables with [b]my[/b].
1. Declare a [b]hash[/b] to use as a counter.
2. Read the file [b]line by line using while[/b] instead of slurping it into an array.
3. [b]Chomp[/b] each line to get rid of trailing newline.
4. Use a [b]regex[/b] to get rid of the first 6 values on each line that you want to ignore.
5. Split the remaining line on whitespace.
6. Use [b]grep to collect non-zero values[/b] from the split into an array.
7. Loop over the array and use each element as a [b]key in the counter hash[/b].  [b]Increment the value[/b] for that hash key by 1.
8. When you're finished reading the file, loop over the counter hash.  [b]For any value in the hash that is greater than 1, the key is a duplicate number in the file[/b].

HTH

nokona13 · Mar 28, 2005

Yeah, it's looking for duplicate lines in the whole file. I tried introducint "use strict;" but I got all kinds of package warnings. Do I have to put main:: in front of every single line if I use strict?

A couple questions about your advice:

1) Reading the file line by line (and not saving each as an element in an array), means I have to keep closing and opening the file to compare line 1 to all the following lines, then compare line 2 to all the following lines, no? If I save them all in an array, does that not do the same thing as slurping and then looping through and chomping each line (something I clearly should have thought of doing)?

2) How does grep work on an array like that?

3) Your hash suggestion is really neat, but I don't think appropriate for the particular data. The datapoints are all small integers, with different ranges in different files but generally from 1-2 through 1-10 or so. Knowing there's a crapload of 1's isn't informative. I need to know if the exact sequence or 3000 integers is identical on two different lines.

Thanks for your help!!

mikevh · Mar 28, 2005

Do I have to put main:: in front of every single line if I use strict?

No, declare your variables with my.

1) Reading the file line by line (and not saving each as an element in an array), means I have to keep closing and opening the file to compare line 1 to all the following lines, then compare line 2 to all the following lines, no?

Clearly at least the values you care about (or some information about them) must be saved. But operating on one line at a time will be much less memory-intensive than slurping a huge file into an array.

2) How does grep work on an array like that?

perldoc -f grep

Knowing there's a crapload of 1's isn't informative. I need to know if the exact sequence of 3000 integers is identical on two different lines.

Hmm, the plot thickens. That wasn't clear to me from your original post. You say you want to exclude any zero values. Would you consider the following 2 sequences identical, then? With the zeros removed, are they the same?
1 2 3 0 0 4 5 6 0 7 8 9
1 2 3 0 4 5 6 7 0 0 8 9

I'll have to think about this. [ponder]

nokona13 · Mar 28, 2005

Sorry this wasn't clear in earlier posts. Two lines are considered identical iff there are ANY non-zero values in the whole string of 3000 integers AND the values at each point in the array are identical.

So:

0 0 0 0 0 0
0 0 0 0 0 0

are not identical, and:

0 0 0 0 0 0
0 0 1 0 0 0

are also not identical, but:

1 0 2 4 0 5
1 0 2 4 0 5

and

0 0 1 0 0 0
0 0 1 0 0 0

are both considered identical...

The thing is, I have a little function (identical(...)) that I'm already calling in the code above that just loops from 6 to the end of the arrays (after making sure they're the same size), comparing each value, and returning 1 if they're non-zero identical and 0 otherwise. I've done enough print statement debugging to be pretty sure it's working correctly. My problem is that this is all causing seg faults still. Here's my latest code:

#!/usr/bin/perl

use diagnostics;
use warnings;
use strict;

my $fn = $ARGV[0];
open (IN, "$fn");
my @file = <IN>;
close (IN);

my $waste = pop(@file);
$waste = 0;

my $len = @file;

for (my $k = 0; $k < $len; $k++) {
chomp ($file[$k]);
}

my $numIdent = 0;
for (my $i = 0; $i < $len - 1; $i++) {
print "Comparing person $i of $len\n";
for (my $j = $i + 1; $j < $len; $j++) {
print "In loop $i : $j\n";
my $t1 = $file[$i];
my $t2 = $file[$j];

my $ident = identical($t1, $t2);
if ($ident == 1) {
my $message = "Persons on lines " . $i+1 . " and " . $j+1 .
" are identical and non-zero";
print "$message\n";
$numIdent++;
}
}
}

print "\n$numIdent total pairs of non-zero persons were identical\n";

sub identical {

my $t1 = $_[0];
my $t2 = $_[1];

if ($t1 eq "") {
die "First argument to identical(...) is blank\n";
}
if ($t2 eq "") {
die "First argument to identical(...) is blank\n";
}

my @a1 = split (/\s+/, $t1);
my @a2 = split (/\s+/, $t2);

my $len1 = @a1;
my $len2 = @a2;

if ($len1 != $len2) {
my $message = "*** ABORT ***\nPersons " . $a1[0] . " " .
$a1[1] . " and " . $a2[0] . " " . $a2[1] .
" have different " . "number of markers";
die "$message\n";
}
my $nonzero = 0;
for (my $j = 6; $j < $len1; $j++) {
if ($a1[$j] != $a2[$j]) {
return 0;
}
elsif (($a1[$j] != 0) || ($a2[$j] != 0)) {
$nonzero = 1;
}
}
if ($nonzero) {
return 1;
}
else {
return 0;
}
}

chazoid · Mar 28, 2005

To conserve memory, you might want to try using the Tie::File module. It will let you treat the file as an array without actually reading the whole file into memory at once.
You should be able to use the following in your existing program without making any other changes. (after installing Tie::File of course)

Code:

use Fcntl 'O_RDWR';
use Tie::File;

and replace the line my @file = <IN>; with:

Code:

tie my @file, "Tie::File", $fn, autochomp => 0, mode => O_RDWR;

mikevh · Mar 28, 2005

Two lines are considered identical iff there are ANY non-zero values in the whole string of 3000 integers AND the values at each point in the array are identical.

At each point in the array before or after the removal of non-zero values? If it's after, then
1 2 3 0 0 4 5 6 0 7 8 9
1 2 3 0 4 5 6 7 0 0 8 9

are identical, no? But if it's before, they're not, since
the numbers 4-7 don't occur in the same places in the original sequences.

nokona13 · Mar 28, 2005

Sorry again mike.

1 2 3 0 0 4 5 6 0 7 8 9
1 2 3 0 4 5 6 7 0 0 8 9

Are not identical. Each point is a reference to a distinct place in a large vector. 0's don't shift out or anything.

Chaz, thanks for the new trick. I've wondered if there was an easy way to get array-style access to a file without reading a giant file into memory. Unfortunately I'm still getting a seg fault after about 250 to 500 comparisons (regardless of the number of lines in the individual file).

mikevh · Mar 28, 2005

While I was waiting for your answer, I came up with this, which lets you specify whether or not to keep the zeros. On the command line, enter your input filename and a 1 or 0 to keep or not keep the zeros. (Based on your last post, you want to keep them, so enter a 1.) Of course I haven't tested it with a file as huge as what you're using, but it seems to give correct results with smaller files.

Code:

#!perl
use strict;
use warnings;

my ($filename, $keepzeros) = (shift, shift);
unless (defined($filename) && defined($keepzeros)) {
    die qq(You must specify filename and whether to keep zeros (1/0).\n);
}
chomp($filename, $keepzeros);
open(FH, $filename) || die qq(Can't open "$filename" for read.\n);

my @arr;
my $ignoren = 6;  #non-blank fields to ignore at start of lines

while (<FH>) {
    chomp;

    # get rid of first $ignoren non-blank fields
    s/^(\S+\s+){$ignoren}//;

    # remove zeros if not wanted
    if (!$keepzeros) {
        1 while s/\b0+\s+//g;
    }

    # skip line if blank
    /^\s*$/ && next;

    # replace multiple whitespace with single space
    s/\s\s+/ /g;

    push @arr, $_;
}
close(FH) || die qq(Can't close "$filename" after read.\n);

my @dups;
for (my $i=0; $i<@arr-1; $i++) {
    for (my $j=$i+1; $j<@arr; $j++) {
        if (identical($arr[$i], $arr[$j])) {
            push @dups, [ ($i, $j) ];
        }
    }
}

unless (@dups) {
    print "No duplicate lines were found.\n";
} else {
    for (@dups) {
        print "line $_->[0] duplicates line $_->[1]\n";
    }
}

sub identical {
    my ($x, $y) = @_;
    $x eq $y;
}

HTH

nokona13 · Mar 28, 2005

Mike, you are the MAN! Worked perfectly!

I tend to think too much like a C programmer in perl, and I forget how incredibly easy (and fast) it is to just manipulate whole lines like that to set up really easy, non-looping comparisons. Not only does this work, but it's almost instantaneous, whereas on the one file my program worked on, it took like 8 minutes.

Thanks again, I REALLY appreciate it!!

nokona13 · Mar 28, 2005

Oops...

If I still have your attention, I need it to ignore identical lines that are all zeros. I tried adding this to sub identical:

if ($x =~ /^(0\s)+$/) {
return 0;
}
else {
return ($x eq $y);
}

And that causes a seg fault. The lines can't be too big for regex stuff since it's done in the top of the program. Any ideas why this would cause it to start segfaulting?

nokona13 · Mar 28, 2005

Okay, so I solved this problem, but if you have any insights into why the above caused a set fault, I'd still be interested to learn.

I solved the problem by putting this into the while (<FH>) loop.

# skip lines with only zeros
/^(0\s)+$/ && next;

Does the quadratic number of regex statements executed when you put it inside a double loop cause some sort of overflow with perl?

mikevh · Mar 28, 2005

I solved the problem by putting this into the while (<FH>) loop.

# skip lines with only zeros
/^(0\s)+$/ && next;

I think I'd have gone with

Code:

!/[1-9]/ && next;

What if your numbers are zero-filled?

No insight into your seg fault. I'm not sure I've ever seen a seg fault with Perl, and I've been using it for about 7 years. (If I have, it was so long ago and so infrequent that I don't remember it.) Does this happen when you run other Perl programs, too?

I didn't try to reproduce the seg fault, and I confess I didn't read your code carefully, as I knew the problem could be solved with a simpler approach.

P. S. You can compare 2 arrays in Perl using eq. I was originally working on an array-driven approach rather than a string-driven one, and identical wouldn't have been much different at all, just that the args would have been 2 array refs instead of 2 strings, e.g.

Code:

sub identical {
    # Array version
    my ($x, $y) = @_;
    [b]@[/b]$x eq [b]@[/b]$y;
}

The string-driven approach was simpler, though, so I went with that. Glad it worked out. Thanks for the star.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Irregular Seg Fault?! 1

nokona13

Programmer

mikevh

Programmer

nokona13

Programmer

mikevh

Programmer

nokona13

Programmer

chazoid

Technical User

mikevh

Programmer

nokona13

Programmer

mikevh

Programmer

nokona13

Programmer

nokona13

Programmer

nokona13

Programmer

mikevh

Programmer

Similar threads

Part and Inventory Search

Sponsor