Reg Ex - how to match pattern EXCEPT for duplicates

leveetated · Mar 8, 2010

Hello regex experts,

I have text files in which I am conducting a regex search. I want to find all instances of two consecutive strings that end in the number 1, *except* for those that are duplicates.

Each string is a 13 character alphanumeric string followed by a DOS line break. It doesn't matter what the first 12 chars are, as long as the last character is a 1.

For example, I want to find this:

3FEC9AAE0A1D1
FBF9A13681971

But NOT this:

FBF9A13681971
FBF9A13681971

And NOT this:

CB9121B9E03F0
6750583E86A11

Any suggestions much appreciated. With thanks,

Lee.

Zhris · Mar 8, 2010

Hello,

To understand better, each text file may contain any number of lines? And you basically want to go through line by line, and if 2 strings that are different (first 12 characters) however both have a 1 at the end, you want to return those strings. If this is the case, what happens if there are 3 consecutive strings? Or does each text file just contain 2 lines?

If i'm not understanding correctly could you provide a larger data example with expected output.

Chris

leveetated · Mar 8, 2010

Yes, each .txt file can contain any number of lines, and I want to go through line by line and find at *least* 2 consecutive strings ending in 1, but any two consecutive strings may not be duplicates.

Below I inserted [Yes] or [No] to indicate what I'm looking for.

Sample file contents 1:

E30E6FA609370
42D48BD54AC71 [No]
42D48BD54AC71 [No]
E713B786FE2A2
DDF30369DEDB2
42D48BD54AC71 [No}

Sample file contents 2:

D34E8631E7790
1B5B4CBCABEC1 [Yes]
7453E189A3071 [Yes]
6D3FDACD5F721 [Yes]
93F7560F388D1 [Yes]
F607FC8A5ECB0
EA6A769AA9C31 [Yes]
21760A9236941 [Yes]
381BAA86E7051 [Yes]
19F5C59D32A00
5372473F13DD1 [Yes]
861EF12254FE1 [Yes]
A5BA1E1D3C0E0
80CF6DD2C3CB1 [No]

With thanks,

Lee.

Zhris · Mar 8, 2010

Hello,

Something like this may work for you (mostly untested):

Code:

#! /usr/bin/perl
use strict;
#####
my %hash;
my $n = 1;
#####
#Read in data
while (<DATA>) {
	#Remove new line characters
	chomp;
	#Get last character
	my $last_char = substr $_, -1;
	#If last character = 1 then push the line into hash in an array context
	if ($last_char == 1) {
		push(@{$hash{$n}}, $_);
	}
	#Else add 1 to the hash index as the consecutive group has ended
	else {
		$n++;
	}
}
#####
#Dump hash
foreach (sort {$a <=> $b} %hash) {
	my $array_count = @{$hash{$_}};
	if ($array_count > 1) {
		foreach (@{$hash{$_}}) {
			print "$_, ";
		}
	}
	print "\n";
}
#####
__DATA__
D34E8631E7790
1B5B4CBCABEC1
7453E189A3071
6D3FDACD5F721
93F7560F388D1
F607FC8A5ECB0
EA6A769AA9C31
21760A9236941
381BAA86E7051
19F5C59D32A00
5372473F13DD1
861EF12254FE1
A5BA1E1D3C0E0
80CF6DD2C3CB1

Which produces the following output:

Code:

1B5B4CBCABEC1, 7453E189A3071, 6D3FDACD5F721, 93F7560F388D1,
EA6A769AA9C31, 21760A9236941, 381BAA86E7051,
5372473F13DD1, 861EF12254FE1,

Chris

leveetated · Mar 8, 2010

Chris, this is good - thanks so much. Will give it a try.

All the best,

Lee.

Zhris · Mar 8, 2010

Hello,

Before you do its not completely correct, I haven't accounted for identical consecutive lines, and theres also an undefined array reference error at the end of the program. I'm just making a couple of changes.

Chris

Zhris · Mar 8, 2010

I can't seem to figure out why an undefined array reference error occurs, however it isn't effecting the output and i'm sure you may be able to figure it out yourself.

The following script accounts for identical consecutive lines. Have a look through __DATA__ and ensure the script is printing lines that you expect it to as your requirements may not be fufilled with particular datasets. Just to note the script removes all identical consecutive lines, not just those that occur once. The new script also ignores lines which don't contain 13 characters that are A-Z (capitals) or 0-9.

Code:

#! /usr/bin/perl
use strict;
#####
my (%hash, $line_carry);
my $n = 1;
#####
while (<DATA>) {
	chomp;
	next unless $_ =~ m/^[A-Z0-9]{13}$/;
	if ($_ eq $line_carry) {
		pop(@{$hash{$n}});
		$n++;
		next;
	}
	my $last_char = substr $_, -1;
	if ($last_char == 1) {
		push(@{$hash{$n}}, $_);
	}
	else {
		$n++;
	}
	$line_carry = $_;
}
#####
foreach (sort {$a <=> $b} %hash) {
	my $array_count = @{$hash{$_}};
	if ($array_count > 1) {
		print "@{$hash{$_}}\n";
	}
}
#####
__DATA__
D34E8631E7790
1B5B4CBCABEC1
7453E189A3071
6D3FDACD5F721
93F7560F388D1
F607FC8A5ECB0
EA6A769AA9C31
21760A9236941
381BAA86E7051
19F5C59D32A00
5372473F13DD1
861EF12254FE1
A5BA1E1D3C0E0
80CF6DD2C3CB1
42D48BD54AC71
42D48BD54AC71
42D48BD54AC71
42D48BD54AC71
42D48BD54AC71
42D48BD54AC71
42D48BD54AC71
E713B786FE2A2
DDF30369DEDB2
42D48BD54AC71

Chris

stevexff · Mar 9, 2010

Doing this in a regex is a tall order, so I've adopted a programmatic approach instead. I've tried to simplify it as much as possible, and I think this does what you want. It even has a regex in the mix, as requested [wink]

Perl:

use strict;
use warnings;

my @data = <DATA>;
chomp @data;

for (my $i = 1; $i < @data; $i++) {
        next if grep {/[^1]$/} @data[$i -1, $i];
        next if $data[$i -1] eq $data[$i];
        print join("\n", "===", @data[$i -1, $i]), "\n";
}

__DATA__
D34E8631E7790
1B5B4CBCABEC1
7453E189A3071
6D3FDACD5F721
93F7560F388D1
F607FC8A5ECB0
EA6A769AA9C31
21760A9236941
381BAA86E7051
19F5C59D32A00
5372473F13DD1
861EF12254FE1
A5BA1E1D3C0E0
80CF6DD2C3CB1
42D48BD54AC71
42D48BD54AC71
42D48BD54AC71
42D48BD54AC71
42D48BD54AC71
42D48BD54AC71
42D48BD54AC71
E713B786FE2A2
DDF30369DEDB2
42D48BD54AC71

gives

Code:

===
1B5B4CBCABEC1
7453E189A3071
===
7453E189A3071
6D3FDACD5F721
===
6D3FDACD5F721
93F7560F388D1
===
EA6A769AA9C31
21760A9236941
===
21760A9236941
381BAA86E7051
===
5372473F13DD1
861EF12254FE1
===
80CF6DD2C3CB1
42D48BD54AC71

The pairs of values can ovelap each other i.e three consecutive values ending in '1' produces two hits.

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object:erlDesignPatterns)[/small]

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Reg Ex - how to match pattern EXCEPT for duplicates

leveetated

Programmer

Zhris

Programmer

leveetated

Programmer

Zhris

Programmer

leveetated

Programmer

Zhris

Programmer

Zhris

Programmer

stevexff

Programmer

Similar threads

Part and Inventory Search

Sponsor