Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Reg Ex - how to match pattern EXCEPT for duplicates

Status
Not open for further replies.

leveetated

Programmer
Jan 24, 2008
46
US
Hello regex experts,

I have text files in which I am conducting a regex search. I want to find all instances of two consecutive strings that end in the number 1, *except* for those that are duplicates.

Each string is a 13 character alphanumeric string followed by a DOS line break. It doesn't matter what the first 12 chars are, as long as the last character is a 1.

For example, I want to find this:

3FEC9AAE0A1D1
FBF9A13681971

But NOT this:

FBF9A13681971
FBF9A13681971

And NOT this:

CB9121B9E03F0
6750583E86A11

Any suggestions much appreciated. With thanks,

Lee.
 
Hello,

To understand better, each text file may contain any number of lines? And you basically want to go through line by line, and if 2 strings that are different (first 12 characters) however both have a 1 at the end, you want to return those strings. If this is the case, what happens if there are 3 consecutive strings? Or does each text file just contain 2 lines?

If i'm not understanding correctly could you provide a larger data example with expected output.

Chris
 
Yes, each .txt file can contain any number of lines, and I want to go through line by line and find at *least* 2 consecutive strings ending in 1, but any two consecutive strings may not be duplicates.

Below I inserted [Yes] or [No] to indicate what I'm looking for.

Sample file contents 1:

E30E6FA609370
42D48BD54AC71 [No]
42D48BD54AC71 [No]
E713B786FE2A2
DDF30369DEDB2
42D48BD54AC71 [No}


Sample file contents 2:

D34E8631E7790
1B5B4CBCABEC1 [Yes]
7453E189A3071 [Yes]
6D3FDACD5F721 [Yes]
93F7560F388D1 [Yes]
F607FC8A5ECB0
EA6A769AA9C31 [Yes]
21760A9236941 [Yes]
381BAA86E7051 [Yes]
19F5C59D32A00
5372473F13DD1 [Yes]
861EF12254FE1 [Yes]
A5BA1E1D3C0E0
80CF6DD2C3CB1 [No]


With thanks,

Lee.
 
Hello,

Something like this may work for you (mostly untested):

Code:
#! /usr/bin/perl
use strict;
#####
my %hash;
my $n = 1;
#####
#Read in data
while (<DATA>) {
	#Remove new line characters
	chomp;
	#Get last character
	my $last_char = substr $_, -1;
	#If last character = 1 then push the line into hash in an array context
	if ($last_char == 1) {
		push(@{$hash{$n}}, $_);
	}
	#Else add 1 to the hash index as the consecutive group has ended
	else {
		$n++;
	}
}
#####
#Dump hash
foreach (sort {$a <=> $b} %hash) {
	my $array_count = @{$hash{$_}};
	if ($array_count > 1) {
		foreach (@{$hash{$_}}) {
			print "$_, ";
		}
	}
	print "\n";
}
#####
__DATA__
D34E8631E7790
1B5B4CBCABEC1
7453E189A3071
6D3FDACD5F721
93F7560F388D1
F607FC8A5ECB0
EA6A769AA9C31
21760A9236941
381BAA86E7051
19F5C59D32A00
5372473F13DD1
861EF12254FE1
A5BA1E1D3C0E0
80CF6DD2C3CB1

Which produces the following output:

Code:
1B5B4CBCABEC1, 7453E189A3071, 6D3FDACD5F721, 93F7560F388D1,
EA6A769AA9C31, 21760A9236941, 381BAA86E7051,
5372473F13DD1, 861EF12254FE1,

Chris
 
Chris, this is good - thanks so much. Will give it a try.

All the best,

Lee.
 
Hello,

Before you do its not completely correct, I haven't accounted for identical consecutive lines, and theres also an undefined array reference error at the end of the program. I'm just making a couple of changes.

Chris
 
I can't seem to figure out why an undefined array reference error occurs, however it isn't effecting the output and i'm sure you may be able to figure it out yourself.

The following script accounts for identical consecutive lines. Have a look through __DATA__ and ensure the script is printing lines that you expect it to as your requirements may not be fufilled with particular datasets. Just to note the script removes all identical consecutive lines, not just those that occur once. The new script also ignores lines which don't contain 13 characters that are A-Z (capitals) or 0-9.

Code:
#! /usr/bin/perl
use strict;
#####
my (%hash, $line_carry);
my $n = 1;
#####
while (<DATA>) {
	chomp;
	next unless $_ =~ m/^[A-Z0-9]{13}$/;
	if ($_ eq $line_carry) {
		pop(@{$hash{$n}});
		$n++;
		next;
	}
	my $last_char = substr $_, -1;
	if ($last_char == 1) {
		push(@{$hash{$n}}, $_);
	}
	else {
		$n++;
	}
	$line_carry = $_;
}
#####
foreach (sort {$a <=> $b} %hash) {
	my $array_count = @{$hash{$_}};
	if ($array_count > 1) {
		print "@{$hash{$_}}\n";
	}
}
#####
__DATA__
D34E8631E7790
1B5B4CBCABEC1
7453E189A3071
6D3FDACD5F721
93F7560F388D1
F607FC8A5ECB0
EA6A769AA9C31
21760A9236941
381BAA86E7051
19F5C59D32A00
5372473F13DD1
861EF12254FE1
A5BA1E1D3C0E0
80CF6DD2C3CB1
42D48BD54AC71
42D48BD54AC71
42D48BD54AC71
42D48BD54AC71
42D48BD54AC71
42D48BD54AC71
42D48BD54AC71
E713B786FE2A2
DDF30369DEDB2
42D48BD54AC71

Chris
 
Doing this in a regex is a tall order, so I've adopted a programmatic approach instead. I've tried to simplify it as much as possible, and I think this does what you want. It even has a regex in the mix, as requested [wink]
Perl:
use strict;
use warnings;

my @data = <DATA>;
chomp @data;

for (my $i = 1; $i < @data; $i++) {
        next if grep {/[^1]$/} @data[$i -1, $i];
        next if $data[$i -1] eq $data[$i];
        print join("\n", "===", @data[$i -1, $i]), "\n";
}

__DATA__
D34E8631E7790
1B5B4CBCABEC1
7453E189A3071
6D3FDACD5F721
93F7560F388D1
F607FC8A5ECB0
EA6A769AA9C31
21760A9236941
381BAA86E7051
19F5C59D32A00
5372473F13DD1
861EF12254FE1
A5BA1E1D3C0E0
80CF6DD2C3CB1
42D48BD54AC71
42D48BD54AC71
42D48BD54AC71
42D48BD54AC71
42D48BD54AC71
42D48BD54AC71
42D48BD54AC71
E713B786FE2A2
DDF30369DEDB2
42D48BD54AC71
gives
Code:
===
1B5B4CBCABEC1
7453E189A3071
===
7453E189A3071
6D3FDACD5F721
===
6D3FDACD5F721
93F7560F388D1
===
EA6A769AA9C31
21760A9236941
===
21760A9236941
381BAA86E7051
===
5372473F13DD1
861EF12254FE1
===
80CF6DD2C3CB1
42D48BD54AC71
The pairs of values can ovelap each other i.e three consecutive values ending in '1' produces two hits.

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object::perlDesignPatterns)[/small]
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top