Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

similar consecutive lines 4

Status
Not open for further replies.

mikeboz

MIS
May 18, 2000
41
US
I need to determine if similar consecutive lines exsist.
ie.
lines 3,4
lines 6,7

12345r#1
11234r#1
11123r#1.mi
11123r#1
12355r#1
54321r#1
54321r#1.MI
12555r#1

How would you do it?
 
something like that [based on MY understanding of waht 'similar' is]:

nawk -f sim.awk sim.txt

#--------------------- sim.awk
BEGIN {
FS="#"
}

NR ==1 { prev=$0; next }

{
split(prev, prevA, FS);
prevPAT=prevA[1];

if ( $1 == prevPAT ) {
print prev; print;
}
prev=$0
}
vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+
 
I think Vgersh's example is the best but if you need
ordinally unbiased char matching you can do that too.

Code:
function a_match(line1,line2, p ,pp) {
matches = 0
z[1] = split(line1,arr1,&quot;&quot;)
z[2] = split(line2,arr2,&quot;&quot;)

      if (!z[1] || !z[2]) {
          return -1
      } else {
          for (p=1 ; p <= z[1] ; p++) {
             for (pp=1 ; pp <= z[2] ; pp++) {
                  if (arr1[p] == arr2[pp]) {
                      matches++
                  }
              }
           }
      }
return matches
}

Then you just set a threshold value for a match.
ex:
Code:
BEGIN {
threshold = 4
}
 if ( (a_match(prev,$0)) >= threshold) {
      print prev,&quot; is similar to&quot;,  $0 
 }

You could also do a strtok() like thing,tokenise
the target string, and use awk's regexp system against
the consecutive string, much like vgersh's example.
 
marsd,
great idea - particularly the LAST paragraph!!!!
have a start ;)

something like that:

#-------------------- sim.awk
BEGIN {
FS=&quot;#&quot;
# to use a static thresholding
threshold=4
}

NR ==1 { prev=$0; next }

{
# to use 'dynamic' thresholding based on the length of the first field
# threshold=length(prev);
if( match(prev, &quot;^&quot; substr(prev, 1, threshold) &quot;.*/&quot;) == 1)
print prev; print;
}
prev=$0
}
vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+
 
Nice response folks thanks.
could you explain this line for me.

(prev, &quot;^&quot; substr(prev, 1, threshold)

thanks
 
sorry:

match(prev, &quot;^&quot; substr(prev, 1, threshold) &quot;.*/&quot;)
 
sorry, it actually should be like below - I'm bad.

#-----------------
BEGIN {
FS=&quot;#&quot;
# to use a static thresholding
threshold=4
}

NR ==1 { prev=$0; next }

{
# to use 'dynamic' thresholding based on the length of the first field
# threshold=length(prev);
if( match($0, &quot;^&quot; substr(prev, 1, threshold) &quot;.*&quot;) == 1) {
print prev; print;
}
prev=$0
}


#---------------------
if( match($0, &quot;^&quot; substr(prev, 1, threshold) &quot;.*&quot;) == 1)

<From 'man nawk'>:
match(s,ere)
Return the position, in characters, numbering from 1

in string s where the extended regular expression ere
occurs, or zero if it does not occur at all.

</From 'man nawk'>
We're the 'match' returns '1' - in other words the 'beginning' of the string. We' are match-ing the ENTIRE current record [$0] against a dynamically built RE based on the content of the PREVIOUS record/line [prev]. The RE consists of:
&quot;^&quot; - beginning of the record
substr(prev, 1, threshold) - substring of the previous record [prev] starting at position 1 which is 'threshold' characters in length.
&quot;.*&quot; - followed by ANY characters

If the entire 'match' reeturns the '1' [FIRST position of the current record, we print the previous record [prev] AND the current record ['print' - implicitely prints the entire current record].

hth

vlad
vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top