Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

How to extract multiple matches (iteration)

Status
Not open for further replies.

qra

Technical User
Nov 6, 2009
5
0
0
PL
Hi.

I'm trying to extract telephone numbers from many files and save result to tab delimited file. Output file has 3 columns per record: telephone number, fax number, cellphone number.

My input looks like this (it's html page, and that's a part with numbers form it):
Code:
<div class='phones'>
 <p class='tel'>
                         Tel/fax: 81 852 50 16</p>
 <p class='tel'>
                         Tel.: 81 852 50 01</p>
 <p class='tel'>
                         Tel.: 81 852 50 35</p>

                         Kom.: 605 340 234<br />   
</div>

Desired output (4 records with 3 columns each):
Code:
81 852 50 01, 81 852 50 35 # 81 852 50 16 # 605 340 234
81 852 50 01               # 81 852 50 16 # 
                           # 81 852 50 16 # 
81 852 50 01, 81 852 50 35 #              # 605 340 234
The problem is that sometimes there is one telephone number, sometimes three - in this case I want to put these numbers in one column as coma delimited. Sometimes there is only Tel/fax: - so the telephone and fax numbers are same.

I end up with this (i realise that is completely wrong solution), and no clue how to change that (some kind of iteration?):
Code:
BEGIN {
  RS="</span>|</div>"
  OFS="\t"
}
FNR==1 && NR!=1 {
  printrecord()
}

## Only telephone number
/<div class='phones'>.*<p class='tel'>.*Tel.:/ {
  d["Tel"]=$0
  gsub(/<div class='phones'>|<\/div>.*/,"",d["Tel"])
  d["tel_stac"]=d["Tel"]
  gsub(/.*<p class='tel'>|<\/p>.*/,"",d["tel_stac"])
  gsub(/.*Tel.: /,"",d["tel_stac"])
}
## Telephone and fax 
/<div class='phones'>.*<p class='tel'>.*Tel\/fax:/ {
  d["Tel"]=$0
  gsub(/<div class='phones'>|<\/div>.*/,"",d["Tel"])
  d["tel_stac"]=d["Tel"]
  gsub(/.*<p class='tel'>|<\/p>.*/,"",d["tel_stac"])
  gsub(/.*Tel\/fax: /,"",d["tel_stac"])
  d["tel_fax"]=d["tel_stac"]
}

## Cellphone number
/Kom.:/ {
  d["Tel"]=$0
  gsub(/.*Kom.:|<br \/>.*/,"",d["Tel"])
  d["tel_kom"]=d["Tel"]
  gsub(/.*Kom.: /,"",d["tel_kom"])
}

END {
  printrecord()
}

function printrecord()
{
  for (f in d) {
    gsub(/^ +| +$/,"",d[f])
    gsub(/&nbsp;/," ",d[f])
    gsub(/&lt;/,"<",d[f])
    gsub(/&gt;/,">",d[f])
    gsub(/&amp;/,"&",d[f])
    gsub(/&raquo;/,"",d[f])
  }

  print d["tel_stac"],d["tel_fax"],d["tel_kom"]

  for (f in d) delete d[f]
}

Any ideas?
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top