Hi.
I'm trying to extract telephone numbers from many files and save result to tab delimited file. Output file has 3 columns per record: telephone number, fax number, cellphone number.
My input looks like this (it's html page, and that's a part with numbers form it):
Desired output (4 records with 3 columns each):
The problem is that sometimes there is one telephone number, sometimes three - in this case I want to put these numbers in one column as coma delimited. Sometimes there is only Tel/fax: - so the telephone and fax numbers are same.
I end up with this (i realise that is completely wrong solution), and no clue how to change that (some kind of iteration?):
Any ideas?
I'm trying to extract telephone numbers from many files and save result to tab delimited file. Output file has 3 columns per record: telephone number, fax number, cellphone number.
My input looks like this (it's html page, and that's a part with numbers form it):
Code:
<div class='phones'>
<p class='tel'>
Tel/fax: 81 852 50 16</p>
<p class='tel'>
Tel.: 81 852 50 01</p>
<p class='tel'>
Tel.: 81 852 50 35</p>
Kom.: 605 340 234<br />
</div>
Desired output (4 records with 3 columns each):
Code:
81 852 50 01, 81 852 50 35 # 81 852 50 16 # 605 340 234
81 852 50 01 # 81 852 50 16 #
# 81 852 50 16 #
81 852 50 01, 81 852 50 35 # # 605 340 234
I end up with this (i realise that is completely wrong solution), and no clue how to change that (some kind of iteration?):
Code:
BEGIN {
RS="</span>|</div>"
OFS="\t"
}
FNR==1 && NR!=1 {
printrecord()
}
## Only telephone number
/<div class='phones'>.*<p class='tel'>.*Tel.:/ {
d["Tel"]=$0
gsub(/<div class='phones'>|<\/div>.*/,"",d["Tel"])
d["tel_stac"]=d["Tel"]
gsub(/.*<p class='tel'>|<\/p>.*/,"",d["tel_stac"])
gsub(/.*Tel.: /,"",d["tel_stac"])
}
## Telephone and fax
/<div class='phones'>.*<p class='tel'>.*Tel\/fax:/ {
d["Tel"]=$0
gsub(/<div class='phones'>|<\/div>.*/,"",d["Tel"])
d["tel_stac"]=d["Tel"]
gsub(/.*<p class='tel'>|<\/p>.*/,"",d["tel_stac"])
gsub(/.*Tel\/fax: /,"",d["tel_stac"])
d["tel_fax"]=d["tel_stac"]
}
## Cellphone number
/Kom.:/ {
d["Tel"]=$0
gsub(/.*Kom.:|<br \/>.*/,"",d["Tel"])
d["tel_kom"]=d["Tel"]
gsub(/.*Kom.: /,"",d["tel_kom"])
}
END {
printrecord()
}
function printrecord()
{
for (f in d) {
gsub(/^ +| +$/,"",d[f])
gsub(/ /," ",d[f])
gsub(/</,"<",d[f])
gsub(/>/,">",d[f])
gsub(/&/,"&",d[f])
gsub(/»/,"",d[f])
}
print d["tel_stac"],d["tel_fax"],d["tel_kom"]
for (f in d) delete d[f]
}
Any ideas?