Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

extract a specific expresion from a file

Status
Not open for further replies.

michael3

Programmer
Aug 8, 2001
26
US
Hi,

I am trying to extract all combined words starting with tbf0_ ( for example, tbf0_thrd_pty_grp_exclsns)from a file as following.

CETP_GExRpt:77:# tbf0_thrd_pty_grp_exclsns
CETP_GExRpt:100:FROM tbf0_thrd_pty_grp_exclsns
CETP_PatEdUpd:166: DELETE from tbf0_drug_ped_ptrn
CETP_PatEdUpd:183: UPDATE tbf0_drug_ped_ptrn
CETP_PatEdUpd:203: DELETE FROM tbf0_drug_ped_txt
CETP_PatEdUpd:220: UPDATE tbf0_drug_ped_txt
CETP_PatEdUpd:239: DELETE FROM tbf0_drug_ped_txt_cd
CETP_PatEdUpd:257: UPDATE tbf0_drug_ped_txt_cd
CETP_PatEdUpd:274:update tbf0_drug_ped_txt
CETP_PatEdUpd:295:update tbf0_drug_ped_txt
CETP_StFormUpd:82:#Add new states to tbf0_medicaid_st_form.
adr7.ctl:4:INTO TABLE TBF0_DRUG_COST
cardinal.ctl:4:INTO TABLE TBF0_DRUG_COST

From the above input file, I need
tbf0_thrd_pty_grp_exclsns,
tbf0_drug_ped_txt_cd,
tbf0_drug_ped_txt, etc
to appear in my output only once ( no duplicate). Can anyone give me some hint. Thanks a lot.

Michael
 
Thanks, lancer73.
But that won't work. Actually my input file is the output of: %grep -in tbf0_ *, so every line contains tbf0_xxxxxx at any place of the line.

I want only the tbf0_xxxxx, not the whole line;
and if have multi tbf0_xxxxx, I only want see one appear in my output, and do not care if the results are sorted or not.
That's to say, from my given input above, I need the output as:

tbf0_thrd_pty_grp_exclsns,
tbf0_drug_ped_ptrn
tbf0_drug_ped_txt_cd,
tbf0_drug_ped_txt, etc
TBF0_DRUG_COST

Thanks.Michael


 
Assuming " " is your field seperator,


nawk -f getPattern.awk myFile.txt

#---------------- getPattern.awk
BEGIN {
pattern="tbf0_"
}

$0 ~ pattern {
for (i=1; i <= NF; i++)
if ( $i ~ pattern )
arr[$i]++
}

END {
for (i in arr)
print i;
}

#---------------- getPattern.awk vlad
+---------------------------+
|#include<disclaimer.h> |
+---------------------------+
 
ooops, forgot about the case incensitivity:

#---------------- getPattern.awk
BEGIN {
pattern=&quot;tbf0_&quot;
}

tolower($0) ~ pattern {
for (i=1; i <= NF; i++)
if ( tolower($i) ~ pattern )
arr[$i]++
}

END {
for (i in arr)
print i;
}

#---------------- getPattern.awk vlad
+---------------------------+
|#include<disclaimer.h> |
+---------------------------+
 
Maybe this will work if the field is always last on the line.

awk '{print $n}' | sort -u CaKiwi
 
Hi:

Sounds to me like you didn't want anything ending with a period either:


nawk ' {

$0 ~ /tbf0_/
{
for (i=1; i<=NF; i++)
if(($i ~ /^tbf0*/ && $i !~ /\.$/) ||
($i ~ /^TBF0*/ && $i !~ /\.$/))
printf(&quot;%s\n&quot;, $i)
}
} ' d.file |uniq

Regards,

Ed
 
Thank you all for help. But I still can't get it work for me.

==========================================================
Tryed vlad's 1st solution (since don't have tolower()), but get the syntax error as:

devced01:/home/capimprv1/dbl/scripts/extract>cat ext
#---------------- getPattern.awk
BEGIN {
pattern=&quot;tbf0_&quot;
}
$0 ~ pattern {
for (i=1; i <= NF; i++)
if ( $i ~ pattern )
arr[$i]++
}
END {
for (i in arr)
print i;
}

devced01:/home/capimprv1/dbl/scripts/extract>awk -f ext tstfile
awk: syntax error near line 6
awk: bailing out near line 6
devced01:/home/capimprv1/dbl/scripts/extract>

========================================================
Thanks, CaKiwi. But tbf0_xxxx can be anywhere in the line.

Thanks, ALL.
Michael

 
Thanks, olded. It almost works. But seems that uniq didn't work. I prefer only see one &quot;tbf0_drug_ped_txt&quot;.

michael

==================================
devced01:/home/capimprv1/dbl/scripts/extract>} ' tst |uniq <
tbf0_thrd_pty_grp_exclsns
tbf0_drug_ped_ptrn
tbf0_drug_ped_txt
tbf0_drug_ped_txt_cd
tbf0_drug_ped_txt
TBF0_DRUG_COST
 
If you are on Solaris, use nawk instead of awk for vlad's solution. CaKiwi
 
Michael:

I'd trade the uniq command for sort -u:

nawk ' {

$0 ~ /tbf0_/ || $0 ~ /TBF0_/
{
for (i=1; i<=NF; i++)
if(($i ~ /^tbf0*/ && $i !~ /\.$/) ||
($i ~ /^TBF0*/ && $i !~ /\.$/))
printf(&quot;%s\n&quot;, $i)
}
} ' d.file |sort -u


Sorry about that!

Ed
 
Thank you all. They work great now.

have a good one

michael
 
well, i don't really have an answer in ksh, but here's a perl script that should work:

#BEGIN perl

@greps = `grep -in tbf0_ *`;
foreach(@greps) {
@line = split /\s+/;
foreach(@line) {
print &quot;$_\n&quot; if /tbf0_/;
}
}

#END perl

you could execute that and pipe to sort -u to discard duplicates or call it from a script

sorry i can't think of anything in shell script
 
If tbf0_ is always lower case and space the field delimiter, you can try the following :

run the following command :

prog.sh filename | sort -u

where prog.sh is the following :
#!/usr/bin/sh
for i in `cat $1`
do
echo $i | awk '$1 ~ /^tbf0_/ {print $1}'
done

and filename is your file.

 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top