extract a specific expresion from a file

michael3 · Jul 19, 2002

Hi,

I am trying to extract all combined words starting with tbf0_ ( for example, tbf0_thrd_pty_grp_exclsns)from a file as following.

CETP_GExRpt:77:# tbf0_thrd_pty_grp_exclsns
CETP_GExRpt:100:FROM tbf0_thrd_pty_grp_exclsns
CETP_PatEdUpd:166: DELETE from tbf0_drug_ped_ptrn
CETP_PatEdUpd:183: UPDATE tbf0_drug_ped_ptrn
CETP_PatEdUpd:203: DELETE FROM tbf0_drug_ped_txt
CETP_PatEdUpd:220: UPDATE tbf0_drug_ped_txt
CETP_PatEdUpd:239: DELETE FROM tbf0_drug_ped_txt_cd
CETP_PatEdUpd:257: UPDATE tbf0_drug_ped_txt_cd
CETP_PatEdUpd:274:update tbf0_drug_ped_txt
CETP_PatEdUpd:295:update tbf0_drug_ped_txt
CETP_StFormUpd:82:#Add new states to tbf0_medicaid_st_form.
adr7.ctl:4:INTO TABLE TBF0_DRUG_COST
cardinal.ctl:4:INTO TABLE TBF0_DRUG_COST

From the above input file, I need
tbf0_thrd_pty_grp_exclsns,
tbf0_drug_ped_txt_cd,
tbf0_drug_ped_txt, etc
to appear in my output only once ( no duplicate). Can anyone give me some hint. Thanks a lot.

Michael

lancer73 · Jul 19, 2002

grep " tbf0_" $file | sort -u

michael3 · Jul 19, 2002

Thanks, lancer73.
But that won't work. Actually my input file is the output of: %grep -in tbf0_ *, so every line contains tbf0_xxxxxx at any place of the line.

I want only the tbf0_xxxxx, not the whole line;
and if have multi tbf0_xxxxx, I only want see one appear in my output, and do not care if the results are sorted or not.
That's to say, from my given input above, I need the output as:

tbf0_thrd_pty_grp_exclsns,
tbf0_drug_ped_ptrn
tbf0_drug_ped_txt_cd,
tbf0_drug_ped_txt, etc
TBF0_DRUG_COST

Thanks.Michael

vgersh99 · Jul 19, 2002

Assuming " " is your field seperator,

nawk -f getPattern.awk myFile.txt

#---------------- getPattern.awk
BEGIN {
pattern="tbf0_"
}

$0 ~ pattern {
for (i=1; i <= NF; i++)
if ( $i ~ pattern )
arr[$i]++
}

END {
for (i in arr)
print i;
}

#---------------- getPattern.awk vlad
+---------------------------+
|#include<disclaimer.h> |
+---------------------------+

vgersh99 · Jul 19, 2002

ooops, forgot about the case incensitivity:

#---------------- getPattern.awk
BEGIN {
pattern="tbf0_"
}

tolower($0) ~ pattern {
for (i=1; i <= NF; i++)
if ( tolower($i) ~ pattern )
arr[$i]++
}

END {
for (i in arr)
print i;
}

#---------------- getPattern.awk vlad
+---------------------------+
|#include<disclaimer.h> |
+---------------------------+

CaKiwi · Jul 19, 2002

Maybe this will work if the field is always last on the line.

awk '{print $n}' | sort -u CaKiwi

olded · Jul 19, 2002

Hi:

Sounds to me like you didn't want anything ending with a period either:

nawk ' {

$0 ~ /tbf0_/
{
for (i=1; i<=NF; i++)
if(($i ~ /^tbf0*/ && $i !~ /\.$/) ||
($i ~ /^TBF0*/ && $i !~ /\.$/))
printf("%s\n", $i)
}
} ' d.file |uniq

Regards,

Ed

michael3 · Jul 19, 2002

Thank you all for help. But I still can't get it work for me.

==========================================================
Tryed vlad's 1st solution (since don't have tolower()), but get the syntax error as:

devced01:/home/capimprv1/dbl/scripts/extract>cat ext
#---------------- getPattern.awk
BEGIN {
pattern="tbf0_"
}
$0 ~ pattern {
for (i=1; i <= NF; i++)
if ( $i ~ pattern )
arr[$i]++
}
END {
for (i in arr)
print i;
}

devced01:/home/capimprv1/dbl/scripts/extract>awk -f ext tstfile
awk: syntax error near line 6
awk: bailing out near line 6
devced01:/home/capimprv1/dbl/scripts/extract>

========================================================
Thanks, CaKiwi. But tbf0_xxxx can be anywhere in the line.

Thanks, ALL.
Michael

michael3 · Jul 19, 2002

Thanks, olded. It almost works. But seems that uniq didn't work. I prefer only see one "tbf0_drug_ped_txt".

michael

==================================
devced01:/home/capimprv1/dbl/scripts/extract>} ' tst |uniq <
tbf0_thrd_pty_grp_exclsns
tbf0_drug_ped_ptrn
tbf0_drug_ped_txt
tbf0_drug_ped_txt_cd
tbf0_drug_ped_txt
TBF0_DRUG_COST

CaKiwi · Jul 19, 2002

If you are on Solaris, use nawk instead of awk for vlad's solution. CaKiwi

olded · Jul 19, 2002

Michael:

I'd trade the uniq command for sort -u:

nawk ' {

$0 ~ /tbf0_/ || $0 ~ /TBF0_/
{
for (i=1; i<=NF; i++)
if(($i ~ /^tbf0*/ && $i !~ /\.$/) ||
($i ~ /^TBF0*/ && $i !~ /\.$/))
printf("%s\n", $i)
}
} ' d.file |sort -u

Sorry about that!

Ed

michael3 · Jul 19, 2002

Thank you all. They work great now.

have a good one

michael

lancer73 · Jul 19, 2002

well, i don't really have an answer in ksh, but here's a perl script that should work:

#BEGIN perl

@greps = `grep -in tbf0_ *`;
foreach(@greps) {
@line = split /\s+/;
foreach(@line) {
print "$_\n" if /tbf0_/;
}
}

#END perl

you could execute that and pipe to sort -u to discard duplicates or call it from a script

sorry i can't think of anything in shell script

malko8 · Jul 20, 2002

If tbf0_ is always lower case and space the field delimiter, you can try the following :

run the following command :

prog.sh filename | sort -u

where prog.sh is the following :
#!/usr/bin/sh
for i in `cat $1`
do
echo $i | awk '$1 ~ /^tbf0_/ {print $1}'
done

and filename is your file.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

extract a specific expresion from a file

michael3

Programmer

lancer73

Technical User

michael3

Programmer

vgersh99

Programmer

vgersh99

Programmer

CaKiwi

Programmer

olded

Programmer

michael3

Programmer

michael3

Programmer

CaKiwi

Programmer

olded

Programmer

michael3

Programmer

lancer73

Technical User

malko8

MIS

Similar threads

Part and Inventory Search

Sponsor