Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Search for words in 1 file that match a list in another file. 2

Status
Not open for further replies.

dodge20

MIS
Jan 15, 2003
1,048
US
I have a list of cities in a text file called city.txt. Each city on its own line such as

city1
city2
city3
city4

I then have another file that will have city names dispersed throughout the file. This file is a press release, so it isn't in any specific format. What I would like to do, is search the press release for any cities that match the cities listed in the city.txt file. I would like to have it output in the format
'city1','city2','city3'
I would also like to avoid any duplicates. Is this possible? I have something similar for MS Word, and excel, but would like o have this on unix and don't know where to begin.

Thanks

It might help, so I figure I will go ahead and post the word macro.
Code:
Sub findcities()
Dim a As New Collection, xlApp As Object, xlWkb As Object, str1
Dim WorkBookName, SheetName, ColumnNumber
WorkBookName = "C:\cities.xls"
SheetName = "Sheet1"
ColumnNumber = 1
Set xlApp = CreateObject("Excel.Application")
Set xlWkb = xlApp.Workbooks.Open(WorkBookName)
For i = 1 To xlWkb.sheets(SheetName).Range("A65536").End(&HFFFFEFBE).Row
str1 = Trim(xlWkb.sheets(SheetName).Cells(i, ColumnNumber).Value)
If ActiveDocument.Content.Find.Execute(FindText:=str1, MatchWholeWord:=True) Then
On Error Resume Next: a.Add str1, CStr(str1): On Error GoTo 0
End If
Next i
xlWkb.Close 0: xlApp.Quit: str1 = ""
Open "C:\Cities.txt" For Output As #1
For i = 1 To a.Count
str1 = str1 & "'" & a.Item(i) & "',"
Next i
Print #1, Left(str1, Len(str1) - 1)
Close #1
Shell "notepad.exe c:\cities.txt", vbMaximizedFocus
End Sub



Dodge20
 
A starting point.
Create a file named cities.awk with the following contents:
Code:
NR==FNR{
  a[$1];next
}
{ for(i=1;i<=NF;++i)
    if($i in a)
      ++b[$i]
}
END{
  for(c in b)
    printf "%s'%s'",(j++ ? "," : ""),c
  printf "\n"
}
And now to get the results:
awk -f /path/to/cities.awk /path/to/cities.txt /path/to/input

Hope This Helps, PH.
FAQ219-2884
FAQ181-2886
 
That looks great, except for the cities that have spaces in their names. Is there a match whole word for Unix like there is in VB? Also probably not a big deal, but there might be an occasion of the cities being listed in All Caps, so case sensitivity might be a problem.

Dodge20
 
Another solution with awk
Code:
#/usr/bin/awk -f
#Awk Program : city.awk

NR == FNR {
   city[++cities_count] = $0;
   pattern[cities_count] = "(^|[^a-z-])" tolower($0) "([^a-z-]|$)";
   count[cities_count] = 0;
   next;
}
{
   $0 = tolower($0);
   for (c=1; c<=cities_count; c++)
      if ($0 ~ pattern[c])
         ++count[c];
}
END {
   for (c=1; c<=cities_count; c++)
      if (count[c])
         list = list (list ? ",'" : "'") city[c] "'";
   print list;
}
Output:
Code:
$ [COLOR=blue]cat city.txt[/color]
paris
bordeaux
biscarrosse plage
BISCA
lyon
nice
$ [COLOR=blue]cat press.txt[/color]
Marc va à [COLOR=red yellow]Bisca[/color] aujourd'hui car la plage de [COLOR=red yellow]Biscarrosse Plage[/color] est la plus belle de toute la côte.
Il est [COLOR=red]paris[/color]ien mais habite [COLOR=red yellow]Bordeaux[/color] en ce moment.
$ [COLOR=blue]awk -f city.awk city.txt press.txt[/color]
'bordeaux','biscarrosse plage','BISCA'
$

Jean-Pierre.
 
That seems to print the cities a bunch of times. I think it is very close though.

Dodge20
 
This is the output for aigles code

Code:
'ACKLEY','ALDEN','IOWA FALLS','IOWA CITY','IOWA CITY','IOWA CITY','IOWA CITY','IOWA CITY','IOWA CITY','SIBLEY','ANKENY','ANKENY','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES','DES MOINES'

And the output for phv's code
Code:
'DES','IOWA','SIBLEY','ANKENY','ALDEN'

Dodge20
 
Here is the test press release I am using.

This is a test of hometowners. SIBLEY is on city
as well as IOWA FALLS, and ALDEN and ACKLEY, but
IOWA FALLS and SIBLEY shouldn't show up twice.

I hope DES MOINES and ANKENY Show up Also. IOWA CITY is a test for a city with to words in the name.




Dodge20
 
Provided the city.txt file contains NO DUPE, aigles's code works fine for me.
 
You are correct, I didn't take into consideration that different states can have the same city names. I eliminated the dups and it works great!

Thanks both of you!

Dodge20
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top