Searching a database of patterns

Daedelus · Sep 27, 2002

We maintain a database of hardware weights consisting of a text file with lines following the format
<part number> <part description> <weight> <other info ...>

for example:
27489-001 switchlight 0.05 ...

The problem is that our database is massive (over 156,000 entries so far) and not versatile. For example, the -001 in the entry above specifies a pin configuration and what text is written on the switch. There are presently ~150 different configurations, none of which affect the weight of the part. But we have to have a separate entry for all of them, and new ones are added regularly.

I would like to be able to include wirecards in the database:

27489-* switchlight 0.05 ...

and have it match any specific number searched for. Unfortunately "grep" and all other search utilities I have seen allow wildcards in the search item, but not in the database being searched. I can't be the first person to need something like this. Does anyone know of a search engine that allows wildcards in the database? It would need to be fast. So far the only scheme I have come up with would be prohibitively slow.

Anyone know of a solution?

stanhubble · Sep 27, 2002

First opinion --> use a real database

Secondly, you can use a character or pattern as a wild card just not the same character as the os.

jad · Sep 27, 2002

grep '^27489-\*' <filename>

will find everything that matches 27489-* at the beginning of each line the \ makes the * a literal *, not a regular expression

however stanhubbles first suggestion is correct ... get a database, if at all possible ... and if possible split up the part number into 2 fields ... you can always concatenate them together afterwards.

jamisar · Sep 27, 2002

mask the metachar!

grep '\*' file print all entry marked with '*'
the same
sed -ne '/\*/p' file ------------ jamisar
fox camantis in deserto.

Daedelus · Sep 27, 2002

PLEASE! I KNOW HOW TO USE WILDCARDS!!!!
That is NOT what I am asking!

Let me see if I can explain it to you this way:

One example: I have a program that goes through a datafile which includes amongst its data hardware items and their weights. The program's purpose is to check the hardware weight to make sure it is accurate, or to provide it if the weight is not already in the datafile. It does this by searching the hardware database for the correct weight. So, if it comes across in the datafile the hardware item

27489-164

it needs to search the hardware database for 27489-164 and find the weight. Currently it can only do this if there is an entry

27489-164 switchlight 0.05 ...

What I need is a search capability that will match 27489-164 as the search string against a database entry of

27489-* switchlight 0.05 ...

grep will not do this. sed will not do this. awk will not do this. look will not do this.

Do not suggest replacing the 164 with a star (or other wildcard - it does not matter what wildcard scheme is used, I only used * as an example because most people are familiar with it!). 27489- is only one example too. There are a vast variety of hardware naming schemes in use, and the program has no way of knowing what part of the hardware part number to replace with a wildcard. It can only search using the exact string. A "break it up into two words" scheme for the database base does not work either. Here are some other part numbers it needs to handle:

MS27484T10F35S (weight depends on the T,10, and S, but not F or 35)
S3106-3 (needs exact match)
MS21042-3N (weight depends on -3, but not N)

As you can see there is no pattern as to what to search for. This is why I need the wildcards in the database!

I need a database/search engine that will do this, and can be launched invisibly by a UNIX script, and give its output as a UNIX datastream. (When I asked for suggestions on what to use, what is the point of telling me to use a "real database"? That is what I am asking for! If you know of one that will work, WHAT IS it? If not, then why are you making the comment????)

Now, does anyone have any USEFUL suggestions?

jad · Sep 27, 2002

i get it now ... you want to match a non-regexp with a stored as is possible regexp ...

Code:

#!/bin/sh
#
# usage: <this script> <datafile> <search string>
cat $1 | awk '{ instr = length(&quot;'$2'&quot;) -1;
if(index($0,&quot;*&quot;) < instr && index($0,&quot;*&quot;) > 0) {     instr = index($0,&quot;*&quot;);     if (substr($0, 0, instr) == substr(&quot;'$2'&quot;, 0, instr)) print $0; } else {     if (substr($0, 0, instr) == &quot;'$2'&quot;)         print $0; }'
# and cross your fingers.

jad · Sep 27, 2002

by the way i've got '*' as the wildcard ... i guess you could choose any string, or combination ...

my script above is simplified, you'll have to do a little extra work to do better regexps.

if you were using oracle ... you could use the '%' wildcard and compare with 'like' in a select which you could use on either side to compare regexps. i'm sure you should be able to do this on most others tho.

Daedelus · Sep 27, 2002

Yeah, thats what I want (never was very good at techspeak).

Your approach will work (I think), but it is also too slow for such a large database for the uses I would like to put it too.

The approach I mentioned as also being to too slow is a variant:

awk '{if (match(part, $1)) print}' part=$2 $1

which would allow me the full range of awk regular expressions, but takes forever on a 150,000+ line file.

I can't believe that I am the only one to need something like this. That is why I was hoping someone knows of a database/search engine that would do it. Thanks

jamisar · Sep 27, 2002

based on your example, can something like this help?

in a file called 4sed
---------------
/^.*T[0-9]*[0-9].*S[ ]*[ ].*/{
p
n
}
/^S[0-9]*[0-9]-[0-9][ ]*[ ].*/{
p
n
}
/.*-[0-9][^N][ ]*[ ].*/{
p
n
}
-----------------
sed -nf 4sed inputfile

------------ jamisar
vox clamantis in deserto.

stanhubble · Sep 27, 2002

Sorry for sounding snippet in suggesting a database rather than a 150,000 line textfile. There are many out there (depends on os/platform/$/expertise etc). The options of a database product would allow for multiple indexes and aliases and pattern matching in a program.

Another option would be agrep, A Unix "grep" program with approximate matching capabilities.

http://www.columbia.edu/acis/documentation/ghttp/agrep.html

http://webglimpse.org/

Daedelus · Sep 27, 2002

Stan, sorry about being overly sensitive myself. There are many out there. Unfortunately there are so many, and the particular requirements I have are difficult to locate in the documentation, so I was hoping someone would already know a database with the appropriate capabilities. To summarize, I need a searchable database structure which meets the following:
1) the keys that are searched through must have some sort of wild card capability. This is important: the wildcards occur in the database NOT the search string.
2) it must be able to work invisibly when called from a Unix script, and must provide its output as a Unix datastream.
3) it needs to be fast, preferably averaging less than a second per search on 100,000+ entries.

Does anyone know of such a beast? (Do you even know of one that does the 1st?)

Jamisar: Thats fine if those were the only naming schemes out there, but we get parts from hundreds of companies, each with their own ideas about how to name things (actually most of these part numbers come from military standards, but there are hundreds (or thousands) of those too). I can't come up with patterns for searching all of them, or if I did, I would need a database of these patterns to search for the one I need, so that I can use it to search my hardware database, which just brings me back to needing a search capability with patterns in the database rather than in the search string. But thanks for the interest.

Daedelus · Sep 27, 2002

I forgot to mention: we currently run on RS6000s running AIX 4.3. But the plan is to phase these out within the next year, replacing them with IBM Intellistations running Windows, using ReflectionX for our UNIX applications (which is most of what we do).

MikeLacey · Sep 30, 2002

Daedelus,

I suggest you have a look at MySQL, a free product known for three things.

1 - It's fast
2 - It's simple
3 - It's robust

Would seem to meet your requirements. Mike
________________________________________________________________

"Experience is the comb that Nature gives us, after we are bald."

Is that a haiku?
I never could get the hang
of writing those things.

jamisar · Sep 30, 2002

based on jad's exemple, assuming it's what you want, i propose you this ( feel free to adapt it ) c-code,
on my solaris sys, need 1 sec to scan 160000 entries:
to compile: gcc -o progname srcname.c
------------------------
#include <stdio.h>

#define JOKER "joker"
#define MaxBuff 256

int main(int argc, char **argv)
{
extern char *basename();
int norm_cmp();
char buff[MaxBuff+1];
char save[MaxBuff+1];
char *myname;
int found;
FILE *inp;

myname = basename(*argv++);

if(--argc != 2)
exit(printf("USAGE: %s inputfile search-string\n",myname));

if(!(inp = fopen(*argv++,"r&quot

))
exit(printf("%s, cannot open inputfile\n",myname));

while(fgets(buff,MaxBuff,inp)){

/*
attention: could be a substring !!
check character buff[found-1] if found >0
check character buff[found+strlen(*argv)]
norm_cmp("aaa","baab&quot

returns -1
norm_cmp("aaa","aaab&quot

returns 0
norm_cmp("aaa","baaa&quot

returns 1
*/
/* look for *argv in buff */
if((found = norm_cmp(*argv,buff)) >= 0)
printf("%s found at pos %d in %s",*argv,found,buff);

/* look for jokerstring in buff */
if(1 >(found = norm_cmp(JOKER,buff))) continue;

/* isole initial part of buff in save, cut joker.* out */
strcpy(save,buff); save[found] = 0;

/* look for initial part of buff in *argv */
if((found = norm_cmp(save,*argv)) >= 0)
printf("%s found at pos %d in %s",save,found,buff);
}
fclose(inp);
exit(0);
}

int norm_cmp(char *uno, char *due)
{
int len, max, pos, tmp;

if(!*uno || !*due) return(-1);

len = strlen(uno)-1; max = strlen(due)-1;

if(len >max) return(-1);

for(pos = tmp = 0;

{
/* goto first occurance of char '*uno' in 'due' */
while(*(due+pos) && *uno != *(due+pos)) ++pos;

/* not found, the string 'uno' is longer as 'due+pos' */
if(pos+len >max) break;

/* starting at 'due+pos' check backward for string 'uno' */
for(tmp = len; tmp >0; --tmp) if(*(uno+tmp) != *(due+pos+tmp)) break;

/* the string 'uno' was entirely found at 'due+pos' */
if(!tmp) return(pos);

/* restart to the last char found skipp sure not matching chars */
pos += 1+len-tmp;
}
/* not found */
return(-1);
} /* end norm_cmp */
vox clamantis in deserto.

Daedelus · Sep 30, 2002

Thanks for the suggestion Mike, and thanks for the program Jamisar! I will look into MySQL for the long run, since I agree with everyone that we need to move to a true database. But convincing the heirarchy, or my more stodgy coworkers of this will take time - probably a LOT of time! Until then, I will see if I can adapt the program to our particular needs. (Or more accurately to what I personally think needs to be done!) The biggest problem there is going to be convincing the IT people to let me compile it. (All the compilers have been restricted to them.) While I have studied C on my own, I've never had the chance to use it. So my skills in that area are not very good. I doubt I could have written it myself.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Searching a database of patterns

Daedelus

Technical User

stanhubble

MIS

jad

Programmer

jamisar

Programmer

Daedelus

Technical User

jad

Programmer

jad

Programmer

Daedelus

Technical User

jamisar

Programmer

stanhubble

MIS

Daedelus

Technical User

Daedelus

Technical User

MikeLacey

MIS

jamisar

Programmer

Daedelus

Technical User

Similar threads

Part and Inventory Search

Sponsor