Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations derfloh on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

VFP like GOOGLE…

Status
Not open for further replies.

Nro

Programmer
May 15, 2001
337
CA
Hello.

The other day, I was using a very popular search engine and like always, I white a couple of words into the text box and instantly received a couple of hundred answers.

Then I continue working on one of my project. It’s for a chemical company, and one of the requirements is to be able to search parts of a chemical name inside a 100,000 bank of name. They never search the entire word exactly, but a word of two. For example, if they white ACID in the textbox, the result is

Chromic Acid Flakes
Chromic Acid Flakes
Fluoboric Acid
Sulfuric Acid 1.800 Tech

The quick and dirty line of code to do that is the same I use for decades:

Code:
SELECT pro_name FROM pro_pro1 WHERE "acid" $ LOWER(pro_name) INTO TABLE ttRes

The result came from an indexed table on a network, and took about 10 to 20 seconds to finish. My question is: is it possible to achieve in VFP something like popular search engines?
My poor result is based on 100,000 records, but these engines search over gazillions records. I can’t use any soundex() or algorithm that transform the whole field. It has to search for one particular word in a string.

Thanks in advance

Nro
 
Hello
Wow, I’m out of the office for a day or two, and fox guru’s are always here to help. Who said VFP is dead…
Now, the first premise of this, start with a question from one of my customer
Customer : Why, when I search in google, I end up with tons of results, base on one or two word, and in your system, if I try to search for the same words, It take a while to get a result.
I told him that if I had found the answer, I’d be rich by now.

What I’m looking for is to be able to search a specific word (as “ACID”) in a name, but not only that: a series of letters in a word. Very fast.

Again, my applications already work, but the speed is not there.

I already heard about PhdBase, and I received good comments, even if I never tried it myself.

As for the speed of the network, it’s not an issue. On network or local BD, what I’m looking for is the Rusmore optimisation. I think I understand how to use Rusmore, so when it’s full, the speed is at top, even if the network is the bottle neck (it’s always easy to blame the net).

Now, to make things clear, here what I propose to my users. When the open the search box, they have 3 choices:

1 – Perfect match. Example: If the user ask for the perfect match for a product number, this code is fully optimized (no indexes set, but one index on Upper(Pro_Code) and DELETED(), set exact is off)
Code:
 SELECT * FROM Pro_Pro1 WHERE UPPER(Pro_Code) = "8016OLD" INTO CURSOR crsProd

2 – Beginning of the field. Example: If the user ask for the first letters or a word in product name, this code is fully optimized (no indexes set, but one index on Upper(Pro_Name) and DELETED())
Code:
 SELECT * FROM Pro_Pro1 WHERE UPPER(Pro_Name) = "ACID" INTO CURSOR crsProd
Rushmore optimization is full, and the result is
Code:
Acid Cleaner 3-A
Acid Cleaner 2-A
Acid Cleaner 4-A
Acid Caps
Acid Copper Solution A-15
Acid Copper Solution A-16

3 –Now, the third option, any part of field. Example: If the user ask for some letters or a word in product name, this code is partly optimized (no indexes set, but one index on Upper(Pro_Name) and DELETED())
Code:
 SELECT * FROM Pro_Pro1 WHERE "SOLUT" $ UPPER(Pro_Name) INTO CURSOR crsProd
Rushmore optimization is partial, and the result is
Code:
Calcium Chloride Solution FCC                               
Copper Sulphate Solution
Ferric Chloride 42 BE Solution                              
Nickel Bromide Solution                                     
XXX - NICKEL PLATING SOLUTION

The third solution is often the one the user takes because they want more results than less.

Olaf, your suggestion looks pretty good, but as you said, the word table will tend to be huge, because it’s not only the chemical word in this case, but the French, Spanish and Italian description as well.

I have no problem with overnight procedure that creates references tables (à la data warehouse) because text searching it’s a very common problem try to improve in my applications.

Thanks again for your support.

Nro
 
Nro,

A couple of points ....

First, I don't see how option 3 can be even partially optimised. If "SOLUT" can be anywhere in Pro_Name, it doesn't matter what indexes you have. VFP has to look at every record, come what may.

But a more important point is that you say you want to search for "a series of letters in a word" -- as opposed to distinct words. In that case, you can't use a solution that involves indexing individual words (either PHDBase or some home-grown replacement for it).

Personally, I would do more or less what you are doing, but make it simpler. Give them a single textbox and a checkbox. In the textbox, let them enter whatever they like.

The checkbox enables a "substring search" (but think of a more user-friendly name for it).

When the checkbox is .F. (the default), do a straight "begins with" search. It doesn't matter if the search term matches the entire field or just the first few characters. It's the same code, and the same user interface (provided you have SET ANSI and SET EXACT set properly.) This is option 1 and option 2 combined.

If the checkbox is .T., use your $ or LIKE operator to do a substring search. Put a little warning on the form to say that this will be much slower.

What I'm saying is that you have to make it as simple as possible for the user. Don't give 'em too many options.

Hope this helps.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro tips, advice, training, consultancy
Custom software for your business
 
Salut

Mike, the rushmore optimisation showplan (sys(3054)) will show partial, because it use the deleted() tag, but in fact, you’re right, there is no optimization at all (because I keep a small amount of deleted record in my tables by packing them once a month).

As for your suggestion, it’s exactly what I propose to the user: a textbox, with an option box with 3 options, “Exact search”, “Beginning of field” and “Any part of field”.

I know that I can’t use an index engine to search some letters in a field, but for some word in a description, I think it’s possible. Again, it’s for improving search.

Thanks for your help.

Nro
 
Nro, yes, it's gonna be huge, but I also said it's gonna grow fast at first, and then will grow slower and slower. Even if you break it down to syllables, you'll be surprised how few you will get, the tabResults table will start growing more, as it will relate more and more result records to already known words/syllables.

Plus you can limit the size of the data, as a word or syllable connected to almost all data is worthless to filter relevant data.

Bye, Olaf.
 
Go to and download the plain text version of "War and Peace", then use this to make a fulltext index based on syllables instead of words:

Code:
#Define ccBaseDir 'C:\my\fulltextindex\'

t0 = Seconds()
lcWarAndPeace = Filetostr("C:\my\fulltextindex\2600.txt")
* double line feeds (CRLF) end a pragraph, sorten each end-of-paragraph to CRCR
lcWarAndPeace = Strtran(lcWarAndPeace ,Chr(13)+Chr(10)+Chr(13)+Chr(10),Chr(13)+Chr(13))
* remaining single CRLFs are not wanted, we want running text - no CRLFs, only to end paragraphs
lcWarAndPeace = Strtran(lcWarAndPeace ,Chr(13)+Chr(10),'')
* NOw change CRCR to single CR, this is used with ALINES as parsechar
lcWarAndPeace = Strtran(lcWarAndPeace ,Chr(13)+Chr(13),Chr(13))
? 'changing text to running text', Seconds()-t0

* create an array of paragraphs, finally a table tabWarAndPeace.dbf
t0=Seconds()
lnParagraphs = Alines(laParagraphs,lcWarAndPeace ,1+2+4,Chr(13))
Close Databases All
Close Tables All

Create Cursor curWarAndPeace (iid I Autoinc, mParagraph M)
For Each lcParagraph In laParagraphs FoxObject
   Insert Into curWarAndPeace (mParagraph) Values (lcParagraph)
Endfor
? 'inserting paragraphs into a cursor', Seconds()-t0

t0=Seconds()
Create Database (ccBaseDir+"WarAndPeaceIndex.dbc")
Copy To (ccBaseDir+"tabWarAndPeace.dbf") Database WarAndPeaceIndex
? 'write cursor to table', Seconds()-t0

t0=Seconds()
Create Table (ccBaseDir+"tabResults") (SyllableID I, ParagraphID I)
Index On SyllableID  Tag xfSyllabID
Index On ParagraphID Tag xfParagrID Additive
Index On BinToC(SyllableID)+BinToC(ParagraphID) Tag xKeyPair Collate "MACHINE" Additive

Create Table (ccBaseDir+"tabSyllables") (SyllableID I Autoinc, Primary Key SyllableID Tag xpSyllabID, cSyllable C(20))
Index On Lower(cSyllable) Tag xSyllable Additive

* minimum syllable length (later also minimum search term length
#Define cnMinLen 3
* maximum syllable length
#Define cnMaxLen 20

Local lnStart, lnLen, lcSyllable

Select 0
Use tabWarAndPeace In 0
Scan
   For lnStart = 1 To Len(mParagraph)-cnMinLen+1
      For lnLen = cnMinLen To cnMaxLen
         lcSyllable = Lower(Alltrim(Substr(mParagraph,lnStart,lnLen)))
         If Len(lcSyllable)<cnMinLen Or ' ' $ lcSyllable
            Loop
         Endif
         If Not Seek(lcSyllable,"tabSyllables","xSyllable")
            Insert Into tabSyllables (cSyllable) Values (lcSyllable)
         Endif
         If Not Seek(BinToC(tabSyllables.SyllableID)+BinToC(tabWarAndPeace.iid),"tabResults","xKeyPair")
            Insert Into tabResults Values (tabSyllables.SyllableID, tabWarAndPeace.iid)
         Endif
      Endfor
   Endfor
Endscan

* Flush to disc:
FLUSH
Close Tables All
? 'create index on syllables', Seconds()-t0

t0=Seconds()
Select tabWarAndPeace.mParagraph; 
  From tabSyllables;
  inner Join tabResults On tabResults.SyllableID = tabSyllables.SyllableID;
  inner Join tabWarAndPeace On tabWarAndPeace.iID = tabResults.ParagraphID;  
  Where Lower(tabSyllables.cSyllable) = "fox";
  into Cursor curFoxInWarAndPeace nofilter
? 'find "fox" with syllables index', Seconds()-t0

t0=Seconds()
Select tabWarAndPeace.mParagraph; 
  From  tabWarAndPeace;
  Where "fox" $ Lower(mParagraph);
  into Cursor curFoxInWarAndPeace2 nofilter
? 'find "fox" without syllables index', Seconds()-t0

Heres some statistics:

Original plain text: 3212 KB
Read into Table of Paragraphs: 3647 KB
Syllable table: ~33 MB
Results Table (keypairs connecting paragraphs and syllables): 133 MB

The search result with using $ takes about 2 to 4 times as long as searching for a syllable.

There are 16 paragraphs in War And Peace containing "fox":


...The weather was already growing wintry and morning frosts congealedan earth saturated by autumn rains. The verdure had thickened and itsbright green stood out sharply against the brownish strips of winter ryetrodden down by the cattle, and against the pale-yellow stubble of thespring buckwheat. The wooded ravines and the copses, which at the end ofAugust had still been green islands amid black fields and stubble, hadbecome golden and bright-red islands amid the green winter rye. Thehares had already half changed their summer coats, the fox cubs werebeginning to scatter, and the young wolves were bigger than dogs. It wasthe best time of the year for the chase. The hounds of that ardent youngsportsman Rostov had not merely reached hard winter condition, but wereso jaded that at a meeting of the huntsmen it was decided to give thema three days' rest and then, on the sixteenth of September, to go ona distant expedition, starting from the oak grove where there was anundisturbed litter of wolf cubs...

..."A perfect picture! How he chased a fox out of the rank grass by theZavarzinsk thicket the other day! Leaped a fearful place; what a sightwhen they rushed from the covert... the horse worth a thousand rublesand the rider beyond all price! Yes, one would have to search far tofind another as smart."...

...Facing him lay a field of winter rye, there his own huntsman stood alonein a hollow behind a hazel bush. The hounds had scarcely been loosedbefore Nicholas heard one he knew, Voltorn, giving tongue at intervals;other hounds joined in, now pausing and now again giving tongue. Amoment later he heard a cry from the wooded ravine that a fox had beenfound, and the whole pack, joining together, rushed along the ravinetoward the ryefield and away from Nicholas...

...He saw the whips in their red caps galloping along the edge of theravine, he even saw the hounds, and was expecting a fox to show itselfat any moment on the ryefield opposite...

...The huntsman standing in the hollow moved and loosed his borzois, andNicholas saw a queer, short-legged red fox with a fine brush going hardacross the field. The borzois bore down on it.... Now they drew closeto the fox which began to dodge between the field in sharper and sharpercurves, trailing its brush, when suddenly a strange white borzoi dashedin followed by a black one, and everything was in confusion; the borzoisformed a star-shaped figure, scarcely swaying their bodies and withtails turned away from the center of the group. Two huntsmen galloped upto the dogs; one in a red cap, the other, a stranger, in a green coat...

...The huntsmen got the fox, but stayed there a long time without strappingit to the saddle. Their horses, bridled and with high saddles, stoodnear them and there too the dogs were lying. The huntsmen waved theirarms and did something to the fox. Then from that spot came the sound ofa horn, with the signal agreed on in case of a fight...

...Nicholas dismounted, and with Natasha and Petya, who had ridden up,stopped near the hounds, waiting to see how the matter would end. Out ofthe bushes came the huntsman who had been fighting and rode towardhis young master, with the fox tied to his crupper. While still at adistance he took off his cap and tried to speak respectfully, but he waspale and breathless and his face was angry. One of his eyes was black,but he probably was not even aware of it...

..."A likely thing, killing a fox our dogs had hunted! And it was my graybitch that caught it! Go to law, indeed!... He snatches at the fox! Igave him one with the fox. Here it is on my saddle! Do you want a tasteof this?..." said the huntsman, pointing to his dagger and probablyimagining himself still speaking to his foe...

..The facts were that Ilagin, with whom the Rostovs had a quarrel and wereat law, hunted over places that belonged by custom to the Rostovs, andhad now, as if purposely, sent his men to the very woods the Rostovswere hunting and let his man snatch a fox their dogs had chased...

...Instead of an enemy, Nicholas found in Ilagin a stately and courteousgentleman who was particularly anxious to make the young count'sacquaintance. Having ridden up to Nicholas, Ilagin raised his beavercap and said he much regretted what had occurred and would have the manpunished who had allowed himself to seize a fox hunted by someone else'sborzois. He hoped to become better acquainted with the count and invitedhim to draw his covert...

...The serfs all dispersed. "Uncle" lifted Natasha off her horse and takingher hand led her up the rickety wooden steps of the porch. The house,with its bare, unplastered log walls, was not overclean--it did not seemthat those living in it aimed at keeping it spotless--but neither wasit noticeably neglected. In the entry there was a smell of fresh apples,and wolf and fox skins hung about...

...The valet brought a woman's fox-lined cloak...

..."Now then, you foxes!" said another, laughing at some militiamen who,stooping low, entered the battery to carry away the wounded man...

..."Where is he?" he inquired. And as he spoke he saw a young man cominground the corner of the house between two dragoons. He had a long thinneck, and his head, that had been half shaved, was again covered byshort hair. This young man was dressed in a threadbare blue cloth coatlined with fox fur, that had once been smart, and dirty hempen convicttrousers, over which were pulled his thin, dirty, trodden-down boots.On his thin, weak legs were heavy chains which hampered his irresolutemovements...

...Half an hour later he was driving with his fast horses across theSokolniki field, no longer thinking of what had occurred but consideringwhat was to come. He was driving to the Yauza bridge where he had heardthat Kutuzov was. Count Rostopchin was mentally preparing the angry andstinging reproaches he meant to address to Kutuzov for his deception. Hewould make that foxy old courtier feel that the responsibility for allthe calamities that would follow the abandonment of the city and theruin of Russia (as Rostopchin regarded it) would fall upon his dotingold head. Planning beforehand what he would say to Kutuzov, Rostopchinturned angrily in his caleche and gazed sternly from side to side...

...Pierre rushed to the wing, but the heat was so great that heinvoluntarily passed round in a curve and came upon the large housethat was as yet burning only at one end, just below the roof, and aroundwhich swarmed a crowd of Frenchmen. At first Pierre did not realizewhat these men, who were dragging something out, were about; but seeingbefore him a Frenchman hitting a peasant with a blunt saber and tryingto take from him a fox-fur coat, he vaguely understood that looting wasgoing on there, but he had no time to dwell on that idea...

Bye, Olaf.
 
Nro,

As for your suggestion, it's exactly what I propose to the user: a textbox, with an option box with 3 options, "Exact search", "Beginning of field" and "Any part of field".

Glad to see we're thinking along the same lines.

However, you only need to give them two options: "Exact search" and "Beginning of field" are really the same thing. Your code will be the same, and -- more importantly -- the user's perception of the search will be the same. So their only decision is whether or not to do "any part of field".

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro tips, advice, training, consultancy
Custom software for your business
 
Olaf,

That's an impressive piece of work (I mean your code is impressive, not War and Peace). But it's no less than we've come to expect from you.

However, I'm puzzled by your definition of syllable. If I've followed it right, you are taking a syllable to mean any substring between 3 and 20 characters in length. Is that what you intended?

That aside, all you've got to do now is to find a way of supporting inflected forms, such as "fox" and "foxes".

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro tips, advice, training, consultancy
Custom software for your business
 
Hello Mike,

thanks, and yes, your observing this right. As the intention is to replace the $ operator with this fulltext index, it's not the traditional meaning of syllables. I index on every substring, but I am limiting it at least to minimum 3 and maximum 20 length.

To tweak performance, you might modify the minimum to 4 and the maximum to 10 or so. Think about how short or long search 'terms' users want to use. That's not necessarily syllables in the language definition though. You could search for 'ox' to find oxygens or oxi-something.

There should be ways of tweaking a little more out of this. At least creating the index does only take aout 5 minutes or so. Time well spent, if it accelrates many searches of many users.

Bye, Olaf.
 
you could also throuw out any 'syllable' with some other char in it. For now I exclude syllables with a space in them, eg stringparts starting in one word and ending in the next one. You could also throw out anything else but letters.

And there's a little error in thowing out the line feeds not ending a pragraph:

* remaining single CRLFs are not wanted, we want running text - no CRLFs, only to end paragraphs
lcWarAndPeace = Strtran(lcWarAndPeace ,Chr(13)+Chr(10),'')

must be changed to

* remaining single CRLFs are not wanted, we want running text - no CRLFs, only to end paragraphs
lcWarAndPeace = Strtran(lcWarAndPeace ,Chr(13)+Chr(10),' ')

to replace these line feeds with a space, othewise two words may be merged to one. You can spot that in the cited paragraphs sometimes.

Bye, Olaf.
 
Hi.
When the user select “Exact search”, it will find the exact code. I pad the code with spaces()
Code:
lcLot = UPPER(PADR("3873",FSIZE("Lt_Number")))
SELECT * FROM Q_Lot WHERE UPPER(Lt_Number) = lcLot INTO CURSOR crsLot

If 3873 exist, the search box disappears and he can continue to work. If he choose “Beginning of field”, and search for the string (ex. Lot number), it will give him all the match for the lot beginning with 3873
Code:
lcLot = UPPER("3873")
SELECT * FROM Q_Lot WHERE UPPER(Lt_Number) = lcLot INTO CURSOR crsLot

The result will be
Code:
3873
3873A
3873B

But not

Code:
123873A
A3873
B3873A

If the result is greater than 1 line, I give them a grid where they choose the right lot number. Depending of the nature of the field, I set the option box to 1 if it’s a code (invoice / order number), 2 if the result is usually more than 1 result (lot number, phone number) and 3 if it’s a description (chemical name…).

Olaf, it’s great work, as usual. I’ll have to test it, and like I said, it possible to create the indexed table overnight. My only fear is that the table will be huge. Remember 2go limits.
Thanks
Nro
 
Nro,

When the user select "Exact search", it will find the exact code. I pad the code with spaces()

If you SET ANSI ON, you don't need to do that. Both types of search will behave the same.

However, I see what you mean about searching for 3873 and 3873A. What I had in mind was perhaps more suitable for things like names and cities.

it possible to create the indexed table overnight.

Does that mean the table is fairly static? That it doesn't get updated very often?

If so, another option might be to store a local copy of the table on the users' hard drives. You could then go back to your original brute force search, but search the local copy rather than the server. You could also have a mechanism that refreshes the local copy on the first search each day.

I don't know if that would be fast enough, but it would at least eliminate any delay due to the network.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro tips, advice, training, consultancy
Custom software for your business
 
Olaf:
This brings back some unpleasant memories, Tolstoy’s War and Peace… Had to study it in grammar school, talk about a “dry” book, but the folks at Senior Cambridge were sadists, I have always suspected… Anybody else in the UK subjected to this torture
 
Sorry to remind you of such bad times. Wasn't it Drew Speediedemonstrating the speed of VFP string functions with War and Piece? I never read it or had to read it, but it came to mind as a very long text, valid for such a test.

There's plenty other books at the project gutenberg site. The bible, old & new testament is longer for example.

Bye, Olaf.
 
Actually, that was Steve Black. He did a terrific session on Text that used the War and Peace file.

Tamar
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top