Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

what is best command for search in strings?

Status
Not open for further replies.

farzad321

Programmer
Sep 14, 2019
53
0
6
FR
Hi sir , we have a table with 250,000 record. and a field name DESC1 type character 150 in it. Now I need to search a contract number in DESC1.

what is the fastest way to find answer in this table ?

1- Set filter to at(variable,'table') > 0 or set index for ...
2- Set filter to (variable $ DESC1) = .t. or set index for ...
3- Create local view with this condition and open the view
4- Select * for ( search condition )
Or any other commands.

In local run all that way is good and fast enough but in network with 4 client online, system have a time problem.
Thanks for HELP
FARZAD
 
Hello Farzad,

Do you want to search for a string that forms the first few characters of DESC1? For example, if DESC1 contains "REDWOOD OAK BOOKSHELF 2 METRES", you want to search for "REDWOOOD" or "REDWOOD OAK"?

Or do you want to search for a string that is anywhere within DESC1? Such as "OAK" or "BOOK" or "2 M"?

If the latter, than I am afraid it will always be slow. There is no way of optimising the search.

But if the search string is the first few characters of DESC1, then the important thing is to make sure you have an index on DESC1. In that case, it won't make a great deal of difference which of your four methods you use. My preference would always be SQL SELECT. This would be simpler and probably more efficient than the other methods, but all four of them would be much faster than if there is no index.

Now having said that, an index on 150-character field is not a good idea. If you know you are always going to search on, say, the first eight characters, then index on LEFT(Desc1,8). And if you want to make it case-insensitive, index on UPPER(LEFT(Desc1), 8), and make sure your search term is also converted to upper.

You will also need to SET EXACT OFF in that case.

Mike
__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
The simple lesson from this is store a contract number in a separate field. You pay for not storing data normalized.

Whether you'd index on At() or $ won't matter much, as you could only index for one specific search value, you can't index on ?value $ field and then SEEK value to get the records with that value somewhere in Desc1, because an index expression can't contain parameters.

And as you know indexing Dsc1 won't help you too, unless you have the contract number at the start of desc.

If I was in your situation I'd extract contract numbers into a new field. Single values are what can be indexed and then even sorted by, or seeked.

For conditions without an index will have to go through all Decs1, just like a where condition of a view will. And preparing an index for one contractnumber will not find another, indexes only help, if you don't need to create them for each value you search, because then the generation of the index will cost just as much time as searching without an index, plus the time necessary to write the index file or tag of cdx file. So that's not a good idea.

If you want a fast search, then you have to extract the contract numbers. So you only need to go through all data once, have that stored separate and can profit on a normal index. Do they have a common format, like 10 digit numbers? Or alphanumeric? Any pattern, then you can look for that and extract those patterns.

If you'd store this data in MSSQL (also Express/LocalDB versions) you could profit from a fulltext index, which VFP does not have. Usage of that also isn't as simple as it sounds like. If you'd ask in a SQL Server forum whether to store contract texts in Text fields to later search for contracts with a certain contract number in them, I guess even though thetexts are only 150 chars they'd recommend storing that separately, despite of the redundancy. You could even remove the redundancy while your at extracting the contract numbers, if you replace them in the desc1 field with a placeholder like <<contracrnumber>> and then merge the contract number in the separate field into the text.

Just out of curiosity, how do 150 character contract texts look like? Is it always about the same text just with varying names of parties or things involved in the contract and the contract numbers? Because that would speak for contract template texts.

Chriss
 
Chris said:
The simple lesson from this is store a contract number in a separate field.

Farzad, I hand't noticed that you will be searching on a contract number. In that case, you should definitely store it in a separate field, as Chris mentioned. And also make sure there is an index on that field. That will give you a huge gain in performance compared to how you are doing it now.

If there is always only one instance of each contract number, the easiest way to find it is to use SEEK.

If there could be more than one instance of a contract number, and you want to find them all in the same query, then go with SQL SELECT. This will let you get all the results in a cursor or array. But in that case, please ignore what I said about SET EXACT. It doesn't apply to SQL SELECT.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
If I'm searching in strings, which is inevitable from time to time, I tend to use the m.Variable$Upper(myTable.myField) variant.
It's ok with small datasets, but performs badly with filters which extract a small no of records from a large dataset.

Once case from 'recent' history I used a combination of indexes and then a sequential search was trying to compare file
names between a live database and a back up. File names can be up to about 250 characters, including a path, and indexes
do not stretch that far, so I indexed the first 200 (I think) characters of the path and filename, used seek to find
the first match on the leftmost 200 characters and then did a sequential search from there until the first 200 characters
changed.

In the real world, even on a structure with many millions of documents the sequential part has not exceeded 50 skips (I put
an alert into the code so it would be logged if that level was exceeded) - this approach has been running for some years.

Regards

Griff
Keep [Smile]ing

There are 10 kinds of people in the world, those who understand binary and those who don't.

I'm trying to cut down on the use of shrieks (exclamation marks), I'm told they are !good for you.
 
Thank you Mike and Chris,
This field DESC1 is Description about paid money for some contract, usually one contract number ,but maybe in some case two or more contract number even five contract number paid at once, and this number can be in beginning of field or even at end of field, therefore I cant use another field for that ,or must reorganized the program and use one to many child table for contract number.
unfortunately I most use this method.
As a result I think the best way is : SELECT SQL

thank you again for your help
 
In addition ,when my client use a 1GB speed ,the speed of search acceptable , but in other client that use a 100MB lan speed we have a problem.
 
Example of my records:

record 1 paid for cn 8023298
record 2 paid for cn 2196409 - 2196410 / 2196411
record 3 paid for cn 4830021 - 1260932
record 4 paid for cn 6071198 / 4930028 89236711
and so on
 
You can try all options and see for yourself, my gut feeling is they'll vary in single percentages about 1-2% from each other, considering even an index only helps with one value it's even saving disk wear to not create them at all. Creating an index is a full table scan operation and thus only cn accelerate searches that can make use of the index multiple times. You're then doing a full table scan to use an index once, which is not preventing a full table scan. An index only used for one specific cn could be persisted, but only will help find this one cn again, later, even shorter filtered indexes will accumulate to more than double the data size due to organizational meta data.

farzad321 said:
paid for 2196409 - 2196410 / 2196411
Looks like you can extract all CNs, at least you don't shorten this three cn range to 2196409 - 2196411. Which would make searching for 2196410 fruitless.

If CNs are all numbers you can find all matches with the help of regexp:
Code:
LOCAL oRegExp

oRegExp = CREATEOBJECT("vbscript.regexp")
oRegExp.Global = .T.
oRegExp.Pattern = '\d+'
oMatches = oRegExp.Execute('paid for cn 60711982 / 4930028 89236711')
clear
? Textmerge('<<oMatches.Count>> CNs found:')
FOR EACH oMatch IN oMatches
  ? oMatch.value
ENDFOR

If you only want to extract numbers as a CN, when they are at leawst 5 digits long, the pattern becomes \d{5,}.

Instead of ? oMatch.value you can put an insert into a 1:n detail table, exactly. And then have a id and cn in there.

Looking for a CN then becomes SELECT * FROM CNs left Join table on CNs.id = table.id Where CNs.cn = '1234567' And an index on CNs.CN as well as on the Ids will make this a fast and exaustive search for all records related to a CN (and probably also others).

Once you have these records, you might expand on other CNs related to the table.ids and expand the set until you have all transaction for the group of CNs and verify total of payments with total billed amount of all the CNs. Which might balance to 0 or need another rate or turns out overpayed.

It's not that hard to do and then add a field stating the Desc1 field was parsed and extended 1:n data already exists. If Desc1 can change, that'd mean adding to the detail data, updating or deleting it, in the simplest case delete all for the updated table.id and create the new set of records. But whatever suits you better.

There unfortunately is no magic that turns a 100 Mbit LAN to a 1GBit LAN. But there is data normalization.

Chriss
 
Hi Farzad,

All the suggestions will speed things up for you when you need to repeat your search many times. You didn't say.

But if you only want to extract record(s) for a particular contract number just once (or perhaps just a few times), a simple, brute force (and slower) method could be easier:

nContractNo = 123 && Or any number
SELECT * FROM MyDbf WHERE TRANS(nContractNo)$Desc1

You can just view the result or add the INTO clause to put the result into a cursor.

HTH

Steve

 
Given that your DESC1 field contains multiple contract numbers, you are not going to be able to optimise it. Whatever method of searching you choose will be slow. That's because you can't create an index on multiple substrings within a longer field.

The ideal solution would be to normalise your data. This would involve having a "master" table, which would be similar to your existing table, and a "detail" table that would contain one record for each contract. The details table would include a field which links back to the corresponding master record. That way, you could index the contract number field, which would give you a very fast search.

But I realise it is probably not practical to change the data model in that way. So how about this for an alternative:

Given that in most cases there is only one contract number, set up a separate field to contain the first (or only) contract number in each case. Create an index on that field. Then, when you want to do a search, search for the contents of that field. Because of the index, that should be fast. In many cases, it should find the required record straight away. If it doesn't, you can then do a substring search on the entire DESC1 field, using (variable $ DESC1) or whatever other method you like.

The point is that the substring search will be slow, and there will be no way of speeding it up. But because you will only be doing it in a minority of cases, you will still see an improvement in overall efficiency.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
Farzad,

Sorry I didn't mention a possible problem about my SELECT statement selection:

If your contract numbers are only three digits, as in my 123 example, you will retrieve unwanted records with contract numbers such as 1234, 56123, etc. because they all contain "123". The problem lessens if the contract numbers in Desc1 are 6 digits or more, or contain leading zeros (e.g. "000123").

Steve

 
The way I see it on the business and use case level is, that there are contracts, between Farzad or his company and customers or suppliers, other parties, so a normalized data structure should have:

1. customers and suppliers. Or in general contract parties
2. Contracts
3. an n:m releation table between partners and contracts, usually with just two partners per contract, maybe only one, if Farzad or his company always is the other contract partner, which then makes this n:m relation tableobsolete, as a contract then can have a Partyid foreign key.
4. Payments that come from a partner and can be for multiple contracts (perhaps by "reason for payment") or as monthly rates are paid aggregated.
And then, from the DESC1 feld of the Payments table you also feed a 5th table
5. Contractpayments joining together a pair of contractid and paymentid. In this relation table you'd have 2 or 3 records, if a payment is about 2 or 3 contracts and the single records should split the payment amount of the single payment record to assign the partly amount to each contract so that the sum is balanced. There should be ways to do this without redundancy and the problem of imbalance. Anothr tble could be helpful to have a ledger table that cares for this book keeping precisely. Because it's one thing, what payments are reported by banks, it's another thing how you do your internal book keeping with the contracts.

And then you can easily query from the point of view of contracts, which payments have been made or from the view of partners, what they have paid for any contracts etc. That's the way you design the data for that part of your application.

Chriss
 
Hi to all,
I read all of method and way for giving some speed to solve the search problem.
My friend CHRIS , this is a software for pay and receive money from passengers to a travel agency, then send a master report to accounting system to control each Contract Number individually.
Here We just Search about one contract to see all money transfer (Because passenger maybe pay two or three time or they can get back his money ). or get money for two contract in one bill.
as a result i think extract first cn from each record and save to a new filed and index it then at first search in this new filed and then do a slower search in DESC field.
This is the best way I can do and read from MIKE post.
Again thank you very much to everyone for good Ideas.
FARZAD
 
Farzad,

only extracting the first CN you won't find any CN that's not first in any record, so that's a bad idea. It may cover the 80% case of one payment for one CN in one record, but it won't even cover all records with a CN in that field, if it's a secondary CN in another record. It's up to you to estimate what percentage of cases that covers, but if you need a second search most often, then this just adds to that time you have to invest anyway.

If you extract one CN you can also extract all, I showed you how and I also gave you the SQL query you finally need to find all records related to a CN, after you have this.

You have 250,000 records. This is not a hobby system. So make what you already said yourself you should do, don't go with half baked solutions.
Sorry to say so, Mike, but though covering most of the cases could be helpful in a situation with no obvious solution, we do have an obvious solution here.

Chriss
 
Besides all that, do you really need a single CN search?

If one of your final goals would be sending out reminders in mails (snail or eMail), one for each customer with a summary of contracts and due payments, then you'd go through all data anyway, so no need for a single CN search.

Besides you can only not have the ideal result of total paid on one CN, as you take in payments for multiple CNs in your data, too, and don't split them, that's a fault in the data you can't mend just with normalizing what you have, that should have been better from the start. Don't take me wrong, it's perhaps a nice service to customers that they can make one payment for multiple CNs, it's not impossible to split up that payment to the several CNs in a meaningful way that gives you a rest due amount or covers them fully. For that a per customer view would help much more than a per CN view, that would also help to find perhaps wrong CNs that mix up in one payment but are for contracts of different customers and find the right CN that had a typo, two neighboring digits switched, for example.

You can get there, adding one detail table for that would already be a good start to go in that direction.

Chriss
 
Chris (and Farzad),

I stand by my suggestion, that is, catering for the 80% of cases where the serached-for number is the first (or only) contract number, and to resort to a substring search in the remaining cases. My point was that this would be overall faster than having to do a substring search in every case.

I agree that it's not an ideal solution. It would be far better to normalise the data, with a separate contract number table which has a foreign key to the main table. Farzad hasn't indicated if that's possible. If it is, that would be the way to go.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
Mike,

but how can you be sure the 80% is really covered, not talking about whether it's 80% or even more, but even in case you find 1 record and the payment is equal to what the contract says, can you be sure there is no secondary payment (as unlikely as it may be for someone to pay more) for the same CN where it's not first CN in the description field and thus is overlooked? The case of paying twice already happend to me before. And I was glad about the foresight of that shop that paid me back.

If you want all data for one CN then an 80% solution doesn't solve that even with payments you find at first glance balancing out with contracts, as you never know whether there's a record with the CN as secondary CN.

Okay, I stop it at this. I'm also not saying you'll not get whatever you want without full normalization, but even if the data is from a system not under your control or "comes in that way" you can always process data the way it's necessary for your system and then questions like the performance of At() vas $ operator don't matter anymore, it's not helping a lot to gain 1%, it helps to have data in a way you can get at your 1-3 records fast by an optimized query.

The culprit isn't Xbase code vs SQL here, too, it's having data incapable to search without doing full scan searches. That's always a sign to do something about it, even if it means changing a lot of other features. For a transition phase you can store redundant data, the detail table does not need to replace the DESC1 field in phase 1 of the change already, only the feature you optimize with it needs to know and care about the detail table at first. But indeed not starting the transition to a better database design you just put off technical debt and it grows over time, makes system unmaintainable and finally breaks them, if you don't address them.

Chriss
 
Some years ago, I wrote a Bible Study program. In the bible there are roughly 32,000 verses. I wanted the user to be able to do free form search for any word, combination of words, or a phrase in the verse and display the matching verses. The verse table contained the verse text in a memo field which is not indexable and would not help anyway since the search would be anywhere in the text string. To do this, I used the ATC() command to perform the comparison in a SCAN-ENDSCAN loop. The results were collected in a cursor. I am able to search for complex phrases/combinations in seconds over the 32,000 records.

In your case the table is on the network so you do not want to do a SCAN-ENDSCAN on this directly; it would take too long. Instead, I would try to do a SELECT of the network table into a temporary cursor and then do the search on this. This way the network is queried only once to get the table with the minimum fields needed into a cursor that is local to the PC. Conceptionally, this would be something like:

Code:
CREATE CURSOR c_results (keyfield I)
SELECT keyfield, textfield FROM networktable INTO CURSOR c_search   && these should be your fields
SELECT c_search
SCAN FOR ATC(lcSearchText, c_search.textfield) > 0
   INSERT INTO c_results (keyfield) VALUES (c_search.keyfield)
ENDSCAN
USE IN SELECT('c_search')

The cursor c_results contains the record keys that matched your text.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top