Seeking with long strings

Bryan - Gendev · Mar 5, 2011

I have 2 tables containing fields with paths within them including the file-name.

I need to find if a file in a folder ( fullpath) is in the smaller list of fullpaths.

I was previously given a 'seek' routine with some testing of success and if not found try another method. The 240chr limit is not exceeded - typically a path would be

P:\_Test_Projects\Workshops\exhibits\002799 Bareholl -Leonard (NZ) =Marcelda Selwyn 1901.png

which I think is 92chrs

It only works sometimes - some matches are missed.

My users aren't happy.

I'm wondering if a lookup in an array would be more effective.

Any comments?

Thanks

GenDev

Dan Freeman · Mar 5, 2011

What is your question?

I was previously given a 'seek' routine with some testing of success and if not found try another method.

What does this mean?

The 240chr limit is not exceeded

I'm not aware of any 240-char limit. What are you talking about?

It only works sometimes - some matches are missed.

Do we get examples or should we guess?

There's no way anyone can help you here. Sorry.

Bryan - Gendev · Mar 5, 2011

OK - sorry

Olaf Doschke · Mar 5, 2011

Dan,

There is a well nown 240 char limit on indexes, and seek is only searching indexes. So this question fits perfectly together and makes sense to me.

Are you using an idx index file or an cdx index tag of the table? Because idx don't autoupdate with new records.

This MUST work, this is not a should, this MUST work, as you don't exceed limits.

And arrays would be more of the worst thing you can do, what makes you think putting paths in arrays makes it faster to search them?
Cursors are already in memory, if that is your argument. And you can't inde an array, ASCAN can only do a full scan search. Awful idea, forget it. Watch Men in Black

Bye, Olaf.

Mike Lewis · Mar 6, 2011

Gedev,

I agree with Olaf that moving the search to an array doesn't make much sense - and it won't necessarily solve your problem.

However, without seeing your code, it's difficult to say why it's going wrong.

A couple of possibilities:

- Does the index expression include an UPPER() - and, if so, are you also wrapping your search term in an UPPER()? If you take no special action, the seek will be case-sensitive, which might not be what you want.

- Are you also wrapping the search term in an ALLTRIM()? If not, the problem might be caused by trailing spaces.

If neither of the above help, please give us more information about where the search term is coming from and how you are using it.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips, training, consultancy

GriffMG · Mar 6, 2011

I would check that you are not having a case issue.

Use an expression like this for the index key Upper(MyField)
and do your search with a Seek (Upper(myValue)).

I would also see if I couldn't 'drop' a part of the path - even if only the drive letter - so that the path/drive letter become relative to (say) "P:\_Test_Projects\Workshops\"

Then you only need search on:

"EXHIBITS\002799 BAREHOLL -LEONARD (NZ) =MARCELDA SELWYN 1901.PNG"

I have a routine that compares about 2-3 million files between two sites each evening - some of the paths and file names exceed the 240 char limit, so I not only use the relative technique I only index the first 230 characters then use a do while loop to search all the files that have that prefix... it works really well and is tremendously fast.

Good luck.

Regards

Griff
Keep [Smile]ing

There are 10 kinds of people in the world, those who understand binary and those who don't.

JRB-Bldr · Mar 6, 2011

You have gotten a lot of good advice above already.

If I needed to do a similar lookup, I'd use a SEEK, but I would also ensure that the SET EXACT was set to ON so that 'near' values were not found by mistake.

Something like:

Code:

   * --- Somewhere, either manually or elsewhere in the code ---
   * --- Build an Index on the desired field ---
   USE MyTable EXCL
   SELECT MyTable
   INDEX ON Path_File TAG Path_File

Somewhere else in the code where the 'look up' needs to occur.

Code:

   * --- Set the VFP EXACT Parameter ---
   SET EXACT ON 

   * --- Now use the table and the desired Index ---
   USE MyTable IN 0
   SELECT MyTable

   * --- Get the length of the 'look for' field ---
   nLenField = LEN(MyTable.Path_File)
   
   * --- Somehow acquire the text string to search for ---
   cSearchString = 'P:\_Test_Projects\Workshops\exhibits\002799 Bareholl -Leonard (NZ) =Marcelda Selwyn 1901.png'

   * --- And use it to build a SEEK value ---
   cSeekKey = PADR(ALLTRIM(cSearchString),nLenField)
   SELECT MyTable
   SET ORDER TO Path_File
   IF SEEK(cSeekKey)
      * --- Fully Pathed Filename Found ---
      * <do whatever>
   ELSE
      * --- Fully Pathed Filename NOT Found ---
      * <do whatever>
   ENDIF

You can either leave the SET EXACT ON or you can return it to whatever previous value as you need.

And NO, do not bother going to the use of an array.

NOTE the SEEK would be case-sensitive (possibly leading to missed 'finds' depending on how the data was entered) so you might want to modify things as such:

Build the INDEX so that it was UPPERCASE.
INDEX ON UPPER(Path_File) TAG Path_File
and the associated seek would then be:
cSeekKey = PADR(ALLTRIM(UPPER(cSearchString)),nLenField)

Good Luck,
JRB-Bldr

Dan Freeman · Mar 6, 2011

Olaf is correct about the 240 character limit and I apologize for my error.

However, the length of the value you posted here has nothing to do with the index key length. What's the length of the field?

What is the exact index expression? If Trim() was used in creating the expression, behavior may be unpredictable.

We still need way more information to figure out what the actual problem is, much less guessing at a solution.

Olaf Doschke · Mar 6, 2011

Dan is right that the field lengths are important, if both tables have different field length and you seek an untrimmed value while the EXECT setting is on you won't find something, even if there is a matched if comparing both trimmed values.

Bye, Olaf.

MrDataGuy · Mar 6, 2011

that compares about 2-3 million files between two sites each evening - some of the paths and file names exceed the 240 char limit, so I not only use the relative technique I only index the first 230 characters then use a do while loop to search all the files that have that prefix... it works really well and is tremendously fast."

Why not use a simple SQL Select statement? No looping needed and SQL will almost always be faster then hand coding.

Lion Crest Software Services
Anthony L. Testi
President

http://www.LionCrestSoftwareServices.com

GriffMG · Mar 6, 2011

How would a SQL statement be quicker?

It couldn't use an index, as there couldn't be one, to let rushmore help?

Am I missing something?

Regards

Griff
Keep [Smile]ing

There are 10 kinds of people in the world, those who understand binary and those who don't.

GriffMG · Mar 6, 2011

Actually, that is such a good suggestion, I HAVE to try it!

[batman mode on]
Off to the VFP code...
[batman mode off]

Regards

Griff
Keep [Smile]ing

There are 10 kinds of people in the world, those who understand binary and those who don't.

MrDataGuy · Mar 6, 2011

BTW
Here is a speed trick that I have used in the past. Add a new column to your data, a checksum. (See Sys 2007 for more information) That is along with the full ~240 character text create an integer check sum, and make sure that there is a cdx on it.

Now when you do the SQL Select do it not via the text, but via the check sums. Now true be told there is a small chance that matchs will be made that are not true matchs, so you should do a 2nd SQL Select on the first set of results to make sure any false matchs are removed.

With large data sets this method can result in very significant speed improvements. Of course code needs to be added to update the checksums whenever the main text changes. (I am not suggesting the check sums be recalced before each SQL Select, I am saying doing it durring the typical running of the code.)

Lion Crest Software Services
Anthony L. Testi
President

http://www.LionCrestSoftwareServices.com

GriffMG · Mar 6, 2011

I had been wondering about using a ZIP encodement (?) to see if
that gave an advantage.

The checksum might be a way...

I am testing your SQL statement idea as we speak.

Regards

Griff
Keep [Smile]ing

There are 10 kinds of people in the world, those who understand binary and those who don't.

MrDataGuy · Mar 6, 2011

Another idea/suggestion is to break the long file path/name into smaller chuncks

Drive_Ltr C(1)
Dir_lvl1 C(?20?)
Dir_lvl2 C(?20?)
DIR_lvl# AS many as you want
Dir_REMAIN C(?100?)
File_Name C(?100?)

So file file:
P:\_Test_Projects\Workshops\exhibits\002799 Bareholl -Leonard (NZ) =Marcelda Selwyn 1901.png

Could be broken into
Drive_LtR: P
Dir_lvl1: _Test_Projects
Dir_lvl2: Workshops
Dir_REMAIN: exhibits
File_Name: 002799 Bareholl -Leonard (NZ) =Marcelda Selwyn 1901.png

One can add checksums fields as needed/wanted. (Maybe only the Dir_Remain and File_Name) But I think by having several fields with indexs speed would be fine.

And also it make it easier to find duplicate file names in different drives/folders

Lion Crest Software Services
Anthony L. Testi
President

http://www.LionCrestSoftwareServices.com

GriffMG · Mar 6, 2011

Well

I don't know exactly how long it normally takes, but with a SQL select statement it took two hours to compare the files in the real world.

I will know the normal duration on Tuesday morning.

Regards

Griff
Keep [Smile]ing

There are 10 kinds of people in the world, those who understand binary and those who don't.

GriffMG · Mar 6, 2011

Ah,

I DO know how long the process took on Friday, and there shouldn't have been any major changes since then... and the SQL Select takes... about 4 times as long. The seek on the first 200 characters and then a do while loop for the balance takes about 30 minutes.

A bit of hand coding can indeed be faster (where an index can't help) than a SQL select!

B-)

Regards

Griff
Keep [Smile]ing

There are 10 kinds of people in the world, those who understand binary and those who don't.

Bryan - Gendev · Mar 6, 2011

Thanks everyone for some though provoking ideas.

I now have it working (added UPPER() to both sides.

Regards

GenDev

MrDataGuy · Mar 6, 2011

SQL select statement it took two hours
WOW!
Was the data on a local drive?

did you do it with the checksum idea? I bet comparing 2-3 million record tables will be done in <5 mins, most likely < 1 min (I assume the tables are on a local computer, there is plently of free RAM and the CPU is reasonably fast )

Lion Crest Software Services
Anthony L. Testi
President

http://www.LionCrestSoftwareServices.com

GriffMG · Mar 6, 2011

Good luck GenDev - sometimes it's the simplest of things.

MrDataGuy, actually, I re ran the same routine again last night - by accident - with the old routine (hand code) on it. This time there were almost certainly no new files (no many people in construction working at that time on a Sunday/Monday morning).

The hand crafted code version took less than 10 minutes.

It's not one SQL statement, it's a search repeated many, many, times for matching file names. The data is 'not quite' local it is held on a mapped drive to a closely coupled server (optical link between them).

I think that using a checksum might help, but as the vary large majority of the file sames are of a sensible length(?) i.e. less than 200 characters, the overhead of doing the double check where two files have the same checksum is going to be very similar to mine where they have the same prefix (first 200 or so characters).

Regards

Griff
Keep [Smile]ing

There are 10 kinds of people in the world, those who understand binary and those who don't.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Seeking with long strings

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Similar threads

Log in

Part and Inventory Search

Sponsor