Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Search with / without didractis signs. 2

Status
Not open for further replies.

Koen Piller

Programmer
Jun 30, 2005
841
NL
Hi,

I would like to have a search procedure which will find names even if they have diactric signs,

So when I search for Aäron, I would like to find Aaron Cohen and also Aäron Baruch,
an when I search for Aaron, I would like to find Aaron Cohen and also Aäron Baruch
and the best thing would be to have a setting which would find only the Aäron Baruch for a search on Aäron

I cant think of anything else than to put an extra field in my dbf where the names are without the diactric signs, also believe there should be a procedure to accomplish this, only how?

Stay healthy,

Koen
 
...SET ORDER TO machName and Jöel is sorted after July, as ö comes after z in ASCII/ANSI order.

I think you could use sys(15), when adapting the european variable to the Windows 1252 or other codepages, but at the same time it's just doing what chrtran can also do, so you're on the right track.

Another variant would be SOUNDEX(), but that surely also doesn't work every time. Both Joel and Jöel give the same Soundex() value, though, and in general it will find other similar names than just removing diacritics.

Bye, Olaf.

Olaf Doschke Software Engineering
 
An issue with SET COLLATE is that not all European languages have the same collation sequence. Olaf, you suggested setting it to DUTCH, which might well be right for Koen, but if he also has customers (or whatever) from Portugal or Iceland, I think the collation might be somewhat different (I'm not going to take the trouble to test that).

I wonder if setting it to GENERAL would be better? (According to the Help, that covers "English, French, German, Modern Spanish, Portuguese, and other Western European languages".)

Also, I disagree with SOUNDEX(). That would find many records that were not at all similar to the one required.

Mike


__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
Mike,
I am sorry, maybe I dont understand, but

Code:
lcAccents = CHR(192) + CHR(193) + CHR(194) + CHR(195) + CHR(196) + CHR(197) + ;  && Cap A
  CHR(200) + CHR(201) + CHR(202) + CHR(203) + ;                                  && Cap E
  CHR(224) + CHR(225) + CHR(226) + CHR(227) + CHR(228) + CHR(229) + ;            && l.c. a
  CHR(232) + CHR(233) + CHR(234) + CHR(235)                                      && l.c. e
  
lcMask = "AAAAAAEEEEaaaaaaeeee"
LOCATE FOR ALLTRIM(CHRTRAN(fullname, lcAccents, lcMask)) = "Aaron" 
BROWSE normal

does not show the record with "Aäron", it shows the record with "Aaron"

 
Griff,

Your function Anglaise
with
Code:
DO findstring WITH "Fullname","Aäron"
Finds Aaron. What I was looking for a function which would return both "Aaron" and "Aäron" when user searches for "Aaron", I believe I shall have to make an extra lookfor field (filled with a stored procedure) of the field content without the diactric sign if available my search proceduree should than look in this lookfor and return the original field. Afterall just an extra field and one simple SP.

If you have a better idea please let me know.

Thanks sofar,

Stay healthy,

Koen
 
If you are looking for an exact match in preference to a close one, do a search with the diactric and no ANGLISIZE first, then if it fails do one with the ANGLISIZE.
You could have indexes on both while only storing one version, to speed it up.



Regards

Griff
Keep [Smile]ing

There are 10 kinds of people in the world, those who understand binary and those who don't.

I'm trying to cut down on the use of shrieks (exclamation marks), I'm told they are !good for you.
 
Koen,

you would need to at least index on the function applied to the name. Queries with ANGLISIZE(searchterm)=ANGLISIZE(field) would be optimized by such an index, that should be the base idea. And if it turns out that ANGLISIZE(searchterm) <> searchterm and the user also turns on an option to search literally, search the normal way. And maybe union both results.

Mike, no we don't disagree on Soundex, while it works for some names, I said "that surely also doesn't work every time". But what it is is a totally different strategy especially in the case of names to find some similar to what you heard someone tell you. I have inherited code that actually did use Soundex in a customer search form, you at least had the option to search by Soundex, too. I know it is used, as it's working somewhat for that case, but yes, surely diacritics usually also sound different from the normal letter, so Soundex has different results for them. I just remember regarding German umlauts it typically just has the same value whether you use ÄÖÜ or AOU.

Bye, Olaf.

Olaf Doschke Software Engineering
 
Like this:

Code:
FUNCTION FINDSOMETHING
	PARAMETERS m.STRING
	PRIVATE m.STRING,m.OLDRECNO,m.OLDORD,m.OLDAREA,m.FLAG
	m.STRING = UPPER(m.STRING)
	m.FLAG = .F.
	m.OLDAREA = SELECT()
	SELECT MYTABLE
	m.OLDRECNO = RECNO()
	m.OLDORD = ORDER()
	SET ORDER TO EXACTMATCH && index on UPPER(fullname)
	SEEK (m.STRING)
	IF !FOUND()
		SET ORDER TO ANGLIMATCH && index on UPPER(ANGLISIZE(fullname))
		SEEK (ANGLISIZE(m.STRING))
		IF FOUND()
			m.OLDRECNO = RECNO()
			m.FLAG = .T.
		ENDIF
	ELSE
		m.OLDRECNO = RECNO()
		m.FLAG = .T.
	ENDIF
	SET ORDER TO (m.OLDORD)
	IF m.OLDRECNO > 0 .AND. m.OLDRECNO <= RECCOUNT()
		GOTO m.OLDRECNO
	ENDIF
	SELECT (m.OLDAREA)
	RETURN(m.FLAG)

** edited code to tidy up after use.

Regards

Griff
Keep [Smile]ing

There are 10 kinds of people in the world, those who understand binary and those who don't.

I'm trying to cut down on the use of shrieks (exclamation marks), I'm told they are !good for you.
 
Koen said:
What I was looking for a function which would return both "Aaron" and "Aäron" when user searches for "Aaron"

The way to handle that would be to do a SELECT rather than a LOCATE.

I'll answer your other point presently.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
Olaf,

It's interesting what you said about SOUNDEX(). One of the very first Foxpro (Foxbase?) applications I wrote incorporated a SOUNDEX() search. The users often had to deal with customers on the phone, and I thought this would help them if they didn't hear their names properly. The language didn't natively support a SOUNDEX function at that time, so I wrote one myself. I thought I was being very clever.

The result: the users hated it. I never understood why, but they all complained bitterly, and so I decided to take it out. What I should have done, of course, was to make it a user option (like in the application you inherited) - and not to make it the default. Ah, well - too late now.

Regarding the user of SOUNDEX() with diacriticals. As I understand it, one of the first things that SOUNDEX does is to strip out all the vowels, except if one is the first letter of the string. Provided you can tell it that accented vowels are still vowels, no other special action is needed. But the function I wrote wasn't that sophisticated.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
Koen said:
does not show the record with "Aäron", it shows the record with "Aaron"

Isn't that because "Aaron" comes before "Aäron" in alphabetical order? The LOCATE will always find the first record that matches the condition. You then have to CONTINUE to find the others.

If you want to find them all (in other words, you want to end up with a set of records that contain "Aäron", "Aaron" and other variations), then you should use a SELECT:

Code:
SELECT * FROM TheTable WHERE <condition>

where [tt]<condition>[/tt] is the same condition that you were previously using with LOCATE.

Mike

__________________________________
Mike Lewis (Edinburgh, Scotland)

Visual FoxPro articles, tips and downloads
 
Hi,
Thanks for all the input, I will settle for my initial thoughts, create an extra index record without any diactric sighn. I can than use a select, locate, find and even the awful, set('filter')
Stat healthy,
Koen
 
If you can work with an extra field that's computed from ANGLISIZE(fullname) or UPPER(ANGLISIZE(fullname)), then you can also just add the index on that expression. An index has the advantage to be in sync with the fullname automatically, whenever you change that.

Just a thought. But I see how you could perhaps cope better with an extra field. First you can see it, you can choose to change it with any update of the other field, but could also overrule it for some cases and store some exceptions differntly.

One thing is for sure: You will use up the space for it, whether in the DBF, the CDX or FPT file.

I don't know what's in your way now we have settled on the ANGLISIZE() function your demand becomes quite simple

> So when I search for Aäron, I would like to find Aaron Cohen and also Aäron Baruch,
>and when I search for Aaron, I would like to find Aaron Cohen and also Aäron Baruch

These both become the same search, where you preprocess your search term AND the field, so you match FOR/WHERE ANGLISIZE(field)=ANGLISIZE(searchterm). Rushmore will help find the index on ANGLISIZE(field) to optimize that, no matter if LOCATE or SQL-Select.

>and the best thing would be to have a setting which would find only the Aäron Baruch for a search on Aäron
Well, and that just is the usual search field=searchterm.

And thinking ab that in detail the extra field will need the space for the extra field and for fast locates and queries you index both fields, that's about 4x storage of the original field. Alternatively, you only keep the original field and have indexes on the field itself and ANGLISIZE(field). That's only about 3x the size of the pure data of the original field.

Bye, Olaf.

Olaf Doschke Software Engineering
 
Here's a test about Locate, you seem to prefer that over SQL.

Code:
Local lcSearched, loResult1, loResult2

Clear
Close Tables All
* usual default settings (only necessary, if your settings differ):
Set Exact Off
Set Optimize On

* off by default and for development (to  see the deleted rows marked deleted, for example)
* but typically on for applications:
Set Deleted On

Create Cursor crsTestdata (cName V(20))
Insert Into crsTestdata Values ('Aäron Baruch')
Insert Into crsTestdata Values ('Aaron Cohen')
Insert Into crsTestdata Values ('Jöel')
Insert Into crsTestdata Values ('Joel')
Index On cName Tag xName
Index On Anglisize(cName) Tag aName
Set Order To

* Locate 'Aäron' diacritics insnsitive...
lcSearched = 'Aäron'
* ...means diacritics also have to be removed from the search term
lcSearched = Anglisize(m.lcSearched)
Select crsTestdata
Locate For Anglisize(cName) = m.lcSearched
? 'Locate result 1:'
Do While Found()
   ? cName
   Continue
Enddo

* Locate with 'Aäron' with diacritics...
lcSearched = 'Aäron'
Select crsTestdata
* ...means you can't used in the Anglisized data or index
Locate For cName = m.lcSearched
? 'Locate result 2:'
Do While Found()
   ? cName
   Continue
Enddo

Function Anglisize()
   Lparameters tcString
   tcString = Chrtran(m.tcString,"ÀÁÂÃÄÅÆ","AAAAAAA")
   tcString = Chrtran(m.tcString,"Ç","C")
   tcString = Chrtran(m.tcString,"ÈÉÊË","EEEE")
   tcString = Chrtran(m.tcString,"ÌÍÎÏÐÑ","IIIIII")
   tcString = Chrtran(m.tcString,"ÒÓÔÕÖ×Ø","OOOOOOO")
   tcString = Chrtran(m.tcString,"ÙÚÛÜÝÞß","UUUUUUU")

   tcString = Chrtran(m.tcString,"àáâãäåæ","aaaaaaa")
   tcString = Chrtran(m.tcString,"ç","c")
   tcString = Chrtran(m.tcString,"èéêë","eeee")
   tcString = Chrtran(m.tcString,"ìíîïðñ","iiiiii")
   tcString = Chrtran(m.tcString,"òóôõö÷ø","ooooooo")
   tcString = Chrtran(m.tcString,"ùúûüýþÿ","uuuuuuu")
   Return m.tcString
EndFunc

I don't see a problem, the SQL situation differs a bit in that it depends on ANSI ON/OFF and when you prefer VFPs string comparison methods to be able to find partial search terms, then you also prefer ANSI OFF.

But what you will never have is same code, just switch a setting like ansi, exact, optimize and have the insensitive or sensitive result. Because you can never find anything with diacritics in the anglisized data or index, of course, it can't go into a superposition state of being diacritics insensitive and sensitive.

A common query for both would need the condition
Code:
Anglisize(cName) = m.lcSearched Or cName = m.lcSearched

That would then depend on whether you anglisize the search term or not. Do this preprocessing step or not, that's your "setting" then, but it would mean the insensitive search also searches within the data with diacritics, the OR part would not be necessary. So I'd prefer to switch both the search term preprocessing and the choice of the specific code for sensitive or insensitive search.. The sensitive search is always the simpler one.

Bye, Olaf.

Olaf Doschke Software Engineering
 
Olaf,

I think it was me who introduced the locate, and it was only because it seemed the easiest way to show how it might be done without referencing index tags.



Regards

Griff
Keep [Smile]ing

There are 10 kinds of people in the world, those who understand binary and those who don't.

I'm trying to cut down on the use of shrieks (exclamation marks), I'm told they are !good for you.
 
I mean, there is one more idea, but it's not just impractical, it's also technically impossible to make use of:

You could of course combine the anglisized and non-anglisized names in one index tag, say left 20 characters anglisized and right 20 characters non anglisized. But you can't make Foxpro use just the left or right half of the index expressions,/index nodes, An expression like RIGHT(indexexpression,20)=searchterm will not make VFP use an index on indexexpression which is just part of the overall condition, because there is no way to use the index on the whole expression to determine the rows that will match just the right half without first computing the index on just the right half. So if at all, VFP would need to compute a temp index based on the one it has, and that's less effective, you'd still be best off with two separate indexes. And then also need two expressions as conditions, there's no way around this, it's just like the saying with the cake: You can't have the cake and eat it at the same time.

I wonder, I think even if VFP had a new mode of "use only half the index expression" the index tree itself is oriented about the whole expression, the values that are similar on the right side are scattered in the tree with different parent nodes, so there would be no good way to even implement such a mode of half index usage. You would at least also need a completely different index structure than a binary tree.

It's far simpler to have two indexes.

New index types would be a good VFP extension overall, because other databases offer easier ways for compound indexes that don't require you to think of an index expression pulling together data from several columns. Such compound indexes of MSSQL work like a subset of the table. When the query optimizer finds all the columns with conditions on them in an index and the result also doesn't want to output columns not in the index, the SQL engine doesn't even need to fetch the data pages of the table and makes use of the index as if it was the table for such a query.

So, one quintessence of this also is, you'd likely have an advantage when using an SQL backend for the kind of queries you have in mind. And it wouldn't even require very special knowledge about indexing these server databases, the usual indexes, also btree based, just have other node composition and other usage. In part, because the tables themselves are stored as betrees, too, once you have a clustered index, because that's the actual meaning of "clustered", the sibling rows of the table are clustered in tree nodex of a betree, just like an index organises its data.

Bye, Olaf.

Olaf Doschke Software Engineering
 
Code:
 lcAccents = CHR(192) + CHR(193) + CHR(194) + CHR(195) + CHR(196) + CHR(197) + ;[indent][/indent]&& Cap A
  CHR(200) + CHR(201) + CHR(202) + CHR(203) + ;                                 [indent][/indent]&& Cap E
  CHR(204) + CHR(205) + CHR(206) + CHR(207) + ;                                  && Cap I
  CHR(210) + CHR(211) + CHR(212) + CHR(213) + CHR(214)+ ;                        && Cap O
  CHR(217) + CHR(218) + CHR(219) + CHR(220) + ;                                  && Cap U
  CHR(224) + CHR(225) + CHR(226) + CHR(227) + CHR(228) + CHR(229) + ;            && l.c. a
  CHR(232) + CHR(233) + CHR(234) + CHR(235) + ;                                  &&lc e         
  CHR(236) + CHR(237) + CHR(238) + CHR(239) + ;                                  &&lc i 
  CHR(242) + CHR(243) + CHR(244) + CHR(245) + CHR(246) + ;                       &&lc o
  CHR(249) + CHR(250) + CHR(251) + CHR(252)                                      &&lc u 

lcMask = "AAAAAAEEEEIIIIOOOOOUUUUaaaaaaeeeeiiiioooooiiiiuuuu"

select fullname from *filename*  where ALLTRIM(UPPER(CHRTRAN(fullname, lcAccents, lcMask))) = UPPER("Aaron")
*or 
brow FOR ALLTRIM(CHRTRAN(fullname, lcAccents, lcMask)) = "Aaron"   
*or
list FOR ALLTRIM(CHRTRAN(fullname, lcAccents, lcMask)) = "Aaron"   

use

 
You're right, Griff,

the original question only talks about a search procedure, any search procedure. In SQL you can use the same conditions in WHERE instead of FOR. So the logic can be transferred to SQL in the same manner.

I have to admit, though, that I have a puzzling behavior of my SQL example no matter if I set ANSI OFF or ON and no matter if I index on padded or non padded expressions. I figured out half of the problem were fully optimizable queries causing filter cursors that then cause an error about a variable going out of scope after the test PRG ends. That's of course easy to fix with NOFILTER clause in the SQL. But I am at a stage where I only get a result when the index is not used and no result when the index is reported to optimize the query by Sys(3054).

The query is simple to construct after the locate:
Code:
Select * From crsTestdata Where Anglisize(cName) = m.lcSearched
* or
Select * From crsTestdata Where cName = m.lcSearched

Again, the query for both sensitive and insensitive search, depending on whether you keep diacritics in lcSearched or not is then using both conditions ORed:

Code:
Select * From crsTestdata Where Anglisize(cName) = m.lcSearched OR cName = m.lcSearched

The results are mutually exclusive, if you search including diacritics in lcSearched, obviously Anglisize() will never match, if you search with diacritics removed from lcSearched, the cName = m.lcSearched only finds names that have no diacritics in their original value and they are also available in the index on Anglisize(cName), so those matches don't contribute more to the result in that case.

In case you ask for a view definition that can be used for both cases that would fit, but you're always doing double the conditions you actually need to check for the specific case of sensitive vs insensitive search, the major effect is on either the first or the second condition, but Foxpro works on both and wastes time on one of them. It may not be much time wasted when there are indexes on both cName and Anglisize(cName), but there is time wasted even in optimized conditions.

Bye, Olaf.

Olaf Doschke Software Engineering
 
You could test the searched for data to see if it equalled it's Anglisized self, then decide whether to do one search or two.

But the code above (FINDSOMETHING) would be faster than a fast thing with indexes and two searches are not going to hurt anyone, I would think they would be faster than a SQL call to be honest as there is no need for SQL to work out which indexes to use - it's in the code.

Regards

Griff
Keep [Smile]ing

There are 10 kinds of people in the world, those who understand binary and those who don't.

I'm trying to cut down on the use of shrieks (exclamation marks), I'm told they are !good for you.
 
Okay, Griff, but then it comes to many more SQL views with other conditions aside from the name search and then you'd really like SQL to take care of using its Rushmore optimization fully.

And views usually also are the reason to ask for one way to do things to not need to duplicate all view definitions and maintain double the number of views. I think there would be a solution in maintaining a set of views for the usual diacritics sensitive queries and a generator transforming them into the second set of views with diacritics insensitive onditions.

Or you make use of the ability of macro substitution in View, not just parameterization of them.

Bye, Olaf.

Olaf Doschke Software Engineering
 
One more completely new thought: VFP is optimizing boolean algebra and stops evaluation of partial conditions when it knows it doesn't matter, ie when it needs to find the overall result of condition1 OR condition2 it's sufficient condition1 is .T., condition2 then doesn't need any time wasted on it. Vice versa with AND conditions, For the case of condition1 AND condition2, VFP knows it doesn't help to evaluate condition2, when condition1 already is .F.

That doesn't mean you can cherry-pick just the advantage, if condition1 is .f. in the or case, the other part must be evaluated, to see whether it turns the overall result .t.
If condition1 is .t. in the AND case, this doesn't unbind you from checking condition2 to also be .t.

The way to make use of that is to combine your actual query conditions with switch variables used as view parameters, which activate or deactivate the evaluation of the conditions that way. Every original partial condition then becomes a condition (?switch AND originalcondition) which VFP only evaluates if switch is .T., but in case switch is .F. you have to be aware this results in .F. and this has to fit into the rest of the boolean algebra. It's not necessarily easy to incorporate that. Especially if your "wiring" of conditions already is quite complex.

Bye, Olaf.

Olaf Doschke Software Engineering
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top