Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

soundex for arabic language

Status
Not open for further replies.

yahiadal

Programmer
Sep 19, 2006
39
0
6
US
Hi all
I am in need of a soundex algorithm supporting arabic language.All what I found is a php class,but I have no experience with php to translate the class into vfp>Any help will be appreciated>Following is the php class:

<?php
// ----------------------------------------------------------------------
// Copyright (C) 2006 by Khaled Al-Shamaa.
// // ----------------------------------------------------------------------
// LICENSE

// This program is open source product; you can redistribute it and/or
// modify it under the terms of the GNU General Public License (GPL)
// as published by the Free Software Foundation; either version 2
// of the License, or (at your option) any later version.

// This program is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
// GNU General Public License for more details.

// To read the license please visit // ----------------------------------------------------------------------
// Class Name: Arabic Soundex
// Filename: ASoundex.class.php
// Original Author(s): Khaled Al-Sham'aa <khaled.alshamaa@gmail.com>
// Purpose: Arabic soundex algorithm takes Arabic word as an input
// and produces a character string which identifies a set words
// that are (roughly) phonetically alike.
// ----------------------------------------------------------------------

class ASoundex {
var $asoundexCode = array('/ا|و|ي|ع|ح|ه/',
'/ب|ف/',
'/خ|ج|ز|س|ص|ظ|ق|ك|غ|ش/',
'/ت|ث|د|ذ|ض|ط|ة/',
'/ل/',
'/م|ن/',
'/ر/'
);

var $aphonixCode = array('/ا|و|ي|ع|ح|ه/',
'/ب/',
'/خ|ج|ص|ظ|ق|ك|غ|ش/',
'/ت|ث|د|ذ|ض|ط|ة/',
'/ل/',
'/م|ن/',
'/ر/',
'/ف/',
'/ز|س/'
);

var $transliteration = array('ا' => 'A',
'ب' => 'B',
'ت' => 'T',
'ث' => 'T',
'ج' => 'J',
'ح' => 'H',
'خ' => 'K',
'د' => 'D',
'ذ' => 'Z',
'ر' => 'R',
'ز' => 'Z',
'س' => 'S',
'ش' => 'S',
'ص' => 'S',
'ض' => 'D',
'ط' => 'T',
'ظ' => 'Z',
'ع' => 'A',
'غ' => 'G',
'ف' => 'F',
'ق' => 'Q',
'ك' => 'K',
'ل' => 'L',
'م' => 'M',
'ن' => 'N',
'ه' => 'H',
'و' => 'W',
'ي' => 'Y'
);
var $len;
var $lang;
var $code;

function ASoundex($len=4, $lang='en', $code='soundex'){
$this->len = $len;
$this->lang = $lang;
$this->code = $code;
}

/**
* @return String : the calculated soundex/phonix numeric code
* @param String : the word that we want to encode it
* [soundex|phonix] : define mapping code to be used in this converting
* @desc mapCode : methode to create soundex/phonix numric code for a given word
* @author Khaled Al-Shamaa
*/
function mapCode($word){
$encodedWord = $word;

if($this->code == 'phonix'){ $map = $this->aphonixCode; }else{ $map = $this->asoundexCode; }

foreach($map as $code=>$condition){
$encodedWord = preg_replace($condition, $code, $encodedWord);
}
$encodedWord = preg_replace('/\D/', '0', $encodedWord);

return $encodedWord;
}

function trimRep($word){
$chars = preg_split('//',$word);

foreach($chars as $char){
if($char != $lastChar){ $cleanWord .= $char; }
$lastChar = $char;
}

return $cleanWord;
}

function soundex($word){
list($dump, $soundex, $rest) = preg_split('//',$word,3);

if($this->lang == 'en'){ $soundex = $this->transliteration[$soundex]; }

$encodedRest = $this->mapCode($rest);
$cleanEncodedRest = $this->trimRep($encodedRest);

$soundex .= $cleanEncodedRest;

$soundex = preg_replace('/0/', '', $soundex);

$totalLen = strlen($soundex);
if($totalLen > $this->len){
$soundex = substr($soundex, 0, $this->len);
}else{
$soundex .= str_repeat('0', $this->len - $totalLen);
}

return $soundex;
}
}

thank you
yahya
 
Thanks for that feedback. I think we have some other arabic members, to whom that'll be helpful. You might consider posting this as a FAQ. Just click on FAQ (you find it in the head section), In the FAQs page scroll all the way down to the "Write A FAQ" form and post a new FAQ. Maybe in the category "String Commands" or "Useful Functions & Procedures".

Put up the code in code tags this way: [tt][ignore]
Code:
your code here
[/ignore][/tt] - not as attachment. I think attachments are only kept a few months and then are removed. Besides, normal threads will get closed and can get no comments after that, could only be referenced by their thread id. FAQs allow sending comments to you, which may also end up as a business proposal for making an integration of that into a software, for example.

Posting some ZIP might also be helpful, then think about using cloud drives. That'll even let you keep control and enable you to update code. The same goes for FAQ text though, you can edit your FAQ after posting, and posted code is easier to be trusted, ZIPs might contain any malware/spyware/ransomware, not onyl because you would put that in, but any cloud drive hacking might add itself to public downloads.

Most important perhaps - FAQs stay in sight, threads go down over time. The forum search of course helps finding both old posts and FAQs, but this qualifies for a FAQ even though it was seldom asked,I always just think about it being something not often found and good to find, if you had the same problem and not being too obscure and special to qualify as FAQ, and this does.

Bye, Olaf.
 
Thanks for you Mr. Olaf and mr.Mike and for everyone assisted or inspired the proposed solution.

Yahya
 
Done adding this article to FAQ section.

Yahya
 
If you look into your mail, you'll see a recommendation to announce the FAQ in a new thread. You can do so by posting [tt]faq 184-7907[/tt] (without a space between faq and the faq id), which automatically expands to a link to faq184-7907.

You chose a onedrive download. That's fine, though I said it might not get trusted as much as simply posted code, it is easier to download and can be checked for viruses etc. Also since you posted a PRG and not ZIP, it can be inspected before execution with the VFP editor. Good choice and compromise of both advantages.

By the way: ZIPs also are not executed, but can exploit buffer overflows of well known zip software like winzip, winrar or 7zip to run malware even just upon inspection of a zip archive. This is not just theory, see a case of last year at
A good idea is to offer a SHA1 or MD5 checksum of the file, so any future downloader can first check no hacker made changes to your original upload by calculating the same checksum of the downloaded file before opening it any other way. A double effort to change both the onedrive file and the posted checksum in the tek-tips FAQ would be needed to make a changed file appear as original upload, then.

Bye, Olaf.
 
Hi Mr. Olaf
If I post the code directly,the arabic letters will not show correctly.That's why I choosed to upload the prg file as it is so the arabic will show correctly in windows with arabic support.
Thank you
Yahya
 
Ah, yes, I forgetthat you'd have a codepage transatin through posting. I would guess if a user has Arabic Windows copy & paste would still work, your posting here also has the arabic letters, so that part of the copy&paste into the forum works.

Anyway, it always is easier to have a separate file than to copy out code of a post. The only thing not showing up right away is getting an overview about how the code works.

Bye, Olaf.
 
Hi mr. Olaf
The supplied program when run give a demo results of applying the function in the program on sample names.
The program generate a table where every arabic letter has it's phoentic engliah equivalent.When we call the function with arabic name,the name is translated to it's English phoenitic equivalent on which we call the standard vfp soundex function and get the soundex code.
By the way I use Oracle virtual box to install different versions of windows with different language support for testing purposes.
Yahya
 
Virtual Box is a good thing, I could do the same, as I still have lots of license keys due to having been MVP for a few years. Anyway I couldn't operate on any other Windows versions than German and English Windows. And those languages English and German can easily be combined in one Windows.

Bye, Olaf.
 
I am too operate on english windows (win 10)with arabic language support installed:
-in control panel,language,add a language,choose arabic labanon
now for vfp being not unicode aware ,you have to :
-in control panel,region,administrative,language for non-unicode prgrams,change system locale,arabic(lebanon)
after that vfp will be able to see and use arabic letters

By the way my supplied program can be adapted to other languages by changing the character mapping defined in the file ar2en.dbf
yahya
 
The only thing hindering is after setting locale to arabic, I would have a hard time setting it back to german or english.

By the way characters mappings are also done for collations, also see SYS(15), which works quite like CHRTRAN, from the description. I never used it, though, it also is recommended to instead use COLLATIONS.

And while at SYS functions I see SYS(2300) could enable setting arabic codepage without changing the whole system locale.

Code:
SYS(2300,1256,1)
MODIFY COMMAND ...php_soundex.prg as 1256
I tried and unfortunately t still gives the "Code page number is invalid" error.

Anyway, in regard of anything related to internationalization always see Steven Black. Most steps mentioned here for the most problematic asian languages also applies to other foreign language settings.

Bye, Olaf.
 
Hi Mr. Olaf

SYS(2300,1256,1)
MODIFY COMMAND ...php_soundex.prg as 1256
worked fine on my system configured as mentioned previously.
I'll see Steven Black pages.
Thank you
yahya
 
Well system doesn't need that sys call anyway. It already supports codepage 1256 out of the box amd may even be your default codepage in VFP.
You don't need any of these hints, they are for people using other than arabic Windows.

Bye, Olaf.
 
I am not using arabic windows.I use english windows same as you but with arabic language added and then system locale for not unicode programs set to arabic as I mentioned in a previous post.
You can try it in a virtual machine if you want.
For testing I removed the arabic for non unicode programs and set it back to english,rebooted,and voila,page 1256 is no more available.return back to arabic for non unicode programs,restart,and page 1256 come back.
thank you
yahya
 
I am using german Windows, in principle it should work, but I'mnot yet eager to find out. I trust your code works and the hints are intended for those needing arabic support, I think that they will have enough information by now.

Thank you.

Bye, Olaf.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top