Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

perl: remove non-english sentence

Status
Not open for further replies.

diera

Programmer
Mar 21, 2011
28
DE
Hi,

I would like to filter all the sentence in my dataset which is not in English language. I already take a look to remove for non-ASCII character but sometime the character for some language are the same with English.

Do you know any Perl library or dictionary to filter non-english word.
any help is much appreciated.


-DATA-
11.10.2009 12.09 @ariel di sini kabar sihat saje
12.10.2009 17:47 start learning at mid night again. Did I stop learning? sure not. I'll learn until die.
12.10.2009 17:48 RT ggapj e tinha PiruliCópteros!! quem não teve?? #twittesuainfancia
12.10.2009 21:29 @anaelisa04 coooomédia meemo em? =D
13.10.2009 14:47 Reply w/Answer for Chance to Win a FREE Phoenix Bat: Who holds the MLB record for most home runs hit in opening day games?
13.10.2009 14:57 @alexforrestitv some treatment are good apparently in the region.Hope you will come back there to show something more cheerful? :)
13.10.2009 15:18 QUER #AUMENTAR SEUS #SEGUIDORES RÁPIDO? Siga o @MeusFollowers. Leia as instruções e em poucos minutos você terá centenas de seguidores!!

 
Perl:
#!/usr/bin/perl -w

use strict;
use warnings;

my %english = ();

 open (  DICTIONARY, '<', '/usr/share/dict/words' );
while ( <DICTIONARY> )
{
	$_ = lc $_;
	
	print;
	
	chomp;
	
	$english{ $_ } = 1;
	
}
close DICTIONARY;

while ( <DATA> )
{
	$_ = lc $_;
	
	my $finds = 0;
	
	my @split_up = split( /\s+/ );
	
	foreach ( @split_up )
	{
		$finds++ if exists $english{ $_ };
	}
	
	print "<<< $finds find(s) >>>\t$_\n";
	
	print "\t... more than half of the words were found in the dictionary\n\n" if $finds >= (( scalar @split_up ) / 2 );
}

__DATA__
11.10.2009 12.09 @ariel di sini kabar sihat saje
12.10.2009 17:47 start learning at mid night again. Did I stop learning? sure not. I'll learn until die.
12.10.2009 17:48 RT ggapj e tinha PiruliCópteros!! quem não teve?? #twittesuainfancia
12.10.2009 21:29 @anaelisa04 coooomédia meemo em? =D
13.10.2009 14:47 Reply w/Answer for Chance to Win a FREE Phoenix Bat: Who holds the MLB record for most home runs hit in opening day games?
13.10.2009 14:57 @alexforrestitv some treatment are good apparently in the region.Hope you will come back there to show something more cheerful? :)
13.10.2009 15:18 QUER #AUMENTAR SEUS #SEGUIDORES RÁPIDO? Siga o @MeusFollowers. Leia as instruções e em poucos minutos você terá centenas de seguidores!!

Kind Regards
Duncan
 
Hi

Nice idea, Duncan. But could you specify the package which provides that file ? Thank you.

( The systems I checked show quite different situations : Frugalware - no /usr/share/dict/; Debian - /usr/share/dict/words is broken symlink; Ubuntu - /usr/share/dict/ exists but empty. )

Feherke.
[link feherke.github.com/][/url]
 
Hi,

thank you Duncan. As feherke's question, can you specify the package?

i have run the code, it seem the file not exist.

'Can't open dictionary: No such file or directory at C:....'

thank you very much.

 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top