suggestions for a spellchecker 1

Jurafsky · Nov 17, 2012

I've created a little spellchecker.
This script works in this way:

After reading each line of the text, realizes some corrections
thanks to a comparison between a dictionary and the text itself.

When it finds a word that doesn't exist in the dictionary, it corrects the words
(giving one or more suggestions) and pushes it into an array.

Here there's my problem:

I would like to give to the user the possibility to choose the correct word
among the words suggested. Something like this:

We found the word "wlak" in your text which isn't correct.
The suggested possibilities are:
1. walk
2. work

type the number associated to the word or 0 if you can't find the correct word.

Then I would like to replace the correct word on the original text (creating a new .txt).

How can I do this?

prex1 · Nov 17, 2012

I would like to go to the moon.
How can I do this?

http://www.xcalcs.com

: Online engineering calculations

http://www.megamag.it

: Magnetic brakes for fun rides

http://www.levitans.com

: Air bearing pads

Jurafsky · Nov 18, 2012

LoL

Without code it's impossible to handle it I know. This is my code.

Perl:

use diagnostics;
use warnings;

my ($file_dictionary, $word, $line, $line1, $alph, $elt, $w, $transposition, $letter1, $letter2);
my (@word, @altered_word, @filedictionary, @filetext, @dictionary, @addition, @replacement, @transposition, @removal);


$file_dictionary = "lexique.txt";
$file_text = "texte.txt";

#I create an array for the dictionary
open (L, "<", $file_dictionary);
while (defined( $line1 = <L>)) {
	chomp($line1);
	@filedictionary = split (/\s/, $line1);
	push (@dictionary, @filedictionary);
	}
	
#I create an array for the text	
open (T, "<", $file_text);
while (defined( $line = <T>)) {
	chomp($line);
	@filetext = split (/(\s|\pP)/, $line);
	for ($i = 0; $i < @filetext; $i++) {
		if (!grep(/^$filetext[$i]$/, @dictionary)) {
		push (@word, $filetext[$i]);
		}
	}
}

#then I create an array for each word 
foreach $w(@word) {
@altered_word = split (//, $w);

#I create an array for the dictionary
open (L, "<", $file_dictionary);
while (defined( $line1 = <L>)) {
	chomp($line1);
	@filedictionary = split (/\s/, $line1);
	push (@dictionary, @filedictionary);
	}

#first operation --> "palrer" will be "parler"
for (my $i=0; $i < $#altered_word ; $i++)
	{
		@transposition = @altered_word;
		$letter1 = $transposition[$i];
		$letter2 = $transposition[$i+1];
		$transposition[$i] = $letter2;
		$transposition[$i+1] = $letter1;
		
		$transposition = join "", @transposition;
		if (grep(/^$transposition$/, @dictionary))
		{
			print "post transposition : $transposition\n";
		}

	}
	
foreach $elt (0 .. $#altered_word) {
#second operation --> parller will be parler

		@removal = @altered_word;
		splice(@removal, $elt, 1);
		$removal = join "", @removal;
		if (grep(/^$removal$/, @dictionary))
		{
			print "post enlevement : $removal\n";
		}

#third operation --> parer will be parler

	foreach $alph('a' .. 'z') {
	
	@addition = @altered_word;
	splice(@addition, $elt, 0, $alph);
	
	$addition = join "", @addition;
		if (grep(/^$addition$/, @dictionary)) {
			print "post addition : $addition\n";
			}

#last operation  : mancer will be manger
		
	@replacement = @altered_word;
	splice(@replacement, $elt, 1, $alph);
	$replacement = join "", @replacement;
		if (grep(/^$replacement$/, @dictionary)) {
			print "post replacement : $replacement\n";
			}
		}
	}
}

TEXT
French Dictionary

prex1 · Nov 18, 2012

So I see that you essentially compare the words in the text with the words in the dictionary in the following steps:
-the word as is
-the word with every character swapped with its neighbor
-the word with every single character deleted
-the word with one character added at every possible place (but parer changed to parler is a bad example, as parer exists in french!)
-the word with every single character replaced by another one.
A few notes on your code:
-the dictionary is created twice
-you should close, after reading them, the files you open
-[tt]grep(/^$filetext[$i]$/, @dictionary[/tt] is better written (faster) as [tt]$filetext[$i] ~~ @dictionary[/tt]
-this code

Code:

while (defined( $line1 = <L>)) {
  chomp($line1);
  @filedictionary = split (/\s/, $line1);
  push (@dictionary, @filedictionary);
}

can be equivalently written as

Code:

while (<L>) {
  push @dictionary, split;
}

This will be a little faster, but there is an important difference: [tt]split[/tt] without argument splits on multiple spaces, not creating null entries when multiple spaces are encountered (this is likely what you want). In your code this would be equivalent to [tt]split (/\s+/, $line1)[/tt].
In essence I think that what you try to do is a gigantic task (not like going to the moon, but...), unless of course this is a divertissement.
Concerning your question, I think that a possible strategy would be to retain the punctuation (but what about [tt]guillemets[/tt], apostrophes and ...?) in your text array (possibly using [tt]split/\b/[/tt], though this will also retain the blanks) and then skipping those during the analysis. At the end your text is rebuilt with a [tt]join[/tt] of the text array.
Good luck

http://www.xcalcs.com

: Online engineering calculations

http://www.megamag.it

: Magnetic brakes for fun rides

http://www.levitans.com

: Air bearing pads

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

suggestions for a spellchecker 1

Jurafsky

Programmer

prex1

Programmer

Jurafsky

Programmer

prex1

Programmer

Similar threads

Part and Inventory Search

Sponsor