Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations TouchToneTommy on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

grammar checker help 4

Status
Not open for further replies.

grammarian

Programmer
Jun 23, 2006
14
US
Hi there,
I'm a C/C++ and VB programmer and I have a program in mind that I would like to come true with the help of Perl. I want to ask you guys if what I want to do is doable.

I want to make a grammar checker.
So I was reading some snippets in Perl and I wanted to know if this was possible. I thought I'd try to write something in Perl that would correct English subjunctives, i.e. wherever it found "If it was" or "if he was" or "if his brother was", it would make it "If it were" and "if he were" and "if his brother were". So that's what I've written:

if ($Text =~ m/(If|if)..*(were){0,0}(was)/g) {
or
if ($Text =~ m/[(If)(if)]..*(were){0,0}(was)/g) {

(from what I understand, this means: find every "If" or "if" that is not followed by "were" before being followed by "was" and that is at least one character apart from the "If" or "if")

Is that right?

And then I don't know what to put inside the braces to make the substitution happen.

And then I would continue to turn grammar rules into Perl syntax. Do you think this is doable? Do you think there is any grammar rule that would be impossible or maybe too complicated to program in Perl?

Thanks.
 
Not that I am a language expert but that looks like a terrible way to go about trying to correcting grammar. That is just to simple and will not work very well.
 
Thanks, but I'm looking for more analytical critiques, not opinions.
 
English is a very complex language, and to do any sort of non-trivial grammatical analysis you would need to have a good knowledge of linguistic theory and experience in artificial-intelligence programming. Also, Perl is powerful, but it's not traditionally associated with AI; that's more the preserve of Prolog and company. You ask if it's doable - my opinion, sorry analytical critique, is no, at least not for a non-specialist, regardless of the programming language used. Is there any grammar rule too complicated? Yes, absolutely, and that's not counting the near-infinitude of valid usages that are not in the rulebook.
 
Yes, English is a complex language, but what language isn't?
I have this "knowledge of linguistic theory" (not in my head, but in my friend's head), and we're trying to do a grammar checker without relying on A.I. I want to do it solely based on string manipulation, since I'm good at it and I think it can be done. No offense, but you needn't tell me whether it's hard or not. I just wanted to know if Perl is up for the job, and apparently (from what I've been reading) it is.

I've gone over Prolog superficially and I don't think it can do much for me. I'm now starting to think Perl is perfect for this job.

Let me try to be clearer, please.

Let's take the prepositions. Should you use pronouns after them, you can never have them as subject, but as object pronouns.
So the grammar checker would have to look for every "on I" and "to he" and "about she" and turn those into "on me" and "to him" and "about her".
I think you'll agree that's easy with Perl.

Moving on. Now, prepositions can never be followed by a non-gerund verb. So I'd have a list of all the verbs in English and scan the text for prepositions and then maybe use look-ahead to check if there's a verb in front of it not ending in ing. If there were, I would add "**prep. followed by gerund" so that the user would know about the mistake.

The preposition AT is not used with days and dates, so whenever I found "at [any day of the week]" or "at [ordinal number] of [month of the year]", I would mark it as a mistake.

So, what I'm asking is this, is there a way in Perl to do at least those things above? E.g. can I really ask for Perl to look for "at [any day of the week]" and not having to tell him specifically to look for "at Monday", "at Tuesday", "at Wednesday", etc?

Also, can you think of a grammar rule that YOU couldn't possibly write in Perl? Please tell me which rule. I know English has complicated rules. I don't care for that. What I worry about is whether English has any rules that cannot be parsed by Perl.


P.S. I was surfing around reading about Perl and saw something that can be applied to grammar checking. This is it:

The poster wants to put a space after any comma (punctuation, actually, but for simplicity, let's say comma) that is not nestled between two digits.

s/(?<=, # after a comma,
(?! # but not matching
(?<=\d,) # digit-comma before, AND
(?=\d) # digit afterward
)
)/ /gx; # substitute a space

(
That's the kind of thing I want to do. Perl amazes me because In C or whatever I would have to write much more than one line to put a space after every punctuation not nestled between digits. It would be pain. It is pain just to make room for that space to come on the string, let alone analyzing the string.
 
Thanks, but I'm looking for more analytical critiques, not opinions.

Well, maybe if you are only trying to correct subjuntives in grammar it could work. I don't know enough about sunjuntives to really know but at least that narrows the scope of what you are trying to do.
 
Perhaps you could try using the Parse::RecDescent module. Here are some good articles that indicated to me this might help you with your task:



Here is a sample piece of basic (very basic) code I hacked together.


Code:
use Parse::RecDescent;
$parser = Parse::RecDescent->new(q(
    startrule : prepositions pronouns
    prepositions : "on" | "to" | "about"
    pronouns : "i" | "he" | "she"
            
));

print "Bad Grammar\n" if $parser->startrule("on i");

I have never used this module before so I can't offer any further insight, but perhaps it is a start.

Raklet
 
So, what I'm asking is this, is there a way in Perl to do at least those things above? E.g. can I really ask for Perl to look for "at [any day of the week]" and not having to tell him specifically to look for "at Monday", "at Tuesday", "at Wednesday", etc?

From what you have explained the answer is no because you are only looking to do simple pattern matching. So for just the above yu would have to find the pattern "at" and see if it's followed by something it should not be followed by.

Code:
my $text = 'At tuesday and at friday we will have a meeting at 1:00 PM.'; 
$text =~ s/\s+(at)\s+(Mon|Tues|Wednes|Thurs|Fri|Satur|Sun)day/case($1). ' ' .ucfirst($2). 'day'/ieg;
print $text; 

sub case {
   my $L = shift;
   return ('On') if $L eq ucfirst($L);
   return ('on');
}
 
for my example the code should have been:

my $text = 'At tuesday and at friday we will have a meeting at 1:00 PM.';
$text =~ s/(at)\s+(Mon|Tues|Wednes|Thurs|Fri|Satur|Sun)day/case($1). ' ' .ucfirst($2). 'day'/ieg;
print $text;

sub case {
my $L = shift;
return ('On') if $L eq ucfirst($L);
return ('on');
}
 
look for every ... "about she" and turn those into ... "about her"
How about: "Whatever I talked about she was interested in"
whenever I found "at [any day of the week]" ... I would mark it as a mistake
How about: "We'll meet at Friday night's party
 
Raklet:

Hi there. Thanks for the links, they seem very useful, and I like the syntax.
Thanks for the example too. :)

KevinADC:

Hi Kevin. I understand the spirit of what you're saying. I have no authority to argue much further on that since we're talking specifically about Perl's capacities now and obviously everybody here knows about them more than I do.

Nevertheless I came to find a GPLed grammar checker written in Perl called CoGroo. Unfortunately, like the website itself puts it, CoGroo is ready to check texts in Portuguese only. But they plan to extend it to languages like English and Spanish.

Visiting their website and clicking on "area of tests" and then "Implemented rules" I found, you guessed it, the implemented rules. If you hover the cursor over the numbers in the left column, a tooltip box appears, showing the "Expressao Regular", which I think means "regular expression". Well, hovering over the very first number, 52, I can see that (at least it seems) they could successfully do something of what I was talking about. The regular expression reads "(a|bunch|of|portuguese|words)" V_ "-(another|bunch)". Well, if that V_ means what I think it means (verb), then Perl is the perfect language for the job, because that's the kind of thing I wanted to be able to do. Am I making my self clear on that? Please let me know.

Also, hovering over "53" I can see "por" "que" DET_ N_ V_ "-(another|bunch)". N is probably "noun", and DET is probably "determiner". If they can really do that, I could write the Days of the week rule simply as PREP_ DAYWEEK_.

Click here to visit the website translated to English by google:

TonyGroves:

Hi there, Groves.

How about: "Whatever I talked about she was interested in"
Good question. I would have to check for "about" followed by "she" and that would only be wrong if the "she" were followed by a gerund verb, since the construction "I was worried ABOUT HER BEING upset" is perfectly correct, but that's not the case with "I was worried ABOUT SHE BEING upset".

How about: "We'll meet at Friday night's party"
Good one too. In this case "at" is a preposition of specific adress, not
a preposition of days and dates. I'll have to come up with a way to solve this. I plan on using a very large English Corpus to look for patterns about which parts-of-speech comes before or after a valid "at Friday" instance, for example.

I appreciate your examples. But I think that someday someone will have to come up with a complete English grammar checker, and I cannot see why it wouldn't be laudable for anyone to take the first steps.
 
Good luck, you have ambition. Probably a 2 to 3 year project if you start from scratch to get a good beta version running and can a few good coders and language experts helping you.
 
You may also want to check out this site:


It is a grammar checking engine for Irish and is written completely in perl. The author says that the program is extensible and can be used to develop engines for others languages. Their is a developer's guide and extensive documentation on the website.
 
Thanks, KevinADC.
And Raklet, thanks for the website. That's very much alike the CoGroo I found. I downloaded the packages already.

Thank you.
 
grammarian,

I cannot see why it wouldn't be laudable for anyone to take the first steps.

Perhaps then you should applaud those who long ago did just that.

Have you seen the Microsoft Word grammar checker? Do you suppose that it just 'fell' together? You don't think it represents an astronomical invesment in time and mental energy, both by linguists and programmers? And it is far from perfect. I don't work for Microsoft and have absolutely no vested interest in promoting Mikeysoft, but there is a point to be made.

If, on the other hand, you meant that you should be lauded for taking the first step, then, in that case ...
(ROTFL Smiley withheld to protect the young.)
Excuse me, I don't want to seem insensitive while I'm on camera.

The language is full of just such examples as others pointed out.
Code:
Everyone I talked to she called.
Your pronoun case-corrector will change that to:
Code:
Everyone I talked to her called.
This is assuming your pronoun object case-corrector follows your subject case-corrector. But in that case, your system will also change
Code:
All to me seems interesting.
to ...
Code:
All to I seems interesting.
and then if you follow with checking the third-person singular rule, will afterward change to
Code:
[All to I seem interesting.]
and if then you fire the preposition-object rule, you'll get
Code:
All to me seem interesting
...which is grammatically correct, but changes the meaning of the sentence. These are the kinds of things that will routinely happen if you only use the simplicity of Perl regular-expression type sequential filtering. Of course, the comma that most would put in the first two sentences might help, but most text is riddled with nonstandard practice. How will you deal with that? The results will make for an interesting study in NLP techniques, but they may not be useful for practical text processing. Long before you get to the point where Perl starts becoming competent for any serious grammar-checking, it will be loaded with rules to handle exceptions and be too slow to wait for.

All modern language processing of any value uses a variety of techniques, including probalistic "guesstimating" that some structure is likely to be the rule rather than the exception. Perl regular expressions will only take you so far. It won't get you anywhere near the grammar-checker in Microsoft Word.

I don't want to disappoint you, or discourage you from continuing to study these techniques, but you're reinventing a stone wheel.

Do a Google on the Viterbi algorithm and go in that direction if you want to take your ideas to the next level. This is something that you could experiment with in Perl and go rather a long way---even produce some useful results. There are many useful language processing tasks that can be done very nicely in Perl. A limited grammar checker is a doable project, but it will be huge if it is to be useful to any degree. If it wasn't fully automatic, but gave the user a simple, single-keystroke interface to accept or reject suggested changes (or open a menu for more options), that could be the difference between a useful program and a gibberish generator.

You might also search for Earley parser, and CYK parser, to name two popular methods for determining the parse tree for a sentence. You could implement these in Perl also, but for practical use you'd by then be looking to move to C.

First you need the parse tree for a sentence (Well, the most likely best one of an exponential number of possible ones, anyway) before it makes any sense to 'check' the grammar. What you describe looks at only very local relationships--bigrams, to be specific. Adjacency has its place, but that alone won't get you very far here.

Prolog is handy for breadboarding ideas best represented as recursions (like parse trees), but too cumbersome for all but some relatively simple real-world applications. Perl will help you learn how to do things, but unless your application is relatively limited, will soon prove too bulky. You likely won't be able to do any industrial-strength NLP grammar checking in Perl. You are talking about an extremely expensive task, in case you don't yet know that.

Press on for the sake on the understanding you'll gain, by all means, but don't entertain any illusions about the territory you're charting. It's already well-mapped insofar as anything you've thus far described. You haven't left your own backyard yet. (I hate it when people tell me that.)

--torandson

 
Yes, I've seen MS word grammar checker. It represents all that you said and yes, it is far from perfect. But it is going to remain far from perfect for a while since no one can help them (it's proprietary).

I don't mean I should be applauded by my initiative. Sorry if I conveyed that sense.

Thanks for the examples, I'll put them to use.

By NLP do you mean "neuro-linguistic programming"?
If so, then no, the results won't make for interesting studies in NLP, because NLP is pseudoscience. PSEUDOSCIENCE.

Then what is the MSword grammar checker using? If not Perl, then what? That's one of my previous questions. Tell me what's the best language on which to write a grammar checker, please.

Yes, I don't want it to be fully automatic. It will have to get input from the user, ask him some questions on what he meant.

I don't entertain any illusions about anything, rest assured. Don't worry about me, I'll be fine, thank you. :)

And thanks for the insights. I'm geared toward a non-parse-trees approach, since I think it can be done without all that bullshit. Remember: I want a program that will ask the user for input, not a fully automatic one. And I think I can do it without parse trees.

My goal here is to make a COMPLETE grammar checker, one that checks for every grammar rule and aspect. I don't care if my program has to ask the user about ALL the supposed "corrections" it makes. Really. I just don't want to have to leaf through my Grammar's thousands of pages anymore. I need a program that analyses everything for me and then I'll go along with it through its "corrections" and tell it what he got right and whathe got wrong. By the end of it I'll have a grammatically correct text without ever leafing a page.

Now, you shouldn't entertain any illusions on the truthfullness of NLP. ;)
 
grammarian,

"NLP" does not mean what you guessed. It simply means "Natural Language Processing." "Natural Language" simply means a human language, like English or Latin, as opposed to 'artificial' languages, such as the language of nested parentheses ( () ( () ) ) or the language of pallindromes using the letters "a" and "b" such as aabbaa, abba, ababaabaababa, etc.

MS Word grammar checker must be using C or machine language to be as fast as it is. And some pretty sophisticated probabilistic techniques as well, I would think.

And cars could be made of wood without bothering with iron or rubber or any of that BS, too, I'm sure, provided they were also intended for a very limited use. I must say, your approach is amusing, if nothing less. Don't forget to let us all know when it's ready to be marketed, okay? (Heavyhanded? Yes. But you will understand why I must force myself to stop laughing after you have gotten to the point where you realize more completely the few potential applications for and many limitations of your brainchild.)

If you want speed of processing, use C and write your own regular expression engine [morning], but you may spend more time writing the code than it ever saves you. If you have nothing but time, go for it.

If you want to develop something quickly write it in perl. The bigger and more complex it is, the slower it will be. If you're going to need to 'approve' of each item as it finds them, the user interaction will be your biggest speed hurdle. If you only check for a few grammatical errors and let a lot of things slide, it could even be useful, to a limited degree.

If you're doing this to learn about language processing and computational linguistics, have at it either way!

If you're doing this to save time on future book reports, save youself a couple dozen years and just go buy a blue pencil. Then hire a secretary.

Can't afford a secretary? Not a valid argument here. I once went gold panning up in the mother lode. Spent two weeks doing backbreaking work for less than 1/20 oz of gold dust---worth less than $20 at recent prices.

Had I swept floors at $5/hr for two weeks, I could have bought a whole ounce of gold doing much less work.

(Of course, it wouldn't have been as much FUN! But that's another issue.)

--torandson
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top