Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

morphological and syntactic analysis on a text 1

Status
Not open for further replies.

Jurafsky

Programmer
Oct 27, 2012
11
IT
Hi all,
I'm a newbie.
I'm tryin to do an exercice but there's something that isn't working yet.
I've a text and I would like to do the analysis of it. The text is in french language.

"viens demain de bon matin"

I would like to analyse the text creating some arrays with the endings. In some cases endings are in common (for example "ain" is an adjectival and nominal ending at the same time, so "demain" will be displayed as "demain NOM ADJ").

How can I do this?

I'll put my code.

Perl:
#!/usr/bin/perl
use warnings;
use diagnostics;
use Data::Dumper;
$Data::Dumper::Terse = 1;
$Data::Dumper::Indent = 0;
 
 #scalars and arrays
 
 my ($m, $file, $suf_adv, $line);
 my (@suf_nom, @suf_adj, @suf_verb);
 my %punt;


@suf_nom = qw(ard ain);
@suf_adj = qw(eux ain);
@suf_verb = qw(est iser ifier eter iller ouiller);
$suf_adv = "ment" ;
%punt = (
	' '   => 1,
	''    => 1,
	','   => 1,
	"'"   => 1,
	'.'   => 1,
	'?'   => 1,
	';'   => 1,
	'!'   => 1,
	'-'   => 1,
	':'   => 1,
	'?'   => 1
);

$line = "viens me trouver demain de bon matin, ... \n";

@words = split (/(\pP|\pS|\s)/, $line);
	
foreach $w(@words) {
	if ($w =~ m/$suf_nom[$_]/ and length($w) >= 6 and !exists $punt{$w}) {
		print "$w NOM\n";
	}
	elsif ($w =~ m/$suf_adj[$_]/ and length($w) >= 6 and !exists $punt{$w}) {
		print "$w ADJ\n";
	}
	elsif ($w =~ m/$suf_verb[$_]/ and length($w) >= 6 and !exists $punt{$w}) {
		print "$w V\n";
	}
	else {
		print "NC\n";
	}
}


I'd like to have this output :
viens
me
trouver
demain ADJ NOM
de
bon
matin
 
I tried this:

Jurafsky.pl
Code:
[COLOR=#a020f0]#!/usr/bin/perl[/color]
[COLOR=#804040][b]use strict[/b][/color];
[COLOR=#804040][b]use warnings[/b][/color];

[COLOR=#0000ff]# define arrays[/color]
[COLOR=#804040][b]my[/b][/color] [COLOR=#008080]@suf_nom[/color] = [COLOR=#ff00ff]qw([/color][COLOR=#ff00ff]ard ain[/color][COLOR=#ff00ff])[/color];
[COLOR=#804040][b]my[/b][/color] [COLOR=#008080]@suf_adj[/color] = [COLOR=#ff00ff]qw([/color][COLOR=#ff00ff]eux ain[/color][COLOR=#ff00ff])[/color];
[COLOR=#804040][b]my[/b][/color] [COLOR=#008080]@suf_verb[/color] = [COLOR=#ff00ff]qw([/color][COLOR=#ff00ff]est iser ifier eter iller ouiller[/color][COLOR=#ff00ff])[/color];
[COLOR=#804040][b]my[/b][/color] [COLOR=#008080]@suf_adv[/color] = [COLOR=#ff00ff]qw([/color][COLOR=#ff00ff]ment[/color][COLOR=#ff00ff])[/color];
[COLOR=#804040][b]my[/b][/color] [COLOR=#008080]@punt[/color] = [COLOR=#804040][b]split[/b][/color] ([COLOR=#ff00ff]""[/color], [COLOR=#ff00ff]"[/color][COLOR=#ff00ff],'.?;!-:?[/color][COLOR=#ff00ff]"[/color]);

[COLOR=#804040][b]my[/b][/color] [COLOR=#008080]$line[/color] = [COLOR=#ff00ff]"[/color][COLOR=#ff00ff]viens me trouver demain de bon matin, ... [/color][COLOR=#6a5acd]\n[/color][COLOR=#ff00ff]"[/color];

[COLOR=#0000ff]# find punctuation chharacters and replace them with space [/color]
[COLOR=#804040][b]my[/b][/color] [COLOR=#008080]@punt_found[/color] = ();
[COLOR=#804040][b]foreach[/b][/color] [COLOR=#804040][b]my[/b][/color] [COLOR=#008080]$p[/color] ([COLOR=#008080]@punt[/color]) {
  [COLOR=#804040][b]if[/b][/color] ([COLOR=#008080]$line[/color] =~[COLOR=#804040][b] /[/b][/color][COLOR=#6a5acd][$p][/color][COLOR=#804040][b]/[/b][/color]) {
    [COLOR=#0000ff]# add to array[/color]
    [COLOR=#804040][b]push[/b][/color]([COLOR=#008080]@punt_found[/color], [COLOR=#008080]$p[/color]); 
    [COLOR=#0000ff]# replace punctuation with space[/color]
    [COLOR=#008080]$line[/color] =~ [COLOR=#804040][b]s/[/b][/color][COLOR=#6a5acd][$p][/color][COLOR=#804040][b]/[/b][/color][COLOR=#ff00ff] [/color][COLOR=#804040][b]/g[/b][/color];
  }
}

[COLOR=#0000ff]# split line to array by space[/color]
[COLOR=#804040][b]my[/b][/color] [COLOR=#008080]@words[/color] = [COLOR=#804040][b]split[/b][/color] ([COLOR=#804040][b]/[/b][/color][COLOR=#6a5acd]\s[/color][COLOR=#6a5acd]+[/color][COLOR=#804040][b]/[/b][/color], [COLOR=#008080]$line[/color]);

[COLOR=#804040][b]foreach[/b][/color] [COLOR=#804040][b]my[/b][/color] [COLOR=#008080]$word[/color] ([COLOR=#008080]@words[/color]) {
  [COLOR=#804040][b]my[/b][/color] [COLOR=#008080]@word_class[/color] = ();
  [COLOR=#804040][b]my[/b][/color] [COLOR=#008080]$pattern[/color] = [COLOR=#ff00ff]""[/color];

  [COLOR=#804040][b]push[/b][/color]([COLOR=#008080]@word_class[/color], [COLOR=#008080]$word[/color]);

  [COLOR=#0000ff]# chack for NOM[/color]
  [COLOR=#804040][b]foreach[/b][/color] [COLOR=#008080]$pattern[/color] ([COLOR=#008080]@suf_nom[/color]) {
    [COLOR=#804040][b]if[/b][/color] ([COLOR=#008080]$word[/color] =~[COLOR=#804040][b] /[/b][/color][COLOR=#008080]$pattern[/color][COLOR=#804040][b]/[/b][/color]) {
      [COLOR=#804040][b]push[/b][/color]([COLOR=#008080]@word_class[/color], [COLOR=#ff00ff]"[/color][COLOR=#ff00ff]NOM[/color][COLOR=#ff00ff]"[/color]); 
    }
  }

  [COLOR=#0000ff]# check for ADJ[/color]
  [COLOR=#804040][b]foreach[/b][/color] [COLOR=#008080]$pattern[/color] ([COLOR=#008080]@suf_adj[/color]) {
    [COLOR=#804040][b]if[/b][/color] ([COLOR=#008080]$word[/color] =~[COLOR=#804040][b] /[/b][/color][COLOR=#008080]$pattern[/color][COLOR=#804040][b]/[/b][/color]) {
      [COLOR=#804040][b]push[/b][/color]([COLOR=#008080]@word_class[/color], [COLOR=#ff00ff]"[/color][COLOR=#ff00ff]ADJ[/color][COLOR=#ff00ff]"[/color]); 
    }
  }

  [COLOR=#0000ff]# check for VERB[/color]
  [COLOR=#804040][b]foreach[/b][/color] [COLOR=#008080]$pattern[/color] ([COLOR=#008080]@suf_verb[/color]) {
    [COLOR=#804040][b]if[/b][/color] ([COLOR=#008080]$word[/color] =~[COLOR=#804040][b] /[/b][/color][COLOR=#008080]$pattern[/color][COLOR=#804040][b]/[/b][/color]) {
      [COLOR=#804040][b]push[/b][/color]([COLOR=#008080]@word_class[/color], [COLOR=#ff00ff]"[/color][COLOR=#ff00ff]VERB[/color][COLOR=#ff00ff]"[/color]); 
    }
  }

  [COLOR=#0000ff]# check for ADV[/color]
  [COLOR=#804040][b]foreach[/b][/color] [COLOR=#008080]$pattern[/color] ([COLOR=#008080]@suf_adv[/color]) {
    [COLOR=#804040][b]if[/b][/color] ([COLOR=#008080]$word[/color] =~[COLOR=#804040][b] /[/b][/color][COLOR=#008080]$pattern[/color][COLOR=#804040][b]/[/b][/color]) {
      [COLOR=#804040][b]push[/b][/color]([COLOR=#008080]@word_class[/color], [COLOR=#ff00ff]"[/color][COLOR=#ff00ff]ADV[/color][COLOR=#ff00ff]"[/color]); 
    }
  }

  [COLOR=#0000ff]# print result[/color]
  [COLOR=#804040][b]printf[/b][/color] [COLOR=#ff00ff]"[/color][COLOR=#ff00ff]%s[/color][COLOR=#6a5acd]\n[/color][COLOR=#ff00ff]"[/color], [COLOR=#804040][b]join[/b][/color]([COLOR=#ff00ff]"[/color][COLOR=#ff00ff] [/color][COLOR=#ff00ff]"[/color], [COLOR=#008080]@word_class[/color]);
}

[COLOR=#804040][b]print[/b][/color] [COLOR=#ff00ff]"[/color][COLOR=#6a5acd]\n[/color][COLOR=#ff00ff]Punctuation characters found: [/color][COLOR=#ff00ff]"[/color];
[COLOR=#804040][b]printf[/b][/color] [COLOR=#ff00ff]"[/color][COLOR=#ff00ff]%s[/color][COLOR=#6a5acd]\n[/color][COLOR=#ff00ff]"[/color], [COLOR=#804040][b]join[/b][/color]([COLOR=#ff00ff]"[/color][COLOR=#ff00ff] [/color][COLOR=#ff00ff]"[/color], [COLOR=#008080]@punt_found[/color]);

Output:
Code:
C:\Work>perl Jurafsky.pl
viens
me
trouver
demain NOM ADJ
de
bon
matin

Punctuation characters found: , .
 
Now I see, that in the array @punt the character '?' is present two times. Delete one of them.
 
it's perfect ! I used the hash "punt" for the punctuation ! Is there a way to print the result apart from the first foreach loop?
 
In my case @punt isn't hash, it's only array which contains punctuation characters - I don't need hash.

Jurafsky said:
Is there a way to print the result apart from the first foreach loop?
What result? I don't understand what you mean..
 
You mean probably something like this:

Code:
[COLOR=#a020f0]#!/usr/bin/perl[/color]
[COLOR=#804040][b]use strict[/b][/color];
[COLOR=#804040][b]use warnings[/b][/color];

[COLOR=#0000ff]# define arrays[/color]
[COLOR=#804040][b]my[/b][/color] [COLOR=#008080]@suf_nom[/color] = [COLOR=#ff00ff]qw([/color][COLOR=#ff00ff]ard ain[/color][COLOR=#ff00ff])[/color];
[COLOR=#804040][b]my[/b][/color] [COLOR=#008080]@suf_adj[/color] = [COLOR=#ff00ff]qw([/color][COLOR=#ff00ff]eux ain[/color][COLOR=#ff00ff])[/color];
[COLOR=#804040][b]my[/b][/color] [COLOR=#008080]@suf_verb[/color] = [COLOR=#ff00ff]qw([/color][COLOR=#ff00ff]est iser ifier eter iller ouiller[/color][COLOR=#ff00ff])[/color];
[COLOR=#804040][b]my[/b][/color] [COLOR=#008080]@suf_adv[/color] = [COLOR=#ff00ff]qw([/color][COLOR=#ff00ff]ment[/color][COLOR=#ff00ff])[/color];
[COLOR=#804040][b]my[/b][/color] [COLOR=#008080]@punt[/color] = [COLOR=#804040][b]split[/b][/color] ([COLOR=#ff00ff]""[/color], [COLOR=#ff00ff]"[/color][COLOR=#ff00ff],'.?;!-:?[/color][COLOR=#ff00ff]"[/color]);

[COLOR=#804040][b]my[/b][/color] [COLOR=#008080]$line[/color] = [COLOR=#ff00ff]"[/color][COLOR=#ff00ff]viens me trouver demain de bon matin, ... [/color][COLOR=#6a5acd]\n[/color][COLOR=#ff00ff]"[/color];

[COLOR=#0000ff]# find punctuation chharacters and replace them with space [/color]
[COLOR=#804040][b]my[/b][/color] [COLOR=#008080]@punt_found[/color] = ();
[COLOR=#804040][b]foreach[/b][/color] [COLOR=#804040][b]my[/b][/color] [COLOR=#008080]$p[/color] ([COLOR=#008080]@punt[/color]) {
  [COLOR=#804040][b]if[/b][/color] ([COLOR=#008080]$line[/color] =~[COLOR=#804040][b] /[/b][/color][COLOR=#6a5acd][$p][/color][COLOR=#804040][b]/[/b][/color]) {
    [COLOR=#0000ff]# add to array[/color]
    [COLOR=#804040][b]push[/b][/color]([COLOR=#008080]@punt_found[/color], [COLOR=#008080]$p[/color]); 
    [COLOR=#0000ff]# replace punctuation with space[/color]
    [COLOR=#008080]$line[/color] =~ [COLOR=#804040][b]s/[/b][/color][COLOR=#6a5acd][$p][/color][COLOR=#804040][b]/[/b][/color][COLOR=#ff00ff] [/color][COLOR=#804040][b]/g[/b][/color];
  }
}

[COLOR=#0000ff]# split line to array by space[/color]
[COLOR=#804040][b]my[/b][/color] [COLOR=#008080]@words[/color] = [COLOR=#804040][b]split[/b][/color] ([COLOR=#804040][b]/[/b][/color][COLOR=#6a5acd]\s[/color][COLOR=#6a5acd]+[/color][COLOR=#804040][b]/[/b][/color], [COLOR=#008080]$line[/color]);

[COLOR=#804040][b]my[/b][/color] [COLOR=#008080]@all_word_classes[/color] = ();
[COLOR=#804040][b]foreach[/b][/color] [COLOR=#804040][b]my[/b][/color] [COLOR=#008080]$word[/color] ([COLOR=#008080]@words[/color]) {
  [COLOR=#804040][b]my[/b][/color] [COLOR=#008080]@word_class[/color] = ();
  [COLOR=#804040][b]my[/b][/color] [COLOR=#008080]$pattern[/color] = [COLOR=#ff00ff]""[/color];

  [COLOR=#804040][b]push[/b][/color]([COLOR=#008080]@word_class[/color], [COLOR=#008080]$word[/color]);

  [COLOR=#0000ff]# chack for NOM[/color]
  [COLOR=#804040][b]foreach[/b][/color] [COLOR=#008080]$pattern[/color] ([COLOR=#008080]@suf_nom[/color]) {
    [COLOR=#804040][b]if[/b][/color] ([COLOR=#008080]$word[/color] =~[COLOR=#804040][b] /[/b][/color][COLOR=#008080]$pattern[/color][COLOR=#804040][b]/[/b][/color]) {
      [COLOR=#804040][b]push[/b][/color]([COLOR=#008080]@word_class[/color], [COLOR=#ff00ff]"[/color][COLOR=#ff00ff]NOM[/color][COLOR=#ff00ff]"[/color]); 
    }
  }

  [COLOR=#0000ff]# check for ADJ[/color]
  [COLOR=#804040][b]foreach[/b][/color] [COLOR=#008080]$pattern[/color] ([COLOR=#008080]@suf_adj[/color]) {
    [COLOR=#804040][b]if[/b][/color] ([COLOR=#008080]$word[/color] =~[COLOR=#804040][b] /[/b][/color][COLOR=#008080]$pattern[/color][COLOR=#804040][b]/[/b][/color]) {
      [COLOR=#804040][b]push[/b][/color]([COLOR=#008080]@word_class[/color], [COLOR=#ff00ff]"[/color][COLOR=#ff00ff]ADJ[/color][COLOR=#ff00ff]"[/color]); 
    }
  }

  [COLOR=#0000ff]# check for VERB[/color]
  [COLOR=#804040][b]foreach[/b][/color] [COLOR=#008080]$pattern[/color] ([COLOR=#008080]@suf_verb[/color]) {
    [COLOR=#804040][b]if[/b][/color] ([COLOR=#008080]$word[/color] =~[COLOR=#804040][b] /[/b][/color][COLOR=#008080]$pattern[/color][COLOR=#804040][b]/[/b][/color]) {
      [COLOR=#804040][b]push[/b][/color]([COLOR=#008080]@word_class[/color], [COLOR=#ff00ff]"[/color][COLOR=#ff00ff]VERB[/color][COLOR=#ff00ff]"[/color]); 
    }
  }

  [COLOR=#0000ff]# check for ADV[/color]
  [COLOR=#804040][b]foreach[/b][/color] [COLOR=#008080]$pattern[/color] ([COLOR=#008080]@suf_adv[/color]) {
    [COLOR=#804040][b]if[/b][/color] ([COLOR=#008080]$word[/color] =~[COLOR=#804040][b] /[/b][/color][COLOR=#008080]$pattern[/color][COLOR=#804040][b]/[/b][/color]) {
      [COLOR=#804040][b]push[/b][/color]([COLOR=#008080]@word_class[/color], [COLOR=#ff00ff]"[/color][COLOR=#ff00ff]ADV[/color][COLOR=#ff00ff]"[/color]); 
    }
  }

  [COLOR=#0000ff]# add reference to array[/color]
  [COLOR=#804040][b]push[/b][/color]([COLOR=#008080]@all_word_classes[/color], [COLOR=#008080]\@word_class[/color]);
}

[COLOR=#0000ff]# print results[/color]
[COLOR=#804040][b]print[/b][/color] [COLOR=#ff00ff]"[/color][COLOR=#ff00ff]Results:[/color][COLOR=#6a5acd]\n[/color][COLOR=#ff00ff]"[/color];
[COLOR=#804040][b]print[/b][/color] [COLOR=#ff00ff]"[/color][COLOR=#ff00ff]--------[/color][COLOR=#6a5acd]\n[/color][COLOR=#ff00ff]"[/color];
[COLOR=#804040][b]foreach[/b][/color] [COLOR=#804040][b]my[/b][/color] [COLOR=#008080]$word_class[/color] ([COLOR=#008080]@all_word_classes[/color]) {
  [COLOR=#0000ff]# dereference and print[/color]
  [COLOR=#804040][b]printf[/b][/color] [COLOR=#ff00ff]"[/color][COLOR=#ff00ff]%s[/color][COLOR=#6a5acd]\n[/color][COLOR=#ff00ff]"[/color], [COLOR=#804040][b]join[/b][/color]([COLOR=#ff00ff]"[/color][COLOR=#ff00ff] [/color][COLOR=#ff00ff]"[/color], @{[COLOR=#008080]$word_class[/color]});
}
[COLOR=#804040][b]print[/b][/color] [COLOR=#ff00ff]"[/color][COLOR=#6a5acd]\n[/color][COLOR=#ff00ff]Punctuation characters found: [/color][COLOR=#ff00ff]"[/color];
[COLOR=#804040][b]printf[/b][/color] [COLOR=#ff00ff]"[/color][COLOR=#ff00ff]%s[/color][COLOR=#6a5acd]\n[/color][COLOR=#ff00ff]"[/color], [COLOR=#804040][b]join[/b][/color]([COLOR=#ff00ff]"[/color][COLOR=#ff00ff] [/color][COLOR=#ff00ff]"[/color], [COLOR=#008080]@punt_found[/color]);

As you see, in the main foreach loop I only stored the array reference \@word_class to the array @all_word_classes. So, after the foreach loop the array @all_word_classes contains the refernces on all particular arrays @word_class (for every word processed). Now I can on other place dereference and print all its elements.

Output:
Code:
C:\Work>perl Jurafsky.pl
Results:
--------
viens
me
trouver
demain NOM ADJ
de
bon
matin

Punctuation characters found: , .
 
If you want that the output for your line
Code:
viens me trouver demain de bon matin, ...
should be
Code:
viens
me
trouver
demain NOM ADJ
de
bon
matin
, PUNT
. PUNT
. PUNT
. PUNT
then first delimite every puctuation character by spaces and then split the string into array.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top