Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Substitute (swap) pairs of words

Status
Not open for further replies.

Kirsle

Programmer
Jan 21, 2006
1,179
US
Hey,

How would the regular expression look that would swap or substitute multiple-word pairs, but not undo itself?

For example:

Code:
# Swap 1st and 2nd person pronouns
my %person = (
   'I am'    => 'you are',
   'you are' => 'I am',
   'I\'m'    => 'you\'re',
   'you\'re' => 'I\'m',
   'your'    => 'my',
   'my'      => 'your',
);
my $str = "I think you want me to say I'm not your scalar.";

foreach my $key (keys %person) {
   $str =~ s/$key/$person{$key}/ig;
}

# The output SHOULD be...
you think I want you to say you're not my scalar

# What it probably WOULD be...
I think you want me to say I'm not your scalar

Because, s/I am/you are/ig and s/you are/I am/ig will contradict each other. Is there a way to just swap them without undoing itself on a reverse substitution? Maybe something like this?

Code:
$str =~ tr/(I am)(you are)/(you are)(I am)/;

I wanna find out how to do this because I'm working on an A.I. system for chatterbots that has substitutions (things like "what's = what is, you're = you are") and person substitutions (swap 1st and 2nd person pronouns), and the way I'm currently doing it is to split the string into an array of words, and replace individual words, like...

Code:
my @words = split(/\s+/, $str);
foreach my $w (@words) {
   foreach my $key (keys %person) {
      if ($w eq $key) {
         $w = $person{$key};
      }
      push (@new, $w);
   }
}
$str = join(" ",@new);

But that only works on single-word substitutions. Ideally it should work for multi-word ones too.

Any help will be appreciated. :)

-------------
Cuvou.com | My personal homepage
Project Fearless | My web blog
 
Maybe first change the multiple words to single words, like "I am" to "I'm". Then see how well the array method works. I can't see any way using a hash table and a regexp to try and enforce all those rules on a string would work. You have no control over which rule in the hash table is evaluated first, that right there has the potential to kill the whole process I would think. Here it is with some more rules thrown in:

Code:
use strict;
use warnings;
my %contractions = (
   'I am'    => q{I'm},
   'you are' => q{you're},
);
my %person = (
   'I am'    => 'you are',
   'you are' => 'I am',
   'I\'m'    => 'you\'re',
   'you\'re' => 'I\'m',
   'your'    => 'my',
   'my'      => 'your',
   'I'       => 'you',
   'you'     => 'I',
   'me'      => 'you'
);
my $str = "I think you want me to say I am not your scalar.";
print "Before: $str\n";
foreach my $key (keys %contractions) {
   $str =~ s/\b$key\b/$contractions{$key}/ig;
}
print "After: $str\n";
my @words = split(/\s+/, $str);
foreach my $w (@words) {
   $w = $person{$w} || $w;
   push (@new, $w);
}
#}
my $newstr = join(" ",@new);
print "After: $newstr";

output:

Before: I think you want me to say I am not your scalar.
After: I think you want me to say I'm not your scalar.
After: you think I want you to say you're not my scalar.


------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
The double words should be removed from %person although leaving them in will have no affect:

Code:
my %person = (
   'I\'m'    => 'you\'re',
   'you\'re' => 'I\'m',
   'your'    => 'my',
   'my'      => 'your',
   'I'       => 'you',
   'you'     => 'I',
   'me'      => 'you'
);

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
This should work, as the string is only traversed once from left to right, meaning that the same part of the string won't be replaced twice:
Code:
my $match = join '|', keys %person;

$str =~ s/\b(\Q$match\E)\b/$person{$1}/ig;
 
I suppose you know that your goal, like many other attempts to manipulate natural languages, is not an easy one...
First you should separate multiple words from single ones and replace them in two successive operations (this applies more specifically to single words that are contained into multiple ones). Also, if you can have double words that are contained in a triple word (not in your examples above), you need to replace first triple words, then doubles, then singles, and so on.
Another point is that, just as when swapping two variables you need an intermediate operation ([tt]$tmp=$first; $first=$second; $second=$tmp;[/tt]), you'll need an intermediate step also here. I used the trick of making replacements surrounded by underscores (it is a way of marking them as 'already replaced'), but I'm sure that other methods may be used. I chose the underscores because, as you will understand from the code, a 'word' character is required to stop further replacements; of course, however, there can't be underscores in your text...
Another trick is to use precompiled regular expressions: should be faster, but that's a minor point here.
Code:
# Swap 1st and 2nd person pronouns
use strict;
use warnings;
my%multiple=(qr/\bI am\b/i => '_you are_',
  qr/\byou are\b/i => '_I am_',
  qr/\bI\'m\b/i => '_you\'re_',
  qr/\byou\'re\b/i => '_I\'m_'
);
my%single=(qr/\byour\b/i => '_my_',
  qr/\bmy\b/i => '_your_',
  qr/\bI\b/i => '_you_',
  qr/\byou\b/i => '_I_',
  qr/\bme\b/i => '_you_'
);
my$str="I think you want me to say I'm not your scalar.";
for my$key(keys%multiple){
  $str=~s/$key/$multiple{$key}/g;
}
for my$key(keys%single){
  $str=~s/$key/$single{$key}/g;
}
$str=~tr/ _/ /s;
$str=~s/^\s+//;
$str=~s/\s+$//;
$str=ucfirst$str;
print$str,"\n";
Output:
You think I want you to say you're not my scalar.

Franco
: Online engineering calculations
: Magnetic brakes for fun rides
: Air bearing pads
 
I eventually solved this problem by something along the lines of...

Code:
sub _personSub {
  my ($self,$string) = @_;

  # Substitute each of the sorted person sub arrays in order,
  # using a one-way substitution algorithm (read: base13).
  foreach my $pattern (@{$self->{sortlist}->{person}}) {
    print "sub $pattern\n";
    my $result = $self->{person}->{$pattern};
    $result =~ tr/A-Za-z/N-ZA-Mn-za-m/;

    $string =~ s/$pattern/<per>$result<rep>/ig;
    print "str: $string\n";
  }

  # Now rot13-decode what's left.
  while ($string =~ /<per>(.+?)<rep>/i) {
    my $rot13 = $1;
    $rot13 =~ tr/A-Za-z/N-ZA-Mn-za-m/;
    $string =~ s/<per>(.+?)<rep>/$rot13/i;
  }

  return $string;
}

where...

Code:
$self = {
   person => {
      'you are' => 'I am',
      'I am'    => 'you are',
      'you\'re' => 'I\'m',
      'i\'m'    => 'you\'re',
      'you'     => 'me',
      'me'      => 'you',
      'my'      => 'your',
      'your'    => 'my',
      'i'       => 'you',
   },
   sortlist => {
      person => [
        "you are",
        "i am",
        "you're",
        "your",
        "i'm",
        "you",
        "my",
        "i",
      ],
   },
};

What I did was added a subroutine that, given an array, would sort the list and store it under $self->{sortlist}, like so:

Code:
sub sortList {
  my ($self,$name,@list) = @_;

  # If a sorted list by this name already exists, delete it.
  if (exists $self->{sortlist}->{$name}) {
    delete $self->{sortlist}->{$name};
  }

  # Initialize the sorted list.
  $self->{sortlist}->{$name} = [];

  # Track by number of words.
  my $track = {};

  # Loop through each item in the list.
  foreach my $item (@list) {
    # Count the words.
    my @words = split(/\s+/, $item);
    my $cword = scalar(@words);

    # Store this by group of word counts.
    if (!exists $track->{$cword}) {
      $track->{$cword} = [];
    }
    push (@{$track->{$cword}}, $item);
  }

  # Sort them.
  my @sorted = ();
  foreach my $count (sort { $b <=> $a } keys %{$track}) {
    print "sorting by $count words\n";
    my @items = sort { length $b <=> length $a } @{$track->{$count}};
    push (@sorted,@items);
  }

  print "sorted list:\n"
    . join("\n",@sorted) . "\n";

  # Store this list.
  $self->{sortlist}->{$name} = [ @sorted ];
  return 1;
}

So... it groups arrays up by the number of words in each pattern, then sorts each array from the most words to the least where each array itself is sorted by length. And then when it actually goes to run the substitutions, it ROT13 encodes the results and puts a little marker around it, and after doing all that, it loops through and ROT13 decodes all the text between those markers.

Kinda a messy solution, but it gets the job done...

Code:
// person test
+ {weight=500}say *
- Umm... "<person>"

...

You> say you are a robot so you don't have to type, but I do because I'm a human
sub you are
str: <per>V nz<rep> a robot so you do not have to type but i do because i am a human
sub i am
str: <per>V nz<rep> a robot so you do not have to type but i do because <per>lbh ner<rep> a human
sub you're
str: <per>V nz<rep> a robot so you do not have to type but i do because <per>lbh ner<rep> a human
sub your
str: <per>V nz<rep> a robot so you do not have to type but i do because <per>lbh ner<rep> a human
sub i'm
str: <per>V nz<rep> a robot so you do not have to type but i do because <per>lbh ner<rep> a human
sub you
str: <per>V nz<rep> a robot so <per>V<rep> do not have to type but i do because <per>lbh ner<rep> a human
sub my
str: <per>V nz<rep> a robot so <per>V<rep> do not have to type but i do because <per>lbh ner<rep> a human
sub i
str: <per>V nz<rep> a robot so <per>V<rep> do not have to type but <per>lbh<rep> do because <per>lbh ner<rep> a human
Bot> Umm... "I am a robot so I do not have to type but you do because you are a human"

-------------
Cuvou.com | My personal homepage
Project Fearless | My web blog
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top