Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Perl: string comparison

Status
Not open for further replies.

diera

Programmer
Mar 21, 2011
28
DE
Hi all,

I would like to extract the message that contain questions indicators like question mark and string 'what, where, who, how' at the beginning of the sentence.

herewith my code

Code:
use strict;
#use warnings;
use diagnostics;

open( INFILE, "tweets.data" )
 or die("Can not open input file: $!");
  
  open MYFILE, ">qtweet.txt";
select MYFILE;
  
  while (<INFILE>)
{
   	/^\S(.*)$/ or die "Invalid input:  $_"; 
	
   my $tweet = $1; 
	         if( $_ =~/\?/ ) {
			     print "$_";
		  		} 
				elsif ($_=~m/^what/) {
					print "$_";
				} 
				elsif ($_=~m/^where/) {
					print "$_";
				}
					else{
						print "";
				}
}

the output that i got

-result-
1. @thekaitling Where are we going to go?
2. @schmeenarf Where are you going??
3. Half Moon Bay, home away from home. Windy, foggy, le tout industry here for Strategy 2009. Where else do AMD, Intel, & VIA share a stage?
4. @Crazycanuckblog Challah? wow, that's amazing. I wouldn't even know where to begin:)
5. this is where you say "hello world" i guess. reflexive, is it not?

any help is much appreciated.
 
Hi

diera said:
What do you mean by "result" ? That is the current result or that should be the desired result ? Anyway, please post the matching fragment from tweets.data too.

Also a description of your intentions would be helpful. I mean, add some comments to your code. For example, explain what should happen with that $tweet variable.

In meantime, note that regular expressions are case sensitive by default. You may want to use the [tt]i[/tt] flag.


Feherke.
 
hi,

My intention is to extract all the questions tweet. Actually there is also a tweet that did not contain '?' but its actually a question.

Code:
use strict;
#use warnings;
use diagnostics;

open( INFILE, "tweets.data" )
 or die("Can not open input file: $!");
  
  open MYFILE, ">qtweet.txt";
select MYFILE;
  
  while (<INFILE>)
{
       /^\S(.*)$/ or die "Invalid input:  $_";
    
   my $tweet = $1;
             if( $_ =~/\?/ ) {  #extract all tweet that contain ? at the end of sentence.
                 print "$_";
                  }
                elsif ($_=~m/^what/) { #extract all tweet that contain 'what, where, who' at the start of sentence.
                    print "$_";
                }
                elsif ($_=~m/^where$/i/) {
                    print "$_";
                }
                    else{
                        print "";
                }
}

Another problem is the dataset that i used, not the same structure (refer to number 2,5). how i want to debug the error when its only contain tweet. The standard structure is

Userid date time tweet

-tweet data-
1. RedHotCopy 2009-09-25 17:45:05 @clydetombaugh You kinda, sorta asked my question to LeBron bout how he stays so grounded. Cool! He's righteous (and hot!)
2. 3. sbennaeim 2009-08-25 18:59:33 I just setup @EzineArticles to tweet my newly published articles: 4. fantasticEDcure 2009-10-08 05:58:28 Herbs For Erectile Dysfunction - 5. 6. medpagetoday 2009-09-25 14:31:27 Sen. Finance meeting until noon today. Will take up amendments on the public plan starting Tuesday. #healthreform
7. mrbazha 2009-10-04 23:38:21 Centrale termice Vaillant Atmo TEC PLUS VUW 280-5H VUW 280-5H | Centrale Termice

The results is current output but i'm really sorry, that all the tweet contain '?'. Another problem is, even if contain ?, but its not question.

example
1.@townsync "Where is Harringay?" North London - 2.@Crazycanuckblog Challah? wow, that's amazing. I wouldn't even know where to begin:)
3. MTM Birkin Edition Upgrade For The Bentley Continental GT, GTC, GT Speed, GTC Speed.
your help is very much appreciated.
 
Hi

diera said:
Code:
if( $_ =~/\?/ ) {  #extract all tweet that contain ? at the end of sentence.
That will match any question mark, not only at the end.
diera said:
Code:
elsif ($_=~m/^what/) { #extract all tweet that contain 'what, where, who' at the start of sentence.
That will match any "what" at the beginning of the string, not at the beginning of the sentence. Note that there is no such thing like sentence.
diera said:
Userid date time tweet

-tweet data-
1. RedHotCopy 2009-09-25 17:45:05 @clydetombaugh You kinda, sorta asked my question to LeBron bout how he stays so grounded. Cool! He's righteous (and hot!)
There seems to be more fields than you specified. What that means ? The order numbers are not actually present in the file, or not ? What are the field separators, tabs or multiple spaces ?

In meantime I would go with something like this :
Code:
[b]while[/b] [teal]([/teal][green][i]<INFILE>[/i][/green][teal])[/teal] [teal]{[/teal]
  [b]next[/b] [b]unless[/b] [b]m[/b][fuchsia]/^(\d+)\. (\w+) +([\d-]+ [\d:]+) +(.+)/[/fuchsia][teal];[/teal]
  [b]if[/b] [teal]([/teal][navy]$4[/navy][teal]=~[/teal][b]m[/b][fuchsia]/(what|where|who).*\?/[/fuchsia][b]i[/b][teal])[/teal] [teal]{[/teal]
    [b]print[/b] [green][i]"$4\n"[/i][/green][teal];[/teal]
  [teal]}[/teal]
[teal]}[/teal]

Feherke.
 
hi feherke,

For numbering in the dataset, i just put in here to differentiate between the data. In real dataset its only have

userid date time tweet #delimited by tab

but somehow, i just realize that some line only contain tweet:(
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top