Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Parse data

Status
Not open for further replies.

cjtucker1976

Technical User
Sep 7, 2006
72
US
Many years ago we had perl scrip created that parsed a AP news feed and just grabbed the headlines and created a clean newswire.html file. It has been working for over 10 years. The way received the feed has changed and our script is no longer working. I am a newbie- any help is appreciated. Below is the perl scrip and then the textfile work on.

SCRIPT:
use Cwd;
$curr= cwd();
# Test
# print "Current working directory is ", $curr, "\n\n";

$PresentTime = time;

#Bigen WHYY content
print "WHYY Content";

chdir "d:\\signfiles" ;

$now = localtime;
$tday = substr($now,0,3) ;
$tmo = substr($now,4,3) ;
$tdate = substr($now,8,2) ;
$ttime = substr($now,11,5) ;
$contentfile = $tday. "_". $tmo. "_". $tdate. ".txt" ;
$apchk = "start" ;
$spchk = "specs" ;
#open(CONFILELIST, $contentfile) || die "cannot opendir. $!";
open(CONFILELIST, $contentfile) or open (CONFILELIST, "whyydef.txt");
print "In whyy content";
$timeskip eq "no" ;
$skip = "yes" ;
$spec = "no" ;
$goodtogo = "no" ;
while (<CONFILELIST>) {
$a = $_ ;
$headline = substr($a,0,5) ;
$headline =~ tr/A-Z/a-z/ ;
if ($apchk eq $headline) {
$timeskip = "yes" ;
$chktime = substr($a,6,5) ;
if ($ttime ge $chktime) {
$skip = "no" ;
$spec = "no" ;
@pcontent = "" ;
$ct = 0 ;
} else {
$skip = "yes" ;
}
}elsif ($spchk eq $headline) {
$timeskip = "yes" ;
$chktime = substr($a,6,5) ;
if ($ttime ge $chktime) {
$skip = "no" ;
$spec = "yes" ;
@pcontent = "" ;
$ct = 0 ;
} else {
$skip = "yes" ;
}
}
if ($skip eq "no") {
if ($timeskip eq "no") {
@pcontent[$ct] = $a ;
$ct = $ct + 1
}
}
$timeskip = "no" ;
$oldchktime = $chktime
}

close CONFILELIST;

chdir "C:\\aperl\\bin" ;

open(FILELIST, "newswire.txt") or $spec = "yes" ;
print filelist
print " :$spec: ";
if ($spec eq "yes") {
unlink ("newswire.html") ;
open (HTML, ">newswire.html") or die "Can not create index.html. $!";
print" Once more in content";
goto FINISHLINE ;
}
# End WHYY contnet
print "End WHYY Content";

#open(FILELIST, "newswire.txt") || die "cannot opendir. $!";

$chk = "^AP Top Headlines" ;
$chkus = "^AP Top U.S. News" ;
$chkyes = "no" ;
$apchk = "AP" ;
$brkchk = "\^" ;
#print $chk ;
$fg = "not" ;
$skip = "no" ;
$newsct = 0 ;
while (<FILELIST>) {
#check for news content
$goodtogo = "yes" ;
$headline eq "" ;
$a = $_ ;
#$a =~ s/\W// ; #This gets rid of the ^ on other lines
$headline = substr($a,11,2) ;
#print $headline ;
#$wait = <STDIN> ;
if ($apchk eq $headline) {
$fg = "not" ;
#print $fg ;
}
$grab = substr($a,0,17) ;
if ($grab eq $chk) {
$chkyes = "yes" ;
$grab =~ s/\W// ; #Delete First Character of line
#print HTML "... " ;
#print HTML $grab ;
unlink ("newswire.html") ;
open (HTML, ">newswire.html") or die "Can not create index.html. $!";
print HTML "The Top Headlines From WHYY" ; #add at " to include the time
@timeparts = localtime(time) ;
#print HTML $timeparts[2], ":", $timeparts[1] ;
$fg = "yes" ;
$skip = "yes" ;
} elsif ($grab eq $chkus) {
if ($chkyes eq "no") {
$grab =~ s/\W// ; #Delete First Character of line
print HTML "... " ;
print HTML $grab ;
unlink ("newswire.html") ;
open (HTML, ">newswire.html") or die "Can not create index.html. $!";
print HTML "The Top Headlines From WHYY" ; #add at " to include the time
@timeparts = localtime(time) ;
#print HTML $timeparts[2], ":", $timeparts[1] ;
$fg = "yes" ;
$skip = "yes" ;
}
}
if ($grab ne $chk) {
if ($fg eq "yes") {
$brk = substr($a,0,1) ;
if ($skip eq "no") {
if ($brk eq $brkchk) {
print HTML "... " ;
}
#} else {
#print HTML " " ;
#}

if ($brk eq $brkchk) {
$a =~ s/\W// ; #Delete First Character of line
}
print HTML $a ;
}
} elsif ($grab ne $chkus) {
if ($chkyes eq "no") {
if ($fg eq "yes") {
$brk = substr($a,0,1) ;
if ($skip eq "no") {
if ($brk eq $brkchk) {
print HTML "... " ;
}
#} else {
#print HTML " " ;
#}

if ($brk eq $brkchk) {
$a =~ s/\W// ; #Delete First Character of line
}
print HTML $a ;
}
$skip = "no" ;
}

}

}
$skip = "no" ;
}

}

print HTML "... " ;

close FILELIST;

#close HTML;



FINISHLINE:
if ($goodtogo eq "no") {
unlink ("newswire.html") ;
open (HTML, ">newswire.html") or die "Can not create index.html. $!";
}
$nct = 0 ;
for ($nct =0; $nct <= $ct; $nct++) {
print HTML @pcontent[$nct] ;
print HTML " " ;
}

close HTML;


Example of NEWSWIRE TEXT File:

Niall Horan (One Direction) is 20. Actor Mitch Holleman (``Reba'') is 18.
^
'' _ Charlotte Bronte (BRAWN'-tee), English author (1816-1855).
^
(Above Advance for Use Friday, Sept. 13)
^
Copyright 2013, The Associated Press. All rights reserved.
^

AP-WF-09-13-13 0401GMT<
0403-----
r a BC-US--People-Kidman 09-13 0501
^*1402< ^AP-US-People-Kidman,118<
^Kidman says she's OK but shaken after collision<
^AP Photo NYET105<
^Eds: APNewsNow. With AP Photos.<

Calvin Klein event, she said she was ok.

Kidman added: ``I'm up, I'm walking around, but I was shaken.''

AP-WF-09-13-13 0411GMT<


0401-----
r a BC-US-TEC--Twitter-IPO-T 09-13 0916
^BC-US-TEC--Twitter-IPO-Tweet Facts,492<
^Tweetable facts about Twitter's IPO<
^AP Photo NYBZ147<
^Eds: With AP Photos.<
^By SCOTT MAYEROWITZ=
^AP Business Writer=

a Tweet.

e limit of tweets.

a planned IPO.


announcement tweet, 7,872 people retweeted the message.


_ The public offering comes at a time of heightened investor interest
in the IPO market _ 131 IPOs have priced so far this year.

_ Is (at)Twitter trying to avoid (at)Facebook's May 2012 IPO (hash)fa
il? Well, company is keeping details secret for now. (hash)TwitterIPO

_ The company hasn't said if it makes a profit or how much revenue it
takes in. (hash)FadOrFuture? Wonder if (at)WarrenBuffett will buy stock.

_ Most of Twitter's revenue comes from advertising. (at)eMarketer est
imates $582.8 million this year, up from $288.3 million in 2012.

_ Compare: In latest quarter, Facebook had $1.6 billion in ad revenue
. By 2015, Twitter's annual ad revenue is expected to hit $1.3 billion.

_ 2013 (hash)Superbowl performance by (at)Beyonce had 268 million twe
ets per minute, more than any other event in past two years.

_ Not everybody on (at)Twitter is who they claim to be. (at)United Ai
rlines CEO Jeff Smisek has to put up with (at)FakeUnitedJeff

_ Sometimes even missing zoo animals get their own Twitter accounts.
And they can be funny. Just read (at)BronxZoosCobra



.''


AP-WF-09-13-13 0411GMT<


0407-----
r a BC-US--NuclearSpending 09-13 0578
^*1110< ^AP-US-Nuclear-Spending,130<
^Nation's bloated nuclear spending comes under fire<
^AP Photo LA104, LA103<
^Eds: APNewsNow. Will be expanded. With AP Photos.<
^By JERI CLAUSING and MATTHEW DALY=
^Associated Press=


sitive nuclear bomb-making facilities doesn't work.

ms that include a redesign to raise the roof so equipment can fit inside.

tic budget increases for nuclear contractors.

uld be overhauled.


 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top