Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Processing Big Text File with no Line break

Status
Not open for further replies.

akv003

Programmer
May 23, 2008
4
IN
Hi,
I need to process 1 GB large text document which has no line break or separating character (i.e. all file content is in single line).

I tried using perl read buffer command but it breaks the last word, which I do not want to break because I am doing some addition processing for word replacement.

Here is my program
my ($data, $n);
while (($n = read FILEHANDLER, $data, 255) != 0) {
if (/(\w+)$/)
{
seek FILEHANDLER, (tell(FILEHANDLER) - length ($1) ),0
}
..do further processing
}


Can someone help me to write this program in perl
 
What do you mean by breaks the last word?
And what exactly do you want to get?
I would use a variable length buffer, something in the line of:
Code:
$n=read(FILEHANDLER,$data,255);
$cursor=0;
while($n){
    #do your processing here up to the last non word character
    #while keeping updated the position you arrived at in $cursor
    #if everything happens to be processed, $cursor should be equal to 1+length$data
  if(length($data)<=$cursor){
    $data=substr($data,$cursor);
  }else{
    $data='';
  }
  $cursor=0;
  $n=read(FILEHANDLER,$data,255,length$data);
}
if(length$data){
    #terminate your processing here
}

Franco
: Online engineering calculations
: Magnetic brakes for fun rides
: Air bearing pads
 
I *think* he means he wants to process the file in chunks rather than reading it all in to memory at once, but he doesn't want to cut half way through a word. So we'd process it in approximately 1K chunks, rather than exactly 1K chunks, for example.

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object::perlDesignPatterns)[/small]
 
Thanks Franco and Steve for prompt and useful reply!

Steve is right, I do mean approximate chunks so that word will not break also file size is more than 1 GB so can not store whole thing together in a buffer.

Let's have a real world example, I need to replace all "nuclear" with "atomic", here is sample text file contains:
"The USA army must now have learnt that nuclear weapons are meant only as a threat of last resort to save the nation."

For example I am defining buffer size 48 then, it will return
"The USA army must now have learnt that nuc", So replacement code will not work.
$_=s/nuclear/atomic/g;

Same thing will happen with buffer size 255 or 1024, I do not know which word will split, and thus I need a cursor which read after offset value or similar



 
This is along the lines I suggested above (and will work even with [tt]$datalen=1;[/tt] !)
Code:
use strict;
use warnings;
my($data,$n,$datalen,$sear,$repl);
$datalen=42;
$sear='nuclear';
$repl='atomic';
$n=read(DATA,$data,$datalen);
while($n){
  $data=~s/(\A|\W)$sear(\W)/$1$repl$2/g;
    #replaces full words only
  ($n)=($data=~m/\W/gc);
    #finds the last non word char
  if(defined(pos$data)){
    print substr($data,0,pos($data));
    $data=substr($data,pos$data);
  }
  $n=read(DATA,$data,$datalen,length$data);
}
if(length$data){
  $data=~s/$sear/$repl/g;
  print$data;
}
__DATA__
The USA army must now have learnt that nuclear weapons are meant only as a threat of last resort to save the nation

Franco
: Online engineering calculations
: Magnetic brakes for fun rides
: Air bearing pads
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top