Processing Big Text File with no Line break

akv003 · May 23, 2008

Hi,
I need to process 1 GB large text document which has no line break or separating character (i.e. all file content is in single line).

I tried using perl read buffer command but it breaks the last word, which I do not want to break because I am doing some addition processing for word replacement.

Here is my program
my ($data, $n);
while (($n = read FILEHANDLER, $data, 255) != 0) {
if (/(\w+)$/)
{
seek FILEHANDLER, (tell(FILEHANDLER) - length ($1) ),0
}
..do further processing
}

Can someone help me to write this program in perl

prex1 · May 23, 2008

What do you mean by breaks the last word?
And what exactly do you want to get?
I would use a variable length buffer, something in the line of:

Code:

$n=read(FILEHANDLER,$data,255);
$cursor=0;
while($n){
    #do your processing here up to the last non word character
    #while keeping updated the position you arrived at in $cursor
    #if everything happens to be processed, $cursor should be equal to 1+length$data
  if(length($data)<=$cursor){
    $data=substr($data,$cursor);
  }else{
    $data='';
  }
  $cursor=0;
  $n=read(FILEHANDLER,$data,255,length$data);
}
if(length$data){
    #terminate your processing here
}

Franco

http://www.xcalcs.com

: Online engineering calculations

http://www.megamag.it

: Magnetic brakes for fun rides

http://www.levitans.com

: Air bearing pads

stevexff · May 24, 2008

I *think* he means he wants to process the file in chunks rather than reading it all in to memory at once, but he doesn't want to cut half way through a word. So we'd process it in approximately 1K chunks, rather than exactly 1K chunks, for example.

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object:erlDesignPatterns)[/small]

akv003 · May 28, 2008

Thanks Franco and Steve for prompt and useful reply!

Steve is right, I do mean approximate chunks so that word will not break also file size is more than 1 GB so can not store whole thing together in a buffer.

Let's have a real world example, I need to replace all "nuclear" with "atomic", here is sample text file contains:
"The USA army must now have learnt that nuclear weapons are meant only as a threat of last resort to save the nation."

For example I am defining buffer size 48 then, it will return
"The USA army must now have learnt that nuc", So replacement code will not work.
$_=s/nuclear/atomic/g;

Same thing will happen with buffer size 255 or 1024, I do not know which word will split, and thus I need a cursor which read after offset value or similar

prex1 · May 29, 2008

This is along the lines I suggested above (and will work even with [tt]$datalen=1;[/tt] !)

Code:

use strict;
use warnings;
my($data,$n,$datalen,$sear,$repl);
$datalen=42;
$sear='nuclear';
$repl='atomic';
$n=read(DATA,$data,$datalen);
while($n){
  $data=~s/(\A|\W)$sear(\W)/$1$repl$2/g;
    #replaces full words only
  ($n)=($data=~m/\W/gc);
    #finds the last non word char
  if(defined(pos$data)){
    print substr($data,0,pos($data));
    $data=substr($data,pos$data);
  }
  $n=read(DATA,$data,$datalen,length$data);
}
if(length$data){
  $data=~s/$sear/$repl/g;
  print$data;
}
__DATA__
The USA army must now have learnt that nuclear weapons are meant only as a threat of last resort to save the nation

Franco

http://www.xcalcs.com

: Online engineering calculations

http://www.megamag.it

: Magnetic brakes for fun rides

http://www.levitans.com

: Air bearing pads

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Processing Big Text File with no Line break

akv003

Programmer

prex1

Programmer

stevexff

Programmer

akv003

Programmer

prex1

Programmer

Similar threads

Part and Inventory Search

Sponsor