Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

how best to split a large file into smaller sized chunks

Status
Not open for further replies.

MiMoore

Programmer
Aug 25, 2005
12
IE
hi.
just looking at a problem where I want to split a large XML file into smaller sized chunks.
is there any recommended ways/ perl utilities for handling this.

I would need to take into account the opening and closing closing tags of an xml element.
ie. if the chunk size threshold was in the middle of a XML element, I would need to create the chunk at the end of the previous element so that the chunk could be processed correctly.

Then start the next chunk at the next element.

It would be like creating a chunk but not in the middle of a line or paragragh to use an analogy.

any suggestions would be appreciated.

thanks.
 
Set the INPUT_SEPERATOR to the end of your XML element, and just read in the file as normal


Along these lines
Code:
$|="</xml>";
open FH, "<text.xml";
while (<FH>) {
  $file_title=&parse_filename($_);
  open OUT, ">$file_title.xml";
  print OUT $_;
  close OUT;
}
close FH;

Don't forget to test begin and end cases
--Paul

Spend an hour a week on CPAN, helps cure all known programming ailments ;-)
 
Is the problem that the XML file is simply too big to read into memory and process? If so, there are techniques which allow you to process it serially. Check out for an excellent comparison of XML parsing techniques.

Yours,

fish


&quot;As soon as we started programming, we found to our surprise that it wasn't as easy to get programs right as we had thought. Debugging had to be discovered. I can remember the exact instant when I realized that a large part of my life from then on was going to be spent in finding mistakes in my own programs.&quot;
--Maurice Wilkes
 
thanks for reply.

yes, the xml is very large and when passed to perl xml parser, cpu and memory utilisation are unacceptably high and sometimes the process runs out of memory.

the solution I was after was to split file and pass each file seperately to the parser. this was done by creating many temp files (based on line count) and then process all files in that temp directory.

I would like a cleaner solution to create first chunk of file, process it, remove it and then create second chunk and so on rather than creating all the temp files first and taking up unnecessary temp memory.


 
Did you look at the INPUT_SEPERATOR solution posted above?

Spend an hour a week on CPAN, helps cure all known programming ailments ;-)
 
not sure if you can use Tie::File for what you are doing but it's for working with large text files.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top