Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Split files based on text within the file

Status
Not open for further replies.

IMAUser

Technical User
May 28, 2003
121
CH

I have an Oracle program producing a huge xml file which is in a concatenated form with every new type within the file having
<?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?> as the first line in the file. A few lines after the above starter line there is another line
<index_name>US_RPI</index_name>

Now the question, what I need to be able to do is something like
-- my_awk_script.aw

OUTFILE=/tmp/my_out_file.lst
if text is /..xml.version/ --look for <?xml version this marks the begining of new type
{
if $OUTFILE exists
new_file_name=`grep index_name in $OUTFILE | get the text within > < `
mv $OUTFILE /tmp/$new_file_name
fi
-- Start the $OUTFILE afresh with new data lines
print $0 > $OUTFILE
}
else
{
-- Since this line is not a new type keep appending to the existing file
print $0 >> $OUTFILE
fi
}

I hope I was clear enough. The hash key seems to have gone bonkers on my keyboard and hence have used -- for coments.

Can I get some pointers please.

Thanks
 
Try and adapt :
[tt]
awk -v OUTFILE=/tmp/my_out_file.lst '

#
# If first line contains &quot;<?xml version&quot;, following lines are memorized
# until we found the &quot;<index_name&quot; line.
#

NR==1 && /<\?xml version/ {
memo = 1
}

#
# If Memorize phase, add line to Lines[]
#

memo {
Lines[++LinesCount] = $0;
}

#
# If Normal phase, print line to OUTPUT file
#

! memo {
print $0 >> OUTFILE;
}

#
# If memorize phase and line contains &quot;<index_name&quot;, get the new output file name
# and write memorized lines to it. Set normal phase (no memo)
#

memo && /<index_name>/ {
sub(/^[^>]*>/,&quot;&quot;);
sub(/<.*$/,&quot;&quot;);
close(OUTFILE);
OUTFILE = &quot;/tmp/&quot; $0;
for (iline=1; iline<=LinesCount; iline++)
print Lines[iline] >> OUTFILE;
memo = 0;
}
' input_file
[/tt]

Jean Pierre.
 

Thanks very much, It works great with a few changes.

But can you please explain whats happening here and what is sub () all about. I just dont get the pattern matching and the way it extracts the text from within
&lt;index_name&gt;US_RPI&lt;/index_name&gt;


sub(/^[^&gt;]*&gt;/,&quot;&quot;);
sub(/&lt;.*$/,&quot;&quot;);

Many Thanks
 
[tt]sub(/^[^&gt;]*&gt;/,&quot;&quot;);[/tt]

Remove in $0 all chars from begining to the first '&gt;'
^ Begining of string
[^&gt;] All chars except '&gt;'
* Zero or more times
&gt; Char '&gt;'

[tt]sub(/&lt;.*$/,&quot;&quot;);[/tt]

Remove in $0 all chars from the first '&lt;' to end.
&lt; Char '&lt;'
. Any single character
* Zero or more times
$ End of string


Jean Pierre.
 

Thanks Jean,
It makes more sense now and will be very helpful in future.

Thanks for all your help on this.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top