Split files based on text within the file

IMAUser · Feb 23, 2004

I have an Oracle program producing a huge xml file which is in a concatenated form with every new type within the file having
<?xml version="1.0" encoding="UTF-8"?> as the first line in the file. A few lines after the above starter line there is another line
<index_name>US_RPI</index_name>

Now the question, what I need to be able to do is something like
-- my_awk_script.aw

OUTFILE=/tmp/my_out_file.lst
if text is /..xml.version/ --look for <?xml version this marks the begining of new type
{
if $OUTFILE exists
new_file_name=`grep index_name in $OUTFILE | get the text within > < `
mv $OUTFILE /tmp/$new_file_name
fi
-- Start the $OUTFILE afresh with new data lines
print $0 > $OUTFILE
}
else
{
-- Since this line is not a new type keep appending to the existing file
print $0 >> $OUTFILE
fi
}

I hope I was clear enough. The hash key seems to have gone bonkers on my keyboard and hence have used -- for coments.

Can I get some pointers please.

Thanks

aigles · Feb 23, 2004

Try and adapt :
[tt]
awk -v OUTFILE=/tmp/my_out_file.lst '

#
# If first line contains "<?xml version", following lines are memorized
# until we found the "<index_name" line.
#

NR==1 && /<\?xml version/ {
memo = 1
}

#
# If Memorize phase, add line to Lines[]
#

memo {
Lines[++LinesCount] = $0;
}

#
# If Normal phase, print line to OUTPUT file
#

! memo {
print $0 >> OUTFILE;
}

#
# If memorize phase and line contains "<index_name", get the new output file name
# and write memorized lines to it. Set normal phase (no memo)
#

memo && /<index_name>/ {
sub(/^[^>]*>/,"&quot

;
sub(/<.*$/,"&quot

;
close(OUTFILE);
OUTFILE = "/tmp/" $0;
for (iline=1; iline<=LinesCount; iline++)
print Lines[iline] >> OUTFILE;
memo = 0;
}
' input_file
[/tt]

Jean Pierre.

IMAUser · Feb 25, 2004

Thanks very much, It works great with a few changes.

But can you please explain whats happening here and what is sub () all about. I just dont get the pattern matching and the way it extracts the text from within
<index_name>US_RPI</index_name>

sub(/^[^>]*>/,"&quot

;
sub(/<.*$/,"&quot

;

Many Thanks

aigles · Feb 25, 2004

[tt]sub(/^[^>]*>/,"&quot

;[/tt]

Remove in $0 all chars from begining to the first '>'
^ Begining of string
[^>] All chars except '>'
* Zero or more times
> Char '>'

[tt]sub(/<.*$/,"&quot

;[/tt]

Remove in $0 all chars from the first '<' to end.
< Char '<'
. Any single character
* Zero or more times
$ End of string

Jean Pierre.

IMAUser · Feb 25, 2004

Thanks Jean,
It makes more sense now and will be very helpful in future.

Thanks for all your help on this.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Split files based on text within the file

IMAUser

Technical User

aigles

Technical User

IMAUser

Technical User

aigles

Technical User

IMAUser

Technical User

Similar threads

Part and Inventory Search

Sponsor