Guest_imported
New member
- Jan 1, 1970
- 0
Hi,
I've written a script that strips off all the HTML tags in a XML file.
BEGIN {
RS=""
}
{
if($0~/^.*<text>/)
gsub(/^.*<text>/, "", $0);
if($0~/<\/text>.*$/)
gsub(/<\/text>.*$/, "", $0);
gsub(/<\/?p>/, "", $0);
gsub(/\"\;/, "", $0);
gsub(/[ ][ ]+/, " ", $0);
print $0
}
What should I add to my script so that each XML file is automatically saved as a text file? So files like news135.xml, news653.xml should get the following names: news135.txt and news653.txt.
The problem is that I've got hundreds of such xml files, so I was thinking of using a wildcard on the command line (gawk -f script.awk *.xml). Each file should be saved as a seperate file but I don't know how to do that with gawk.
Can someone help me with this?
Febri
I've written a script that strips off all the HTML tags in a XML file.
BEGIN {
RS=""
}
{
if($0~/^.*<text>/)
gsub(/^.*<text>/, "", $0);
if($0~/<\/text>.*$/)
gsub(/<\/text>.*$/, "", $0);
gsub(/<\/?p>/, "", $0);
gsub(/\"\;/, "", $0);
gsub(/[ ][ ]+/, " ", $0);
print $0
}
What should I add to my script so that each XML file is automatically saved as a text file? So files like news135.xml, news653.xml should get the following names: news135.txt and news653.txt.
The problem is that I've got hundreds of such xml files, so I was thinking of using a wildcard on the command line (gawk -f script.awk *.xml). Each file should be saved as a seperate file but I don't know how to do that with gawk.
Can someone help me with this?
Febri