Hello everyone,
The extract below is the result of a pipe passed through sed with which I clean white space from HTML pages.
/* --- SAMPLE --- */
...
<a href='/books/filebooks_gen_cbooks=41386.html'>
<b>L'école de la chair</b></a>
(Nikutai No Gakka)<br />
<span class=\"fs11\">
1963<br />
Auteur: Yukio Mishima<br />
[/color]<!-- /fs11 -->
</div></div></td></tr>
...
...
<a href='/books/filebooks_gen_cbooks=71697808.html'>
<b>La fin des temps</b></a>
<span class=\"fs11\">
1985<br />
<br />
[/color]<!-- /fs11 -->
</div></div></td></tr>
...
...
__________________________________________
Each record has a tag "<a href =" followed by a line containing the name of a book, but the line beginning with "Author" is not always informed.
How to write my awk script to get the following result:
/* --- Desired output --- */
1: L'école de la chair, Yukio Mishima
2: La fin des temps, #n/a
--
41386.html
71697808.html
/* End output */
I started to write this:
but I can't see how to write the rest of the script to get the desired result.
Thank you in advance for your help.
The extract below is the result of a pipe passed through sed with which I clean white space from HTML pages.
/* --- SAMPLE --- */
...
<a href='/books/filebooks_gen_cbooks=41386.html'>
<b>L'école de la chair</b></a>
(Nikutai No Gakka)<br />
<span class=\"fs11\">
1963<br />
Auteur: Yukio Mishima<br />
[/color]<!-- /fs11 -->
</div></div></td></tr>
...
...
<a href='/books/filebooks_gen_cbooks=71697808.html'>
<b>La fin des temps</b></a>
<span class=\"fs11\">
1985<br />
<br />
[/color]<!-- /fs11 -->
</div></div></td></tr>
...
...
__________________________________________
Each record has a tag "<a href =" followed by a line containing the name of a book, but the line beginning with "Author" is not always informed.
How to write my awk script to get the following result:
/* --- Desired output --- */
1: L'école de la chair, Yukio Mishima
2: La fin des temps, #n/a
--
41386.html
71697808.html
/* End output */
I started to write this:
Code:
awk '$0 ~ /books\/filebooks_gen_cbooks=[0-9]+\.html\047>$/ {
gsub(/(^.*=|\\047\\/?|>)/,\"\")
print
getline
gsub(/<[^>]*>/,\"\")
print
};
# extract author
/^Auteur: */ {
gsub(/(de *|<[^>]*>)/,\"\")
gsub(/,.*$/,\"\")
print
}
# print record separator (blank line)
/<!--.* -->/{
print ""
}'
but I can't see how to write the rest of the script to get the desired result.
Thank you in advance for your help.