List output from HTML pages 1

clemhoff · Nov 24, 2009

Hello everyone,

The extract below is the result of a pipe passed through sed with which I clean white space from HTML pages.

/* --- SAMPLE --- */
...
<a href='/books/filebooks_gen_cbooks=41386.html'>
<b>L'école de la chair</b></a>
(Nikutai No Gakka)<br />

<span class=\"fs11\">
1963<br />
Auteur: Yukio Mishima<br />
[/color]
</div></div></td></tr>
...
...
<a href='/books/filebooks_gen_cbooks=71697808.html'>
<b>La fin des temps</b></a>

<span class=\"fs11\">
1985<br />
<br />
[/color]
</div></div></td></tr>
...
...
__________________________________________

Each record has a tag "<a href =" followed by a line containing the name of a book, but the line beginning with "Author" is not always informed.
How to write my awk script to get the following result:

/* --- Desired output --- */
1: L'école de la chair, Yukio Mishima
2: La fin des temps, #n/a
--
41386.html
71697808.html
/* End output */

I started to write this:

Code:

awk '$0 ~ /books\/filebooks_gen_cbooks=[0-9]+\.html\047>$/ {
		gsub(/(^.*=|\\047\\/?|>)/,\"\")
		print
		getline
		gsub(/<[^>]*>/,\"\")
		print
	};
	
# extract author
	/^Auteur: */ {
		gsub(/(de *|<[^>]*>)/,\"\")
		gsub(/,.*$/,\"\")
		print
	}
	
# print record separator (blank line)
	/<!--.* -->/{
		print ""
	}'

but I can't see how to write the rest of the script to get the desired result.

Thank you in advance for your help.

feherke · Nov 24, 2009

Hi

I would write it like this :

Code:

awk '
BEGIN {
  a=1
}
/filebooks_gen_cbooks/ {
  if (!a) print "#n/a"
  gsub(/.*=|\047>/,"")
  s=s$0"\n"
  getline
  gsub(/<[^>]*>/,"")
  printf "%d: %s, ",++n,$0
  a=0
}
/^Auteur: */ {
  sub(/Auteur: /,"")
  gsub(/<[^>]*>/,"")
  print
  a=1
}
END {
  if (!a) print "#n/a"
  print "--"
  print s
}'

Feherke.

http://free.rootshell.be/~feherke/

clemhoff · Nov 24, 2009

Hi Feherke,

Thank you for the script, it works perfectly. (as usual !)
One question though, why the if statement line in the "END" procedure? Could you help me understand ?

feherke · Nov 24, 2009

Hi

clemhoff said:
why the if statement line in the "END" procedure?

Because we keep hoping that will find an author for the previous link until either
[ul]
[li]the next link is found[/li]
[li]the input ends[/li]
[/ul]
The two [tt]if (!a)[/tt] is [tt]print[/tt]ing the default "#n/a" string for those two cases. We can know for sure that the last entry had no author only when there is no more input.

Of course, if the presence of that comment after each title/author pair is sure, we can simplify it :

Code:

awk '
/filebooks_gen_cbooks/ {                                  
  gsub(/.*=|\047>/,"")
  s=s$0"\n"
  getline             
  gsub(/<[^>]*>/,"")
  printf "%d: %s, ",++n,$0
  a="#n/a"
}                   
/^Auteur: */ {
  sub(/Auteur: /,"")
  gsub(/<[^>]*>/,"")
  a=$0
}
/<![s].* [/s]>/ {
  print a
}
END {               
  print "--"
  print s                      
}'

Feherke.

http://free.rootshell.be/~feherke/

clemhoff · Nov 24, 2009

Indeed the comment tag is at the end of each record. But I retain the first script... just in case!

Thanks for your help.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

List output from HTML pages 1

clemhoff

Technical User

feherke

Programmer

clemhoff

Technical User

feherke

Programmer

clemhoff

Technical User

Similar threads

Part and Inventory Search

Sponsor