Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

List output from HTML pages 1

Status
Not open for further replies.

clemhoff

Technical User
Jul 10, 2007
19
FR
Hello everyone,

The extract below is the result of a pipe passed through sed with which I clean white space from HTML pages.

/* --- SAMPLE --- */
...
<a href='/books/filebooks_gen_cbooks=41386.html'>
<b>L'école de la chair</b></a>
(Nikutai No Gakka)<br />

<span class=\"fs11\">
1963<br />
Auteur: Yukio Mishima<br />
[/color]<!-- /fs11 -->
</div></div></td></tr>
...
...
<a href='/books/filebooks_gen_cbooks=71697808.html'>
<b>La fin des temps</b></a>


<span class=\"fs11\">
1985<br />
<br />
[/color]<!-- /fs11 -->
</div></div></td></tr>
...
...
__________________________________________

Each record has a tag "<a href =" followed by a line containing the name of a book, but the line beginning with "Author" is not always informed.
How to write my awk script to get the following result:

/* --- Desired output --- */
1: L'école de la chair, Yukio Mishima
2: La fin des temps, #n/a
--
41386.html
71697808.html

/* End output */


I started to write this:
Code:
awk '$0 ~ /books\/filebooks_gen_cbooks=[0-9]+\.html\047>$/ {
		gsub(/(^.*=|\\047\\/?|>)/,\"\")
		print
		getline
		gsub(/<[^>]*>/,\"\")
		print
	};
	
# extract author
	/^Auteur: */ {
		gsub(/(de *|<[^>]*>)/,\"\")
		gsub(/,.*$/,\"\")
		print
	}
	
# print record separator (blank line)
	/<!--.* -->/{
		print ""
	}'

but I can't see how to write the rest of the script to get the desired result.

Thank you in advance for your help.


 
Hi

I would write it like this :
Code:
awk '
BEGIN {
  a=1
}
/filebooks_gen_cbooks/ {
  if (!a) print "#n/a"
  gsub(/.*=|\047>/,"")
  s=s$0"\n"
  getline
  gsub(/<[^>]*>/,"")
  printf "%d: %s, ",++n,$0
  a=0
}
/^Auteur: */ {
  sub(/Auteur: /,"")
  gsub(/<[^>]*>/,"")
  print
  a=1
}
END {
  if (!a) print "#n/a"
  print "--"
  print s
}'

Feherke.
 
Hi Feherke,

Thank you for the script, it works perfectly. (as usual !)
One question though, why the if statement line in the "END" procedure? Could you help me understand ?


 
Hi

clemhoff said:
why the if statement line in the "END" procedure?
Because we keep hoping that will find an author for the previous link until either
[ul]
[li]the next link is found[/li]
[li]the input ends[/li]
[/ul]
The two [tt]if (!a)[/tt] is [tt]print[/tt]ing the default "#n/a" string for those two cases. We can know for sure that the last entry had no author only when there is no more input.

Of course, if the presence of that comment after each title/author pair is sure, we can simplify it :
Code:
awk '
/filebooks_gen_cbooks/ {                                  
  gsub(/.*=|\047>/,"")
  s=s$0"\n"
  getline             
  gsub(/<[^>]*>/,"")
  printf "%d: %s, ",++n,$0
  a="#n/a"
}                   
/^Auteur: */ {
  sub(/Auteur: /,"")
  gsub(/<[^>]*>/,"")
  a=$0
}
/<![s].* [/s]>/ {
  print a
}
END {               
  print "--"
  print s                      
}'

Feherke.
 
Indeed the comment tag is at the end of each record. But I retain the first script... just in case!

Thanks for your help.

 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top