Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Extract value from HTML file using shell script

Status
Not open for further replies.

Tina_9841

Programmer
Oct 17, 2020
2
US
I am new to unix and trying to extract some values from a HTML file. Using shell script, I am trying to extract value1 , value2 and value3 in this file below.

Using these values I have to compare against database and confirm if document is right.


<html>
<body>
<div style="text-align:center;"><font style="background-color:rgb(255,255,255, 0.0);color:#000000;font-family:'Times New Roman',sans-serif;font-size:12pt;font-weight:400;line-height:120%;">value1 is &#58; 10000</font></div>
<div><font><br></font></div>
<div style="text-align:center;"><font style="background-color:rgb(255,255,255, 0.0);color:#000000;font-family:'Times New Roman',sans-serif;font-size:12pt;font-weight:400;line-height:120%;"> value2 is &#58; 10001</font></div>
<div><font><br></font></div>
<div style="text-align:center;"><font><br></font></div>
<div style="text-align:center;"><font style="background-color:rgb(255,255,255, 0.0);color:#000000;font-family:'Times New Roman',sans-serif;font-size:12pt;font-weight:400;line-height:120%;">Value3 is &#58; 10002</font></div>
</body>
</html>

Thanks
 
Hi

Your requirement is quite brief and parsing HTML is a complex issue. Will keep it simple and show you how that HTML could be processed the simplest way : just remove all tags.
Code:
[blue]bash-5.0$[/blue] sed -r 's/<[^>]*>//g' Tina_9841.html 


value1 is &#58; 10000

 value2 is &#58; 10001


Value3 is &#58; 10002

And from that is simple to extract the relevant information into an associative array :
Code:
[blue]bash-5.0$[/blue] declare -A data
[blue]bash-5.0$[/blue] while read name is colon value; do [[ "$name" ]] && data["$name"]="$value"; done <<< "$( sed -r 's/<[^>]*>//g' Tina_9841.html )"

Then whatever you need, you take it from the associative array :
Code:
[blue]bash-5.0$[/blue] echo "${data[value2]}"
10001


Feherke.
feherke.github.io
 
Thanks Feherke,

For some reason -r option is throwing me illegal option issue.I am using unix in AIX platform.
The document I have to validate isn't that big content wise, is there a way we can search for the word 'value1 is &#58;' (for some reason colon is appearing as &#58 in my case), and extract content
till next '<' , which is 10000?




<body>
<div style="text-align:center;"><font style="background-color:rgb(255,255,255, 0.0);color:#000000;font-family:'Times New Roman',sans-serif;font-size:12pt;font-weight:400;line-height:120%;">value1 is &#58; 10000</font></div>
<div><font><br></font></div>
<div style="text-align:center;"><font style="background-color:rgb(255,255,255, 0.0);color:#000000;font-family:'Times New Roman',sans-serif;font-size:12pt;font-weight:400;line-height:120%;"> value2 is &#58; 10001</font></div>
<div><font><br></font></div>
<div style="text-align:center;"><font><br></font></div>
<div style="text-align:center;"><font style="background-color:rgb(255,255,255, 0.0);color:#000000;font-family:'Times New Roman',sans-serif;font-size:12pt;font-weight:400;line-height:120%;">Value3 is &#58; 10002</font></div>
</body>
</html>


Thanks
 
Hi

Oh, that is actually left over from an earlier idea, but not needed anymore. Just remove the [tt]-r[/tt] switch and should work.

( In the GNU [tt]sed[/tt] implementation the [tt]-E[/tt], [tt]-r[/tt], [tt]--regexp-extended[/tt] switches indicate that extended regular expressions are used, instead of the default basic regular expressions. But the regular expression features I used so far are common in both, so the switch is not needed. )


Feherke.
feherke.github.io
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top