Extract value from HTML file using shell script

Tina_9841 · Oct 17, 2020

I am new to unix and trying to extract some values from a HTML file. Using shell script, I am trying to extract value1 , value2 and value3 in this file below.

Using these values I have to compare against database and confirm if document is right.

<html>
<body>
<div style="text-align:center;"><font style="background-color:rgb(255,255,255, 0.0);color:#000000;font-family:'Times New Roman',sans-serif;font-size:12pt;font-weight:400;line-height:120%;">value1 is : 10000</font></div>
<div><font><br></font></div>
<div style="text-align:center;"><font style="background-color:rgb(255,255,255, 0.0);color:#000000;font-family:'Times New Roman',sans-serif;font-size:12pt;font-weight:400;line-height:120%;"> value2 is : 10001</font></div>
<div><font><br></font></div>
<div style="text-align:center;"><font><br></font></div>
<div style="text-align:center;"><font style="background-color:rgb(255,255,255, 0.0);color:#000000;font-family:'Times New Roman',sans-serif;font-size:12pt;font-weight:400;line-height:120%;">Value3 is : 10002</font></div>
</body>
</html>

Thanks

feherke · Oct 17, 2020

Hi

Your requirement is quite brief and parsing HTML is a complex issue. Will keep it simple and show you how that HTML could be processed the simplest way : just remove all tags.

Code:

[blue]bash-5.0$[/blue] sed -r 's/<[^>]*>//g' Tina_9841.html 


value1 is &#58; 10000

 value2 is &#58; 10001


Value3 is &#58; 10002

And from that is simple to extract the relevant information into an associative array :

Code:

[blue]bash-5.0$[/blue] declare -A data
[blue]bash-5.0$[/blue] while read name is colon value; do [[ "$name" ]] && data["$name"]="$value"; done <<< "$( sed -r 's/<[^>]*>//g' Tina_9841.html )"

Then whatever you need, you take it from the associative array :

Code:

[blue]bash-5.0$[/blue] echo "${data[value2]}"
10001

Feherke.
feherke.github.io

Tina_9841 · Oct 18, 2020

Thanks Feherke,

For some reason -r option is throwing me illegal option issue.I am using unix in AIX platform.
The document I have to validate isn't that big content wise, is there a way we can search for the word 'value1 is :' (for some reason colon is appearing as &#58 in my case), and extract content
till next '<' , which is 10000?

<body>
<div style="text-align:center;"><font style="background-color:rgb(255,255,255, 0.0);color:#000000;font-family:'Times New Roman',sans-serif;font-size:12pt;font-weight:400;line-height:120%;">value1 is : 10000</font></div>
<div><font><br></font></div>
<div style="text-align:center;"><font style="background-color:rgb(255,255,255, 0.0);color:#000000;font-family:'Times New Roman',sans-serif;font-size:12pt;font-weight:400;line-height:120%;"> value2 is : 10001</font></div>
<div><font><br></font></div>
<div style="text-align:center;"><font><br></font></div>
<div style="text-align:center;"><font style="background-color:rgb(255,255,255, 0.0);color:#000000;font-family:'Times New Roman',sans-serif;font-size:12pt;font-weight:400;line-height:120%;">Value3 is : 10002</font></div>
</body>
</html>

Thanks

feherke · Oct 18, 2020

Hi

Oh, that is actually left over from an earlier idea, but not needed anymore. Just remove the [tt]-r[/tt] switch and should work.

( In the GNU [tt]sed[/tt] implementation the [tt]-E[/tt], [tt]-r[/tt], [tt]--regexp-extended[/tt] switches indicate that extended regular expressions are used, instead of the default basic regular expressions. But the regular expression features I used so far are common in both, so the switch is not needed. )

Feherke.
feherke.github.io

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Extract value from HTML file using shell script

Tina_9841

Programmer

feherke

Programmer

Tina_9841

Programmer

feherke

Programmer

Similar threads

Part and Inventory Search

Sponsor