welcome everyone.
i'm new on this forum, same with awk programming. i'm from Poland so forgive me my english.
i'm looking for fast data extraction from over 1.000.000 html files (around 30kB each) and save it to txt with tab as delimiter. each html file has many informations about one company that i want to extract (company name, phone and fax, adress, www, email, keywords etc) - so one html file will be one record (or row in txt as you like).
Data I want to extract are easy to spot. Ex. company name always contains between h1 tags:
<h1 class="wiz_tyt">Euroco Sp. z o.o.</h1>
Another fields:
<b>Adres:</b> Pomorska 17/60, 03-101 Warszawa</span>
<b>Telefon:</b> 22 6767697</span>
<b>Fax:</b> (022) 870-66-12</span>
<b>Nip:</b> 1132483993 </span>
<b>Strona <a href=" rel="nofollow" target="_blank">www.euroco.pl</a>
<b>Forma w?asno?ci:</b> Sp. z o.o.</span>
<b>Zatrudnienie:</b> 11-50</span>
<b>Numer KRS:</b> <i>0000156847</i>
<b>Godziny otwarcia:</b> poniedzia?ek-pi?tek: 09:00-17:00</span>
Little diffrent is email field "coded" in js. Just extract lg0 and lg1 vars:
<b>E-mail:</b> <script type="text/javascript">pr_1 = "mai";pr_2 = "lto:";lg0 = "biuro";at = "@";lg1 = "euroco.pl"; document.write('<a href=\"'+pr_1+pr_2+lg0+at+lg1+'\">'); document.write( lg0 + at + lg1 + '<\/a>' ); </script></span>
And last - the keywords contained in title tag and between <a></a> tags.
<b>S?owa kluczowe:</b> <a href="/0,adaptacje,indeks.html" title="adaptacje" style="color:#888;">adaptacje</a>, <a href="/0,eksport,indeks.html" title="eksport" style="color:#888;">eksport</a>, <a href="/0,import,indeks.html" title="import" style="color:#888;">import</a>, <a href="/0,inwestycje,indeks.html" title="inwestycje" style="color:#888;">inwestycje</a>, <a href="/0,nieruchomo%B6ci,indeks.html" title="nieruchomo?ci" style="color:#888;">nieruchomo?ci</a>, <a href="/0,remonty,indeks.html" title="remonty" style="color:#888;">remonty</a>, <a href="/0,us%B3ugi+bdowlane,indeks.html" title="us?ugi bdowlane" style="color:#888;">us?ugi bdowlane</a></span>
Desired output in txt file for that data:
Euroco Sp. z o.o. [TAB] Pomorska 17/60, 03-101 Warszawa [TAB] 22 6767697 [TAB] (022)870-66-12 [TAB] 1132483993 [TAB] [TAB] biuro@euroco.pl [TAB] Sp. z o.o. [TAB] 11-50 [TAB] 0000156847 [TAB] poniedzia?ek-pi?tek: 09:00-17:00 [TAB] adaptacje eksport import inwestycje nieruchomo?ci remonty us?ugi budowlane [END OF LINE]
Many fields are optional. Company name and adresses are always present, but such fields as email, fax aren't in every file.
I wrote script in Brown Recluse (programmable spider) but it's too slow working on files. Using their modified RegExp it was something like this:
ry.Mask = 'Telefon:</b>(.*?)</span>';
if ry.Match then tel = Trim(Decode(ry.Value[1]));
How can I do that using awk? I prefer put awk code in separate file, not in cmd line. So only i know so far is cmd to run this
awk -f program.awk *.* > result.txt
Example 3 html files in attachment. Can anyone help me with some advices how to do this?
greetings
Paul
i'm new on this forum, same with awk programming. i'm from Poland so forgive me my english.
i'm looking for fast data extraction from over 1.000.000 html files (around 30kB each) and save it to txt with tab as delimiter. each html file has many informations about one company that i want to extract (company name, phone and fax, adress, www, email, keywords etc) - so one html file will be one record (or row in txt as you like).
Data I want to extract are easy to spot. Ex. company name always contains between h1 tags:
<h1 class="wiz_tyt">Euroco Sp. z o.o.</h1>
Another fields:
<b>Adres:</b> Pomorska 17/60, 03-101 Warszawa</span>
<b>Telefon:</b> 22 6767697</span>
<b>Fax:</b> (022) 870-66-12</span>
<b>Nip:</b> 1132483993 </span>
<b>Strona <a href=" rel="nofollow" target="_blank">www.euroco.pl</a>
<b>Forma w?asno?ci:</b> Sp. z o.o.</span>
<b>Zatrudnienie:</b> 11-50</span>
<b>Numer KRS:</b> <i>0000156847</i>
<b>Godziny otwarcia:</b> poniedzia?ek-pi?tek: 09:00-17:00</span>
Little diffrent is email field "coded" in js. Just extract lg0 and lg1 vars:
<b>E-mail:</b> <script type="text/javascript">pr_1 = "mai";pr_2 = "lto:";lg0 = "biuro";at = "@";lg1 = "euroco.pl"; document.write('<a href=\"'+pr_1+pr_2+lg0+at+lg1+'\">'); document.write( lg0 + at + lg1 + '<\/a>' ); </script></span>
And last - the keywords contained in title tag and between <a></a> tags.
<b>S?owa kluczowe:</b> <a href="/0,adaptacje,indeks.html" title="adaptacje" style="color:#888;">adaptacje</a>, <a href="/0,eksport,indeks.html" title="eksport" style="color:#888;">eksport</a>, <a href="/0,import,indeks.html" title="import" style="color:#888;">import</a>, <a href="/0,inwestycje,indeks.html" title="inwestycje" style="color:#888;">inwestycje</a>, <a href="/0,nieruchomo%B6ci,indeks.html" title="nieruchomo?ci" style="color:#888;">nieruchomo?ci</a>, <a href="/0,remonty,indeks.html" title="remonty" style="color:#888;">remonty</a>, <a href="/0,us%B3ugi+bdowlane,indeks.html" title="us?ugi bdowlane" style="color:#888;">us?ugi bdowlane</a></span>
Desired output in txt file for that data:
Euroco Sp. z o.o. [TAB] Pomorska 17/60, 03-101 Warszawa [TAB] 22 6767697 [TAB] (022)870-66-12 [TAB] 1132483993 [TAB] [TAB] biuro@euroco.pl [TAB] Sp. z o.o. [TAB] 11-50 [TAB] 0000156847 [TAB] poniedzia?ek-pi?tek: 09:00-17:00 [TAB] adaptacje eksport import inwestycje nieruchomo?ci remonty us?ugi budowlane [END OF LINE]
Many fields are optional. Company name and adresses are always present, but such fields as email, fax aren't in every file.
I wrote script in Brown Recluse (programmable spider) but it's too slow working on files. Using their modified RegExp it was something like this:
ry.Mask = 'Telefon:</b>(.*?)</span>';
if ry.Match then tel = Trim(Decode(ry.Value[1]));
How can I do that using awk? I prefer put awk code in separate file, not in cmd line. So only i know so far is cmd to run this
awk -f program.awk *.* > result.txt
Example 3 html files in attachment. Can anyone help me with some advices how to do this?
greetings
Paul