Multiple html files to tab delimited txt (data extraction question) 1

qra · Nov 7, 2009

welcome everyone.

i'm new on this forum, same with awk programming. i'm from Poland so forgive me my english.

i'm looking for fast data extraction from over 1.000.000 html files (around 30kB each) and save it to txt with tab as delimiter. each html file has many informations about one company that i want to extract (company name, phone and fax, adress, www, email, keywords etc) - so one html file will be one record (or row in txt as you like).

Data I want to extract are easy to spot. Ex. company name always contains between h1 tags:

<h1 class="wiz_tyt">Euroco Sp. z o.o.</h1>

Another fields:
Adres: Pomorska 17/60, 03-101 Warszawa
Telefon: 22 6767697
Fax: (022) 870-66-12
Nip: 1132483993 
Strona

http://www:

<a href="

http://www.euroco.pl"

rel="nofollow" target="_blank">www.euroco.pl</a>
Forma w?asno?ci: Sp. z o.o.
Zatrudnienie: 11-50
Numer KRS: 0000156847
Godziny otwarcia: poniedzia?ek-pi?tek: 09:00-17:00

Little diffrent is email field "coded" in js. Just extract lg0 and lg1 vars:

E-mail: <script type="text/javascript">pr_1 = "mai";pr_2 = "lto:";lg0 = "biuro";at = "@";lg1 = "euroco.pl"; document.write('<a href=\"'+pr_1+pr_2+lg0+at+lg1+'\">'); document.write( lg0 + at + lg1 + '<\/a>' ); </script>

And last - the keywords contained in title tag and between <a></a> tags.

S?owa kluczowe: <a href="/0,adaptacje,indeks.html" title="adaptacje" style="color:#888;">adaptacje</a>, <a href="/0,eksport,indeks.html" title="eksport" style="color:#888;">eksport</a>, <a href="/0,import,indeks.html" title="import" style="color:#888;">import</a>, <a href="/0,inwestycje,indeks.html" title="inwestycje" style="color:#888;">inwestycje</a>, <a href="/0,nieruchomo%B6ci,indeks.html" title="nieruchomo?ci" style="color:#888;">nieruchomo?ci</a>, <a href="/0,remonty,indeks.html" title="remonty" style="color:#888;">remonty</a>, <a href="/0,us%B3ugi+bdowlane,indeks.html" title="us?ugi bdowlane" style="color:#888;">us?ugi bdowlane</a>

Desired output in txt file for that data:

Euroco Sp. z o.o. [TAB] Pomorska 17/60, 03-101 Warszawa [TAB] 22 6767697 [TAB] (022)870-66-12 [TAB] 1132483993 [TAB]

http://www.euroco.pl

[TAB] biuro@euroco.pl [TAB] Sp. z o.o. [TAB] 11-50 [TAB] 0000156847 [TAB] poniedzia?ek-pi?tek: 09:00-17:00 [TAB] adaptacje eksport import inwestycje nieruchomo?ci remonty us?ugi budowlane [END OF LINE]

Many fields are optional. Company name and adresses are always present, but such fields as email,

http://www or

fax aren't in every file.

I wrote script in Brown Recluse (programmable spider) but it's too slow working on files. Using their modified RegExp it was something like this:

ry.Mask = 'Telefon:(.*?)';
if ry.Match then tel = Trim(Decode(ry.Value[1]));

How can I do that using awk? I prefer put awk code in separate file, not in cmd line. So only i know so far is cmd to run this

awk -f program.awk *.* > result.txt

Example 3 html files in attachment. Can anyone help me with some advices how to do this?

greetings
Paul

feherke · Nov 7, 2009

Hi

Supposing the files are as in your example :
[ul]
[li]each piece of information is in separate line[/li]
[li]each piece of information is in a single line[/li]
[/ul]

Code:

[gray]#!/usr/bin/awk -f[/gray]

[red]BEGIN[/red] [teal]{[/teal]
  [blue]OFS[/blue][teal]=[/teal][green][i]"[/i][/green][lime][i]\t[/i][/lime][green][i]"[/i][/green]
[teal]}[/teal]

[blue]FNR[/blue][teal]==[/teal][purple]1[/purple] [teal]&&[/teal] [blue]NR[/blue][teal]!=[/teal][purple]1[/purple] [teal]{[/teal]
  [COLOR=darkgoldenrod]printrecord[/color][teal]()[/teal]
[teal]}[/teal]

[teal]/<[/teal]h1[teal]/[/teal] [teal]{[/teal]
  [COLOR=chocolate]gsub[/color][teal](/<[^<>]+>/,[/teal][green][i]""[/i][/green][teal])[/teal]
  d[teal][[/teal][green][i]"Name"[/i][/green][teal]]=[/teal][navy]$0[/navy]
  [COLOR=chocolate]next[/color]
[teal]}[/teal]

[teal]{[/teal]
  [COLOR=chocolate]gsub[/color][teal](/<[^<>]+>/,[/teal][green][i]""[/i][/green][teal])[/teal]
  [b]if[/b] [teal]([/teal]c[teal]=[/teal][COLOR=chocolate]index[/color][teal]([/teal][navy]$0[/navy][teal],[/teal][green][i]":"[/i][/green][teal]))[/teal] d[teal][[/teal][COLOR=chocolate]substr[/color][teal]([/teal][navy]$0[/navy][teal],[/teal][purple]1[/purple][teal],[/teal]c[teal]-[/teal][purple]1[/purple][teal])]=[/teal][COLOR=chocolate]substr[/color][teal]([/teal][navy]$0[/navy][teal],[/teal]c[teal]+[/teal][purple]1[/purple][teal])[/teal]
[teal]}[/teal]

[red]END[/red] [teal]{[/teal]
  [COLOR=darkgoldenrod]printrecord[/color][teal]()[/teal]
[teal]}[/teal]

[b]function[/b] [COLOR=darkgoldenrod]printrecord[/color][teal]()[/teal]
[teal]{[/teal]
  [b]for[/b] [teal]([/teal]f [b]in[/b] d[teal])[/teal] [teal]{[/teal]
    [COLOR=chocolate]gsub[/color][teal](/^[/teal] [teal]+|[/teal] [teal]+[/teal]$[teal]/,[/teal][green][i]""[/i][/green][teal],[/teal]d[teal][[/teal]f[teal]])[/teal]
    [COLOR=chocolate]gsub[/color][teal](/&[/teal]nbsp[teal];/,[/teal][green][i]" "[/i][/green][teal],[/teal]d[teal][[/teal]f[teal]])[/teal]
    [COLOR=chocolate]gsub[/color][teal](/&[/teal]lt[teal];/,[/teal][green][i]"<"[/i][/green][teal],[/teal]d[teal][[/teal]f[teal]])[/teal]
    [COLOR=chocolate]gsub[/color][teal](/&[/teal]gt[teal];/,[/teal][green][i]">"[/i][/green][teal],[/teal]d[teal][[/teal]f[teal]])[/teal]
    [COLOR=chocolate]gsub[/color][teal](/&[/teal]amp[teal];/,[/teal][green][i]"&"[/i][/green][teal],[/teal]d[teal][[/teal]f[teal]])[/teal]
  [teal]}[/teal]

  [COLOR=chocolate]split[/color][teal]([/teal]d[teal][[/teal][green][i]"E-mail"[/i][/green][teal]],[/teal]a[teal],/[/teal]"[teal]/)[/teal]
  d[teal][[/teal][green][i]"E-mail"[/i][/green][teal]]=[/teal]a[teal][[/teal][purple]6[/purple][teal]][/teal] [green][i]"@"[/i][/green] a[teal][[/teal][purple]10[/purple][teal]][/teal]
  [COLOR=chocolate]gsub[/color][teal](/,[/teal] [teal]*/,[/teal][green][i]" "[/i][/green][teal],[/teal]d[teal][[/teal][green][i]"S#owa kluczowe"[/i][/green][teal]])[/teal]

  [COLOR=chocolate]print[/color] d[teal][[/teal][green][i]"Name"[/i][/green][teal]],[/teal]d[teal][[/teal][green][i]"Adres"[/i][/green][teal]],[/teal]d[teal][[/teal][green][i]"Telefon"[/i][/green][teal]],[/teal]d[teal][[/teal][green][i]"Fax"[/i][/green][teal]],[/teal]d[teal][[/teal][green][i]"Nip"[/i][/green][teal]],[/teal]d[teal][[/teal][green][i]"Strona www"[/i][/green][teal]],[/teal]d[teal][[/teal][green][i]"E-mail"[/i][/green][teal]],[/teal]d[teal][[/teal][green][i]"Forma w#asno#ci"[/i][/green][teal]],[/teal]d[teal][[/teal][green][i]"Zatrudnienie"[/i][/green][teal]],[/teal]d[teal][[/teal][green][i]"Numer KRS"[/i][/green][teal]],[/teal]d[teal][[/teal][green][i]"Godziny otwarcia"[/i][/green][teal]],[/teal]d[teal][[/teal][green][i]"S#owa kluczowe"[/i][/green][teal]][/teal]

  [b]for[/b] [teal]([/teal]f [b]in[/b] d[teal])[/teal] [b]delete[/b] d[teal][[/teal]f[teal]][/teal]
[teal]}[/teal]

Tested with [tt]gawk[/tt] and [tt]mawk[/tt].

Sorry for the Polish-specific characters. You will have to replace the sharp ( # ) characters inserted by my ( not so clever ) editor.

Feherke.

http://free.rootshell.be/~feherke/

qra · Nov 8, 2009

Hi!

Thanks for very fast response! Thanks alot for script, but i don't understand how it works exactly. And it doesn't do what I meant. Maybe I described it wrong. What Your script is actually doing is stripping html tags and that's not the point

. I will describe this on example.

3 htm files I've included in first post will be test data. So let's say we have these 3 files in dir pages/. So starting script will be:

gawk -f program.awk pages/*.* > output.txt

After this in output file we have 3 lines of pure txt (each htm file is in one line). I will put here one line, and color by red these information that i want to keep in this file - other is trash.

« przejd? do wynikówEuroco Sp. z o.o.Adres: Pomorska 17/60, 03-101 WarszawaTelefon: 22 6767697Fax: (022) 870-66-12Nip: 1132483993 E-mail: pr_1 = "mai";pr_2 = "lto:";lg0 = "biuro";at = "@";lg1 = "euroco.pl"; document.write(''); document.write( lg0 + at + lg1 + '' ); Strona www: www.euroco.plForma w?asno¶ci: Sp. z o.o.Zatrudnienie: 11-50Numer KRS: 0000156847KRS »Godziny otwarcia: poniedzia?ek-pi±tek: 09:00-17:00S?owa kluczowe: adaptacje, eksport, import, inwestycje, nieruchomo¶ci, remonty, us?ugi bdowlane Raport biznesowy (KRS) / cut rest of line - trash data /

So how can I filter that pure txt file to get output file with each line containing only data I want? Desired output file like this (tab as delimiter):

Euroco Sp. z o.o. Pomorska 17/60, 03-101 Warszawa 22 6767697 (022)870-66-12 1132483993

http://www.euroco.pl

biuro@euroco.pl Sp. z o.o. 11-50 0000156847 poniedzia?ek-pi?tek: 09:00-17:00 adaptacje eksport import inwestycje nieruchomo?ci remonty us?ugi budowlane [END OF LINE]

I need such structure of data file to easly import it to MS Access, so every line is one record in databse, and each information from one line is in separate column like this:

Company Name | Adress | Phone | Fax |

http://WWW |

http://www.e|

biuro@ |...

I can't explain this in other way. I think it can be done in two passes. First strip html tags (as Your script already does), and the run another script on that output file to filter those data.

I'm impressed with speed that awk works, that's why I beg for help with that.

If it'll be not big problem for You, can You comment Your code? I'm always try to understand how program/script works.

Thanks for help and Your time!
Greetings
Paul

feherke · Nov 8, 2009

Hi

For the sample input you posted my script produces the same line of tab separated values as in your desired output. Unless you give us details about what [tt]awk[/tt] implementation are you using, how are you running the script, how exactly the input files are looking and what exactly my script produces, there is no way to help you more.

Anyway, here is the commented script :

Code:

[gray]#!/usr/bin/awk -f[/gray]

[red]BEGIN[/red] [teal]{[/teal]                     [gray]# before processing the input[/gray]
  [blue]OFS[/blue][teal]=[/teal][green][i]"[/i][/green][lime][i]\t[/i][/lime][green][i]"[/i][/green]                  [gray]# set the OFS to tab character[/gray]
[teal]}[/teal]

[blue]FNR[/blue][teal]==[/teal][purple]1[/purple] [teal]&&[/teal] [blue]NR[/blue][teal]!=[/teal][purple]1[/purple] [teal]{[/teal]           [gray]# if this the first record of an input file, but not the overall first record[/gray]
  [COLOR=darkgoldenrod]printrecord[/color][teal]()[/teal]             [gray]# call the printrecord() function[/gray]
[teal]}[/teal]

[teal]/<[/teal]h1[teal]/[/teal] [teal]{[/teal]                     [gray]# if current record contains h1 tag[/gray]
  [COLOR=chocolate]gsub[/color][teal](/<[^<>]+>/,[/teal][green][i]""[/i][/green][teal])[/teal]       [gray]# strip HTML tags[/gray]
  d[teal][[/teal][green][i]"Name"[/i][/green][teal]]=[/teal][navy]$0[/navy]              [gray]# put current record in array d at key Name[/gray]
  [COLOR=chocolate]next[/color]                      [gray]# skip the remaining code[/gray]
[teal]}[/teal]

[teal]{[/teal]                           [gray]# for every record[/gray]
  [COLOR=chocolate]gsub[/color][teal](/<[^<>]+>/,[/teal][green][i]""[/i][/green][teal])[/teal]       [gray]# strip HTML tags[/gray]
  [b]if[/b] [teal]([/teal]c[teal]=[/teal][COLOR=chocolate]index[/color][teal]([/teal][navy]$0[/navy][teal],[/teal][green][i]":"[/i][/green][teal]))[/teal]      [gray]# if current record contains colon, put its position in variable c[/gray]
    d[teal][[/teal][COLOR=chocolate]substr[/color][teal]([/teal][navy]$0[/navy][teal],[/teal][purple]1[/purple][teal],[/teal]c[teal]-[/teal][purple]1[/purple][teal])]=[/teal][COLOR=chocolate]substr[/color][teal]([/teal][navy]$0[/navy][teal],[/teal]c[teal]+[/teal][purple]1[/purple][teal])[/teal] [gray]# put the substring after the colon in array d at key given by text before the colon[/gray]
[teal]}[/teal]

[red]END[/red] [teal]{[/teal]                       [gray]# after processing the input[/gray]
  [COLOR=darkgoldenrod]printrecord[/color][teal]()[/teal]             [gray]# call the printrecord() function[/gray]
[teal]}[/teal]

[b]function[/b] [COLOR=darkgoldenrod]printrecord[/color][teal]()[/teal]      [gray]# define function printrecord()[/gray]
[teal]{[/teal]
  [b]for[/b] [teal]([/teal]f [b]in[/b] d[teal])[/teal] [teal]{[/teal]            [gray]# for each key of array d[/gray]
    [COLOR=chocolate]gsub[/color][teal](/^[/teal] [teal]+|[/teal] [teal]+[/teal]$[teal]/,[/teal][green][i]""[/i][/green][teal],[/teal]d[teal][[/teal]f[teal]])[/teal] [gray]# trim spaces from both ends[/gray]
    [COLOR=chocolate]gsub[/color][teal](/&[/teal]nbsp[teal];/,[/teal][green][i]" "[/i][/green][teal],[/teal]d[teal][[/teal]f[teal]])[/teal] [gray]# replace &nbsp; character entity with space[/gray]
    [COLOR=chocolate]gsub[/color][teal](/&[/teal]lt[teal];/,[/teal][green][i]"<"[/i][/green][teal],[/teal]d[teal][[/teal]f[teal]])[/teal]   [gray]# replace &lt; character entity with less-than[/gray]
    [COLOR=chocolate]gsub[/color][teal](/&[/teal]gt[teal];/,[/teal][green][i]">"[/i][/green][teal],[/teal]d[teal][[/teal]f[teal]])[/teal]   [gray]# replace &gt; character entity with greater-than[/gray]
    [COLOR=chocolate]gsub[/color][teal](/&[/teal]amp[teal];/,[/teal][green][i]"&"[/i][/green][teal],[/teal]d[teal][[/teal]f[teal]])[/teal]  [gray]# replace &amp; character entity with ampersand[/gray]
  [teal]}[/teal]

  [COLOR=chocolate]split[/color][teal]([/teal]d[teal][[/teal][green][i]"E-mail"[/i][/green][teal]],[/teal]a[teal],/[/teal]"[teal]/)[/teal]  [gray]# split the value of array d's E-mail element on double quotes and put it in array a[/gray]
  d[teal][[/teal][green][i]"E-mail"[/i][/green][teal]]=[/teal]a[teal][[/teal][purple]6[/purple][teal]][/teal] [green][i]"@"[/i][/green] a[teal][[/teal][purple]10[/purple][teal]][/teal] [gray]# put array a's 6[sup]th[/sup] and 10[sup]th[/sup] element into array d at key E-mail[/gray]
  [COLOR=chocolate]gsub[/color][teal](/,[/teal] [teal]*/,[/teal][green][i]" "[/i][/green][teal],[/teal]d[teal][[/teal][green][i]"S#owa kluczowe"[/i][/green][teal]])[/teal] [gray]# replace commas followed by any number of spaces with one space[/gray]

[gray]# print array d's elements with the specified keys, separated with the OFS[/gray]
  [COLOR=chocolate]print[/color] d[teal][[/teal][green][i]"Name"[/i][/green][teal]],[/teal]d[teal][[/teal][green][i]"Adres"[/i][/green][teal]],[/teal]d[teal][[/teal][green][i]"Telefon"[/i][/green][teal]],[/teal]d[teal][[/teal][green][i]"Fax"[/i][/green][teal]],[/teal]d[teal][[/teal][green][i]"Nip"[/i][/green][teal]],[/teal]d[teal][[/teal][green][i]"Strona www"[/i][/green][teal]],[/teal]d[teal][[/teal][green][i]"E-mail"[/i][/green][teal]],[/teal]d[teal][[/teal][green][i]"Forma w#asno#ci"[/i][/green][teal]],[/teal]d[teal][[/teal][green][i]"Zatrudnienie"[/i][/green][teal]],[/teal]d[teal][[/teal][green][i]"Numer KRS"[/i][/green][teal]],[/teal]d[teal][[/teal][green][i]"Godziny otwarcia"[/i][/green][teal]],[/teal]d[teal][[/teal][green][i]"S#owa kluczowe"[/i][/green][teal]][/teal]

  [b]for[/b] [teal]([/teal]f [b]in[/b] d[teal])[/teal]              [gray]# for each key of array d[/gray]
    [b]delete[/b] d[teal][[/teal]f[teal]][/teal]             [gray]# delete array d's element at the given key[/gray]
[teal]}[/teal]

Feherke.

http://free.rootshell.be/~feherke/

qra · Nov 8, 2009

Hi

Thanks again for fast reply. In attachment I included package contains:

- run.bat (script running command)
- out.txt (output I'm recieving on my comp)
- test/ (directory with 65 html files I'm working on)
- program.awk (Your script)

E:\>gawk --version
GNU Awk 3.1.6
Copyright (C) 1989, 1991-2007 Free Software Foundation.

Operating system: Windows XP Pro SP3

Running command:
E:\>awk -f program.awk test/*.* > out.txt

No matter I use awk, gawk or pgawk - outpu is always same

Got this build from:

http://gnuwin32.sourceforge.net/packages/gawk.htm

So it seems to be the problem i think - awk implementation under Win :/

feherke · Nov 8, 2009

Hi

Feherke said:
Supposing the files are as in your example :
[ul]
[li]each piece of information is in separate line[/li]
[li]each piece of information is in a single line[/li]
[/ul]

Those HTML files are not like your sample input. So my script has no reason to work on them.

However with some minor changes can be adapted :

Code:

[red]BEGIN[/red] [teal]{[/teal]
  [highlight][blue]RS[/blue][teal]=[/teal][green][i]"</span>|</div>"[/i][/green][/highlight]
  [blue]OFS[/blue][teal]=[/teal][green][i]"[/i][/green][lime][i]\t[/i][/lime][green][i]"[/i][/green]
[teal]}[/teal]

[blue]FNR[/blue][teal]==[/teal][purple]1[/purple] [teal]&&[/teal] [blue]NR[/blue][teal]!=[/teal][purple]1[/purple] [teal]{[/teal]
  [COLOR=darkgoldenrod]printrecord[/color][teal]()[/teal]
[teal]}[/teal]

[teal]/<[/teal]h1[teal]/[/teal] [teal]{[/teal]
  d[teal][[/teal][green][i]"Name"[/i][/green][teal]]=[/teal][navy]$0[/navy]
  [highlight][COLOR=chocolate]gsub[/color][teal](/.*<[/teal]h1[teal][^>]*>|<\/[/teal]h1[teal]>.*/,[/teal][green][i]""[/i][/green][teal],[/teal]d[teal][[/teal][green][i]"Name"[/i][/green][teal]])[/teal][/highlight]
[teal]}[/teal]

[teal]{[/teal]
  [COLOR=chocolate]gsub[/color][teal](/<[^<>]+>/,[/teal][green][i]""[/i][/green][teal])[/teal]
  [b]if[/b] [teal]([/teal]c[teal]=[/teal][COLOR=chocolate]index[/color][teal]([/teal][navy]$0[/navy][teal],[/teal][green][i]":"[/i][/green][teal]))[/teal] d[teal][[/teal][COLOR=chocolate]substr[/color][teal]([/teal][navy]$0[/navy][teal],[/teal][purple]1[/purple][teal],[/teal]c[teal]-[/teal][purple]1[/purple][teal])]=[/teal][COLOR=chocolate]substr[/color][teal]([/teal][navy]$0[/navy][teal],[/teal]c[teal]+[/teal][purple]1[/purple][teal])[/teal]
[teal]}[/teal]

[red]END[/red] [teal]{[/teal]
  [COLOR=darkgoldenrod]printrecord[/color][teal]()[/teal]
[teal]}[/teal]

[b]function[/b] [COLOR=darkgoldenrod]printrecord[/color][teal]()[/teal]
[teal]{[/teal]
  [b]for[/b] [teal]([/teal]f [b]in[/b] d[teal])[/teal] [teal]{[/teal]
    [COLOR=chocolate]gsub[/color][teal](/^[/teal] [teal]+|[/teal] [teal]+[/teal]$[teal]/,[/teal][green][i]""[/i][/green][teal],[/teal]d[teal][[/teal]f[teal]])[/teal]
    [COLOR=chocolate]gsub[/color][teal](/&[/teal]nbsp[teal];/,[/teal][green][i]" "[/i][/green][teal],[/teal]d[teal][[/teal]f[teal]])[/teal]
    [COLOR=chocolate]gsub[/color][teal](/&[/teal]lt[teal];/,[/teal][green][i]"<"[/i][/green][teal],[/teal]d[teal][[/teal]f[teal]])[/teal]
    [COLOR=chocolate]gsub[/color][teal](/&[/teal]gt[teal];/,[/teal][green][i]">"[/i][/green][teal],[/teal]d[teal][[/teal]f[teal]])[/teal]
    [COLOR=chocolate]gsub[/color][teal](/&[/teal]amp[teal];/,[/teal][green][i]"&"[/i][/green][teal],[/teal]d[teal][[/teal]f[teal]])[/teal]
  [teal]}[/teal]

  [highlight][b]if[/b] [teal]([/teal]d[teal][[/teal][green][i]"E-mail"[/i][/green][teal]])[/teal] [teal]{[/teal][/highlight]
    [COLOR=chocolate]split[/color][teal]([/teal]d[teal][[/teal][green][i]"E-mail"[/i][/green][teal]],[/teal]a[teal],/[/teal]"[teal]/)[/teal]
    d[teal][[/teal][green][i]"E-mail"[/i][/green][teal]]=[/teal]a[teal][[/teal][purple]6[/purple][teal]][/teal] [green][i]"@"[/i][/green] a[teal][[/teal][purple]10[/purple][teal]][/teal]
  [highlight][teal]}[/teal][/highlight]
  [COLOR=chocolate]gsub[/color][teal](/,[/teal] [teal]*/,[/teal][green][i]" "[/i][/green][teal],[/teal]d[teal][[/teal][green][i]"S³owa kluczowe"[/i][/green][teal]])[/teal]

  [COLOR=chocolate]print[/color] d[teal][[/teal][green][i]"Name"[/i][/green][teal]],[/teal]d[teal][[/teal][green][i]"Adres"[/i][/green][teal]],[/teal]d[teal][[/teal][green][i]"Telefon"[/i][/green][teal]],[/teal]d[teal][[/teal][green][i]"Fax"[/i][/green][teal]],[/teal]d[teal][[/teal][green][i]"Nip"[/i][/green][teal]],[/teal]d[teal][[/teal][green][i]"Strona www"[/i][/green][teal]],[/teal]d[teal][[/teal][green][i]"E-mail"[/i][/green][teal]],[/teal]d[teal][[/teal][green][i]"Forma w³asnoci"[/i][/green][teal]],[/teal]d[teal][[/teal][green][i]"Zatrudnienie"[/i][/green][teal]],[/teal]d[teal][[/teal][green][i]"Numer KRS"[/i][/green][teal]],[/teal]d[teal][[/teal][green][i]"Godziny otwarcia"[/i][/green][teal]],[/teal]d[teal][[/teal][green][i]"S³owa kluczowe"[/i][/green][teal]][/teal]

  [b]for[/b] [teal]([/teal]f [b]in[/b] d[teal])[/teal] [b]delete[/b] d[teal][[/teal]f[teal]][/teal]
[teal]}[/teal]

Just a personal curiosity, why you choose [tt]awk[/tt] ?

Feherke.

http://free.rootshell.be/~feherke/

qra · Nov 10, 2009

Hi

That script works like a harm :]. Awesome. But I have question about some modification. I noticed that company name in <h1> tags sometimes has graphic logo, and then h1 tag looks like this:

Code:

<h1 class="wiz_tyt"><div style="height: 50px; float: left;"><img src="[URL unfurl="true"]http://m.onet.pl/_m/c97f80d5b468399024bb57854c70a1ea,0,1.gif"[/URL] alt="Brava-Ex" style="margin-right: 8px; float:left; "/></div>Brava-Ex</h1>

In this example company name is Brava-Ex. How to modify Your script to extract data from this? Because Name field contains this now:

Code:

<div style="height: 50px; float: left;"><img src="[URL unfurl="true"]http://m.onet.pl/_m/c97f80d5b468399024bb57854c70a1ea,0,1.gif"[/URL] alt="Brava-Ex" style="margin-right: 8px; float:left; "/>

And second question. Adress contains always street and after coma and space it has postal code (always in XX-XXX format) and city name. How to split this on 3 columns. Example:

Wolbromska 53, 03-680 Warszawa

Postal code have always XX-XXX format where X=[0-9]. And before postal code always appear coma with space. City name always after postal code (some times contains 2-3 words). Can you modify Your script that way? (with comments on new lines if it's not big problem for you, please).

I've chosen AWK because of it's C-like language form and it uses RegExp. And I just didn't know how to do this

(in what way). I'm always looking for UNIX tools to solve problems, because they are the best and fastest. I did solution on my problem with Brown Recluse (c-like form with regexp, programmable web spider) - but it's not tool to work on hundreds thousands of files - it's damn too slow. So I gave a try to AWK. How would You deal with that problem? Can You point me some other ways to do this? Other tools or something?

Thanks A LOT for Your patience
Greetings
Paul

feherke · Nov 10, 2009

Hi

Paul said:
company name in <h1> tags sometimes has graphic logo

Actually the problem is not the graphic, but the [tt]div[/tt] which contains it. When found out that the HTML is not wrapped into lines, I changed the [tt]RS# ( record separator ) to "|</div>", so the closing [/tt]span[tt] and [/tt]div[tt] tags to end separate records. So such [/tt]h1##'s content is read in as two separate record and processed in two separate steps :

Code:

[COLOR=white gray]NR==61[/color] <h1 class="wiz_tyt"><div style="height: 50px; float: left;"><img src="[URL unfurl="true"]http://m.onet.pl/_m/c97f80d5b468399024bb57854c70a1ea,0,1.gif"[/URL] alt="Brava-Ex" style="margin-right: 8px; float:left; "/>

[COLOR=white gray]NR==62[/color] Brava-Ex</h1>

The quick & dirty solution is to identify lines containing the company name not by the opening [tt]h1[/tt] tag, but the closing one :

Code:

[teal]/<[highlight]\/[/highlight][/teal]h1[teal]/[/teal] [teal]{[/teal]
  d[teal][[/teal][green][i]"Name"[/i][/green][teal]]=[/teal][navy]$0[/navy]
  [COLOR=chocolate]gsub[/color][teal](/.*<[/teal]h1[teal][^>]*>|<\/[/teal]h1[teal]>.*/,[/teal][green][i]""[/i][/green][teal],[/teal]d[teal][[/teal][green][i]"Name"[/i][/green][teal]])[/teal]
[teal]}[/teal]

Paul said:
How to split this on 3 columns.

The usual way is use the [tt]split()[/tt] function for that. As the separators differs, two calls are required. Then you have to change the [tt]print[/tt] too, to output [tt]d["Street"][/tt], [tt]d["Code"][/tt] and [tt]d["City"][/tt] instead of [tt]d["Adres"][/tt].

But while you have GNU [tt]awk[/tt], I suggest to use its powerful [tt]gensub()[/tt] function, which handles captured groups in regular expressions. So just insert this before the [tt]print[/tt] :

Code:

d[teal][[/teal][green][i]"Adres"[/i][/green][teal]]=[/teal][COLOR=chocolate]gensub[/color][teal](/(.+),[/teal] [teal]([[:[/teal]digit[teal]:]-]+)[/teal] [teal](.+)/,[/teal][green][i]"[/i][/green][lime][i]\\[/i][/lime][green][i]1"[/i][/green] [blue]OFS[/blue] [green][i]"[/i][/green][lime][i]\\[/i][/lime][green][i]2"[/i][/green] [blue]OFS[/blue] [green][i]"[/i][/green][lime][i]\\[/i][/lime][green][i]3"[/i][/green][teal],[/teal][green][i]""[/i][/green][teal],[/teal]d[teal][[/teal][green][i]"Adres"[/i][/green][teal]])[/teal]

Feherke.

http://free.rootshell.be/~feherke/

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Multiple html files to tab delimited txt (data extraction question) 1

qra

Technical User

feherke

Programmer

qra

Technical User

feherke

Programmer

qra

Technical User

feherke

Programmer

qra

Technical User

feherke

Programmer

Similar threads

Part and Inventory Search

Sponsor