Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

using an external script in awk 3

Status
Not open for further replies.

theo67

Technical User
Jul 17, 2008
31
DE
Hello to all,

i can not find a solution:

i have a file file1.txt looks like
4567;9803298ß;38840 (over thousands and thousands Lines)

And an external script "dosomething"
Now i need to do the following stepps:
1. Open the file
2 Cut the first 4 digits
3. Use them as a parameter to call an external script like: dosomething xxxx
4 Use the result of "dosomething xxxx" to replace the first 4 digits of this line...
5 print the line in output.

i am working with awk and i can not find a way to use this external skript in it...

Can someone PLEASE help?

thnaks!
 
Have a look at the getline function in your awk's documentation or man page:
command | getline var

Hope This Helps, PH.
FAQ219-2884
FAQ181-2886
 
Hi

Sorry, this is an off-topic answer, but my curiosity is at the end of its limits.

Why do you need and Awk solution for this ?

Do not take it personally, this is not strictly related to you or your question. I just saw during the years people coming and asking for Awk solutions and there are cases when I can not imagine why.

For example I would solve your problem like this, definitely without involving Awk :
Code:
[b]while[/b] [navy]IFS[/navy][teal]=[/teal][green][i]';'[/i][/green] [b]read[/b] -r begin end[teal];[/teal] [b]do[/b]
  echo [green][i]"$( dosomething "$begin" );$end"[/i][/green]
[b]done[/b] [teal]<[/teal] file1[teal].[/teal]txt
The above works in Bash, Dash, MKsh.

Feherke.
[link feherke.github.com/][/url]
 
Hi feherke,

i think you are right... i am a newbie so i have to learn a lot!!
And i am glad everytime i get an idea from people who have experience like you!!

Thank you very much for the answer. I used it and it is exactly what i need...

Theo
 
hmm i ralise now thats very slow...
my file is 80 MB and the script runs since 3 hours....

I tested it with i small file and it worked fine but i did not realise that it takes so long if i have my orig file...
 
Hi

I am afraid there is no much to optimize in that code. But some strategies may help.

Maybe caching ? Previously you wrote :

Theo said:
4567;9803298ß;38840 (over thousands and thousands Lines)

[gray](...)[/gray]

2 Cut the first 4 digits

Given the huge amount and the shortness of codes, is it possible the 4 digit codes to not be unique ? In this case we could run dosomething for a given code only once and save its output, then later reuse that saved output without running dosomething again.

Maybe parallelising ? Some versions of [tt]xargs[/tt] and [tt]make[/tt] are able to execute tasks in parallel. This is especially useful if dosomething has idle times during the run or you have multicore processor. But even if not, running multiple dosomething processes in the same time should help. Of course, if the order of the output matters, this becomes abit more complicated, but bearable.

So give us some details on those codes and dosomething's activity.


Feherke.
[link feherke.github.com/][/url]
 
Hi Feherke and thank you so much for your help!

the first field (first 4 digits) are not unique.
"dosomething" is i binary and it takes this number and calculate a new one. The new number depends allways from the input. That means e.g. "dosomething 4567" gives allways 9878 as output.

Is this what you needed to know?
 
What about this ?
Code:
awk -F';' '{
 if(!d[$1])"dosomething "$1 | getline d[$1]
 print d[$1] substr($0,5)
}' file1.txt

Hope This Helps, PH.
FAQ219-2884
FAQ181-2886
 
Hi

The simplest version :
Code:
[navy]cache[/navy][teal]=[/teal][green][i]'/tmp/dosomething.cache'[/i][/green]

[b]while[/b] [navy]IFS[/navy][teal]=[/teal][green][i]';'[/i][/green] [b]read[/b] -r begin end[teal];[/teal] [b]do[/b]
  [teal][[[/teal] -f [green][i]"$cache/$begin"[/i][/green] [teal]]][/teal] [teal]||[/teal] dosomething [green][i]"$begin"[/i][/green] [teal]>[/teal] [green][i]"$cache/$begin"[/i][/green]
  echo [green][i]"$(< "$cache/$begin" );$end"[/i][/green]
[b]done[/b] [teal]<[/teal] file1[teal].[/teal]txt
This creates separate file for each code. Fast to write, fast to read, may be slow to search, but this probably depends on the used filesystem too.

Regarding that search slowness, I would just start the script, wait until there are a few thousand files in the cache directory, then do a [tt][teal][[[/teal] -f [green]'/tmp/dosomething.cache/4567'[/green] [teal]]][/teal][/tt] ( or any other code ) from the command line and see whether it takes whole seconds. If yes, tell us. Then we will look for other storage tricks ( for example separate subdirectories based on the first character ) or alternatives ( for example SQLite database ).

One thing to note :
man bash said:
BUGS
It's too big and too slow.​

If you have Ksh, use that instead. ( On Linux you will probably find the public domain ( [tt]pdksh[/tt] ) or MirOS ( [tt]mksh[/tt] ) implementation. They are also faster. )

If you have Dash, use that instead. But Dash has only what POSIX specifies, so the above code will need minor rewrite.

Feherke.
[link feherke.github.com/][/url]
 
Hi

Thinking again, my concern was exaggerated. Even there are thousands of lines, there will be no more than 10000 code pairs. So search speed can not be an issue.

Even more, neither the storage can be an issue. I mean, while actually running dosomething was reduced to minimum, PHV's Awk code should be also fast. ( With one minor glitch : a [tt]close()[/tt] after the [tt]getline()[/tt] would avoid running out of available file handles. )

Feherke.
[link feherke.github.com/][/url]
 
Good catch, Feherke.
Code:
awk -F';' '{
 if(!d[$1]){cmd="dosomething "$1;cmd | getline d[$1];close(cmd)
 print d[$1] substr($0,5)
}' file1.txt

Hope This Helps, PH.
FAQ219-2884
FAQ181-2886
 
OOps, sorry for the typo:
Code:
awk -F';' '{
 if(!d[$1]){cmd="dosomething "$1;cmd | getline d[$1];close(cmd)}
 print d[$1] substr($0,5)
}' file1.txt

Hope This Helps, PH.
FAQ219-2884
FAQ181-2886
 
@phv
i get as output, my input file without the first "column" :)
 
oooh sorry.. i should refresh the site bevor posting :)
 
Hi PHV,
using your code, i stopped the skript after a few minutes and opened the output file. I see the whole line but without the first "column".. That's the position where the 4 digits should be...
 
Hi feherke,

i runed for a few seconds your 2nd version and braked it. should i now type in the commandline only:

[[ -f '/tmp/dosomething.cache/4567' ]]

???
 
Hi

Theo said:
should i now type in the commandline only:

[[ -f '/tmp/dosomething.cache/4567' ]]
Yes. You will see no output, only the exit code will be set. ( [tt]echo $?[/tt] to see the exit code of the previous command. But is irrelevant now. ) The key point was to see if a simple check for the file is affected by the huge amount of filesystem entries in that directory.

But as I mentioned in the next post, given that the cache directory will never have more than 10000 files, my concern was exaggerated.

Feherke.
[link feherke.github.com/][/url]
 
And this ?
Code:
awk -F';' '{
 if(d[$1]==""){cmd="dosomething "$1;cmd | getline d[$1];close(cmd)}
 print d[$1] substr($0,5)
}' file1.txt

Hope This Helps, PH.
FAQ219-2884
FAQ181-2886
 
[shocked] @PHV WOW 12 seconds!!!!! and the file was ready!!

A big THANKS to all for your help!!!!!!! Those are the moments where i realise all the things i can NOT do [wink]
 
PHV is it possible to explain to me a little bit your code?
I am not sure about it...
awk -F';' Fileseparator is ; (until here ok) [blush]
But for the rest i supose what it "could" mean..
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top