find command with awfull performance

TSch · May 1, 2012

Hello everybody,

I wrote a little script that's supposed to monitor our print server. It checks each print job in the print queue for several states (Delivery Problem, Failed, Aborted).

For every state I perform a separate find command and store the output in a variable.

e.G.

Code:

chk_1b=$(ssh spooler "find /data/SpoolIn -name job.history -exe
c grep -l FAIL {} \; | wc -l" | awk '{print $1}')
chk_1c=$(ssh spooler "find /data/SpoolIn -name job.history -exe
c grep -l ABOR {} \; | wc -l" | awk '{print $1}')

The SpoolIn directory contains a subdirectory for each printjob and every printjob subdirectory itself contains one subdirectory for each step being performed within the printjob. The jobstate is being found within the job.history file.

So there's quite a large directory structure the find command has got to search that might contain more than 5.000 subdirectories plus several files within each subdirectory.

Now - as you might have guessed already ;-) - all those find commands take quite a long time to complete ...

Is there any way to improve the script to increase the performance and reduce the search time ?

Best Regards,
Thomas

feherke · May 2, 2012

Hi

Please post a job.history sample. Your Awk implementation and the used shell could be also interesting in picking an optimal solution.

And after all, what is your ultimate goal ? Just 4 shell variables, each with the count of a status ?

One thing in meantime : [tt]-exec[/tt]ing separate [tt]grep[/tt] instances for each found file is a bad idea. Either use [tt]+[/tt] instead of [tt];[/tt] or pipe to [tt]xargs[/tt]. Both methods will run each [tt]grep[/tt] instance with as many parameters as possible :

Code:

ssh spooler "find /data/SpoolIn -name job.history -exec grep -l FAIL {} \[highlight]+[/highlight] | wc -l" | awk '{print $1}'

[gray]# or[/gray]

ssh spooler "find /data/SpoolIn -name job.history [highlight]| xargs[/highlight] grep -l FAIL | wc -l" | awk '{print $1}'

By the way, are you sure the Awk part is necessary there ? When counting the standard input my [tt]wc[/tt] outputs only the number.

Feherke.

http://feherke.github.com/

blarneyme · May 2, 2012

I know this is in the scripting forum, but Perl is an alternative solution using Find::File like so in my example usage:

Code:

#!/usr/bin/perl

use File::Find;

sub findit {
        my @files = ();
	my @dirs = ();
	open DF,"df -k | grep -v ':/' | awk '{print $6}' | tail +2l|" or die;
	while(<DF>){
		($fs,$blocks,$used,$avail,$pct,$mnt) = split;
		print "$fs $mnt\n";
        	find(sub {
               		if ($_ =~ /\.rhosts$/) {
                       		push(@files, [$File::Find::dir . "/$_", (stat($_))[7]]);
               		}
        	}, "$mnt");
	}
	close(DF);
	map {
                print "$$_[0]\n";
                open(F, $$_[0]);
                while (<F>) {
                        print;
                }
                close(F);
                print "|\n";
        } sort {
                $$a[1] <=> $$b[1]
        } @files;
}
&findit;
exit(0);

feherke · May 2, 2012

Hi

And what that code does ? There seems to be a serious contradiction :

blarneyme said:

Code:

    open DF,"df -k | grep -v ':/' | awk '{[red]print $6[/red]}' | tail +2l|" or die;
    while(<DF>){
        ([red]$fs,$blocks,$used,$avail,$pct,$mnt[/red]) = split;

Feherke.

http://feherke.github.com/

SamBones · May 2, 2012

If the [tt]job.history[/tt] file is at a consistent depth in the directory tree, something like this can be much faster than a [tt]find[/tt].

Code:

ssh spooler "ls -1 /data/SpoolIn/*/*/job.history 2>/dev/null | xargs grep -l FAIL | wc -l"

TSch · May 3, 2012

Hi everyone,

here's an example:

Code:

<br><br> 07:25:09.637 18.04.12 <br> CREATE NEW -&gt; PROCESSING job: 14033481<br><br> 07:26:25.439 18.04.12 <br> PROCESSING -&gt; FAILED job: 14033481<br><br> 08:33:23.478 18.04.12 <br> FAILED -&gt; REPRINTED job: 14034875

I'm using ksh under AIX.

The directory depth is not always the same cause there's a subdirectory being created for each step as well as another subdirectory for each substep of the print job and each directory contains a job.history ...

The idea is to script an ASCII monitor that scans for the states mentioned above and gives me a "red light" if there are failed/aborted jobs and a yellow light if there are delivery problems or both lights. So far it works perfectly. Only problem is the performance issue.

Ultimate goal is to get a screen full of monitoring windows for several system's states and get an overview of the landscape's condition ...

Best Regards
Thomas

feherke · May 4, 2012

Hi
[ul]
[li]How many job.history file are there regularly ?[/li]
[li]Are they changing over the time ? If yes, how frequently ?[/li]
[li]Are you running this check periodically ? If yes, how frequently ?[/li]
[/ul]
I asked these because I am thinking about asking [tt]find[/tt] to consider only files newer than a certain age. Not sure if this can bring any improvement, just thinking.
[ul]
[li]Is that text you quoted above a single line ?[/li]
[/ul]
I ask because a big simplification would be to use a single [tt]find[/tt] and [tt]grep[/tt] for all "FAIL\|ABOR\|whatever" in once. The [tt]-o[/tt] switch would solve to get only the matched word, but sadly if there are multiple matches in a line, it outputs them all, even if [tt]-m[/tt] is specified. Or at least GNU [tt]grep[/tt] does. If your [tt]grep[/tt] has different behavior ( I mean [tt]echo [green]'foo foo'[/green] [teal]|[/teal] grep -o -m [purple]1[/purple] [green]'foo'[/green][/tt] ( with or without the [tt]-m[/tt] option ) outputs a single "foo" ) then tell us. But this would not be an issue if there would be no more than one matching word in a line, so this would work :

Code:

find /data/SpoolIn -name job[teal].[/teal][b]history[/b] [teal]|[/teal] xargs grep -o -m [purple]1[/purple] -h [green][i]'FAIL[/i][/green][lime][i]\|[/i][/lime][green][i]ABOR'[/i][/green] [teal]|[/teal] sort [teal]|[/teal] uniq -c

This will output a list like :

Code:

      2 ABOR
      3 FAIL

Similar thing could be achieved using Sed too :

Code:

find /data/SpoolIn -name job[teal].[/teal][b]history[/b] [teal]|[/teal] xargs sed -n [green][i]'/FAIL[/i][/green][lime][i]\|[/i][/lime][green][i]ABOR/s/.*[/i][/green][lime][i]\([/i][/lime][green][i]FAIL[/i][/green][lime][i]\|[/i][/lime][green][i]ABOR[/i][/green][lime][i]\)[/i][/lime][green][i].*/[/i][/green][lime][i]\1[/i][/lime][green][i]/p'[/i][/green] [teal]|[/teal] sort [teal]|[/teal] uniq -c

[ul]
[li]

Feherke said:
And after all, what is your ultimate goal ? Just 4 shell variables, each with the count of a status ?

[/li]
[/ul]
I asked that because as you can see, a simple list can be output easier, but if you definitely need separate shell variables, Awk would work better :

Code:

[b]eval[/b] [navy]$([/navy]find /data/SpoolIn -name job[teal].[/teal][b]history[/b] [teal]|[/teal] xargs awk [green][i]'FNR==1{f=0}!f&&match($0,/FAIL|ABOR/){s[substr($0,RSTART,RLENGTH)]++;f=1}END{for(i in s)printf"chk_1%s=%d[/i][/green][lime][i]\n[/i][/lime][green][i]",i,s[i]}'[/i][/green][teal])[/teal]

This will create shell variables like [navy]chk_1ABOR[/navy][teal]=[/teal][purple]2[/purple] and [navy]chk_1FAIL[/navy][teal]=[/teal][purple]3[/purple].

Feherke.

http://feherke.github.com/

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

find command with awfull performance

TSch

Technical User

feherke

Programmer

blarneyme

MIS

feherke

Programmer

SamBones

Programmer

TSch

Technical User

feherke

Programmer

Similar threads

Part and Inventory Search

Sponsor