How to slow program as workload grows and grows

MrCBofBCinTX · Jul 25, 2013

I have a small program that scans Apache log files for certain "evil" patterns and then blocks those IP addresses from server (not just Apache).
But as these log files grow and grow, the CPU percentage keeps going up.
When I restart Apache after clearing log files, problem goes away, until size grows again.

I don't want the scanning to wait too long since many pages are produced with mod_perl and PostgreSQL values.

Should I be changing how it gets its data or doing something inside of the program?

Any suggestions?

feherke · Jul 27, 2013

Hi

Would be really necessary to know how and how many log files are processed.

I suppose, the script always read the entire file, processing the already seen lines too again and again. In this case, after processing all lines I would save the current position returned by [tt]tell[/tt] into a file, then o next processing I would read that value back, move to that position using [tt]seek[/tt] and process only the lines arrived since the previous processing.

If you have to always process the entire log file, I would use the above [tt]tell[/tt] / [tt]seek[/tt] thing to [tt]insert[/tt] the new lines into a database then doing the intrusion detection queries in the database.

But of course, these are just theories. They may or may not match your task.

Feherke.
feherke.github.io

MrCBofBCinTX · Jul 27, 2013

Just the Apache error and access logs. No need to scan anything but the latest entries as they come off.

Your idea seems like a good one, I will try it.
I thought I might get a better answer by asking here

feherke · Jul 27, 2013

Hi

MrCBofBCinTX said:
Just the Apache error and access logs.

Obviously my question was too brief. By "how many" I was thinking to rotated logs.

MrCBofBCinTX said:
No need to scan anything but the latest entries as they come off.

Another way to achieve that, is to use two set of log files :
[ul]
[li]the original one in common log format, that you will keep untouched[/li]
[li]another one formatted for easier parsing ^(*), that one you will process and after each processing [tt]truncate[/tt] to empty[/li]
[/ul]
[small](*) Why to log the date as "[27/Jul/2013:20:53:00 +0000]" when "1374947580" is faster to parse ?[/small]

Feherke.
feherke.github.io

MrCBofBCinTX · Jul 27, 2013

I decided to use File::Tail.
It has several different parameters and it allows me to keep using a debugging mode I have which lets me test new patterns by reading in the whole file if I want to.
(And reversing the IP's blocked if it is screwed up.)
Luckily I never make mistakes

I like to restart the server fairly often and look through the error log to see if any new bad bots are showing up to be added to the special friends list

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

How to slow program as workload grows and grows

MrCBofBCinTX

Technical User

feherke

Programmer

MrCBofBCinTX

Technical User

feherke

Programmer

MrCBofBCinTX

Technical User

Similar threads

Part and Inventory Search

Sponsor