Finding redundant website files

Stretchwickster · May 1, 2009

Hi there,

Unfortunately, I've inherited a large site which I believe to have many unused files. Could anyone recommend the most effective way to obtain an accurate listing of such files?

Your advice would be much appreciated.

Clive

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
"To err is human, but to really foul things up you need a computer." (Paul Ehrlich)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To get the best answers from this forum see: faq102-5096

feherke · May 1, 2009

Hi

I know two ways. Both implies recursively following all the links on the site and examining the content of the document root directory locally.

For recursively following all the links the best is a site mirroring tool. I prefer [tt]wget[/tt].

1) Ask the file system. Works for both static and dynamically generated documents. Needs to have access time updating enabled.
[ol]
[li]Recursively request all documents[/li]
[li]List all files in the document root not accessed in the last n minutes[/li]
[/ol]

Code:

wget -m -nd --delete-after -q [green][i][URL unfurl="true"]http://example.com/[/URL][/i][/green]
find [green][i]/var/www/example/[/i][/green] -type f -amin +10

2) Compare the lists. Works only for static documents. Needs extra processing if directory aliases were used.
[ol]
[li]Recursively request all documents and create a list of files[/li]
[li]List all files in the document root[/li]
[li]Compare the two lists[/li]
[/ol]

Code:

wget -m -nd --delete-after -nv [green][i][URL unfurl="true"]http://example.com/[/URL][/i][/green] 2>&1 | sed '/URL:/!d;s/.*\/\///;s/ *\[[[:digit:]\/]*\] *//g;s/"//g;/\/->/s/-> *//;s/->.*//' | sort -u > needed.txt
find [green][i]/var/www/example/[/i][/green] -type f | sort > exists.txt
diff --line-format="%L" needed.txt exists.txt

Feherke.

http://rootshell.be/~feherke/

Stretchwickster · May 5, 2009

Sorry for the delay in replying - many thanks for the suggestions Feherke.

feherke said:
Needs to have access time updating enabled.

Is this a setting in wget or something I need to enable on the server?

Clive

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
"To err is human, but to really foul things up you need a computer." (Paul Ehrlich)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To get the best answers from this forum see: faq102-5096

feherke · May 5, 2009

Hi

Clive said:
Is this a setting in wget or something I need to enable on the server?

[tt]mount[/tt] option. Of course, on Linux systems. No idea about Windows.

Feherke.

http://rootshell.be/~feherke/

kjv1611 · May 5, 2009

At least for checking for broken links, I found that [link Google's site map seems to work really well. I realize there could still be plenty of unused files outside of linked files, but it could be a start, and isn't a bad tool, regardless.

Well, it may not just be called "Google Site Map", I may be thinking of their "developer tools" set... either way, take a look here (just create an account if you don't already have a Google account):

https://www.google.com/webmasters/tools

--

"If to err is human, then I must be some kind of human!" -Me

kjv1611 · May 5, 2009

oops. I left a partial [link tag in there, which I meant to go back and delete. [blush]

--

"If to err is human, then I must be some kind of human!" -Me

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Finding redundant website files

Stretchwickster

Programmer

feherke

Programmer

Stretchwickster

Programmer

feherke

Programmer

kjv1611

Active member

kjv1611

Active member

Similar threads

Part and Inventory Search

Sponsor