Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Finding redundant website files

Status
Not open for further replies.

Stretchwickster

Programmer
Apr 30, 2001
1,746
0
0
GB
Hi there,

Unfortunately, I've inherited a large site which I believe to have many unused files. Could anyone recommend the most effective way to obtain an accurate listing of such files?

Your advice would be much appreciated.

Clive
Runner_1Revised.gif

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
"To err is human, but to really foul things up you need a computer." (Paul Ehrlich)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To get the best answers from this forum see: faq102-5096
 
Hi

I know two ways. Both implies recursively following all the links on the site and examining the content of the document root directory locally.

For recursively following all the links the best is a site mirroring tool. I prefer [tt]wget[/tt].

1) Ask the file system. Works for both static and dynamically generated documents. Needs to have access time updating enabled.
[ol]
[li]Recursively request all documents[/li]
[li]List all files in the document root not accessed in the last n minutes[/li]
[/ol]
Code:
wget -m -nd --delete-after -q [green][i][URL unfurl="true"]http://example.com/[/URL][/i][/green]
find [green][i]/var/www/example/[/i][/green] -type f -amin +10

2) Compare the lists. Works only for static documents. Needs extra processing if directory aliases were used.
[ol]
[li]Recursively request all documents and create a list of files[/li]
[li]List all files in the document root[/li]
[li]Compare the two lists[/li]
[/ol]
Code:
wget -m -nd --delete-after -nv [green][i][URL unfurl="true"]http://example.com/[/URL][/i][/green] 2>&1 | sed '/URL:/!d;s/.*\/\///;s/ *\[[[:digit:]\/]*\] *//g;s/"//g;/\/->/s/-> *//;s/->.*//' | sort -u > needed.txt
find [green][i]/var/www/example/[/i][/green] -type f | sort > exists.txt
diff --line-format="%L" needed.txt exists.txt

Feherke.
 
Sorry for the delay in replying - many thanks for the suggestions Feherke.

feherke said:
Needs to have access time updating enabled.
Is this a setting in wget or something I need to enable on the server?

Clive
Runner_1Revised.gif

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
"To err is human, but to really foul things up you need a computer." (Paul Ehrlich)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To get the best answers from this forum see: faq102-5096
 
At least for checking for broken links, I found that [link Google's site map seems to work really well. I realize there could still be plenty of unused files outside of linked files, but it could be a start, and isn't a bad tool, regardless.

Well, it may not just be called "Google Site Map", I may be thinking of their "developer tools" set... either way, take a look here (just create an account if you don't already have a Google account):

--

"If to err is human, then I must be some kind of human!" -Me
 
oops. I left a partial [link tag in there, which I meant to go back and delete. [blush]

--

"If to err is human, then I must be some kind of human!" -Me
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top