Filling cache

csteinhilber · Oct 16, 2004

We're running a content management system on a Solaris web server (iPlanet). The CMS is built to cache pages off once all the elements are assembled from a database (essentially a static page at that point). The caching occurs on the first page hit to that URL. What we're looking to do is build up the cache automatically, as much as possible, so that the first user to the page doesn't incur the performance hit.

We've already built the hooks that would be able to fire off the cache building process... the question is what to use during the process itself.

Right now, we've started with wget. We assemble a list of pages that need to be hit (cached), and pass that to wget via the -i or -input-file= argument. It's working alright... but it's darned slow. So we're looking for alternatives.

We really don't need something that will save the downloaded file, nor do we need it to traverse (spider) links within the page (or process the download in any way, actually)... in fact, if it didn't need to actually download the file, all the better. All we need it to do is pretend to be a browser (being able to set a particular user agent is a must) and hit a given URL in a list, in order for the CMS to do it's thing and cache the page.

Anybody know of any possible solutions/alternatives to wget for this process? While httrack is multi-threaded and probably somewhat faster at downloads than wget, it's actually slower for us because it processes the downloaded page for other links to traverse (and it doesn't appear that you can simply pass it a list of URLs, like you can wget).

Any comments/input would be greatly appreciated.

TIA!
-Carl

Annihilannic · Oct 16, 2004

wget --spider --input-file=filename seems to do just what you want, and quickly on my P233... it doesn't appear to attempt to follow any links. Presumably you would need to turn on recursion to make it do follow links, but you don't want that obviously.

Annihilannic.

csteinhilber · Oct 16, 2004

We were running --spider... but it wasn't actually hitting the pages in such a way that the webserver thought it was a true request.

I don't pretend to know a great deal about HTTP... but I think --spider just sends headers to ask the server whether the page exists or not, it doesn't actually request that the page be served. Thus our CMS never sees it as a page hit, and the cache for that page is not built.

I remove --spider, and the cache starts filling up.

Unless someone has another explanation why --spider wouldn't work.

-Carl

stefanwagner · Oct 18, 2004

If you have java and javac, you may test this code:

Code:

// package

import java.io.*;
import java.util.*;
import java.net.*;
import java.net.URL.*;

/**
	PrintUrl

	@author Stefan Wagner
	@date Fr Okt  1 02:30:17 CEST 2004

*/
public class PrintUrl
{
	/** */
	public PrintUrl (String surl)
	{
		URL url = getUrl (surl);
		printUrl (url);
	}

	/** */
	public void printUrl (URL url)
	{
		BufferedReader br = null;
		try
		{
			br = new BufferedReader (new InputStreamReader (url.openStream ()));
			do
			{
				System.out.println (br.readLine ());
			} while (br.ready ());
			if (br != null) br.close ();
		}
		catch (EOFException eofe)
		{
			System.err.println ("eof: " + eofe.getMessage ());
		}
		catch (IOException ioe)
		{
			System.err.println ("ioe: " + ioe.getMessage ());
		}
	}

	public URL getUrl (String name)
	{
		URL url = null;
		try
		{
			url = new URL ("[URL unfurl="true"]http://"[/URL] + name);
		}
		catch (MalformedURLException mue)
		{
			System.err.println ("mue: " + mue.getMessage ());
		}
		return url;
	}

	/** */
	public static void main (String args[])
	{
		String url = null;
		String file = "./tmpUrl.html";
		if (args.length != 1)
		{
			usage ();
			System.exit (1);
		}
		url = args[0];
		new PrintUrl (url);
	}

	/** */
	public static void usage ()
	{
		System.out.println ("Usage:\tjava PrintUrl URL ");
		System.out.println (" i.e.:\tjava PrintUrl home.arcor.de/hirntrom/index.html");
		System.out.println (" note:\tomit '[URL unfurl="true"]http://'[/URL] in the url.");
	}
}

seeking a job as java-programmer in Berlin:

http://home.arcor.de/hirnstrom/bewerbung

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Filling cache

csteinhilber

Programmer

Annihilannic

MIS

csteinhilber

Programmer

stefanwagner

Programmer

Similar threads

Part and Inventory Search

Sponsor