Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Sitemap Generator 3

Status
Not open for further replies.

audiopro

Programmer
Apr 1, 2004
3,165
GB
Can anyone recommend a sitemap generator script?
I know there is the Google 500 page one but I need to add some extra functionality and would rather not re-invent the wheel by writing an already existing front end.

Keith
 
Well the protocol only has a handful of attributes, it's a sinch to write one...


using the template module I can generate a dynamic sitemap with just a few lines...
Code:
        # Start template 
        my $template = HTML::Template->new(  global_vars => 1,
                                            type => 'filename',
                                          source => DIR_TO_DOCS . '/templates/sitemap.xml',
                               die_on_bad_params => 0) or die "Cannot open sitemap.xml Template file: $!";

        # Add variables to Template
        $template->param( 'domain' => $domain[0]{'Domain'} );

        my $html = $template->output;

        if(!&writeHTML("$_[0]","sitemap",$html,".xml")){die "Error creating sitemap.xml";}

the code for creating the file is simple also...
Code:
###########################
# Write HTML Code to file #
###########################

sub writeHTML {

#_[0] = AR No.
#_[1] = PageID
#_[2] = String
#_[3] = file extension

# Build file string
my $file = DIR_TO_WEB . "/" . $_[0] . "/" . $_[1] . $_[3];

#write to file     
open(CATAL, ">$file"); 
flock(CATAL, 2);
binmode(CATAL);
truncate(CATAL, length($_[2]));
seek(CATAL, 0, 0);
print CATAL "$_[2]";
close(CATAL);


# Test for file return true/false

if(-f $file){1;}
else{0;}

}
Ok this is for a specific site map which has a specific number of pages, the only thing that changes is the domain name.

But to glob a directory and add pages to an array and loop via a template, 30 minutes coding for a pro like you, surely ;-)


"In complete darkness we are all the same, only our knowledge and wisdom separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!
 
Thanks 1DMF.
I have never used HTML::Template so it is off into the darkened room, to get my head round it.

Not sure what is going on there especially where 'AR No' comes from but I will have a dig at it before asking any more questions. I am also a bit unsure if this code follows actual links.

I have quite a number of domains where there are lots of redundant files, images and out dated pages. I need a way of listing everything that is on the domain and then mapping everything that is used. I can then dump the unused files. Is a sitemap the best option?

Keith
 
1stly we are talking XML sitemap aren't we, for Search Engine inclusion?

How is your code meant to know if the html page is used or not?

the HTML::Template is an awesome module to separate your design / HTML from your perl code.

it uses a simple tag based system to populate a pre-designed template with variables, you can loop arrays and even use if/else statements.

Any help you need just ask.

As for the ARNo, sorry, that is working code and so has some vars specific to our system.

Basically as it points to a members white labeled website, the path to the hosting folder includes their membership number (ARNo).

So that line ends up as ...

'c:\path to website main folder' (DIR_TO_WEB)\'membership number' (ARNo)\'file name' (sitemap)\'extension' (xml)

hope that makes more sense.



"In complete darkness we are all the same, only our knowledge and wisdom separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!
 
Hi

1DMF said:
How is your code meant to know if the html page is used or not?
That is why such tools usually request recursively the pages through HTTP and not just read the file system locally.

( Sorry, the solutions I used are all off-topic : put [tt]lynx[/tt] or [tt]wget[/tt] to crawl the site then parsed their output. )


Feherke.
 
A bit of background -
All this started when a client brought in an SEO company and they complained that there are a lot of broken links on the website. The website is almost completely dynamic so dodgy links are possib;e but unlikely. Turns out, the dodgy links they are referring to are in files which are not used on the site but are used for admin and testing. They have obviously just made a local list of files and created a sitemap from that.
I thought a better method would be to crawl around all the links on the site and create an actual map of the site and a list of all files and images used rather than a list of files, which seems to be what an XML sitemap is.

This particular site has been redone a number of times and there are still files around from the first version - I am not very tidy.

Is it feasible to write such a crawler or am I wasting my time attempting such a thing?

Keith
 
You want to spider the pages and check if the urls are dead and if they are report what page they are on?

then no a sitemap is not the tool for the job.

you need to use

and HTML::TokeParser::Simple modules to perform that task.

"In complete darkness we are all the same, only our knowledge and wisdom separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!
 
so put them in your CGI bin and add the following lines in the top of your code...

Code:
# Set path to user modules
use FindBin qw($Bin);
use lib "$Bin";
where there is a will, there is a way ;-)

"In complete darkness we are all the same, only our knowledge and wisdom separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!
 
I am trying HTML::Template but am stuck on the first, very simple example. After some messing, I managed to get it to run without config errors but it just prints nothing except my 'ok' check.

Template
Code:
<html>
<head>
<title>Test Template</title>
<body>
My Home Directory is <TMPL_VAR NAME=HOME>
<p>
My Path is set to <TMPL_VAR NAME=PATH>
</body>
</html>

Script
Code:
#!/bin/perl -w

use FindBin qw($Bin);
use lib "$Bin";

use HTML::Template; 

# open the html template 
my $template = HTML::Template->new(filename => '../school/test.tmpl');
# fill in some parameters 
$template->param(HOME => $ENV{HOME});
$template->param(PATH => $ENV{PATH}); 
# send the obligatory Content-Type and print the template output 
print "Content-Type: text/html\n\n";

$template->output; 
print "ok";

I get the same results with or without the FindBin statement.

Keith
 
The way I use it is as follows...

HTML Template....
Code:
<html>
<head>
<title>Test Template</title>
<body>
<p>My Home Directory is <tmpl_var name='HOME'></p>
<p>My Path is set to <tmpl_var name='PATH'></p>
</body>
</html>

PERL....
Code:
#!/bin/perl -wuse 
FindBin qw($Bin);
use lib "$Bin";

use HTML::Template; # open the html template 

# Start template 
my $template = HTML::Template->new(  global_vars => 1,
                                            type => 'filename',
                                          source => '../school/test.tmpl' ,
                               die_on_bad_params => 0) or die "Cannot open '../school/test.tmpl' Template file: $!";

# add vars

$template->param( 'HOME' => $ENV{HOME} );
$template->param( 'PATH' => $ENV{PATH} ); 

# send the obligatory Content-Type and print the template output 
print "Content-Type: text/html\n\n";

print $template->output; 

exit();


You were pretty close , just a few changes, the find bin is only required if your host doesn't have the template module installed, then you just put the template.pm module in the CGI-BIN and use those two bin commands ;-)

just holla if you need more help :)



"In complete darkness we are all the same, only our knowledge and wisdom separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!
 
Thanks 1DMF, that has got me going in the right direction.
The strange thing is it now does not work unless I hash out the FindBin command.
My ISP does not have the module installed, proved it by renaming the module - got a config error.
On with simple example 2 - I am sure it will be worth it in the end.

Keith
 
hmm very odd, I have to use the finbin so it picks up the folder the script is running in as a module repository.

perhaps they have locked it don't so you can't do this or it is already part of the environment string.

Glad you're on the right track, once you get your head round the template module, you'll never look back, it's awesome to keep your HTML and PERL code separate, you can loop arrays and even arrays within arrays, but cross that bridge when you get to it, I found that a bit tricky at first to get my head round.

I use the template module for everything now :)


"In complete darkness we are all the same, only our knowledge and wisdom separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!
 
I have managed to get this fully working and it works well for most URL's. There is a problem with some redirected URLs where just locks up, on the 'get' command.
Has anyone experienced this and if so created a work round?

Code:
my $mech = [URL unfurl="true"]WWW::Mechanize->new([/URL] autocheck => 0 );
$mech->get( "$Page" );

Keith
 
If I enter a domain into a browser ie. it automatically opens the index page on that domain, how do I find the extension of the index page?

index.htm, index.html, index.shtml etc.

At the moment I paste the domain name, including the extension manually, it would be nice to just point to the domain.

Keith
 
I think this may depend on processing that happens on the target web server. Short of doing a GET on the URL and seeing what comes back, I'm not sure you can find it out.

Steve

[small]"Every program can be reduced by one instruction, and every program has at least one bug. Therefore, any program can be reduced to one instruction which doesn't work." (Object::perlDesignPatterns)[/small]
 
there is a list of extensions built into the web server configuration it will try when someone doesn't specify a page. So it might have, html, htm, php, shtml and if you don't put anything after it will try index.html, index.htm, index.php, and index.shtml. You can always type in to see if that file actually exists. .html and .htm are the two most common but .php is catching up.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[noevil]
Travis - Those who say it cannot be done are usually interrupted by someone else doing it; Give the wrong symptoms, get the wrong solutions;
 
Keith,

have you tried
Code:
$mech->get( "$Page" ) || die "can't get URL";

for default page, this is difficult to work out, i don't know if there is a parameter set with file name in the Mechanize module?

You could just try a few 'defaults' kept in a hash or array.


The default page is relative to server settings and could be any , but not restricted to, the following..

index.htm, index.html, index.asp, index.aspx, index.php, index.shtml, default.htm, default.html, default.asp, default.php, default.shtml

howerver , you could tell the server to load as default....

my_default_page.cfm


as it is totaly configurable via the webserver.

"In complete darkness we are all the same, only our knowledge and wisdom separates us, don't let your eyes deceive you."

"If a shortcut was meant to be easy, it wouldn't be a shortcut, it would be the way!
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top