Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Developing a specific kind of search engine on a MySQL database

Status
Not open for further replies.

Maccaday

Technical User
Dec 9, 2003
71
GB
Sorry about the length of this post, but I'm trying to be clear about my needs so that I can get the answer to a specific question.

I am trying to set up a site that is a different type of search engine. Instead of a top-down approach of redirecting users to other websites based on conventional search criteria, I'm looking at actually cataloguing the pages of different websites, and structuring them in a directory format (a bit like the Open Directory Project, but using computers rather than people for most of the cataloguing process), which is essentially a bottom-up approach.

Basically, items will be catalogued by having direct access to data on other people's servers, and by various scripts/algorithms. Initially I will be organising information on commercial sites with affiliate programs, so they should allow direct (albeit restricted) access to their databases. For each site included in our catalogue, I will need to know information about what I'm cataloguing.

For example, if I am cataloguing a poster of the Mona Lisa from AllPosers.com, I will need to store information such as the artist, price of the poster, what artistic style it is etc, so that a user on my site could get it it via a section on posters, or on Renaissance art, or on Leonardo Da Vinci etc.. I will also want to make sure that posters from other online stores are correctly placed with the one from AllPoster.com when a request for posters of the Mona Lisa is placed by a user of the site - I don't want the search to do a search of posters with the name Mona Lisa as such, I want it to have already been catalogued. This means that accurate price comparisons on any item could be made.

As I see it, there will need to be three scripts/algorithms. The first will extract the relevant information from a site that is to be catalogued, and return various information that will then be put into the database (This will probably be different for different sites, though some sites will have similar scripts, and generic scripts will be written for this process). These will then pass on their results to a second algorithm.

The second algorithm will then input the relevant data in the database, including such things as URLs, image locations, various site structure information (such as how the information could be reached through the directory structure). Comparisons will be made with items already catalogued, so that the same item from different stores can be correctly placed together (this may require some human checking as well, though hopefully not much).

The third script is the search engine itself. This will not just return a page like a conventional search engine, but might redirect a user to an individual page if there is a very high chance that that is the page a user wanted (e.g. if David Blane was entered, then the user would be redirected to .../david_blane.html). If there are several different possible results, such as a search for pool might yeild results relating to a swimming pool, but also for the game. All results related to the former could be found through links on a page called swimming_pool.html, and the latter on pool.html. These two possibilities would be listed on one page, along with some of their sub-categories underneath the main headings, and any other possible derivations of the word pool.

My question is this: Is MySQL suitable for such a task (my gut instinct is yes)? And more importantly, what language should I use for the search engine script?

Obviously, being a search engine, it needs to be quick at retreiving results from a database. I am intending on having the rest of the site in PHP (Linux/Apache/MySQL/PHP/Turck MMCache is my current idea for the setup) - Turck MMCache speeds up PHP processing by caching compiled scripts. I'm thinking that Perl might be quicker to have as a search engine script, but I just don't know. Is there anyone with experience of writing search engines that can make some suggestions?

Considering that there will potentially be results catalogued for anything you could think of, how should I structure the tablespace of the search engine part of the database? This tablespace could potentially reach terabytes in size. Should I have just one table with all the possible results, or would it be more efficient to somehow split up the results into multiple tables? If so, are there any suggestions for how I should split up the tables?

One possibility that I've thought of is to have one table for each search result (so potentially tens of millions of tables), with a list of all the tables being kept in the data-definition of the database? Each table could then contain data about what should be shown for each page, including the links to other pages. If a search did not result in an actual page that had already been structured, then suggested results pages could be made for pages with similar names.

Also, would Perl be better than PHP for the other two algorithms/scripts, which will input data into the database?

Any responses to these questions will be gratefully received. Thanks in advance.
 
I love MySQL. It is very good at what it does, and I'm
using it on my current project.
But, it would be a big mistake to use it on a database that
could be terabytes in size.

MySQL has all the basic capabilities required for a database,
but just barely. It is meant to be used on small and medium
sized databases, not large databases. You can tell this is
true by noting it has,

1. no way to do parallel queries
2. no way to distribute a single table across multiple spindles
unless you use MERGE tables (but, personally, I think MERGE tables
create more problems than they solve)
3. no clustering capabilities
4. row-locking only for innodb tables

When you get databases measured in terabytes your primary
criteria of how successful you will be in getting a decent
response time is the number of CPUs you can get searching
for your record on as many different hard disk spindles as
possible. MySQL does not play well in a distributed system
environment (it can't run queries in parallel).

Oracle does.
MySQL is eating Oracle's lunch for small databases, but there
is still a reason why Oracle is such a force in the industry,
and that's because it can distribute large amounts of data
across many processors and many spindles. Why would you spend
so much money on Oracle if MySQL did everything as well?
Because MySQL can't handle super-large databases
elegantly. The best you can do with MySQL is get yourself
a multiprocessor motherboard and gobs of RAM and pray that
you never need more horsepower.

As far as Perl vs PHP goes, Perl is an excellant language to
write your spider in. I don't like it because I find Perl code
hard to decipher, but you have to
pick the right language for the job and Perl does spiders well.
I haven't used PHP but I'm sure that would work, too.

As far as your "search engine" language, I think it is a
mistake to embed business logic with presentation (HTML),
but I wouldn't discourage it if you just want to make a
prototype. That's why I don't use PHP; I like to separate
my HTML generation from my database operations. I don't
have enough experience on the LAMP side to recommend a
good search engine language.

As far as the number of DB tables you should have:
the only time one should consider having more than 1000
tables in a database is if those tables are temp tables that
are dynamically generated in RAM and automatically go away
as soon as the connection is dropped.
It's just too much work to keep more than 1000 tables
working. On my largest job (3 million lines of C code) we
only had 350 tables, but that was more than enough. You're
asking for admin headaches if you design for more tables.
Best to have fewer tables, but be able to distribute the
table over many spindles. Oracle even allows you to have
just one table on one computer, and another table located
on a completely different computer. It handles large data
sets well.

Food for thought.

BTW, Oracle costs an arm and a leg but you will end of
spending the money anyway if you truely will be working
with a humongous database. The question is whether you
want to spend the money on a pre-packaged database or
whether you want to write your own.
 
Thanks for your response.

Due to the initial costs, I wouldn't be able to afford to purchase Oracle, but I may well end up doing so in the long run - the site would be generating money from an early stage where a large database wasn't required, so I should be able to transfer before the size was too big for MySQL.

I am anticipating having many hundreds or possibly thousands of computers linked up via various load-balancing and redirecting servers, so that in fact the individual databases containing the content of the 'site' could be relatively small. The load-balancing servers would need to have a lot of info on which servers were hosting which data. This way I wouldn't need to be searching through huge databases each time, but would be redirecting towards an appropriate server with a smaller database. Each smaller databse would merely be a replicated part of the whole database. Even if the tables in the whole database were huge, the replicated parts wouldn't be.

The only possibly very large database would be the one with the actual search results on, though I'm not sure that that is necessarily going to be all that big now. I could probably get away with not having many installations of Oracle, even if I spread the database over large numbers of servers, with the rest being MySQL databases.

I have a few more questions:

1) One question I do have is could I have one installation of Oracle which searched not only on its own database, but also from other databases - so I could say have one Oracle installation that also searched tables in a MySQL installation?

2) MySQL is quoted as having a theoretical maximum database size of 8Tb, depending on your system. How is this achieved in practice? i.e. How does MySQL deal with tables that are spread over several disks, or doesn't it? Would you need to set up a virtual drive within the OS, which made several disks look like one?

3) If you have an installation of Oracle, does that installation allow you to install one database on as many physical machines as you like, or do you have to pay per processor? - I have a feeling that it's the latter, isn't it?

Thanks for your reply.
 
1. No, Oracle will not search MySQL databases. You would have to
write a program to sometimes connect to Oracle, and other times
connect to MySQL depending on where your data resides.
Oracle can automatically access data on different
spindles/machines transparently so that your programs
wouldn't have to know where the data actually was, but MySQL can't.
Oracle's coolest features wont extend to MySQL databases.

2. 8TB: that depends on the maximum file size supported by
your operating system. MySQL is file-based, meaning you can
make your database as big as what can fit into one file on
your OS. The question is, since one MySQL file can only reside on
one hard drive spindle (unless using RAID or MERGE tables), would you ever
want an 8TB database on one spindle? It would take forever
to search. For large databases the biggest concern is finding
ways to distribute your data over as many spindles as possible.
What is possible in theory is not necessarily
what you would want to do in practice.

MySQL does _not_ allow for tables to be distributed over
several disks -- unless you use MERGE tables -- and I would
think long and hard before I commited to using MERGE tables.
(The cons of using MERGE tables are listed in the MySQL manual.)
I've never set up a virtual drive, but, again, you have to
ask yourself what payoff that would give you: if the virtual
drive was only able to sequentially access the super-large
file (even if distributed over several disks), then you
haven't bought anything except slower searches.
There's no intelligence in virtual drives. That is, there's
no daemon behind a virtual drive saying, "Oh, I know this
virtual drive contains a common database file, and if
somebody wants to search it I'll just perform that search
on each of the spindles in parallel." On the other hand,
there is such a program that can spawn
parallel searches to multiple machines, and it's called
Oracle.

A couple of years ago I the SCO system I was working on had a
4GB max file size, so guess what? The maximum size of one
of our Oracle tables was 4 GB. (We obtained a patch from
SCO to fix this, and we upgraded our version of Oracle to
allow large file sizes.) MySQL would have had the exact
same limit to. I believe Windows 95 and 98 have upper limits
of 4 GB file sizes, too, for what it's worth, except there's
no patch for these OSs, not that you'd want one.


3. I'm not sure what Oracle's licensing terms are -- I just
make them go. In my experience DB licenses are per processor
and if what I'm told is correct the license for the last
company I worked for that used Oracle was +/- $100,000.00.
(We used a 4-processor machine.)


Don't let me discourage you from using MySQL for the first
phase of the project, especially if you don't have the cash.
Just don't expect MySQL to scale to TB levels unless you
are willing to manually partition your data across machines.
I don't think it's possible for anybody to manually do this
and still be able to keep everything in sync, and easy to
maintain, but you could prove me wrong.
 
Thanks.

I have some questions regarding how data is stored in a DB on MySQL and on Oracle.

1) Is each table stored as a separate file, or is the whole database saved as a file - does this depend on table types used?

2) What specifically are tablespaces, and how are they saved (i.e. as one big file, or as different tables)?

Also:

3) Do you think that it would be fairly feesible to use many MySQL databases, with cut down, replicated databases, and by redirecting and load-balancing based on the requested URL. This would keep each database small - as mentioned in my previous post. The distribution would then be done by the load-balancers, rather than by the database itself.

If you are interested in the project that I'm working on, I have copied the post that I made in a different forum (on Perl and also on Apache - the same person replied to separate points in both forums, so I combined them into one). Feel free to read and comment on it.

Thanks again for your responses.
 
Siberian, thanks for your messages. I'll respond to both of them here, rather than separately. The details of my project are such:

I am trying to catalogue the internet in a different way than it already is. It's kind of like a hybrid of a search engine (that has the quantity of the net) and the various directories such as the Open Directory Project (which essentially have everything categorised, therefore are more qualitative). The two combined do not exists yet, as such - for several related reasons:

1) Computers aren't 'smart' enough to be able to sort everything well enough without some kind of human intervention (they're not even close to passing a reduced Turing Test).

2) There are so many sites on the net with different setups, that in order to properly catalogue the information on that site, you need a different sorting algorithm for each site (though there will be significant overlap between many sites).

3) The task is just too immense to be done by humans alone - see how few pages the ODP has with respect to the main search site of major search engines.

The site that I want will not only catalogue links to other websites, but will also contain information detailed on those sites (with their permission of course), such as information about a product that they sell. If the same item exists on two different sites, then they will be dealt with together. Some sites offer poor or cut-down versions of this, where they compare items in different online stores, but I'm looking at accurately identifying any item in any online store, in any language.

It would almost be a stepping stone to Xanadu, which I'm sure you're aware of. It wouldn't be a proper attempt to implement Xanadu, since the cross-referencing would not be done between the sites that I would catalogue - at least not initially. (In fact I don't think that that can work properly anyway, because if you want an article on Abraham Lincoln, then there will not just be one article, but hundreds of thousands). However, for each item that is included in our site, it will be cross-referenced within our site, and it is hoped that later on people will not only submit their sites to our search engine, but also incorporate back-references to relevant pages on our site within their articles. This would provide a framework for Xanadu to work.

There is the major problem of how to set this up, how to get people to use it, and how to make it financially viable.

The way I am looking to finance it is through reselling mobile content, commissions from affiliate programs, and incorporation in conventional multi-level marketing schemes, along with promoting our site by multi-level marketing and making money for charity.

Mobile content speaks for itself, and there are a lot of individuals that make a lot of money from it online.

By having a high-quality environment to accurately and comprehensively compare the prices of any item that you can buy online, it would hopefully be a useful resource resource
that fills a current gap (I know that there are sites that offer the idea, but I haven't yet come across one that does a truely good job in all countries - some do an ok job in the US alone). We would then receive commissions on the sales that went through our site.

I've worked out a way that we can be a member of every reputable multi-level marketing scheme in the world, and potentially earn money from them all (though I won't go into details). There are over 500 that I've found, so even just a small amount from each sums up to a decent amount.

I would incorporate a multi-level marketing program of our own, so that people would be encouraged to tell others about our site. They wouldn't need to have a website to do so, and could download business cards from our site to promote it.

When people made money for themselves from the site, they automatically generated money for charity, and they could get to choose which charity(ies) it goes to. Guidestar in the US offer a facility where you can donate money online to any registered charity in the US, and they will allow you to incorporate this service into your site, and will host it on their servers. Users of our site can then decide that they want the money that they have generated for charity to go to their local school, or to Cancer Research. For other countries, the money will go to UN projects in the Third World, unless there is some already set-up framework to donate money to charities online.

The site itself, if it comes off, will almost certainly eventually be the largest site on the net, since it will be trying to catalogue all the other ones. This in itself poses hardware problems. I am intending on setting up a 'share a server' scheme, whereby unused webspace can be used to store data by our site. Many users have unused webspace that is given to them by their ISP, or people that have websites that do not even approach the capacity that they have purchased.

Most of these servers will lie on fast connections (though not all of course), and could be used to store data on. Some of the data could be just images or other media files (which would make up the vast majority of the space needed), whereas others could store the non-sensitive data in the databases, assuming that a compatible system was set up (which would probably mean that the servers would need to have LAMP).

The search engines would reside on central servers for speed, but the pages could be stored potentially anywhere by having cut-down replications of tables in the master database, and by being properly redirected to a server that contained the information. The major difficulty with the architecture is controlling where the requests go - the servers that do all the redirecting will have to contain a lot of information about where all the various servers are that contain the page contents. Of course there will be master servers where data will be written, and these will need to contain the whole database (though images could still probably just be referenced).

How am I going to set all this up considering there is just myself working on it at the moment? Well, I am going to make it an Open Source Project, with an incentive for people to help me. Since the site will potentially be bringing in money, I shall say that anyone that helps out can gain a percentage of the revenue of the site, for life, depending on what they do, and where the money has come in from. For instance, if someone helps me set up the mobile phone content part, then they can gain a percentage of all the money that comes in from mobile content sales. Likewise with all other areas that will be directly making money. I am intending on starting some academic open projects as well, but these are unlikely to generate a lot of income (if any), so the rewards will be worked out partly on the field in which that person is operating - so if they help out in the section on maths, then they will receive a percentage of all the money that comes in relating to maths products, including maths books, stationary etc. As for setting up and maintaining the whole structure of the site, then I'd offer percentages of all the money that came in from the whole site. You get the picture. In any case, I won't be paying anything out until it comes in.

Since it would be Open Source, all parts of the site would be freely downloadable, so that people could use features included on the site. Everyone that contributes to the site in any way would be credited on the site. Anyone that writes code for the site in any language must annotate it properly, so that it can be reused easily, and there would be a section of the site devoted to explaining every aspect of how the site was pieced together, and how to use the code that you can download.

Of course security is a concern on such a project, so people would only have easy access to non-sensitive data.

It's a mammoth project, and could only really be achieved by many people coming together - as is the spirit of Open Source.

I have only found out about XHTML, CSS, Javasctipt, Perl, PHP, MySQL, Apache, Linux (I haven't even tried it yet, though I did know about it), and the various other things that I'm intending on using in my research for the project, but I'm getting there. I don't want the thing to be a failure because there wasn't enough planning, so I'm doing my research so that I know how things are going to be done. I don't need to know how to do them yet, but I really do want to know how they CAN be done, and how I probably will do them eventually.

Having been working on this now for about 6 months (essentially full time), I'm at a stage where I know what I want to ask people to do. I'm currently in the process of setting up a site that will detail all the different aspects of the site, roughtly how I anticipate setting them up, and what incentives there are for helping me out. I anticipate that this will probably take me about 6 months to set up properly, since there is quite a lot of work that will go into just this initial site - especially considering I've never even used a database.

However, I have done a lot of reading in preparation, and am confident that I'll be able to set it up alright. As for the ERDs, they are all essentially in my head and contained within the various piles of notes that I have - I just need to write them up. Since there are potentially going to be hundreds or thousands of people working on this (it could be one, but I hope not!), I need to be extremely clear about how I'm intending on setting it up.

I'm probably going to write the whole site in UML, and I might even post the relevant parts with each section that I put on my initial site to make it clearer what I'm trying to achieve. Just because of the size and scope of the project, this will be a task in itself - I am yet to have used UML as well!

It's all essentially new to me, and the learning curve is steep, but it's a lot of fun, and I'm learning a lot. I anticipate spending a lot of time asking questions in forums. They really are an invaluable resource to people like myself.

Thanks again for all your responses. They have been very helpful. Any further suggestions that you can make would be gratefully received, as always.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top