Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Content Index / Search Hardware Setup 1

Status
Not open for further replies.

NATD

IS-IT--Management
Sep 18, 2007
49
AU
Hi,

I'm starting to read up Content Indexer as we have had the licences sitting here for a while and it's time I actually used it.

I've already got the hardware I need to use (ie it's got to suffice) which is as follows;

Server: Dell 1955 dual Xeon E5320, 8GB RAM running W2K3 Enterprise. Dual 2GB FC HBAs. (they are actually 4GB, but some design flaw in my Dell 8th Gen blade chassis prevents them running at 4). Local storage is 2 * 15k 74GB SAS. Dual gigabit Ethernet.

Additional storage: 3 * 15K 300GB FC disks spare in my HDS SAN.

At this point, I have about 400GB of document and email data which I want to have indexed. Given that I'd normally use those local disks (as RAID 1) for booting the server and it's swapfile, I'm considering that it's remaining space is no use to me for the actual indexes. What I'm wondering now is how best to configure those 3 disks in my SAN, which really boils down to understanding whether the index data itself is important.

Can it easily be rebuilt if I use RAID 0 and have a disk failure at some point? Books online does say RAID 0 is preferable for performance, but is it assuming I'm protecting that data in some other way?

Or is there a much better way to use my available resources?

 
Are we to assume that you are referring to SIMPANA 7.0 Content Indexing? This is completely differant to pre-7.0 CI. Are we to also assume this is to be a single node deployment you wish to configure an a 64bit system?

I think to have the Admin node and the Indexing node plus the websearch all on one box is asking alot but that really depends on the volume of data and number of simultanious searches that you expect to run.

I think RAID5 makes more sense.
In a commvault deployment I think RAID0 only fits for spool copy B2D.

Are you intending to enable lemmatization?
The index process will consume disk equal to three times the size of the data ingested before the index is built. Once complete, the size of the index will shrink to 80% of the ingested data size with lemmatization enabled. If 400 GB of data will be ingested for index, 1.2TB of disk space will be required for the index to build; the final product will consume only 320 GB. Lemmatization, search and index beyond the root of a word (i.e. run, running, ran, etc.), is enabled by default. A substantial space saving will be recognized with slim (no lemmatization) indexing.

Good luck!

---------------------------------------
EMCTA & EMCIE - Backup & Recovery
Legato & Commvault Certified Specialist
MCSE
 
Thanks for the detailed reply.

You made the right assumptions re version and that I'm going single node. The hardware is 64-bit but I understand from books online that I should use 32-bit Windows regardless. I'll go back and see if I read the pre-requisites wrong.

While we hadn't gone through a paid engagement, my advice from CV when purchasing the licencing was that a single well specced host would perform well for my scale - time will tell I suppose. If it doesn't I could russle up some additional slightly older hardware for the less demanding roles.

I hadn't seen the note about lemmatization needing the 3x space initially, I'll have to give that some though. I may have the space to loan it temporarily.

Your comment on RAID5 is at odds to books online - is that just your general feel or have you a specific reason for knowing RAID0 will underperform? Or is it that you feel redundancy is needed?

The only question not addresses was whether or not my index data is valuable..again a factor in whether to use RAID0 or not.



 
Well...
Content indexing deployments scale on the following axes:
Content Volume: The amount of content to be handled by the system measured in the number of objects. An object is a single document or file. When indexing email the email itself is an object and each if any attachment is an object (thus an email with two attachments is indexed as three objects).

Query Rate: The number of queries per second the system must support.

Once you know how many objects and how many concurrent searches you can have a better understanding on what scale you are deploying.

Small: Supporting up to 15,000,000 objects
Medium: Supporting between 15,000,000 and 45,000,000 objects
Large: Supporting more than 45,000,000 objects


For western languages and for better performance slim is recommended rather than full lemmatization.

Disk IO is the primary performance factor. Improvements in indexing speed are derived from larger numbers of smaller drive heads within a single Content Indexing storage cloud RAID 5/6 is the baseline storage recommendation for the Index store. RAID should be configured to allow for fastest possible IO within the Content Indexing cloud

You will need to protect the index data somehow.
Do you plan on using a iDA to protect data_fixml & data_indexfixml data? You should find that a FS iDA is included in the licensing for this purpose. So I guess if you are protecting this data using backup than RAID0 would be fine.

Hope this fills the missing space!

---------------------------------------
EMCTA & EMCIE - Backup & Recovery
Legato & Commvault Certified Specialist
MCSE
 
Hi

Your hardware specs look fine - to figure out if you are ok on a single node it's important to know how many objects you have to index and how heavy you expect the load to be on the websearch - you can always move the websearch to a different IIS box

Couple of points to add to above comments due to the new version of FAST (4.3);-

1) You are correct - There is no support for 64-bit as yet
2) It is correct that temp space for indexing needs to be around 3x total data to be indexed but, the actual stored data will be around 25-30% (not 80% & lemmatization isn't a factor anymore)
3) It is always recommended to seperate the indexing nodes from the Admin node due to it being easier to increase the cloud if required (as well as resource balancing)
4) You should look at how many objects you are actually going to be indexing - for a single node (inc web search) the recomendation is to aim for no more than 10 million objects (a standalone indexing node can handle ~15 million objects but ,it is now possible to increase this to 30million with some modifications (due to FAST 4.3)
5) Raid 0 should be absolutely fine if you aren't too concerned about having a bit of downtime to reindex if a disk goes - you get a filesys iDA included as std for the Admin + Index node but, you would need to purchase a SQL iDA to protect the DB2 db
6) At a minimum, you should ensure that you backup the fixml which would make recovery easier in the event of a DR (and then you just need to run another indexing operation)
7) To reduce storage required you could always filter what file types you want to index

As always, recomendations are always top end and dependant on your data etc you will probably find that with only 400Gb of file & email data to index that you'll have no problems with the 1node (dependant on what that 400Gb is made up of)

You already have the licenses so if I was in your position I would just install & see how you go on



Birky
CommVault Certified Engineer
 
That certainly helps, I'll get on with setting it up and testing and get on with reading the rest of the docs.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top