Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

DW Storage Hardware buying criteria?

Status
Not open for further replies.

pmorrow

IS-IT--Management
Mar 4, 2002
1
0
0
US
Looking for help in understanding what a storage vendor should consider as the most important buying criteria for storage hardware to support data warehousing applications.

It seems capacity, scalability and interoperability (to work with multiple systems from which one would want to gather data) are the most important criteria. Would you agree? How would other characteristics like I/O speed, availability, reliability rank in your opinion for DW needs in comparison to requirements for other applications?

Thanks for the help,

Peter
 
I'm used to answering this question from a client's point of view, not a vendor's, but the answer is probably the same...

Capacity and scalability are no brainers. You have to be able to store the needed data, but most storage vendors don't have trouble with this, so alone it probably won't set a single vendor apart. Interoperability is a little more interesting, however. Most DW implementations I've seen rely on a single server (or cluster) to do ETL, and a single server (almost always of the same type) to respond to requests for data (end user queries). There's a lot of software competition in those two areas, but we haven't seen any particular standouts in the hardware arena. HP, Sun, Microsoft, IBM, etc. all seem to have equal footing, so a DW solution will probably stick with one provider for systems, which may make interoperability less important.

I think the variables that are far more important have to do with the physical storage itself; I/O, RAID settings, latency, redundancy, etc. People don't often consider the differences between traditional transactional systems and DW systems with respect to these factors. For example OLTP systems traditionally require the ability to quickly write small pieces of data to a database. A single transaction being written is the core functionality. An OLAP system, however, emphasizes quick retrieval of large amounts of data. Trading off time spent storing data (especially data that won't change, such as archival data) for faster retrieval rates makes sense. From a storage point of view this means technologies such as mirrored drives (slower write, faster read) are preferred to striping schemes with the opposite result. The specifics always change of course, depending on if you deploy a DW with a relational DB that does lots of on-the-spot aggregation, or with pre-aggregated multidimensional objects. Lots of variables, but the point is the environment is different and that needs to be considered.

These issues affect capacity and scalability, but I'd rather see well configured medium size storage terabytes of poorly configured media. Large banks of small disks, for example, allow for better granular handling of RAID parameters than fewer larger media.

Another important consideration is bandwith. Being able to pull large quantities of data off of disk quickly is useless if it can't get across the network or any other bottleneck. Another strike against SANs in my opinion, but eventually you have to hit a network to get to users. Aggregating data at the storage level helps significantly, but other hardware approaches like, say, clustered/remotely mirrored storage would have a stronger impact. Has anyone implemented hardware solutions to remote mirroring? I can't bring any to mind.

Availability and Reliability are red herrings to some extent. To lessen their importance is madness from a marketing standpoint. But, in reality, the DW environment does differ on these points. For example in most cases DW data is partitioned at some level over time, and old data rarely changes. Should the hardware anticipate this? Can we sacrifice reliability and speed for cost if older data is rarely used and can be recovered from any backup? Maybe. I'm a fan of the efficient manner in which IBM's mainframes tend to differentiate between online and offline storage, but its a trend people are getting away from. Probably this is because the cost savings really aren't that great any more with drive costs plummeting. This probably needs to be covered on a case-by-case basis.

Hmmm... have I rambled enough on this answer? Probably. Reply if I can cloud the issue any further.

 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top