i am about to start helping on a data-mining project where the starting data weighs in at 350GB. this is the compressed size on a set of backup tapes so i presume the uncompressed size will be greater.
i do not need the data mining scripts to be fast but i do need rapid view access to the recordsets. I am not hopefully going to be deploying expensive servers on this project either: it's hopefully not going to be a big job.
Does anyone have active experience using mysql with data of this size? will it cope on low-mid spec machines?
bear in mind:
the first few tasks will be removing columns that are obviously irrelevant to our purpose.
the next few will involve an analysis of which further columns we can delete.
then a cleansing exercise involving simplification of various columns
then a transformation on the remaining recordset
lastly a single query across the entire recordset that is intended to result in a single number (the data is several years of financial data from which we are trying to derive an index that will be maintained monthly going forward - depending on where we end up we might well decide to recalculate the index on the fly per additional base record: all depends on how long the generation query takes.
thanks in advance for any insights you may have
Justin
i do not need the data mining scripts to be fast but i do need rapid view access to the recordsets. I am not hopefully going to be deploying expensive servers on this project either: it's hopefully not going to be a big job.
Does anyone have active experience using mysql with data of this size? will it cope on low-mid spec machines?
bear in mind:
the first few tasks will be removing columns that are obviously irrelevant to our purpose.
the next few will involve an analysis of which further columns we can delete.
then a cleansing exercise involving simplification of various columns
then a transformation on the remaining recordset
lastly a single query across the entire recordset that is intended to result in a single number (the data is several years of financial data from which we are trying to derive an index that will be maintained monthly going forward - depending on where we end up we might well decide to recalculate the index on the fly per additional base record: all depends on how long the generation query takes.
thanks in advance for any insights you may have
Justin