I have a class library that does calculations (usually AVERAGE) on (sometimes) large sets of data. The nature of the application is that it normally does several calculations (5-50 or more depending on user requests) on the same set of data, so the basic implementation is that I grab all the data out into a DataTable and then the library takes arguments and uses the .Compute() method of the DataTable to return results. In most cases, I'm doing an average of a column on a subset of rows. Subsequent calls do averages on different subsets of rows.
This all works fine until the DataTable grows to 10k+ rows. Once I pass that threshold (or thereabouts), the performance goes into the toilet. It can literally take minutes to pull reports that would take < 10 seconds on a dataset half the size.
The problem is that I don't want a single request to result in 50+ queries to my database, yet I must do 50+ calculations. I can add application servers pretty easily, but I only have the 1 database server. That's my scarce resource so I want to minimize his work (even though he's better at doing aggregation that my app server) and put my app servers to work.
I am well aware of the limitations of the DataTable and I am also well aware that if I run these same queries directly against the database, I get much better performance (orders of magnitude better). I'm just not sure how to overcome this.
So my question is does anyone have an idea on how to pull out possibly large sets of data and do calculations on the application servers so as not to overload my database server but avoid this huge performance penalty that I get when working with large DataTables? I'm willing to give up some performance for smaller sets of data if necessary (the tradeoff would be worth it, I'm sure, if I can figure out how to optimize the top end).
I have a small benchmarking application that I'm working with and I have yet to find any method that out-performs DataTable.Compute(). I'm working under the assumption that I need to start w/ a DataTable because I will need to access the data several times (maybe I'm hosed right there - that's why I'm asking this question).
Here is what I've tried thus far:
1) Manually call DataTable.Select() and get a DataRow[] object - brute force iterate and do the average
2) Use LINQ against an EnumerableRowCollection<DataRow> pulled from the DataTable and call .Average()
3) Brute force iterate over the DataTable myself and do the average
4) Call DataTable.CreateDataReader() and brute force iterate on the DataReader to do the average
Of all of these methods, #1 results in the closest performance numbers to .Compute() (and it's pretty darn close) and #2 results in the worst (almost twice as bad) with #3 & #4 falling solidly in the middle of those two.
Does anyone out there have an idea on how to improve upon .Compute()?
Many thanks -
Paul Prewett
The answer to getting answered -- faq855-2992
This all works fine until the DataTable grows to 10k+ rows. Once I pass that threshold (or thereabouts), the performance goes into the toilet. It can literally take minutes to pull reports that would take < 10 seconds on a dataset half the size.
The problem is that I don't want a single request to result in 50+ queries to my database, yet I must do 50+ calculations. I can add application servers pretty easily, but I only have the 1 database server. That's my scarce resource so I want to minimize his work (even though he's better at doing aggregation that my app server) and put my app servers to work.
I am well aware of the limitations of the DataTable and I am also well aware that if I run these same queries directly against the database, I get much better performance (orders of magnitude better). I'm just not sure how to overcome this.
So my question is does anyone have an idea on how to pull out possibly large sets of data and do calculations on the application servers so as not to overload my database server but avoid this huge performance penalty that I get when working with large DataTables? I'm willing to give up some performance for smaller sets of data if necessary (the tradeoff would be worth it, I'm sure, if I can figure out how to optimize the top end).
I have a small benchmarking application that I'm working with and I have yet to find any method that out-performs DataTable.Compute(). I'm working under the assumption that I need to start w/ a DataTable because I will need to access the data several times (maybe I'm hosed right there - that's why I'm asking this question).
Here is what I've tried thus far:
1) Manually call DataTable.Select() and get a DataRow[] object - brute force iterate and do the average
2) Use LINQ against an EnumerableRowCollection<DataRow> pulled from the DataTable and call .Average()
3) Brute force iterate over the DataTable myself and do the average
4) Call DataTable.CreateDataReader() and brute force iterate on the DataReader to do the average
Of all of these methods, #1 results in the closest performance numbers to .Compute() (and it's pretty darn close) and #2 results in the worst (almost twice as bad) with #3 & #4 falling solidly in the middle of those two.
Does anyone out there have an idea on how to improve upon .Compute()?
Many thanks -
Paul Prewett
The answer to getting answered -- faq855-2992