Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

PLEASE HELP, I am stuck!!!

Status
Not open for further replies.

drepac

Programmer
Jan 9, 2006
4
CA
Hi everybody and happy new year to all of you.

I am asking for your help because I am really not getting anywhere with my project. So you guys are my last hope.

I am a newbie to data mining and I am doing my best to learn the basics so that I get started with my project.

I guess my problem, is the fact that I am new to data mining and I have no clue how to get started. The approach I have been following is trying to read books about data mining and machine learning so that I can understand and compare all the "numerous" algorithms out there and then try to find to the one(s) that would apply to my case.

The first problem with this approach is that I did not find any good resources (either the subjects are treated on the surface or they are overly made complex and hard to follow through). The second problem is that it is incredibly time consuming. So I am wondering if I should continue in this path or if I should try to proceed differently.

I am sure that a lot of you guys have been in the same position and some of you have struggled with this problem just like me. So, I am hopping that you guys would suggest a method that would help me get started.

The project I am working on is related to the field of agriculture and has as objective to try to find the best values of all the parameters that affect the outcome (the amount of meat produced) of an animal production (could be dairy, poultry, porch, etc...)

So as I said, the approach is to run one or more algorithms on historical data for a certain type of production (poultry for example) and trying to find what should be the best values for the operating conditions that would maximize the growth of the animals (weight), while trying to minimize the production costs. A few examples of the questions that this project is trying to solve are as follows: when is the best time and how long should the barns be light? When and how much food should we give the animals? What is the best operating temperature set point? When and how much cooling/heating should be done? , etc....

As you noticed, all these questions are concerned with the optimization of the operating conditions but most importantly, the reduction of operating costs. Huge amounts (10's of Go) of historical data for these operating conditions are to be used for this purpose.

PS: I am trying to use the Weka learning environment (java based and open-source).

I hope that you guys would kind enough to help me work my way trough this. I would appreciate your help and advice and I thank in advance all of you who took the time to read this lengthy post

Cheers.
 
Microsoft Data Mining will help you with the model building; it comes free with SQL Server Analysis Services. Remember that you are building a predictive model and not necessarily trying to uncover the truth. If the data mining of a clothing manufacturer indicates a strong seasonal peak in the purchase of raincoats in July, it may or may not mean that July is a rainy month. It might also be retailers stocking up for a future rainy season. All that matters is that you have identified July as a significant predictor in raincoat sales, and so your model would take that into account.

In my mind, this kind of model building is applying statistics such as Analysis of Variance (ANOVA) against a dataset and they applying the explained variance of each metric as part of a predictive model (correlations, cross correlations, autocorrelation, etc).

-------------------------
The trouble with doing something right the first time is that nobody appreciates how difficult it was - Steven Wright
 
Hi johnherman and thanks for the quick reply.

I did not understand the analogy between the rain coats example you mentioned and the optimization problem I am trying to solve. I would appreciate if you could elaborate more on this and maybe explain how ANOVA would help me solve my problem. Also, is there a good resource you would recommend to help get familiar with ANOVA.

thanks again,

Cheers.
 
I think my point regarding the raincoats is that although you may find that increasing (say) Calcium in the dietary mix increases weight, it may not be a direct cause of weight gain. For instance, if Calcium caused water retention or prevented the metabolism of fat. Nevertheless, increased levels of Calcium can be used to explain (ANOVA) or predict (model building) weight gain.

Actually, what most data mining softwares do is to gather statistics (ANOVA, Regression, Correlation, etc) on a subset of the data, then "test" their solution against another part or parts of the data (validating the model). This is a permissible activity if the entire history of your data can be considered to be "normal", so the first thing you need to do is to test your data population for normaility (if the data mining tool does not do this).

The ANOVA (and data mining model) will describe to what degree each factor contributes to explaining the variance (ANOVA) or to what magnitude each factor contributes to the prediction of the optimal solution.

So, from a statistical point of view, its ANOVA, from a Data Mining perspective, its Factoring, or Factor Analysis.
Hope this helps a little more.

-------------------------
The trouble with doing something right the first time is that nobody appreciates how difficult it was - Steven Wright
 
It sounds to me like your problem is an optimization problem which may involve modeling, as opposed to a pure modeling problem. If your historic database is extensive enough, then finding the optimal combination of factors is simply a matter of searching for the optimal result.

If the historic database is not of sufficient volume or quality to do this, then a model might be constructed to suggest new combinations of inputs to try. It may be helpful to study "design of experiments" (also called "experimental design").


-Will Dwinnell
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top