Balancing Data

Guest_imported · Jan 4, 2002

Hi everyone,

I hope someone can help me
I´m not sure if I should balance my data for predictions or decision trees or I should not: Imagine you have 10000 cases (customers). 1000 answered your first direct mail, 9000 customers did not. For your next direct mail you want to select only profitable customers (who may answer you mail) with binary logistic regression or may be a decision
tree. Dependent variable is "Answer" vs. "No answer". Do you use your sample like it is (1000 customers vs. 9000 customers) or do you first balance your sample like taking all the 1000 customers who answered your first direct mail vs. a random sample of 9000 customers who didn´t answered you first mail?

Thank you very much for you help, best regards

Markus

Predictor · Jan 4, 2002

This exact question just came up, over on the Data Mining Club on Yahoo! (

http://clubs.yahoo.com/clubs/datamining).

The answer depends on what you want to accomplish. Taking the simple route, a modeling system might collapse to always predicting the majority class, and thereby claim a performance of 90% (as measure by simple, unweighted accuracy)! If this is not satisfactory, then we must be able to say why.

By saying that we are willing to accept a poorer naive accuracy, we are saying that correctly predicting an example of the minority class is more important than accidentally misclassifying an example of the majority class. To translate this to terms the modling algorithm can understand, we may weight cases of the minority class more than those of the majority class, either explicitly in our learning algorithm or by stratifying the data set ("balancing", as you say).

It may be helpful for you to define a cost matrix, which indicates for every combination of predicted and actual class, what the cost is. Clearly specifying what your cost structure is will allow the construction of an appropriate weighting scheme.

Predictor

qidynamics · Mar 12, 2002

An approach that has worked for me is balanced stratified sampling. Basically the idea is for the machine learning algorithm to learn the rules. The rules will be learned first on a test data set, then evaluated, then applied to the 'real world'. Since you are trying to learn the rules at this point, I would have a roughly equal number of cases from each type of instance (balancing). This way you will have less trees in the forest in order to create your rule base. I cannot overemphasize the need for your model evaluation to test for accuracy. No model is useful unless it passes this test. I would also point to Dorian Pyle's "Data Preparation for Data Mining"

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Balancing Data

Guest_imported

New member

Predictor

Programmer

qidynamics

Technical User

Similar threads

Part and Inventory Search

Sponsor