Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

what to stratify by in data partition

Status
Not open for further replies.

toolmano

Technical User
Jun 14, 2007
2
NZ
Hi all

I am wondering what the general rule is to stratify sampling in the data partion node?

for the catagorical variables

do you stratify by just the target variable?

or all input catagorical variables that have underepresented classes in them?

some advice would be greatly appreciated.

Kind Regards
Tim
 
That depends on what one is trying to accomplish. Why are you stratifying the data?
 
Sorry i will put in some context. I am stratifying because i am setting the sampling for splitting the data into validation and training data.

I have 5 catagorical variables, one is binary (target classifcation) and has a fairly even distribution of the two groups.

However the others have 4 options with underrepresented groups, which is where the stratification will come in.

Does this seem a valid assumption?
 
In general, for train/test splitting, I try to stratify as much as possible within reason, and yes, I do stratify on the dependent variable. "Within reason" means: 1. I worry most about variables believed to be important, and 2. individual stratification cells should not become too small.

You don't give actual distributions or observation counts, but I'd start with the dependent variable and the whichever of the other variables you think are most important. Add stratifying variables until the cell sizes are no longer reasonable.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top