scoring records that were used in training/validation

woodlumn · Oct 24, 2008

Hello,

I have a question pertaining to data partitioning, model training, and ultimately, scoring a data set (predictive modeling).

The heart of the question is this: If you have a population of 20,000 divided up into training/validating/testing (40/30/30) for modeling purposes, is it incorrect to use the resulting score code to score the same population of 20,000?

That was the way I did it, accidentally. So I went back and sampled the entire database (rather than using the very specific population of 20,000), reconstructed my modeling table, and went through the modeling/scoring process again. This time, I used my new score code to re-score the original 20,000, so that I could compare the results.

I compared the scores of 100 records. I found the difference between scores, took the absolute value, and calculated the average. My number was .05. This means that on average, a probability score of 80% may be off plus or minus 5%. So there was a difference, but that can be attributed to many things. All it really told me was that I need to ask the question!

So back to the question - what is considered "best practice" as far as scoring records that you trained your model from?

Many thanks!

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

scoring records that were used in training/validation

woodlumn

IS-IT--Management

Similar threads

Part and Inventory Search

Sponsor