Hello,
I have a question pertaining to data partitioning, model training, and ultimately, scoring a data set (predictive modeling).
The heart of the question is this: If you have a population of 20,000 divided up into training/validating/testing (40/30/30) for modeling purposes, is it incorrect to use the resulting score code to score the same population of 20,000?
That was the way I did it, accidentally. So I went back and sampled the entire database (rather than using the very specific population of 20,000), reconstructed my modeling table, and went through the modeling/scoring process again. This time, I used my new score code to re-score the original 20,000, so that I could compare the results.
I compared the scores of 100 records. I found the difference between scores, took the absolute value, and calculated the average. My number was .05. This means that on average, a probability score of 80% may be off plus or minus 5%. So there was a difference, but that can be attributed to many things. All it really told me was that I need to ask the question!
So back to the question - what is considered "best practice" as far as scoring records that you trained your model from?
Many thanks!
I have a question pertaining to data partitioning, model training, and ultimately, scoring a data set (predictive modeling).
The heart of the question is this: If you have a population of 20,000 divided up into training/validating/testing (40/30/30) for modeling purposes, is it incorrect to use the resulting score code to score the same population of 20,000?
That was the way I did it, accidentally. So I went back and sampled the entire database (rather than using the very specific population of 20,000), reconstructed my modeling table, and went through the modeling/scoring process again. This time, I used my new score code to re-score the original 20,000, so that I could compare the results.
I compared the scores of 100 records. I found the difference between scores, took the absolute value, and calculated the average. My number was .05. This means that on average, a probability score of 80% may be off plus or minus 5%. So there was a difference, but that can be attributed to many things. All it really told me was that I need to ask the question!
So back to the question - what is considered "best practice" as far as scoring records that you trained your model from?
Many thanks!