Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Stat Question

Status
Not open for further replies.

ecugrad

MIS
Apr 17, 2001
191
US
Question, I'm trying to run some statistical info. My sample size consist of 20,000+ observations. The data is comes from our workflow database that trackes the time complete an item by our employees. The data is also very noisy do to the fact that a person might be working an object and forgets to submit it or some type of system problem. I'm trying to find the mean and dispose of all outlyers. But with such noisy data, I'm having to run many many iterations to do this. Does anybody know how I can do this in SAS. Our how many iterations you need to run through to find a mean..

Mike
 
ecugrad,
I'm not a Stats person but my gut tells me that you could get results very easy using the SAS procs (freq, stat, means, univariate).

In your case I would discuss your problem (noisy data etc) with a statistition and try to formulate a plan to eliminate the 'outlyers' so that your data is workable. I use proc gplot to give me a picture of how my data falls. Outlyers will not fall in the normal range. I then can write exclusion statements to have those outlyers drop out of the data subset.
Using proc freq you can see all sorts of outlyer info.

In short you will probably have to massage your data before you get decent results, but hey thats the thrill of programming in stats.

If you have a specific question feel free to post and I will try to answer it.
Klaz
 
There are lots of ways of doing this stuff. One is to flag the outlier based on the the number of Standard Deviations it is away from the mean. In the example below, I used 1.5 Standard Deviations ...

You can cut and paste this code as a new little program, run it step by step and see the results.

You might want to play around with the value of 1.5, 2, etc. What I like about this, is the choice of 1.5 or 2 or whatever is completely independent of the type of data (i.e. Winter Temperatures, New Car prices ... it's all relative to the data it's being applied to).

Here's the code
---------------------------------------


* Quick macro to make a sample of 100 obs ;
* all between 0 and 1 ;
%macro makedata ;
%do i=1 %to 100 ;
RANDNUM = ranuni(1234) ; output ;
%end;
%mend makedata ; run;

data SAMPLE100 ;
%makedata ;
format RANDNUM 9.4 ;
run;

* Force observation #50 to be relatively huge ;
data SAMPLE100 ;
set SAMPLE100 ;
if _n_ = 50 then RANDNUM = 1000 ;
run;

* Identify the outlier ;
proc means data=SAMPLE100 noprint ;
var RANDNUM ;
output out=STATS (drop=_type_ _freq_)
std=STD_RANDNUM
mean=AVG_RANDNUM ;
run;

* Put these values into macro variables ;

data STATS ;
set STATS ;
call symput("STDDEV",STD_RANDNUM) ;
call symput("AVERAGE",AVG_RANDNUM) ;
run;

* Find the outliners ;
data SAMPLE100 ;
set SAMPLE100 ;
* If the value is more than 1.5 Standard Deviations ;
* from the mean, flag it as an outlier ;
if abs( (RANDNUM-&AVERAGE.) / &STDDEV ) > 1.5
then OUTLIER='Y' ;
else OUTLIER='N' ;
run;


Alan J. Volkert
Fleet Services
GE Commercial Finance Capital Solutions
(World's longest company title)
Eden Prairie, MN
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top