Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chris Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Find data outlier 1

Status
Not open for further replies.

DotNetGnat

Programmer
Mar 10, 2005
5,548
IN
Guys,

I have the following tables shown below in the simplest possible format...

Code:
Table Name -> Route

RouteId | Elevation
__________________________
   1     | 200.25
   1     | 200.30
   1     | 200.27
   2     | 100.92
   2     | 100.85
   3     | 300.89
   4     | 200.23

Code:
Table Name -> RouteNameJoin

OrgRouteId | RouteId
_________________________________
     1000  |   1    
     1000  |   4    
     2000  |   2
     2000  |   3

The data here is only shown as sample and may not be correct...I want to find the data outliers in the Original Routes...There are multiple routes that form a original route...may be I need to find mean and standard deviation for the data set....but any ways I want the following result...

Code:
OriginalRouteId | RouteId | Elevation
______________________________________
    2000        |    3    |  300.89

Any suggestions...

-DNG

 
DNG,

Please pardon my dimness...I'm fuzzy on your algorithm. If you could explain how you get from your data to your result, in terms that an 8-year-old can understand, then I'm sure I can be more helpful.

[santa]Mufasa
(aka Dave of Sandy, Utah, USA)
[I provide low-cost, remote Database Administration services: www.dasages.com]
 
Dave,

Thanks for your attention to my question and sorry that I had not framed my question right...

Here is the scenario...

Each MainRoute or OrgRoute is comprised of smaller route segments...for example from the above data...MainRoute 1000 is comprised of routes 1 and 4..something like this diagrammatically...

.-----route 1----->.----route 4----->.
._______________MainRoute 1000______>.

here i gave only an example..but in some cases main route may have 50 smaller routes joined...

hope that is clear...next...

each smaller route has elevation readings or numbers taken at various places along the route as shown in table 1...something like below...


|||||||||||||||||
.----Route 1-------.

What I have to find is if there are any outliers or bad readings in the routes and show the result in the following format...

MainRoute | SmallerRoute | BadElevationReading

I was thinking that I need to first find mean and standard deviation for the elevation reading of each smaller route and then do the same for all the routes in a main route to find the bad data...

Hope I made clear...

-DNG
 
Hi,
What constitures an outlier?

How do we know, in your example, that 300.89 is one?




[profile]

To Paraphrase:"The Help you get is proportional to the Help you give.."
 
Turkbear,

Thanks for your reply. As I said that numbers I posted may be wrong...

I was thinking finding mean and distribution of reading and then assuming an outlier to be one that points above or below 3 standard deviations...

thanks

-DNG
 
I meant finding mean and standard deviation of the elevation readings and then assuming an outlier to be one of the readings that is above or below 3 standard deviations...

thanks

-DNG
 
Guys,

Also if there are any other easier and efficient ways to find outliers(bad data) in the situation I posted above, are very welcome...

Thanks

-DNG
 
Gnat,

IMHO this is a case of a badly defined requirement. For example, in the UK, one can use certain rules regarding elevation, e.g. the highest mountain cannot be above x feet above sea level.

In the USA, the limit is much higher, as you have those pimples on the landscape knowns as the rocky mountains, right?

So, if I go down monument valley in my British car, the elevation will be pretty much constant. If the road goes over a peak, and then drops back to what it was before, how can a computer system know that that is incorrect? My British car system must accept that the Yanks have higher mountains than we do, so an elevation of 10,000 feet is definitely possible, whilst on holiday in the good ol' US of A. If I am back in blightey, 10,000 feet is clearly drivel.

Similarly, a Swiss driver going from A to B might have to go over the Alps, and go up > 10,000 feet. Alternatively, he might use one of the Alpine tunnels, and his elevation would be constant, whilst going from A to B.

I do not believe that the weird and amazing surface of the planet is amenable to analysis by simple statistical tools.

Who said that stats work in this application? Why, and were they using herbal tobacco when they said it?

Regards

T

Grinding away at things Oracular
 
thargtheslayer,

I totally agree with what you said...i was just trying to devise an algorithm that gives some records that would be manually investigated again...

Its not that i am totally relying on the computer program...manual work is needed...

thanks

-DNG
 
Your approach strikes me as being a reasonable attempt at finding outliers. After all, in the real world a connected route rarely has a sudden spike in elevation followed by an equally sudden reversion to the mean. It could happen, but it would take something like a hiker walking on level ground, taking a break to climb a cliff, descending and continuing the hike on level ground. In that route only the elevation on top of the cliff would show up as an outlier.

And, fortunately for you, Oracle has a standard deviation function. I would try something like the following to identify possible outliers in the Route table.

Code:
select a.routeid, elevation, avg_route_elevation, deviation  from route a, 
  (select routeid, avg(elevation) avg_route_elevation, stddev(elevation) deviation from route
  group by routeid) b
where a.routeid = b.routeid
and abs(elevation - avg_route_elevation) > 3 * deviation
order by 1, 2

Similarly, candidates for outliers in an original route might be identified by

Code:
select b.orgrouteid, a.routeid, a.elevation, avg_route_elevation, deviation from route a, routenamejoin b,
  (select e.orgrouteid, avg(elevation) avg_route_elevation, stddev(elevation) deviation from route d, routenamejoin e
    where d.routeid = e.routeid
    group by e.orgrouteid) c
where a.routeid=b.routeid
  and b.orgrouteid=c.orgrouteid
  and abs(elevation - avg_route_elevation) > 3 * deviation
order by 1, 2

Please note that your sample data doesn't actually have any candidate outliers by this method. That's because original route 2000 has only three measured elevations, and so you get an extremely large standard deviation. You would have to have more measured elevations around 100 to be statistically confident that the 300.89 elevation is a real outlier.
 
Thanks Karluk,

you were right on the requirements...

I also had my queries written along the same lines as you have shown in the post...

I posted my question to get suggestions from your experts..

I have always got great suggestions and help from Dave and others on the forums...

thanks again...have a star for your efforts...

-DNG
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top