I have some data that looks like this:
Code:
region_code city zip
AAA Houston 77076
AAA Houston 77024
AAB Houston 77024
The region_code is entered manually by a multitude of different people. It should always be the same thing for each city, but some times it gets messed up.
I am trying to write a query that will let me find the cities that have more than one region code per city, and then update them based on which region code appears more than the other region code.
So far I have:
Code:
select city, count(*) as n
from (
select distinct pid, city, region_code
from customer_dim
) a
group by city
having count(*) > 1
order by count(*) desc
Which finds all the cities that have more than one region code assigned to them per city.
The below code is my attempt to find the max value for each city and return its region code, but of course it doesn't work because I am not doing the sub queries right. Can someone help?
Code:
select region_code, city
from (
select region_code, city, count(*) as n
from customer_dim
where city = 'HOUSTON'
group by region_code, city) a
join (
select max(n) as maxn
from (
select region_code, city, count(*) as n
from customer_dim
group by region_code, city) aa
) b
on a.city = b.city
and b.maxn = a.n