Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Identifying Social Security Numbers with reverse order digits 2

Status
Not open for further replies.

jimoo

Programmer
Jun 2, 2003
1,111
US
I am cleaning some legacy data and have noticed there are individual in the Database twice with the same 9 digits in their social security number (SSN) but 2 of the digits are reversed, probably because of a keypunch error.

I am looking for a way to identify people with the same 9 digits, but 2 are reversed. Alternately, if this is too much of puzzle, identify the the same 9 digits and 7 of 9 are in the same order.

Below is some samlple code to populate a test db.

Notice the pairs SSNs that start wit 12, 34, 67 all contain the same 9 digits, but 2 are revesed.

12.. has the 67 reversed.
34.. has the 56 reversed
67.. has the 12 reversed

These sets of records I'd like to identify with code.


CREATE TABLE #temp (
[SSN] [varchar](9))

INSERT into #temp (ssn) values ('123456789')
INSERT into #temp (ssn) values ('123457689')
INSERT into #temp (ssn) values ('234567890')
INSERT into #temp (ssn) values ('345678901')
INSERT into #temp (ssn) values ('346578901')
INSERT into #temp (ssn) values ('456789012')
INSERT into #temp (ssn) values ('567890123')
INSERT into #temp (ssn) values ('678901234')
INSERT into #temp (ssn) values ('678902134')
INSERT into #temp (ssn) values ('789012345')


select * from #temp

-- drop table #temp



Jim
 
Wouldn't you be better of just identifying those records where the people match (fisrtname, surname, dob, etc) but the SSN is different? My thinking here is you still won't know which is the correct SSN anyway (so will have to deal with it) so rather than a complex function to find transposed digits, you could run a simpler query?

I like work. It fascinates me. I can sit and look at it for hours...
 
Here's a query that may work for you.

Just to explain it first...

The first thing I do is to pivot the data so that a SSN of 222334444 looks like:

Code:
Position, Character
1         2
2         2
3         2
4         3
5         3
6         4
7         4
8         4
9         4

Now that I have the data like this, I can do interesting things with it. For example, I can join this data back to itself and compare the individual characters from each of the SSN's. I can also sum the individual digits.

Here's the code:

Code:
CREATE TABLE #temp (
[SSN] [varchar](9))

Declare @Temp Table(Position Int)

Insert Into @Temp Values(1)
Insert Into @Temp Values(2)
Insert Into @Temp Values(3)
Insert Into @Temp Values(4)
Insert Into @Temp Values(5)
Insert Into @Temp Values(6)
Insert Into @Temp Values(7)
Insert Into @Temp Values(8)
Insert Into @Temp Values(9)


INSERT into #temp(ssn) values ('123456789')
INSERT into #temp(ssn) values ('123457689')	
INSERT into #temp(ssn) values ('234567890')	
INSERT into #temp(ssn) values ('345678901')
INSERT into #temp(ssn) values ('346578901')
INSERT into #temp(ssn) values ('456789012')
INSERT into #temp(ssn) values ('567890123')
INSERT into #temp(ssn) values ('678901234')
INSERT into #temp(ssn) values ('678902134')
INSERT into #temp(ssn) values ('789012345')

; With Data As
(
    select  T.SSN, SUBSTRING(T.SSN, P.Position, 1) As C, P.Position
    from    #temp T
            Cross Join @Temp P
)
Select  A.SSN, B.SSN, Count(*)
From    Data As A
        Inner Join Data As B
          On A.SSN < B.SSN
          And A.Position = B.Position
          And A.C = B.C
Group By A.SSN, B.SSN
Having Count(*) > 6
       And Sum(Convert(int, A.C)) = Sum(Convert(Int, B.C))

Drop Table #Temp

If you remove the sum condition from the having clause, you may catch more than just transposition problems. For example, if one SSN is 111223333 and another is 11122233[!]4[/!]. The sum condition will remove it from the result set because the digits, when added together, don't match.



-George
Microsoft SQL Server MVP
My Blogs
SQLCop
twitter
"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
George

This is excellent. It even shows me the 8 of 9, which I didn't ask for but will be equally important.

Great job and Thank You!



Jim
 
Something else you may want to consider: In a valid SSN, none of the groups can be all zeros, so
[ul]
[li][tt]000-01-2345[/tt][/li]
[li][tt]001-00-2345[/tt][/li]
[li][tt]012-34-0000[/tt][/li]
[/ul]
are all invalid (or at least they used to be).

Unfortunately, SSNs have no check digit, like a bank account number or a credit card number, so you can't validate them that way.

They were issued in a complicated order, and the Social Security Administration published a list of high-number ranges which used to make it easier to spot a bad number (or, at least one that hadn't been issued yet). I wrote a validation program for that way back when (it was in RPG on an AS/400).

However, that's now all out the window, because since June 25, 2011, they are now issuing numbers at random.

If you can be sure that all of the SSNs in your database were issued prior to that, you can use the group maximums to validate.

-- Francis
Aut viam inveniam aut faciam.
 
Thank you Francis, you shared a good idea to check the ranges of the SSN from the prior to 6/2011 way the SSN admininstration issued the numbers. All of the members in my db are adults. I will look at this as well.

A side note, George's solution works great for small number of records, but when I get over 10,000 or so it really slows and at almost 500,000 doesn't seem practical. That doesn't mean I am not using though. I just need to filter the data into smaller sets and maybe I'll just compare the most common.



Jim
 
You indicated that this was a "clean up" routine, where performance does not matter as much.

How many SSN's do you have, and what would you consider to be acceptable performance?



-George
Microsoft SQL Server MVP
My Blogs
SQLCop
twitter
"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
Jimoo:

Consider the fact that adults as well as children can be issued a SSN (immigrants come to mind).

I don't work anymore at the place where I wrote that program, but here is an explanation of how they were issued before June 25, 2011.

-- Francis
Aut viam inveniam aut faciam.
 
George, I did not capture exact metrics, but what occurred was this.

When I selected just 1,000 records it ran in a few seconds, but when I got to 10,000 records it ran for almost 7 minutes. I increased it to just over 20,000 records and it ran for about 15 minutes before I encountered the following:

(20351 row(s) affected)
Msg 1101, Level 17, State 10, Line 94
Could not allocate a new page for database 'TEMPDB' because of insufficient disk space in filegroup 'DEFAULT'. Create the necessary space by dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup.


It filled the tempdb. It appears it is exponential. I.E. 1x1 + 1+2 + 1+3, etc... until the cross tab becomes very large. I have over 450,000 records.

Therefore, I limited my request to hard match based on lastname, first initial, and year of birth. I also looped it 26 times and processed each beginning letter separately. The script took about 20 minutes to execute and produced about 800 hard matches.

Code:
USE MyDatabaseName


DECLARE @Temp TABLE(Position Int)

INSERT INTO @Temp Values(1)
INSERT INTO @Temp Values(2)
INSERT INTO @Temp Values(3)
INSERT INTO @Temp Values(4)
INSERT INTO @Temp Values(5)
INSERT INTO @Temp Values(6)
INSERT INTO @Temp Values(7)
INSERT INTO @Temp Values(8)
INSERT INTO @Temp Values(9)


DECLARE @i int
SELECT @i = 1

DECLARE @letter char;
SET  @letter = ''


WHILE (@i <= 26) 
BEGIN
	SELECT @letter =  CHAR(@i + 64)
	PRINT @letter

    
	CREATE TABLE #temp (
	[SSN] [varchar](9))
	
	-- FYI: letter a has approximately 3095
	INSERT INTO #temp
	SELECT ssn FROM customer WHERE 
	lastname + left(firstname,1) + str(year(birthdate)) 
		in (SELECT distinct lastname + left(firstname,1) + STR(YEAR(birthdate))
	FROM customer WHERE LEFT(lastname,1) = @letter
	GROUP by lastname + left(firstname,1) + STR(YEAR(birthdate))
	HAVING count(*) > 1) order by LastName, FirstName
      
	; WITH Data AS
	(
		SELECT  T.SSN, SUBSTRING(T.SSN, P.Position, 1) As C, P.Position
		FROM    #temp T
				Cross Join @Temp P
	)
	SELECT  A.SSN AS SSN1, B.SSN as SSN2, Count(*) AS Cnt, 0 AS reversed,
		SPACE(30) AS LastName1, SPACE(30) AS FirstName1, SPACE(30) AS MiddleName1,
		SPACE(30) AS LastName2, SPACE(30) AS FirstName2, SPACE(30) AS MiddleName2,
		GETDATE() AS birthdate1, GETDATE() AS birthdate2
	INTO #temp2
	FROM    Data AS A
			Inner Join Data AS B
			  ON A.SSN < B.SSN
			  AND A.Position = B.Position
			  AND A.C = B.C
	GROUP BY A.SSN, B.SSN
	HAVING Count(*) > 6
		   AND Sum(Convert(int, A.C)) = Sum(Convert(Int, B.C))

	UPDATE t1 
	SET  LastName1 = t2.LastName,
	 FirstName1 = t2.FirstName,
	 MiddleName1 = t2.MiddleName,
	 Birthdate1 = t2.Birthdate
	FROM #temp2 t1 inner join customer t2
	ON t1.ssn1 = t2.SSN

	UPDATE t1 
	SET  LastName2 = t2.LastName,
	 FirstName2 = t2.FirstName,
	 MiddleName2 = t2.MiddleName,
	 Birthdate2 = t2.Birthdate
	FROM #temp2 t1 inner join customer t2
	ON t1.ssn2 = t2.SSN

	-- insert possible dups into a table that can be analyzed later
	INSERT INTO dbo.zDupCleanup 
	SELECT * 
	FROM #temp2 
	WHERE SOUNDEX(firstname1) = SOUNDEX(firstname2)
	AND SOUNDEX(lastname1) = SOUNDEX(lastname2)  AND year(birthdate1) = year(birthdate2)


    DROP TABLE #temp
	DROP TABLE #temp2   
	
	SELECT @i = @i +1

	
END
PRINT 'All Done'





Make Table to hold Duplicates
Code:
USE [MyDatabaseName]


CREATE TABLE [dbo].[zDupCleanup](
	[SSN1] [varchar](9) NULL,
	[SSN2] [varchar](9) NULL,
	[Cnt] [int] NULL,
	[reversed] [int] NOT NULL,
	[LastName1] [varchar](30) NULL,
	[FirstName1] [varchar](30) NULL,
	[MiddleName1] [varchar](30) NULL,
	[LastName2] [varchar](30) NULL,
	[FirstName2] [varchar](30) NULL,
	[MiddleName2] [varchar](30) NULL,
	[birthdate1] [datetime] NOT NULL,
	[birthdate2] [datetime] NOT NULL
) ON [PRIMARY]


To test:
1. create a sample customer table with ssn, lastname, firstname, and birthdate.
2. Add records including some with 7 and 8 digits matching
3. Create z holding table
4. Run script


Jim
 
I noticed my INSERT STATEMENT uses SOUNDEX of firstname and the condition should be LEFT(firstname,1) similar to the query that select the initial data.

Change Insert below where to:

WHERE LEFT(firstname1,1) = LEFT(firstname2,1)


INSERT INTO dbo.zDupCleanup
SELECT *
FROM #temp2
WHERE SOUNDEX(firstname1) = SOUNDEX(firstname2)
AND SOUNDEX(lastname1) = SOUNDEX(lastname2) AND year(birthdate1) = year(birthdate2)




Jim
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top