Normal distribution in anonymised data

SimonSellick · Nov 16, 2010

Hi,

I have produced a masking process to anonymise personal data in a large database, for testing etc. The approach I have taken is a many-to-one mapping from source to masked values. I now want to apply a more realistic distribution of masked values, so that some of them appear only a few times with others more frequent.

As an example, suppose that I generated 2000 names from 20000 unique source names. Rather than getting roughly 10 instances of each generated name, I would like to get a few with only a single instance, more with 2-5 instances, many with say 20-30 instances, etc; so profiled with something like a bell curve.

I am using T-SQL Checksum() to create a 32-bit integer from the name string (or other value), then using the answer modulo the number of masked values available to pick one reasonably randomly but repeatably. What I think I need is a statistical function (preferably) or technique that I can use to divide a single interval (the range of a 32-bit integer in this case) into a specified number of low-high interval pairs, such that the ranges of the low-high pairs could be expected to catch randomly-spaced values in the original range with a statistically normal distribution.

Any pointers would be very welcome, and alternative approaches - I'm not stuck on using Checksum() or selecting via limits.

TheBugSlayer · Nov 17, 2010

Huh? Since you are talking about SQL Server why not create your "distribution" based on the number of values generated from the unique source, save it in a table then use it when appropriate. I.e. when a generated name is assigned/used, increase its counter and don' t assign/use the name when the maximum is reached.

Good luck.

MCP SQL Server 2000, MCTS SQL Server 2005, MCTS SQL Server 2008 (DBD, DBA)

fredericofonseca · Nov 17, 2010

As a alternative I'm going to suggest what I've used previously in some companies when moving data from production to test/dev environments.

Clients names and addresses.
change name from "john smith" to "client xxx"
change address from "timbuctu road" to "addr line 1 xxx"
"mongolia" to "addr line 2 xxx"
where xxx is the client id of that client

and so on to all char fields that can contain personal data.
For ppsn numbers (or any type of national id number) it is normally possible to do something similar based on the client id number also even if said id is a char field.

Date of birth is a harder one, as that many times affects the rest of the application, so tend to leave that one as is.

Think the above clarifies a possible alternative, with the benefit that clearly identifies the id it relates to.

Regards

Frederico Fonseca
SysSoft Integrated Ltd

http://www.syssoft-int.com

FAQ219-2884
FAQ181-2886

simian336 · Nov 18, 2010

I would just assign a integer to each of the 20,000 names then pull 2000 random numbers between 1 and 20,000. You will get mostly unigue and a few multiples.

Simi

SimonSellick · Nov 18, 2010

Some really interesting ideas there - thank you all.

Bug Slayer's technique is sort-of what I am doing: it is generating the normalised distribution within the list of names (your 'maximum') that is tricky.

Unfortunately, I cannot use Frederico's appealing alternative as the data has to look 'real', or at least realistic.

Simi's idea of generating the sample first then applying it to the target range looks good although I'll have to think about how to make it repeatable.

Thanks again for the ideas.

simian336 · Nov 19, 2010

--create table #test (
--mynum int)

delete #test

declare @x as int
set @x = 0

declare @mynum as int
set @mynum =0

while @x < 2000
begin
set @mynum = cast(rand()*20000+1 as int)
insert into #test (mynum) values (@mynum)
set @x =@x +1
end

select mynum, count(mynum)
from #test
group by mynum
order by mynum

Simi

TheBugSlayer · Nov 22, 2010

Simon, just to make my idea clearer, if your generated table looks like:

Code:

SampleTable([u]Id[/u], Name, TotAssigned, MaxToAssign)

To randomly pick the next name to assign (probably looping while there is a Name with TotAssigned < MaxToAssign) you will do

Code:

SELECT TOP 1 Id, Name FROM SampleTable WHERE TotAssigned < MaxToAssign ORDER BY NEWID()

Assign it, then update TotAssigned.

Good luck.

MCP SQL Server 2000, MCTS SQL Server 2005, MCTS SQL Server 2008 (DBD, DBA)

sparkbyte · Nov 22, 2010

Just to ask, why the need to mask actual data with 'realistic' masked data? I do get what you are trying to do just curious as to the why.

I work in job where the concern of releasing Personally Identifiable Information (PII), so this tread piqued my interest.

Thanks

John Fuhrman

SimonSellick · Nov 22, 2010

Thanks again for the clarifications and code.

Simi - That cracks it, thank you!

BugSlayer - How to populate MaxToAssign? That is really the hub of my original question, but I was a bit long-winded about it. If I can do that across a sample table with a varying number of entries, the problem is solved. Looking at it in this way, the question is really how to divide a bell curve into n equally-spaced samples with values between 0 and 1. I am doing the actual value updates in bulk SQL (not iteratively), so I have to get the frequency right by pre-calculating.

John - It is a solution to a 'political' objection; basically, so that uncleared testers can perform various types of testing on migrated data that looks broadly like the real data did but without being able to make the link between what they see and the real data.

It's been an interesting introduction to anonymisation techniques. Thanks all again.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Normal distribution in anonymised data

SimonSellick

Programmer

TheBugSlayer

Programmer

fredericofonseca

IS-IT--Management

simian336

Programmer

SimonSellick

Programmer

simian336

Programmer

TheBugSlayer

Programmer

sparkbyte

Technical User

SimonSellick

Programmer

Similar threads

Part and Inventory Search

Sponsor