Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Normal distribution in anonymised data

Status
Not open for further replies.

SimonSellick

Programmer
Nov 3, 2003
305
GB
Hi,

I have produced a masking process to anonymise personal data in a large database, for testing etc. The approach I have taken is a many-to-one mapping from source to masked values. I now want to apply a more realistic distribution of masked values, so that some of them appear only a few times with others more frequent.

As an example, suppose that I generated 2000 names from 20000 unique source names. Rather than getting roughly 10 instances of each generated name, I would like to get a few with only a single instance, more with 2-5 instances, many with say 20-30 instances, etc; so profiled with something like a bell curve.

I am using T-SQL Checksum() to create a 32-bit integer from the name string (or other value), then using the answer modulo the number of masked values available to pick one reasonably randomly but repeatably. What I think I need is a statistical function (preferably) or technique that I can use to divide a single interval (the range of a 32-bit integer in this case) into a specified number of low-high interval pairs, such that the ranges of the low-high pairs could be expected to catch randomly-spaced values in the original range with a statistically normal distribution.

Any pointers would be very welcome, and alternative approaches - I'm not stuck on using Checksum() or selecting via limits.
 
Huh? Since you are talking about SQL Server why not create your "distribution" based on the number of values generated from the unique source, save it in a table then use it when appropriate. I.e. when a generated name is assigned/used, increase its counter and don' t assign/use the name when the maximum is reached.

Good luck.

MCP SQL Server 2000, MCTS SQL Server 2005, MCTS SQL Server 2008 (DBD, DBA)
 
As a alternative I'm going to suggest what I've used previously in some companies when moving data from production to test/dev environments.

Clients names and addresses.
change name from "john smith" to "client xxx"
change address from "timbuctu road" to "addr line 1 xxx"
"mongolia" to "addr line 2 xxx"
where xxx is the client id of that client

and so on to all char fields that can contain personal data.
For ppsn numbers (or any type of national id number) it is normally possible to do something similar based on the client id number also even if said id is a char field.

Date of birth is a harder one, as that many times affects the rest of the application, so tend to leave that one as is.

Think the above clarifies a possible alternative, with the benefit that clearly identifies the id it relates to.

Regards

Frederico Fonseca
SysSoft Integrated Ltd

FAQ219-2884
FAQ181-2886
 
I would just assign a integer to each of the 20,000 names then pull 2000 random numbers between 1 and 20,000. You will get mostly unigue and a few multiples.

Simi
 
Some really interesting ideas there - thank you all.

Bug Slayer's technique is sort-of what I am doing: it is generating the normalised distribution within the list of names (your 'maximum') that is tricky.

Unfortunately, I cannot use Frederico's appealing alternative as the data has to look 'real', or at least realistic.

Simi's idea of generating the sample first then applying it to the target range looks good although I'll have to think about how to make it repeatable.

Thanks again for the ideas.
 
--create table #test (
--mynum int)

delete #test

declare @x as int
set @x = 0

declare @mynum as int
set @mynum =0

while @x < 2000
begin
set @mynum = cast(rand()*20000+1 as int)
insert into #test (mynum) values (@mynum)
set @x =@x +1
end

select mynum, count(mynum)
from #test
group by mynum
order by mynum

Simi
 
Simon, just to make my idea clearer, if your generated table looks like:
Code:
SampleTable([u]Id[/u], Name, TotAssigned, MaxToAssign)
To randomly pick the next name to assign (probably looping while there is a Name with TotAssigned < MaxToAssign) you will do
Code:
SELECT TOP 1 Id, Name FROM SampleTable WHERE TotAssigned < MaxToAssign ORDER BY NEWID()
Assign it, then update TotAssigned.

Good luck.

MCP SQL Server 2000, MCTS SQL Server 2005, MCTS SQL Server 2008 (DBD, DBA)
 
Just to ask, why the need to mask actual data with 'realistic' masked data? I do get what you are trying to do just curious as to the why.

I work in job where the concern of releasing Personally Identifiable Information (PII), so this tread piqued my interest.



Thanks

John Fuhrman
 
Thanks again for the clarifications and code.

Simi - That cracks it, thank you!

BugSlayer - How to populate MaxToAssign? That is really the hub of my original question, but I was a bit long-winded about it. If I can do that across a sample table with a varying number of entries, the problem is solved. Looking at it in this way, the question is really how to divide a bell curve into n equally-spaced samples with values between 0 and 1. I am doing the actual value updates in bulk SQL (not iteratively), so I have to get the frequency right by pre-calculating.

John - It is a solution to a 'political' objection; basically, so that uncleared testers can perform various types of testing on migrated data that looks broadly like the real data did but without being able to make the link between what they see and the real data.

It's been an interesting introduction to anonymisation techniques. Thanks all again.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top