SimonSellick
Programmer
Hi,
I have produced a masking process to anonymise personal data in a large database, for testing etc. The approach I have taken is a many-to-one mapping from source to masked values. I now want to apply a more realistic distribution of masked values, so that some of them appear only a few times with others more frequent.
As an example, suppose that I generated 2000 names from 20000 unique source names. Rather than getting roughly 10 instances of each generated name, I would like to get a few with only a single instance, more with 2-5 instances, many with say 20-30 instances, etc; so profiled with something like a bell curve.
I am using T-SQL Checksum() to create a 32-bit integer from the name string (or other value), then using the answer modulo the number of masked values available to pick one reasonably randomly but repeatably. What I think I need is a statistical function (preferably) or technique that I can use to divide a single interval (the range of a 32-bit integer in this case) into a specified number of low-high interval pairs, such that the ranges of the low-high pairs could be expected to catch randomly-spaced values in the original range with a statistically normal distribution.
Any pointers would be very welcome, and alternative approaches - I'm not stuck on using Checksum() or selecting via limits.
I have produced a masking process to anonymise personal data in a large database, for testing etc. The approach I have taken is a many-to-one mapping from source to masked values. I now want to apply a more realistic distribution of masked values, so that some of them appear only a few times with others more frequent.
As an example, suppose that I generated 2000 names from 20000 unique source names. Rather than getting roughly 10 instances of each generated name, I would like to get a few with only a single instance, more with 2-5 instances, many with say 20-30 instances, etc; so profiled with something like a bell curve.
I am using T-SQL Checksum() to create a 32-bit integer from the name string (or other value), then using the answer modulo the number of masked values available to pick one reasonably randomly but repeatably. What I think I need is a statistical function (preferably) or technique that I can use to divide a single interval (the range of a 32-bit integer in this case) into a specified number of low-high interval pairs, such that the ranges of the low-high pairs could be expected to catch randomly-spaced values in the original range with a statistically normal distribution.
Any pointers would be very welcome, and alternative approaches - I'm not stuck on using Checksum() or selecting via limits.