Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

CHECKSUM mistery 3

Status
Not open for further replies.

TheBugSlayer

Programmer
Sep 22, 2002
887
US
Hello all.

I am trying to understand why CHECKSUM is yielding
1)a different result for the same input values based on whether they are coming from a view or from a table
2)the same value for different input values from the same function.

Is the source of the input values factor into the CHECKSUM formula (view, table, etc)?
Does CHECKSUM makes some assumptions depending on whether one input parameter is empty(I convert to '' when NULL)?

My input, whether from the view, which feeds the table eventually, or from the table are:
Last Name, First Name, Number
COOK, BRENT,
BRAY, DAVID,
KHAN, AMY,
KHAN, ALI,


The CHECKSUM(LastName, FirstName, Number) function returns

2047633
2047633
9523382
9523382

respectively when calculated from the view that populates the table.
Why do COOK-BRAY and KHAN-KHAN have the same hash (2047633 and 9523382)?

The same function applied to the table populated from the view returns

2021169596
-394774451
1148953778
75211809

which is more correct in my opinion as all the values are different...I don't understand.

Your help is appreciated.


MCP SQL Server 2000, MCTS SQL Server 2005, MCTS SQL Server 2008 (DBD, DBA), Training HOTT cert SQL Server 2012 Business Intelligence (SSIS, SSAS).
 
You leave out the number and I can only recompute your below checksum values via select Checksum('COOK','BRENT'), Checksum('BRAY','DAVID'), Checksum('KHAN','AMY'), Checksum('KHAN','ALI')

If I take for granted what you say, you do compute two different checksums, once names only and once names+number.

CHECKSUM has the smallest range of values checksum functions of SQL Server offer and can have such "collisions". The more data you have, the surer you find two records having same checksum while being different. You can make this more secure by finally checking some single field being equal, which would fail in case of Last Name in your example, but you also could take much better checksums. What the checksum algorithm guarantees is, that one change of a letter or one swap of the adjacent letter generates a different checksum. Nothing more. Other hashing algorithms are guaranteeing less overlap, though all of them aggregate however long text or byte array to some constrant length number and if that would be collisionless you'd have found a holy grail of unbound compression rates of data.

Books Online said:
However, there is a small chance that the checksum will not change. For this reason, we do not recommend using CHECKSUM to detect whether values have changed, unless your application can tolerate occasionally missing a change. Consider using HashBytes instead. When an MD5 hash algorithm is specified, the probability of HashBytes returning the same result for two different inputs is much lower than that of CHECKSUM.

Even more advanced hashes can never guarantee to be different for all data, only the whole data could guarantee that. But the guarantee to be totally different with just a slight bit flip of data and they guarantee to be so "random", that you can't just make changes as you like (for example embedding a virus in an executable), and then add some extra bytes to achieve the same hash. So you can't enforce a certain hash value, if you'd like to. You have to test hughe amounts of variations to get there and then still can detect the fraud with the file, if several hashes are computed and legnth changes, of course.

So one way to get magnitudes better results is computing two hashes with two different hash algorithms.

Bye, Olaf.
 
Thank you Federico and Olaf.

This is borderline ridiculous. If the input values look very different, the probability of their hash colliding should be very low! Olaf, in this case the numerical value is the same-coincidentally empty. Substitute with any numerical value and you get the same hash if you use the same for both pairs or colliding inputs; use a different value for each, they yield a different hash.

The reason I am using a hash key is just to save space, not necessary for indexing. I have a one-to-many relationship relationship between Provider and Practice. Providers repeated in the source system. I insert only one in the Provider table, with a column that's the hash key. Provider-Practice table contains ALL providers, so there are "duplicates". This table has the hash key of the one provider in the Provider table they relate to. Now I have no confidence in the CHECKSUM function because you can clearly see where the problem lies: a query to retrieve all practices a provider works for might return providers that have nothing to do with him. I have no choice but to change the columns data type and use HASHBYTES.

Now that you have a better context maybe you can suggest a better solution...

Thank you again!

MCP SQL Server 2000, MCTS SQL Server 2005, MCTS SQL Server 2008 (DBD, DBA), Training HOTT cert SQL Server 2012 Business Intelligence (SSIS, SSAS).
 
The CheckSum() is that way. It's using the "ridiculous" CRC32.
No Hash function will give you the 100% confidence about the comnbined key you want to replace with the hash, other hash function just will lower the chances of collisions.

You can define foreign key constraints on multiple fields, if you need that.

Bye, Olaf.
 
using MD5 will give you less chance of a collision (or, any of MD2 | MD4 | MD5 | SHA | SHA1 | SHA2_256 | SHA2_512 )

the hashbytes function takes an algorithm, and a single variable.

e.g.

select hashbytes('MD5','madonnac') : 0xD7CA39B7F7F21CB229DC3DA72E17C064
select hashbytes('MD5','COOK, BRENT') : 0x04268EF3BFC8F6D01006AE18874BED02
select hashbytes('MD5','COOK, BRENS') : 0x5D0DA79F689CEA4C438712490DB2EF7F
select hashbytes('MD5','COOK, BRENT ') : 0x9A51DEC8F9322280405638CF467319DB

similar names do not appear to create collisions, and you can reduce the chances further by using a longer hash:

select hashbytes('SHA2_512','madonnac') : 0xD79F915B06D2B19443E0B5A1F8DFC8AEE6FD459F40B4E88EFBA7522238A8A0E3FCF6DDC420910F7E9A66A72AC31B9167C161D85E47542DB931453CCF24646A6D
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top