Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

The simplest checksum algorithm...

Status
Not open for further replies.

OrthoDocSoft

Programmer
May 7, 2004
291
US
Dear folks,

I am trying to get used to using "checksum" to determine if my data transfers to my database are correct and complete. Would someone please write the very, very, very simplest algorithm to illustrate how this is done?

Let's say we're going to write "Hello" to a record in a database. How would you "checksum" that?

Thanks,

Ortho

[lookaround] "you cain't fix 'stupid'...
 
>very, very, very simplest algorithm

Try HashData function. It returns a hash code of variable length (upto 256 bytes). You can use this hash as a checksum.

See the following sample function, which hashes a string.
___
[tt]
Private Declare Function HashData Lib "shlwapi" (pbData As Any, ByVal cbData As Long, pbHash As Any, ByVal cbHash As Long) As Long
Private Function HashString(Text As String) As Long
HashData ByVal Text, Len(Text), HashString, Len(HashString)
End Function[/tt]
___

The function returns a long integer (4-byte hash) and [tt]HashString("Hello")[/tt] returns -1235047813 (&HB662AA7B).

You can use this function to hash any kind of data, including arrays.
 
Use a common hash algorithm like MD5, VB6 module to include in your project can be found here.
Run the hash on the original data and the new data, if hashes match, your data is verified. No need to write your own hash method, use MD5, it runs very quickly and is designed for any type of data, any size of data.


Creator of - Game Reviews, Game Lists, and much more!
 
barryna,

In essence is this not the solution that Hypetia has already provided?

I concur that using one of the public hashing algorithms is probably a better idea (In my opinion Microsoft's hashdata function is flawed since the algorithm is, to the best of my knowledge not publically documented, and thus remains untested)

There are, however, easier ways of generating an MD5 hash digest in VB than the linked article. We have covered it a number of times in this forum, but in summary we can do:

Code:
[blue]Public Function MD5HashDigest(strSource As String) As String
    
    With CreateObject("CAPICOM.HashedData")
        .Algorithm = 3 [green]'CAPICOM_HASH_ALGORITHM_MD5[/green]
        .Hash StrConv(strSource, vbFromUnicode) 
        MD5HashDigest = .Value
    End With
    
End Function[/blue]
 
It depends on how secure you want to be.
The most basic method of deriving a checksum of using only simple VB code is to add up all the ASCII values of the characters.
To handle the checksum easier, convert the sum to hex and use say the last 4 hex bytes as the final checksum.
The possibility of a missing bit producing the same value is pretty remote
Something like this for illustration -
Code:
CheckSum=0
for a=1 to len(MyData)
   CheckSum=CheckSum + asc(Mid(MyData,a,1))
next
CheckSum=Right(("0000" & Hex(CheckSum),4)
"Hello" gives a value of 01F4 But of course so will "Hdmlo" but it would be bad luck if one byte increased and an adjacent byte decreased by the same amount.
The longer the data string the better it would be

Then when you want to verify the data, test the data again and compare it with the original Checksum that you would have to store in another column

Of course if the string was huge like a complete novel or a picture, it would take a fair amount of time to add up all the letters.

A more involved and secure method using only VB6 code uses the Polynomial method. This is ideal for data transmission and is usually tacked on the end of the String when sent.

Code:
Function CRC_Calc(CRC_Message As String) As String
'Calculates a 16 bit checksum of a String of any length.
    Dim Polynomial16 As Long
    Dim Y As Integer
    Dim Char_Text_DEC As Long
    Dim X As Integer
    Y = 1
    X = 0
    CRC_value = 0
    Polynomial16 = 33800 'Polynomial &H8408
        For Y = 1 To Len(CRC_Message)
            CRC_Text_Single = Mid$(CRC_Message, Y, 1)
            Char_Text_DEC = Asc(CRC_Text_Single)
            For X = 1 To 8
                LSB_CRC = CRC_value And &H1
                LSB_Char = Char_Text_DEC And &H1
                If LSB_CRC = 1 And LSB_Char = 1 Or LSB_CRC = 0 And LSB_Char = 0 Then
                    CRC_value = Fix(CRC_value / 2)
                    Char_Text_DEC = Fix(Char_Text_DEC / 2)
                ElseIf LSB_CRC = 0 And LSB_Char = 1 Or LSB_CRC = 1 And LSB_Char = 0 Then
                    CRC_value = Fix(CRC_value / 2)
                    Char_Text_DEC = Fix(Char_Text_DEC / 2)
                    CRC_value = Polynomial16 Xor CRC_value
                Else
                End If
            Next X
        Next Y
        If Len(Hex(CRC_value)) = 4 Then
            CRC_LO = Mid$(Hex(CRC_value), 3, 2)
            CRC_HI = Mid$(Hex(CRC_value), 1, 2)
        ElseIf Len(Hex(CRC_value)) = 3 Then
            CRC_STRING = "0" & Hex(CRC_value)
            CRC_LO = Mid$(CRC_STRING, 3, 2)
            CRC_HI = Mid$(CRC_STRING, 1, 2)
        ElseIf Len(Hex(CRC_value)) = 2 Then
            CRC_STRING = "00" & Hex(CRC_value)
            CRC_LO = Mid$(CRC_STRING, 3, 2)
            CRC_HI = Mid$(CRC_STRING, 1, 2)
        ElseIf Len(Hex(CRC_value)) = 1 Then
            CRC_STRING = "000" & Hex(CRC_value)
            CRC_LO = Mid$(CRC_STRING, 3, 2)
            CRC_HI = Mid$(CRC_STRING, 1, 2)
        ElseIf Len(Hex(CRC_value)) = 0 Then
            CRC_LO = "00"
            CRC_HI = "00"
        End If
    CRC_Calc = Chr("&h" & CRC_LO) & Chr("&h" & CRC_HI)
    'CRC_Calc = Chr(CRC_value Mod 256) & Chr(CRC_value \ 256) 'ALTERNATIVE METHOD
    
End Function
 
What makes anyone think they need checksums for this in the first place?

If main memory is that unreliable you're screwed before you start, checksums or not. Disk I/O already has integrity checks. IP has integrity checks. USB has integrity chcks. DBMSs have integrity checks.

If you are having data corruption problems it is far more likely you simply have bad code screwing it up or have failed to manage concurrency properly... or you have had catastrophic failures during write operations. The latter will be detected at the DBMS level before you're ever going to find it by in-band data checksums.

Something seems very wrong if you are asking for something like this.


CRCs are pretty darned weak compared to something like an MD5 hash, and I don't think I've ever seen such a slow and convoluted way of calculating one before the code above. What's with all those undeclared variables, and those string operations, and worse yet Variant-returning operations, etc. anyway?

Or goofy things like:
Code:
ElseIf Len(Hex(CRC_value)) = 0 Then
... paths that will never be taken.
 
The question at hand is not meant to be scrutinized. We don't know the extent of what he may be trying to accomplish. I do know from past experience if you are working with sensitive data, you want to make sure the process you ran worked as desired. A checksum or hash method can check that, instead of a human reviewing the end result for correctness. It doesn't mean there are hardware issues.


Creator of - Game Reviews, Game Lists, and much more!
 
Yes it is goofy (it was written in the last century by persons unknown - who knows, it might even have been strongm!)
I didn't intend anyone to ever use either of my examples because I thought the original question was more about understanding how simple checksums could be generated rather than have an efficient code snippet to use.

Quote "the very, very, very simplest algorithm to illustrate how this is done?"

 
Well the simplest would be to simply sum the bytes (or characters, since we're dealing with 16-bit Unicode), possibly adding a modular operation to handle "wrap around" values as the sum grows large.

I'd avoid ANSI conversions, they're unfaithful crossing locales by nature. Such a "checksum" calculated using Locale A gives potentially different results when done in Locale B given the same inputs unless you luck out and only the 7-bit ASCII subset was used in the String.

The question at hand is not meant to be scrutinized.

Sure it is. Poor practice is poor practice. Use of any kind of "checksum" here is redundant and adds nothing in the way of "security" (or integrity, which is what you probably meant).

About the only places you'd use them would be for things like serial data links with no error checking protocol, or hand-entry of something like account numbers containing check digits.
 
I though you would have been into this one by now so I was just checking if you were awake!

I think the "ElseIf Len(Hex(CRC_value)) = 0 Then" is to pad out the "hex" string with 00 so you always return 4 byte hex checksum string even if the data is blank.
 
Wow! That prompted a debate....

What I'm REALLY trying to do is to prove within my code that the data I THINK I wrote to a record is the data that I DID write.

What I'm actually doing is:

1) writing the data.
2) reading back the data I just wrote.
3) comparing the read-back to the original data sent.
4) if NOT the same, then I try to FIND the record by its "minimum ID". If I find it, I UPDATE that record and go back to number 2) above and re-confirm.
5) if I can't FIND the record by its "minimum ID" (10 tries, mind you), I REWRITE the whole record and go back to 2.

This "works", but I'm sure there is a more elegant way to accomplish this.

The MAIN reason I do this is because I am writing data through wireless networks that fail, temporarily, from time to time. Usually just for seconds....

What do you all do?

Thanks,

Ortho.





[lookaround] "you cain't fix 'stupid'...
 
So you are verifying the wireless part of it rather than the total storage of data.
Echoing back a simple checksum over just the wireless part does avoid having to resend back the whole data for verification.

Or
I had a similar problem with a dodgy network so I put all data into a "PropertyBag"
and decoded it using a stream.
If you are happy your database will store OK, this has an advantage that the received data has to be complete before you even save it in a record and gives an error if not. You don't have to generate or save the checksum. You can then try to send a number of times until it is OK before being saved.

 
I proposed the HashData solution because of the "very, very, very simplest algorithm" requirement.

Although the algorithm is not documented (at least I am not aware of it), I don't mind using it for computing a simple one-way checksum for data integrity check. We are not using it for encryption, when the knowledge of actual algorithm and proof of its reliability is of acute importance.

At least, it is better and faster than adding or XORing byte values together in a loop.

As far as simplicity is concerned, strongm's MD5HashDigest function also looks a treat. Unfortunately, on my computer, running Win7 Pro 64-bit SP1, it throws error 429, "ActiveX component can't create object". I don't know which library is required for creating CAPICOM.HashedData object.

Whichever method you use, you should write your code in such a way that checksum of data written to database is computed by the computer hosting the database. You can accomplish this by wrapping your hash function in an ActiveX component and intantiating it remotely on the target computer. The data written to the database will be available to the hash function on the remote computer locally (on the same machine) and checksum will be computed faster.

There is no point in computing the checksum from your own computer, attempting to write data, because in this case, the whole data will be read back over the network again. If you do so, than computing the checksum will be useless, you can simply compare the written data to original data as you are currently doing.

Hope you understand the point.
 
Hypetia, under W7:
Code:
[blue]
' W7: Add a reference to CAPICOM V 2.1 type library
Public Function MD5HashDigest(strSource As String) As String
    
    With New CAPICOM.HashedData
        .Algorithm = CAPICOM_HASH_ALGORITHM_MD5
        .Hash StrConv(strSource, vbFromUnicode)
        MD5HashDigest = .Value
    End With
    
End Function[/blue]

>proof of its reliability is of acute importance.

Proof that it actually captures bit errors would be important in this case. And we don't have that evidence. Now, whilst Microsoft have a history of developing home-grown encryption that is somewhat flawed, I'm not saying it is a poor algorithm - we don't know that either - what I'm saying is that, given we have well-known alternatives that have been analysed to death we might prefer to use them when appropriate.
 
Well, further reading about MD5 reveals that it also has severe security vulnerabilities and susceptible to collision attack.


Although the hash can easily be compromised as they say "it is easy to generate MD5 collisions", it is still widely used for data integrity check and also for storing passwords.
 
The difference is we know the weaknesses of MD5. And with MD5 we can say with a very high level of confidence that a hash produced from a set of data is not going to be the same as a hash of that data with a few bits changed, where those bits have changed by chance (say due to transmission errors, which is what we are looking at here). If I wanted a cryptographically secure hash, however, I wouldn't recommend it.

And if you are running on W2K3 or later, and a high level of secure hashing is required, we can easily bump up to SHA-2:

Code:
[blue]Public Function HashDigest(strSource As String) As String
    
    With New CAPICOM.HashedData
        .Algorithm = CAPICOM_HASH_ALGORITHM_SHA_256 [green]'or 384 or 512[/green]
        .Hash StrConv(strSource, vbFromUnicode)
        MD5HashDigest = .Value
    End With
    
End Function[/blue]


 
>CAPICOM_HASH_ALGORITHM_SHA_256

Fair enough. You might want to change

[tt]MD5HashDigest = .Value[/tt]
to
[tt]HashDigest = .Value[/tt]
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top