## CHECKSUM weakness explained

The built-in CHECKUM function in SQL Server is built on a series of 4 bit left rotational xor operations. See here in a previous forum post http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=70832 for more explanation.

Today, I wanted to see how often a collision (or false positive) can occur.

Let's take a very simple CHECKSUM value, for example 123. Decimal 123 is "01111011" in binary representation.

Since CHECKSUM function rotates the iterative checksum value 4 bits to the left (same thing as multiplying by 16), how many permutations of two characters returns the same CHECKSUM value of 123? The answer is 16 permutations.

Let's investigate by writing down the solution of this two character replacement.

abcd 1011 Second character

0000 aBCD First character

--------------

0000 0111 1011

Before we do anything more, let's refresh our mind on how XOR works. 0 and 0 returns 0, 0 and 1 returns 1, 1 and 0 return 1 and finally 1 and 1 return 0.

_ _|_0_|_1_|

_0_|_0_|_1_|

_1_|_1_|_0_|

Why did I then write the mysterious abcd and aBCD in the formula above?

Because the resulting bit sequence should be 0111, and only the same bits xor'ed gives 0. And the reason to use lower and upper characters for remaining bits are also due to xor. A result of 1 means the two bits has to be the opposite of each other; 0 xor 1 gives 1 and 1 xor 0 also gives 1 as result.

So "a" can have 2 possible values, and so can "b", "c", and "d" have. It gives 16 permutations in total.

See here

Bitseq1 Bitseq2 xor

0000 0111 0111

0001 0110 0111

0010 0101 0111

0011 0100 0111

0100 0011 0111

0101 0010 0111

0110 0001 0111

0111 0000 0111

1000 1111 0111

1001 1110 0111

1010 1101 0111

1011 1100 0111

1100 1011 0111

1101 1010 0111

1110 1001 0111

1111 1000 0111

This table above can be applied for any number of character for the first or last character.

Can this be applied onto a 3 character replacement? Of course, and in the same way. Still using the original decimal value of 123, the used characters are these

abcd 1011 Third character

xxxx aBCD Second character

0000 xxxx First character

-------------------

0000 0000 0111 1011

As you now can see, there are 16 times 16 possible (256) permutations giving the same CHECKSUM value!

And for four characters, the map looks like this

abcd 1011 Fourth character

yyyy aBCD Third character

xxxx yyyy Second character

0000 xxxx First character

------------------------

0000 0000 0000 0111 1011

As you now can see, there are 16 times 16 times 16 permutations (4096) giving the same CHECKSUM value

This means I can with any two characters create same original checksum value 16 times. With any three characters (with a total of 65536 permuations), I can create 256 combinations resulting in the same decimal 123 checksum value.

Let's now see how the likelyness of getting a collision (or a false positive) is.

Character 256^(c-1) 16^(c-1)

replacment Permutations Collisions

2 256 16

3 65536 256

4 16777216 4096

5 4294967296 65536

6 1099511627776 1048576

7 281474976710656 16777216

From the table above, we can deduct that the number of collisions are the square root of the number possible permutations. This can be interpreted as "if you double the number of records, and calculate a checksum value, the likelyness of getting a collision (false positive) is quadrupled."

Triple the number of records and you have 9 times more chance of getting a collision (false positive).

The reason I don't give more than 7 character example is that for the eight character, the leftmost 4 bits are rotated and xored to the lower part of the iterative CHECKSUM value. It doesn't matter really, because then the flushed bits must match the final "1011" in this example, so the other part still has 16 permutations.