Peter Larsson Blog

Patron Saint of Lost Yaks

Excerpt from The Compression Session

This tuesday I had the opportunity to meet Kalen Delaney and hear her talk about the new compression algorithms in SQL Server 2008.
For those of you that never have met Kalen, I can tell she is a sweet lady, knowledgable and interesting to listen to.

With SQL Server 2005 SP2, the new VARDECIMAL datatype arrived, with a few stored procedures to calculate eventual savings. This turned out to be a subset for the ROW and PAGE compressions available in SQL Server 2008, Enterprise Edition. You can have ROW compression only if you want, but if you choose PAGE compression you also get ROW compression.

Example: See this PAGE records (sizes are shown without the overhead for storing varchar data).

           Col1      Col2
Record1    ABCD      CHOCO   4+5 bytes =  9 bytes
Record2    ABCRD     CHALK   5+5 bytes = 10 bytes
Record3    ABCDE     CHORD   5+5 bytes = 10 bytes
Record4    DE        CHOOSE  2+6 bytes =  8 bytes
Record5    QWERTY    DE      5+2 bytes =  7 bytes
    
                                  = 45 bytes

What happens first is that every page gets a new type of record, an anchor record. The anchor record is the longest most frequent data for every column.

          Col1      Col2
AncRec    ABCDE     CHO      5+3 bytes =  8 bytes 

Record1   4         3CO      1+3 bytes =  4 bytes
Record2   3RD       2ALK     3+4 bytes =  7 bytes
Record3   5         3RD      1+3 bytes =  4 bytes
Record4   0DE       3OSE     3+4 bytes =  7 bytes
Record5   0QWERTY   0DE      7+3 bytes = 10 bytes
                                       = 40 bytes

As you can see, the ROW compression step looks at the complete page and looks for the longest common data and puts that in an anchor record. Then the first byte of every record and columns get a value for how many characters of the prefix that match. With a better example I could have shown that the savings could be as much as 80 or 90 percent.

What happens next? SQL Server 2008 then tries to  apply a dictionary algorithm. It looks at all the new values for the records in the page and creates a dictionary (up to about 300 items for now).

Dict    D1 = 0DE, D2 = 3RD

          Col1      Col2
AncRec    ABCDE     CHO      5+3 bytes =  8 bytes 

Record1   4         3CO      1+3 bytes =  4 bytes
Record2   D2        2ALK     1+4 bytes =  5 bytes
Record3   5         D2       1+1 bytes =  2 bytes
Record4   D1        3OSE     1+4 bytes =  5 bytes
Record5   0QWERTY   D1       7+1 bytes =  8 bytes
                                       = 32 bytes

 As you can see, the space has gone from 45 bytes down to 32 bytes, almost 25% savings! The tradeoff, of course, is a small overhead of CPU usage to compress and decompress the data on disk. In memory the pages are always decompressed.
There is also a threshold. Pages are always compressed but later evaluated if the savings are big enough to stay compressed. If the savings are not big enough, the page stays uncompressed, even if you have choosen compression.
And the compression only occurs when a page is full. When a pagesplit occurs, the new page inherits same anchor record.

This is the essentials of Karen's Compression Session, a new chapter in her next book!

The new BACKUP compressions later applies a kind of ZIP compression which also deals with suffixes, so the savings on backups would be even more!

Legacy Comments


Casual Observer
2008-10-09
re: Excerpt from The Compression Session
Nice explanation. Thank you :)

Saggi Neumann
2008-10-13
re: Excerpt from The Compression Session
Hey mate,

"As you can see, the ROW compression step looks at the complete page" - did you mean prefix compress (the 2nd phase in page compression)?

Cheers,
S. Neumann