Excerpt from The Compression Session
This tuesday I had the opportunity to meet Kalen Delaney and hear her talk about the new compression algorithms in SQL Server 2008.
For those of you that never have met Kalen, I can tell she is a sweet lady, knowledgable and interesting to listen to.
With SQL Server 2005 SP2, the new VARDECIMAL datatype arrived, with a few stored procedures to calculate eventual savings. This turned out to be a subset for the ROW and PAGE compressions available in SQL Server 2008, Enterprise Edition. You can have ROW compression only if you want, but if you choose PAGE compression you also get ROW compression.
Example: See this PAGE records (sizes are shown without the overhead for storing varchar data).
Col1 Col2
Record1 ABCD CHOCO 4+5 bytes = 9 bytes
Record2 ABCRD CHALK 5+5 bytes = 10 bytes
Record3 ABCDE CHORD 5+5 bytes = 10 bytes
Record4 DE CHOOSE 2+6 bytes = 8 bytes
Record5 QWERTY DE 5+2 bytes = 7 bytes
= 45 bytes
What happens first is that every page gets a new type of record, an anchor record. The anchor record is the longest most frequent data for every column.
Col1 Col2
AncRec ABCDE CHO 5+3 bytes = 8 bytes
Record1 4 3CO 1+3 bytes = 4 bytes
Record2 3RD 2ALK 3+4 bytes = 7 bytes
Record3 5 3RD 1+3 bytes = 4 bytes
Record4 0DE 3OSE 3+4 bytes = 7 bytes
Record5 0QWERTY 0DE 7+3 bytes = 10 bytes
= 40 bytes
As you can see, the ROW compression step looks at the complete page and looks for the longest common data and puts that in an anchor record. Then the first byte of every record and columns get a value for how many characters of the prefix that match. With a better example I could have shown that the savings could be as much as 80 or 90 percent.
What happens next? SQL Server 2008 then tries to apply a dictionary algorithm. It looks at all the new values for the records in the page and creates a dictionary (up to about 300 items for now).
Dict D1 = 0DE, D2 = 3RD
Col1 Col2
AncRec ABCDE CHO 5+3 bytes = 8 bytes
Record1 4 3CO 1+3 bytes = 4 bytes
Record2 D2 2ALK 1+4 bytes = 5 bytes
Record3 5 D2 1+1 bytes = 2 bytes
Record4 D1 3OSE 1+4 bytes = 5 bytes
Record5 0QWERTY D1 7+1 bytes = 8 bytes
= 32 bytes
As you can see, the space has gone from 45 bytes down to 32 bytes, almost 25% savings! The tradeoff, of course, is a small overhead of CPU usage to compress and decompress the data on disk. In memory the pages are always decompressed.
There is also a threshold. Pages are always compressed but later evaluated if the savings are big enough to stay compressed. If the savings are not big enough, the page stays uncompressed, even if you have choosen compression.
And the compression only occurs when a page is full. When a pagesplit occurs, the new page inherits same anchor record.
This is the essentials of Karen's Compression Session, a new chapter in her next book!
The new BACKUP compressions later applies a kind of ZIP compression which also deals with suffixes, so the savings on backups would be even more!
Legacy Comments
Casual Observer
2008-10-09 |
re: Excerpt from The Compression Session Nice explanation. Thank you :) |
Saggi Neumann
2008-10-13 |
re: Excerpt from The Compression Session Hey mate, "As you can see, the ROW compression step looks at the complete page" - did you mean prefix compress (the 2nd phase in page compression)? Cheers, S. Neumann |