For a recent SSIS package at work I needed to determine the distinct values in a partition, the number of rows in each partition and the file group name on which each partition resided in order to come up with a grouping mechanism. Of course sys.partitions comes to mind for some of that but there are a few other tables you need to link to in order to grab the information required.
The table I’m working on contains 8.8 billion rows. Finding the distinct partition keys from this table was not a fast operation. My original solution was to create a temporary table, grab the distinct values for the partitioned column, then update via sys.partitions for the rows and the $partition function for the partitionid and finally look back to the sys.filegroups table for the filegroup names. It wasn’t pretty, it could take up to 15 minutes to return the results. The primary issue is pulling distinct values from the table. Queries for distinct against 8.8 billion rows don’t go quickly.
A few beers into a conversation with a friend and we ended up talking about work which led to a conversation about the task described above. The solution was already built in SQL Server, just needed to pull it together.
The first table I needed was sys.partition_range_values. This contains one row for each range boundary value for a partition function. In my case I have a partition function which uses dayid values. For example July 4th would be represented as an int, 20130704. This table lists out all of the dayid values which were defined in the function. This eliminated the need to query my source table for distinct dayid values, everything I needed was already built in here for me. The only caveat was that in my SSIS package I needed to create a bucket for any dayid values that were out of bounds for my function. For example if my function handled 20130501 through 20130704 and I had day values of 20130401 or 20130705 in my table, these would not be listed in sys.partition_range_values. I just created an “everything else” bucket in my ssis package just in case I had any dayid values unaccounted for.
To get the number of rows for a partition is very easy. The sys.partitions table contains values for each partition. Easy enough to achieve by querying for the object_id and index value of 1 (the clustered index)
The final piece of information was the filegroup name. There are 2 options available to get the filegroup name, sys.data_spaces or sys.filegroups. For my query I chose sys.filegroups but really it’s a matter of preference and data needs. In order to bridge between sys.partitions table and either sys.data_spaces or sys.filegroups you need to get the container_id. This can be done by joining sys.allocation_units.container_id to the sys.partitions.hobt_id. sys.allocation_units contains the field data_space_id which then lets you join in either sys.data_spaces or sys.file_groups.
The end result is the query below, which typically executes for me in under 1 second. I’ve included the join to sys.filegroups and to sys.dataspaces, and I’ve just commented out the join sys.filegroups.
As I mentioned above, this shaves a good 10-15 minutes off of my original ssis package and is a really easy tweak to get a boost in my ETL time. Enjoy.