Life as a SQL Swiss Army Knife

..or maybe I'm just the guy with the shovel behind the horses in the parade
posts - 5, comments - 0, trackbacks - 0

Sunday, July 14, 2013

Fetching Partition Information

For a recent SSIS package at work I needed to determine the distinct values in a partition, the number of rows in each partition and the file group name on which each partition resided in order to come up with a grouping mechanism. Of course sys.partitions comes to mind for some of that but there are a few other tables you need to link to in order to grab the information required.

The table I’m working on contains 8.8 billion rows. Finding the distinct partition keys from this table was not a fast operation. My original solution was to create  a temporary table, grab the distinct values for the partitioned column, then update via sys.partitions for the rows and the $partition function for the partitionid and finally look back to the sys.filegroups table for the filegroup names. It wasn’t pretty, it could take up to 15 minutes to return the results. The primary issue is pulling distinct values from the table. Queries for distinct against 8.8 billion rows don’t go quickly.

A few beers into a conversation with a friend and we ended up talking about work which led to a conversation about the task described above. The solution was already built in SQL Server, just needed to pull it together.

The first table I needed was sys.partition_range_values. This contains one row for each range boundary value for a partition function. In my case I have a partition function which uses dayid values. For example July 4th would be represented as an int, 20130704. This table lists out all of the dayid values which were defined in the function. This eliminated the need to query my source table for distinct dayid values, everything I needed was already built in here for me. The only caveat was that in my SSIS package I needed to create a bucket for any dayid values that were out of bounds for my function. For example if my function handled 20130501 through 20130704 and I had day values of 20130401 or 20130705 in my table, these would not be listed in sys.partition_range_values. I just created an “everything else” bucket in my ssis package just in case I had any dayid values unaccounted for.

To get the number of rows for a partition is very easy. The sys.partitions table contains values for each partition. Easy enough to achieve by querying for the object_id and index value of 1 (the clustered index)

The final piece of information was the filegroup name. There are 2 options available to get the filegroup name, sys.data_spaces or sys.filegroups. For my query I chose sys.filegroups but really it’s a matter of preference and data needs. In order to bridge between sys.partitions table and either sys.data_spaces or sys.filegroups you need to get the container_id. This can be done by joining sys.allocation_units.container_id to the sys.partitions.hobt_id. sys.allocation_units contains the field data_space_id which then lets you join in either sys.data_spaces or sys.file_groups.

The end result is the query below, which typically executes for me in under 1 second. I’ve included the join to sys.filegroups and to sys.dataspaces, and I’ve  just commented out the join sys.filegroups.

As I mentioned above, this shaves a good 10-15 minutes off of my original ssis package and is a really easy tweak to get a boost in my ETL time. Enjoy.

image003

posted @ Sunday, July 14, 2013 5:38 PM | Feedback (0) | Filed Under [ SSIS SQL Server Partition ]

Monday, April 16, 2012

The 5 stages reviewing bad TSQL

I'm working with an app team that is light on TSQL expertise this week and couldn't help but draw a parallel to the 5 stages of grieving.

  1. Denial: There’s nothing wrong with the code SQL Server has a bug in it. There is a network problem.
  2. Anger: You’re doing what in your code?! Why on earth are you doing that? That’s crazy.
  3. Bargaining: Fine you can keep your cursor but let’s speed things up a bit.
  4. Depression: Ugh, this is so horrible I’m never going to be able to fix all of it.
  5. Acceptance: Ok, we’re screwed and we know we’re screwed. This is going to hurt…

posted @ Monday, April 16, 2012 3:56 PM | Feedback (0) |

Thursday, September 02, 2010

How to tell what your insert, update is doing

Lately I’ve been working on a data migration SSIS package for a customer and some of it is just a straight forward execute sql tasks in SSIS on some very large tables. I pull 10 million rows into a staging area, massage some data into the proper format, insert it into it’s final production table resting place, easy enough. I’m testing it on a very low powered 32 bit server and it takes a while. At one point today I started wondering if it was just locked up or was still running.

I’ve used the sys.partitions table before as an alternate method of getting row counts for a table as it already has that information handy for me. This time I was querying the table while the insert was running and I was able to see exactly what was happening. The code below will walk you through an easy example. This will work the same in 2005, or 2008.

First, just create a throw away table:

Create table Loadtest
(
IDVal int identity(1,1),
CustomerName varchar(200),
CustomerCity varchar(200)
)

 

Now let’s create a looping insert that gives us enough time to look at things. You can just stop it when you have had enough. It’s nothing exciting here, I’m just creating an loop 2 million times and doing an insert..like I said, stop it when you’ve had enough:

declare @counter int
set @counter=2000000
While @counter>0
BEGIN
    Insert into LoadTest(CustomerName,CustomerCity)
        values('Mike Femenella','Matthews')
        SET @counter=@counter-1
END       

While your loop is running, open up another query analyzer window to the same database and run this:

SELECT sp.index_id,Rows,si.name,si.type_desc
      FROM sys.partitions sp
      inner join sys.indexes si
      on sp.object_id=si.object_id
      AND sp.index_id=si.index_id
      where object_name(sp.object_id)='LoadTest'

Since the table is not indexed you’ll see the row count and that it’s populating the heap, hit F5 a few times and you can watch the rows field increment. If you have doubts your insert statement is actually running, this is a good way to check that it is alive and you can also get an idea of it’s progress if you have a guess as to how many rows “should” be there.

--Unindexed, you see the heap being populated.     

   index_id        Rows    name    type_desc
    0                   24252    NULL    HEAP

 

So that was mildly interesting and sort of cool. Of course we can take it a step further (and I always love doing that). Go ahead and stop the loop when you have have a million or so rows in there, you need a large amount of data in the LoadTest table so you have enough time to run the sys.partitions query again. I was letting my loop run while writing the first part of the blog so I have 2 million rows in LoadTest.

Let’s create a mirror image of our LoadTest table and since I’m at the end of my work week and feeling very creative I’m going to call it LoadTest_Copy, go ahead and F5 and create the table.. Award winning material here isn’t it?

Create table Loadtest_Copy
(
IDVal int ,
CustomerName varchar(200),
CustomerCity varchar(200)
)

GO
Create Clustered index ixc_IDVal on LoadTest_Copy(IDVal)
Create index ix_CustomerName on LoadTest_Copy(CustomerName)
Create index ix_CustomerCity on LoadTest_Copy(CustomerCity)

GO

In another window modify your sys.partition query and change the name from LoadTest to LoadTest_Copy. Run this now and you have 3 rows, one for the clustered index and one for each of the other indexes. Somewhat more realistic and more interesting.

SELECT sp.index_id,Rows,si.name,si.type_desc
      FROM sys.partitions sp
      inner join sys.indexes si
      on sp.object_id=si.object_id
      AND sp.index_id=si.index_id
      where object_name(sp.object_id)='LoadTest_Copy'

 

Results:

index_id    Rows    name                           type_desc
1                0          ixc_IDVal                     CLUSTERED
2                0          ix_CustomerName       NONCLUSTERED
3                0          ix_CustomerCity         NONCLUSTERED

Now, do an insert from LoadTest into LoadTest_Copy. Start it and then flip over to your sys.partition query, run it while the insert is happening and look at the results.

You can keep hitting F5 and watch the numbers change. Notice it builds the clustered index first, which of course makes sense, and then will go through each of the other 2 indexes building them out.

index_id    Rows         name                           type_desc
1                   449900    ixc_IDVal                    CLUSTERED
2                   0             ix_CustomerName      NONCLUSTERED
3                   0            ix_CustomerCity          NONCLUSTERED

 

index_id    Rows            name                      type_desc
1               2000000        ixc_IDVal                 CLUSTERED
2               1065245        ix_CustomerName   NONCLUSTERED
3               0                   ix_CustomerCity     NONCLUSTERED

 

index_id    Rows          name                      type_desc
1               2000000       ixc_IDVal                 CLUSTERED
2               2000000       ix_CustomerName   NONCLUSTERED
3               126137         ix_CustomerCity     NONCLUSTERED

 

And finally now that it’s done:

index_id         Rows        name                       type_desc
1                    2000000    ixc_IDVal                  CLUSTERED
2                    2000000    ix_CustomerName    NONCLUSTERED
3                    2000000    ix_CustomerCity       NONCLUSTERED

 

So next time you’re wondering if your insert or update statement is actually doing anything useful, you can leverage some of the underlying DMV data to make sure things are still on track and it’s a good way to tell if you have time to run out for coffee. That’s the last SQL related thing I’m doing, time for vacation…cheers!

posted @ Thursday, September 02, 2010 8:15 PM | Feedback (0) |

Monday, April 12, 2010

SQL Saturday Atlanta: Intro To Performance Tuning

I'm looking forward to speaking in Atlanta on the 24th, will be fun to get back down that way to visit with some friends and present two topics that I really enjoy.

First, an introduction to performance tuning. Performance tuning is a very wide and deep topic and we're staying close to the surface. I direct this class for newbie sql users who have less than 2 years of experience. It's all the things I wish someone would have told me in my first 2 years about what to look for when the database was slow...or allegedly slow I should say. We'll cover using profiler to find slow performing queries and how to save the data off to a table as well as a tour of other features. The difference between clustered, non clustered and covering indexes. How to look at and understand an execution plan (at a high level) and finally the difference between a temp table and a table variable and what the implications are of using either one in your code. That pretty much takes up a full hour.

Second presentation, Loading Data in Real Time. It's really a presentation about partitioning but with a twist that we used at work recently to solve a need to load some data quickly and put it into production with minimal downtime. We'll cover partition functions, schemes,$partition, merge, sys.partitions and show some examples of building a set of partitioned tables and using the switch statement to move it from one table to another. Finally we'll cover the differences in partitioning between 2005 and 2008. Hope to see you there!

And if you read my blog please introduce yourself!

posted @ Monday, April 12, 2010 9:00 PM | Feedback (0) |

Saturday, March 27, 2010

SQL Saturday Richmond, VA

Very excited to announce that I’ll be holding 2 sessions at SQL Saturday in VA on April 10th. If there are any frequent readers of SQLTeam.com attending, please make sure to say hi! Topics I’m covering are partitioning & loading data real time and an introduction to performance tuning. Hope to see you there!

SQL Saturday Richmond Schedule

posted @ Saturday, March 27, 2010 8:19 PM | Feedback (0) |

Powered by:
Powered By Subtext Powered By ASP.NET