SSIS Performance Framework Part 2
You can expect:
This is a medium intensity post (you should not get a migraine, but I make no promises).I Expect (that you have the following experience):
- Excel experience
- Complex SQL queries
Performance Reporting for SSIS Packages.
Part 1 helped us set up the logging for SSIS so that we could retrieve the data we need to make the pretty charts that help us in tuning our packages. This information is VITAL for engaging in REAL performance tuning. You need PROOF and not anecdotes about how X is faster than Y. Many times in my career, I have had people try to convince me that their way is better or faster. I used to try to argue about what was faster/better, but I soon learned the foolishness of this. Hard evidence is the only way to make a determination on performance. That is why this framework is so important. It will bring you up to speed on how to get the best performance out of SSIS faster than going from site to site for tips. It will allow you to do tests on your own in a very efficient manner.Of note about the SQL in Part 1. That SQL is meant to be used WHILE THE PACKAGE IS RUNNING. It will let you see where the package is while it is running. For long running packages, it is a way to peek at the health of the process. For instance, if someone is doing bad things or the server is choking, you can tell by running that code and comparing results to past known good runs to ensure that you are not tracking behind in performance. Part 2 is about looking at completed runs and analyzing them. Just wanted to make that distinction.
The first thing you need to do is go back to the Logging config on your package and add the OnProgress event. This will allow us to look at more than just the DataFlow items' progress. I've crafted a pretty crazy query to produce some statistics that can be used in an Excel Pivot table/chart.
Select
'Run ' + Cast(RunNumber As varchar) As RunName, S.Source,
DataFlowSource, RecordCount, RecordsPerSec,
Min(S.StartTime) StartTime, Max(S.EndTime) EndTime,
DateDiff(ms, Min(S.StartTime), Max(S.EndTime)) Duration
From sysdtslog90 S
Inner Join ( Select Distinct Source, ExecutionID,
Row_Number() Over (Order By StartTime) As RunNumber
From sysdtslog90
Where event = 'PackageStart' ) P On S.ExecutionID = P.ExecutionID
Left Outer Join (
Select SourceID, ExecutionID, DataFlowSource,
Sum(RecordCount) RecordCount,
Min(StartTime) StartTime,
(Cast(Sum(RecordCount) as real) / DateDiff(ms, Min(StartTime), Max(EndTime)))
* 1000 As RecordsPerSec
From
(
Select SourceID, ExecutionID, StartTime, EndTime,
Left(
SubString(message,
CharIndex(':', message,
CharIndex(':', message,
CharIndex(':', message,
CharIndex(':', message, 56)
+ 1)
+ 1)
+ 1)
+ 2, Len(message)),
CharIndex(':',
SubString(message,
CharIndex(':', message,
CharIndex(':', message,
CharIndex(':', message,
CharIndex(':', message, 56)
+ 1)
+ 1)
+ 1)
+ 2, Len(message))
) - 2) As DataFlowSource,
Cast(Right(message, CharIndex(':',
Reverse(message)) - 2) As int) As RecordCount
From sysdtslog90
Where event = 'OnPipelineRowsSent'
) DF_Events
Group By SourceID, ExecutionID, DataFlowSource
) DF On S.SourceID = DF.SourceID And S.ExecutionID = DF.ExecutionID
Where message <> 'Validating' Group By
RunNumber, S.Source, DataFlowSource, RecordCount,
DF.StartTime, RecordsPerSec,
Case When S.Source = P.Source Then 1 Else 0 End
Order By
RunNumber,
Case When S.Source = P.Source Then 1 Else 0 End Desc,
DF.StartTime,
Min(S.StartTime)
Update: The above query had a bug with some DataFlow transforms. It is fixed now, so please use the latest. The bug was how I was retrieving the RecordCount field. Some transforms would have "Input 1 : " and I was only looking for "Input : ". This caused an error in how the Right() was pulling the data. Thanks to Bill Fellows for pointing it out.
This data could also be used in other reporting things, but it was specifically designed for use in an Excel environment. The Pivot functions can be used to chart record counts, record speeds and Control Flow Task speeds. If you've never worked with Excel Pivot charts/tables, this is a great opportunity to start. If you have trouble getting it to work, leave me a comment and I'll help you out. The following chart shows the execution time of different Control Flow Tasks:
As you can see, the Data Transfer for Demo task is MUCH slower in Run 3. This is because I tweaked to to be slower. Below is a detailed view of the RPM (Records Per Minute) of that DataFlow task. Note that I am using the same query results, I am just changing the Pivot options in my pivot table/chart.
We see here that the RPM is much slower. Interestingly enough, the setting that is making this so slow is A DEFAULT!!! The developer who made that a default needs a punch in the nose as most new SSIS developers won't understand this at first. This leaves a bad impression if they aren't thorough enough to look at the dropdown. Here is the setting that made this package come to a crawl:
You want the one that says "- fast load" because it does a REAL bulk load. The one selected does individual insert statements FOR EVERY ROW PASSED IN!! This is only useful in certain situations that I'll talk about at some point. Part 3 of this series will cover what to play around with to increase performance now that you have a way to look at the results and make some sense of what needs to be changed. Sorry this one is short, but I promise Part 3 will have some very good things!
-Madman out-
Legacy Comments
Madman
2008-02-12 |
re: SSIS Performance Framework Part 2 Hey guys, I had to correct a problem with the SQL here. Please review the Update comments and use the corrected SQL. Thanks. |
Bill Fellows
2008-02-12 |
re: SSIS Performance Framework Part 2 I suspect the posting software will eat the prettiness of this but I reset Madman's code to use common table expressions (CTEs). Live them, learn them, love them. I have to scoot to a meeting but I'll link it in the raw format elsewhere if it gets garbled. -- Previous statement must be terminated with a semicolon to use CTE -- Also requires SQL Server 2005 or more recent ; WITH PACKAGE_START AS ( SELECT DISTINCT Source , ExecutionID , Row_Number() Over (Order By StartTime) As RunNumber FROM SYSDTSLOG90 WHERE event = 'PackageStart' ) , EVENTS AS ( SELECT SourceID , ExecutionID , StartTime , EndTime , Left(SubString(message, CharIndex(':', message, CharIndex(':', message, CharIndex(':', message, CharIndex(':', message, 56) + 1) + 1) + 1) + 2, Len(message)), CharIndex(':', SubString(message, CharIndex(':', message, CharIndex(':', message, CharIndex(':', message, CharIndex(':', message, 56) + 1) + 1) + 1) + 2, Len(message)) ) - 2) As DataFlowSource , Cast(Right(message, CharIndex(':', Reverse(message)) - 2) As int) As RecordCount FROM SYSDTSLOG90 WHERE event = 'OnPipelineRowsSent' ) , FANCY_EVENTS AS ( SELECT SourceID , ExecutionID , DataFlowSource , Sum(RecordCount) RecordCount , Min(StartTime) StartTime , ( Cast(Sum(RecordCount) as real) / Case When DateDiff(ms, Min(StartTime), Max(EndTime)) = 0 Then 1 Else DateDiff(ms, Min(StartTime), Max(EndTime)) End) * 1000 As RecordsPerSec FROM EVENTS DF_Events GROUP BY SourceID , ExecutionID , DataFlowSource ) SELECT 'Run ' + Cast(RunNumber As varchar) As RunName , S.Source , DF.DataFlowSource , DF.RecordCount , DF.RecordsPerSec , Min(S.StartTime) StartTime , Max(S.EndTime) EndTime , DateDiff(ms, Min(S.StartTime) , Max(S.EndTime)) Duration FROM SYSDTSLOG90 S INNER JOIN PACKAGE_START P ON S.ExecutionID = P.ExecutionID LEFT OUTER JOIN FANCY_EVENTS DF ON S.SourceID = DF.SourceID AND S.ExecutionID = DF.ExecutionID WHERE S.message <> 'Validating' GROUP BY RunNumber , S.Source , DataFlowSource , RecordCount , DF.StartTime , RecordsPerSec , Case When S.Source = P.Source Then 1 Else 0 End ORDER BY RunNumber , Case When S.Source = P.Source Then 1 Else 0 End Desc , DF.StartTime , Min(S.StartTime) |
Bill Fellows
2008-02-12 |
re: SSIS Performance Framework Part 2 http://home.kc.rr.com/bfellows/ssis/SSIS_Madman_Framework2.txt |
Bill Fellows
2008-08-29 |
re: SSIS Performance Framework Part 2 *bonk* You can run into a divide by zero condition as well when your starttime and endtime happened less than 3 milliseconds apart due to datetime precision issues with SQL Server2005 and previous editions. Case When DateDiff(ms, Min(StartTime), Max(EndTime)) = 0 Then 1 Else DateDiff(ms, Min(StartTime), Max(EndTime)) End |