posts - 10, comments - 49, trackbacks - 0

Monday, February 11, 2008

SSIS Performance Framework Part 2

You can expect:

This is a medium intensity post (you should not get a migraine, but I make no promises).

I Expect (that you have the following experience):

  • Excel experience
  • Complex SQL queries

 Performance Reporting for SSIS Packages.

Part 1 helped us set up the logging for SSIS so that we could retrieve the data we need to make the pretty charts that help us in tuning our packages.  This information is VITAL for engaging in REAL performance tuning.  You need PROOF and not anecdotes about how X is faster than Y.  Many times in my career, I have had people try to convince me that their way is better or faster.  I used to try to argue about what was faster/better, but I soon learned the foolishness of this.  Hard evidence is the only way to make a determination on performance.  That is why this framework is so important.  It will bring you up to speed on how to get the best performance out of SSIS faster than going from site to site for tips.  It will allow you to do tests on your own in a very efficient manner.

Of note about the SQL in Part 1.  That SQL is meant to be used WHILE THE PACKAGE IS RUNNING.  It will let you see where the package is while it is running.  For long running packages, it is a way to peek at the health of the process.  For instance, if someone is doing bad things or the server is choking, you can tell by running that code and comparing results to past known good runs to ensure that you are not tracking behind in performance.  Part 2 is about looking at completed runs and analyzing them.  Just wanted to make that distinction.

The first thing you need to do is go back to the Logging config on your package and add the OnProgress event.  This will allow us to look at more than just the DataFlow items' progress.  I've crafted a pretty crazy query to produce some statistics that can be used in an Excel Pivot table/chart. 
Select 
'Run ' + Cast(RunNumber As varchar) As RunName, S.Source,
DataFlowSource, RecordCount, RecordsPerSec,
Min(S.StartTime) StartTime, Max(S.EndTime) EndTime,
DateDiff(ms, Min(S.StartTime), Max(S.EndTime)) Duration

From sysdtslog90 S
Inner Join ( Select Distinct Source, ExecutionID,
Row_Number() Over (Order By StartTime) As RunNumber
From sysdtslog90
Where event = 'PackageStart' ) P On S.ExecutionID = P.ExecutionID
Left Outer Join (
Select
SourceID, ExecutionID, DataFlowSource,
Sum
(RecordCount) RecordCount,
Min(StartTime) StartTime,
(Cast(Sum(RecordCount) as real) / DateDiff(ms, Min(StartTime), Max(EndTime)))
* 1000 As RecordsPerSec
From
(
Select SourceID, ExecutionID, StartTime, EndTime,
Left(
SubString(message,
CharIndex(':', message,
CharIndex(':', message,
CharIndex(':', message,
CharIndex(':', message, 56)
+ 1)
+ 1)
+ 1)
+ 2, Len(message)),
CharIndex(':',
SubString(message,
CharIndex(':', message,
CharIndex(':', message,
CharIndex(':', message,
CharIndex(':', message, 56)
+ 1)
+ 1)
+ 1)
+ 2, Len(message))
) - 2) As DataFlowSource,
Cast(Right(message, CharIndex(':',
Reverse(message)) - 2) As int) As RecordCount
From sysdtslog90
Where event = 'OnPipelineRowsSent'
) DF_Events
Group By
SourceID, ExecutionID, DataFlowSource
) DF On S.SourceID = DF.SourceID And S.ExecutionID = DF.ExecutionID
Where message <> 'Validating' Group By
RunNumber, S.Source, DataFlowSource, RecordCount,
DF.StartTime, RecordsPerSec,
Case When S.Source = P.Source Then 1 Else 0 End
Order By

RunNumber,
Case When S.Source = P.Source Then 1 Else 0 End Desc,
DF.StartTime,
Min(S.StartTime)

Update:  The above query had a bug with some DataFlow transforms.  It is fixed now, so please use the latest.  The bug was how I was retrieving the RecordCount field.  Some transforms would have "Input 1 : " and I was only looking for "Input : ".  This caused an error in how the Right() was pulling the data.  Thanks to Bill Fellows for pointing it out.

This data could also be used in other reporting things, but it was specifically designed for use in an Excel environment.  The Pivot functions can be used to chart record counts, record speeds and Control Flow Task speeds.  If you've never worked with Excel Pivot charts/tables, this is a great opportunity to start.  If you have trouble getting it to work, leave me a comment and I'll help you out.  The following chart shows the execution time of different Control Flow Tasks:


As you can see, the Data Transfer for Demo task is MUCH slower in Run 3.  This is because I tweaked to to be slower.  Below is a detailed view of the RPM (Records Per Minute) of that DataFlow task.  Note that I am using the same query results, I am just changing the Pivot options in my pivot table/chart.


We see here that the RPM is much slower.  Interestingly enough, the setting that is making this so slow is A DEFAULT!!!  The developer who made that a default needs a punch in the nose as most new SSIS developers won't understand this at first.  This leaves a bad impression if they aren't thorough enough to look at the dropdown.  Here is the setting that made this package come to a crawl:


You want the one that says "- fast load" because it does a REAL bulk load.  The one selected does individual insert statements FOR EVERY ROW PASSED IN!!  This is only useful in certain situations that I'll talk about at some point.  Part 3 of this series will cover what to play around with to increase performance now that you have a way to look at the results and make some sense of what needs to be changed.  Sorry this one is short, but I promise Part 3 will have some very good things!

-Madman out-

posted @ Monday, February 11, 2008 4:24 PM | Feedback (4) | Filed Under [ SSIS Madness ]

SSIS Performance Framework Part 1

You can expect:

This is a medium intensity post (you should not get a migraine, but I make no promises)

I Expect (that you have the following experience):

  • The ability to POINT and CLICK on your OWN (I won't hold your hand like a little girl. Get over it.)
  • at lease SOME SSIS experience
  • SSIS Logging
  • Stored Procedure writing

 Performance Measuring in SSIS

SSIS Logging lets you see what the heck is going on during run-time of your packages. You can see where failures occur and that is what MOST people are anxious to see. I am not most people and I want to see how fast my packages are running. You can see total execution time from an SQL Agent job, but that isn't enough! I want to know how EVERYTHING is performing. Armed with this information I know exactly what to focus on if things are not what I expect them to be. This post will show you how to set up logging to a SQL server and then USE that logging info for the forces of good.

The first thing to do is turn on logging for the package (right click on the Control Flow surface - ok, that was a freebie):



Use the provider for SQL Server and configure your SQL Server connection manager. SSIS will automatically create a table called sysdtslog90 if it does not exist. Be sure to check both boxes as illustrated here:



Next is the Details tab. These settings are package level settings and will propagate to all objects in the package. So, when you see the OnError event checked, this will automatically be checked for any object you go to configure the logging for.



Ok, create a Dataflow task and pipe some data from one place to another. For purposes of this example, I am going to use a OleDB source, a Lookup, and a OleDB Destination. This next sentence is VERY IMPORTANT, so LISTEN UP. You need to name each of the Dataflow (and everything, really) steps something MEANINGFUL. "OleDB Source" is NOT MEANINGFUL. If you clods aren't already doing this in your packages, than you fail SSIS 101. Go to jail, go directly to jail, do NOT pass Go, do NOT collect $200.



Next, go back to the Control Flow surface and go back into logging. You will now see that the Package is not the only container. You will see that the DataFlow container is grayed out. This is because it is inheriting the settings from the parent. If you want (and we do) to configure this container ourselves, you have to check the box (click it twice - there I go again! slap my hand) and enable the SQL Server logger. Then go to the Details tab. All the setting from the patent have to be re-set because we choose to configure it ourselves. Check the same 2 we had last time and also OnPipelineRowSent. This event lets us know how many rows were sent to the particular component in the DataFlow (this is a DataFlow specific event). There are a lot of events you can look at and it's up to you to determine what you want. For this demo (because it is more about logging for Performance tuning) these are the only ones needed:



Now, pull up your Management Studio and run a Select on the table (Select * From sysdtslog90). You will see a goldmine of information. Times and row counts = much win. Play around with the data a little and them come back to this post. Go, do it. Go on!!
Ok, a couple of things to note. Executionid is an ID that covers the complete execution of the package. The message field contains the count of records sent to the component. To get the rowcount out of the field, use:
CAST(RIGHT([MESSAGE], LEN([MESSAGE]) - CHARINDEX('INPUT : ', [MESSAGE]) - 7) AS INT)

This Stored Procedure offers a complete look at the progress of a package execution. It can be used to check performance over many executions to evaluate slow downs etc.
CREATE PROCEDURE [DBO].[CHECKPROGRESS]
(
@EXECUTIONID AS UNIQUEIDENTIFIER,
@MESSAGE AS VARCHAR(100),
@GOAL AS INT,
@PROCESS VARCHAR(50)
)
AS BEGIN
SELECT
PROCESS = @PROCESS, TIMEINSECS = DATEDIFF(SS, MIN(STARTTIME), MAX(ENDTIME)),
RECCOUNT = SUM(CAST(RIGHT([MESSAGE], LEN([MESSAGE]) -
CHARINDEX('INPUT : ', [MESSAGE]) - 7) AS INT)),
RECSPERSEC = SUM(CAST(RIGHT([MESSAGE], LEN([MESSAGE]) -
CHARINDEX('INPUT : ', [MESSAGE]) - 7) AS NUMERIC)) /
DATEDIFF(SS, MIN(STARTTIME), MAX(ENDTIME)),
RECSPERHOUR = (SUM(CAST(RIGHT([MESSAGE], LEN([MESSAGE]) -
CHARINDEX('INPUT : ', [MESSAGE]) - 7) AS NUMERIC)) /
DATEDIFF(SS, MIN(STARTTIME), MAX(ENDTIME))) * 60 * 60,
GOAL = @GOAL,
PERCENTCOMPLETE = (SUM(CAST(RIGHT([MESSAGE], LEN([MESSAGE]) -
CHARINDEX('INPUT : ', [MESSAGE]) - 7) AS NUMERIC)) /
@GOAL) * 100 ,
ESTIMATEDTOTALTIME = CAST(ROUND(DATEDIFF(SS, MIN(STARTTIME),
MAX(ENDTIME)) / (SUM(CAST(RIGHT([MESSAGE], LEN([MESSAGE]) -
CHARINDEX('INPUT : ', [MESSAGE]) - 7) AS NUMERIC)) /
@GOAL), 0) AS INT),
SERVERETA = DATEADD(SS, (CAST(ROUND(DATEDIFF(SS,
MIN(STARTTIME), MAX(ENDTIME)) / (SUM(CAST(RIGHT([MESSAGE],
LEN([MESSAGE]) - CHARINDEX('INPUT : ', [MESSAGE]) - 7)
AS NUMERIC)) / @GOAL), 0) AS INT) - DATEDIFF(SS,
MIN(STARTTIME), MAX(ENDTIME))), GETDATE()),
CURRENTSERVERTIME = GETDATE()

FROM
SYSDTSLOG90
WHERE EXECUTIONID = @EXECUTIONID
AND [EVENT] = 'ONPIPELINEROWSSENT' AND [MESSAGE] LIKE '%' + @MESSAGE + '%'
END

Here is a driver for that stored proc that matches my demo package.
DECLARE @MAX INT
DECLARE
@EXECID UNIQUEIDENTIFIER

SELECT
@MAX = MAX(ID)

FROM SYSDTSLOG90
WHERE STARTTIME BETWEEN CONVERT(VARCHAR, GETDATE(), 101)
AND
CONVERT(VARCHAR, DATEADD(DD, 1,GETDATE()), 101)


SELECT @EXECID = EXECUTIONID
FROM SYSDTSLOG90
WHERE ID = @MAX

DECLARE
@RESULTS TABLE ( PROCESS VARCHAR(50),
TIMEINSECS INT,
RECCOUNT INT,
RECSPERSEC REAL,
RECSPERHOUR REAL,
GOAL INT,
PERCENTCOMPLETE REAL,
ESTIMATEDTOTALTIME INT,
SERVERETA DATETIME,
CURRENTSERVERTIME DATETIME )

INSERT INTO
@RESULTS
EXEC CHECKPROGRESS @EXECID,
'LOOKUP THE DATA IN THE OTHER DEMO TABLE', 1, 'LOOKUP'

INSERT INTO
@RESULTS
EXEC CHECKPROGRESS @EXECID,
'RECORDS FOUND', 1, 'RECORDS FOUND'

INSERT INTO
@RESULTS
EXEC CHECKPROGRESS @EXECID,
'RECORDS NOT FOUND', 1, 'RECORDS NOT FOUND'

SELECT * FROM @RESULTS


If you run you package and then create the stored proc and then run the driver you should see some valuable results. I've used this exact stuff to measure performance day over day, month over month, etc. Keep in mind that there are other factors that can affect performance (such as some dimwit running a CURSOR on your server...), so sometimes you need to average runtimes. Armed with this data, you can march up to Excel and tell it, "Give me charts fool!". Charts make management giggly and thus you can defend your well written code when all of a sudden the Virtual your DB is running on decides to do something funny! (true story)

This post is the foundation for this performance series.  We will use the sysdtslog90 table extensively to determine actual performance and not just anecdotal performance.  The OnError stuff will be used in subsequent posts.  It is there for a reason if you were wondering.

-Madman out-

posted @ Monday, February 11, 2008 10:34 AM | Feedback (3) | Filed Under [ SSIS Madness ]

Introduction Part 2

I have always viewed "social" movements on the Internet with somewhat of a mocking look. Think of a cross between Spock's eyebrow raise and the gleeful mocking smile and laugh that you crack when you see someone run a red light right into the waiting arms of the cop on the other side of the intersection. I lump many "fantastic things" into this category such as MySpace, Facebook, Blogs, etc. There is not enough to keep our quickly declining intelligences busy, so let us fill them with more useless things! "OMG you didn't say HI!!1! on my MySpace page! I h4t3 y0u!"

Lately, my titanium exterior has been softened towards Blogs (but only technical ones!). I began to use SSIS in extreme ways the day after RTM. My first project was to de-dupe 41m records of telco (Sprint) customer data. The data was horrendous and SSIS had fuzzy logic! Sounds like a perfect fit! The previous project owner had done it all in T-SQL. Yech! Certainly this wonderful new Microsoft technology can solve all our ills! What I learned on the next few weeks was an important lesson about SSIS that you all should know/learn. SSIS is a very powerful technology, but you have to use it correctly to be effective.

"Well, duh" you may say, but over the last ~2.5 years working with SSIS I've seen people use it in the worst ways EVER. This brings me to the motivation for this blog. I am here to slap some sense into y'all who are making the SSIS gods (and little babies) cry many many rivers of tears. When I started with SSIS there was little RELIABLE technical information for SSIS. Most of what was out there was inaccurate or speculative (even from Microsoft! I know, shocking). There are now several Blogs with some very good information, but there is so much that I have learned that isn't out there. I felt that it would benefit the community to get some of that information out to the unwashed masses.

The tips/lessons I have to post range from Mindless to MindNumbing and I will tell you up front what is difficult and what is simple. I will ATTEMPT to post at least once a week, but I cannot guarantee that regularity. I have a TON of stuff to get out there to start with, so there will most likely be a big burst of postings at the outset and then it will taper down a bit. Also, school will affect how often I post.

Hang on for this wild ride... Big one coming at you to start us off!

posted @ Monday, February 11, 2008 9:56 AM | Feedback (0) |

Introduction Part 1

I started a Blog a few weeks ago to share my knowledge of SSIS with people.  I didn't think it would really get anyone's attention, I just felt like I needed to get information out there because so many people do horrible things with this technology!  In any case, through my blog (all 2 posts :P) and some interaction with a system I had built at a previous employer I drew the attention of Bill here at SQLTeam.com.  He has asked me to blog here, so I am moving the contents of my old Blog (it was Blogger and the interface kinda sucked a lot) to here.  I have my original post and then the start of a technical series that will be 3 or 4 LONG posts.  That is how I got here.  I've also started posting in the forums under the moniker Qualis.  Feel free to mock me endlessly if you wish, but be sure that I will correct HORRIBLE solutions where I find them!

posted @ Monday, February 11, 2008 9:56 AM | Feedback (3) |

Powered by:
Powered By Subtext Powered By ASP.NET