posts - 10, comments - 49, trackbacks - 0

Wednesday, April 30, 2008

SSIS Performance Framework Part 3

You can expect:

This is a low intensity post (you should be able to do this in your sleep).

I Expect (that you have the following experience):

  • Simple SSIS Dataflow work
  • That you have read and understood/implemented parts 1 and 2

Performance Tuning for SSIS Packages.

Preparation

This is the 3rd and final part to this series.  In this part, I will cover techniques and settings in your SSIS packages to tune them so that they perform at appropriate speeds.  To really test speed, I need lots of data to test against.  The first thing I'm going to do is generate 3m rows of sample data.  I generated a table with an identity (duh!), a few varchar fields with varying length (Replicate('ge3', 1 + @cnt % 80) etc.), a few int fields, a few decimal fields and two date fields.  The reason I want all this is because I want to make this as close to a real world test as I can.  I then exported the data to a flat file (pipe delimited because I like pipes!)  If you want my script to create the test data, ask me in a comment and I'll post it.

Tuning Test 1

For our first tuning test, I am importing the file I just exported.  At 3m rows with variable data, it will be a good test of different methods and settings.  I created a new package and set up the logging options as in parts 1 and 2.  I also truncated my SampleData table and will use that for my inserts.  In the package, I created 4 DataFlows for my inserts and 4 SQL Statements for truncating my SampleData table.
  • For DataFlow 1, I created a Flat File Source and in the flat file connection manager, I specified all the columns as varchar(250).  I then created an OLE DB Destination and set it to use: FastLoad, TableLock and did not map my identity column (I could have told it to keep my identity if I wanted).
  • For DataFlow 2, I followed the pattern of DataFlow 1 except in the connection manager, I used the Suggest Types (1k records) feature on the Advanced tab.
  • For DataFlow 3, I used a SQL Server Destination which requires all mapped columns to be the EXACT type in the table, so in my connection manager, I had to ensure the data types were exact.
  • For DataFlow 4, I also used a SQL Server Destination, but I used the first connection manager and a Data Conversion transform to take the varchar(250) fields and convert them to the correct data type.
  • Notes:
    • OLE DB Destinations allow you to insert 1 record at a time or use FastLoad which does a TRUE Bulk Load (use SQL Profiler if you don't believe me).  Using FastLoad is PREFERRED!!  You will have poor performance if you don't use it.
    • I left the default of 0 for Commit Size on the OLE DB Destination. This allows the Bulk Load operation to choose how much data to commit.  This is typically the DefaultBufferMaxRows setting on the DataFlow.  This is a tuning option that can help performance if you are having memory bottlenecks.
    • I used Table Lock on the OLE DB Destinations.  This will ensure that you have full access to the table and will enable the Bulk Load operation to perform better.
    • SQL Server Destinations require the package to be running on the SAME server as the database and the database MUST be a SQL Server 2005 database.
Here are the results:
Source DataFlowSource RecordCount RecordsPerSec
Simple File Import With OLEDB Sample Data Destination - Simple with OLEDB 3000000 24793.39
Typed File Import With OLEDB Sample Data Destination - Typed with OLEDB 3000000 27272.73
Typed File Import With SQLServer Sample Data Destination - Typed with SQLServer 3000000 30612.25
Simple File Import With SQLServer Convert the data to the Table Types 3000000 28037.38
Simple File Import With SQLServer Sample Data Destination - Typed with SQLServer 3000000 28037.38

As you can see and might have expected, Typed imports are faster.  This is because the SQL Server is not having to do conversions.  The SSIS engine is doing those conversions.  Also expected is that the Data Conversion transformation slows down the process a little.  One thing that is very important to note is that I am running these tests on my desktop and not a robust server.  The differences between the last 3 tests is so close that you should consider them to be identical in terms of performance.

Tuning Test 2

Taking the fastest method from above, I performed a test with different transform items.  I created a basline DataFlow to compare results against.  I then used a few common items to demonstrate the performance of SSIS.  This is just to show you how to proceed in your own testing.
Source DataFlow Source Record Count RecordsPerSecond
Baseline Numbers Sample Data Destination - Typed with SQLServer 3000000 27272.73
Multicast Test Multicast 3000000 27272.73
Multicast Test Sample Data Destination - Typed with SQLServer 3000000 27272.73
Integer Conditional Split Test Conditional Split 3000000 26548.67
Integer Conditional Split Test Union All 3000000 26548.67
Integer Conditional Split Test Sample Data Destination - Typed with SQLServer 3000000 26548.67
FindString Conditional Split Conditional Split 3000000 26086.96
FindString Conditional Split Union All 3000000 26086.96
FindString Conditional Split Sample Data Destination - Typed with SQLServer 3000000 25862.07

As you can see, the Multicast caused no degradation in performance whatsoever.  This is because the output buffers reference the same memory.  The integer test in the conditional split and the subsequent Union All only had a negligible effect on performance.  The FindString split had a little bit more of a performance hit, but the effect was also very minor considering you are doing a string search!

SSIS is a very powerful tool.  You can do a lot of things in a lot of different ways.  It is up to the developer to pick the right tool for the right job.  I can't stress this enough.  You have to use the tool to become familiar with its power.  You need to get in and play around with the features and figure out how to do things in different ways.  The old adage of "practice makes perfect" applies directly to SSIS.

The goal of this post is to get you to think about how/what to test.  The first set of tests was designed to help you think through how to do a certain task and then test the performance of your setup.  The second set of tests was designed to show you that the built in controls will perform better than you might expect. 

Lacking from this testing methodology is the impact on system resources.  The focus of this test was purely for speed of execution.  You can see from the results that doing fairly simple tasks works extremely well even on a desktop box without the robustness in memory and hard drive pipelines that you will get with a server.  In addition to testing the speed of execution, you should familiarize yourself with PerfMon and use it to monitor the system resources.  For instance, the Fuzzy logic components use a lot of memory as does the Sort component (they are asynchronous components and can cache some or all of the data coming into them).

-Madman out-

posted @ Thursday, May 01, 2008 7:54 AM | Feedback (2) | Filed Under [ SSIS Madness ]

Error Handling Template

You can expect:

This is a medium intensity post (you should not get a migraine, but I make no promises).

I Expect (that you have the following experience):

  • Error Logging experience
  • UDF (User Defined Function) creation experience

Creating a template Package that handles errors.
(also how to create a template and make it available)

SSIS can be the wild west.  Each developer doing things their own way.  Errors are often overlooked and then when one happens no one knows why or how to fix it.  Logging is often not sufficiently enabled.  This causes maintenance headaches (especially with Excel, but that is a rabbit trail which dives steeply into the lowest reaches of Hell itself!).  What I am going to present with this post is how to log, compile and disseminate error information and not have your package just fail with no idea why.

 

Logging, GO!

The first thing we must do is create a new package (duh!).  Next we turn error logging on.  I've shown this before and here it is again (right click on the Control Flow canvas and go to Logging...):



And then set the logging details:



Handle That Event!

Next, we need to handle the error.  We could handle the OnError event, but that would fire on EVERY error that was generated.  A Task can generate an infinite (well, you get the idea) number of errors, so we don't want to fire an infinite number of error handlers!  What we really want to know is when a task fails.  We do this by setting up an event handler for the OnTaskFailed event (duh!):


After we create the event handler, we want to create a place to store the error information we are about to gather, so we create a variable.  I named mine ErrorMessageText.  Here is where things get fun.  We want to get all of the error information from the logging we've set up.  First thing we need to do is to set up some SQL to get it.  You will want to create a UDF for this because otherwise we are in the maintenance nightmare again.  Here is a sample that you can use:
Create Function GetErrors(@ExecutionID varchar(50))
Returns varchar(8000) As
Begin
Declare
@Result varchar(8000)

Select @Result = Coalesce(@Result, '') + '>>' + message + Char(10) + Char(13)
From sysdtslog90
Where event = 'OnError' And executionid = @ExecutionID
Order By id

Return @result

End

Now that we have an easy way to collect the error information (notice that I am passing in the execution id and retrieving all the errors for that execution) we need to get the data out of the database and into our message variable.  Grab an Execute SQL Task and configure it like so:






If you configure it just like this, it will work.  Note that the Execute SQL Task is VERY touchy about this kind of configuration and demands it be configured just so.

Next, we are going to drop in a Script Task to grab the rest of the information we want to present.  In the ReadOnlyVariables put:  ErrorMessageText, CreationDate, CreatorComputerName, CreatorName, ExecutionInstanceGUID, InteractiveMode, MachineName, OfflineMode, PackageID, PackageName, StartTime, UserName, VersionBuild, VersionGUID, VersionMajor, VersionMinor, ExecutionValue, SourceDescription, SourceID, SourceName

This allows us to produce as much information as possible for debugging.  Design the script and use something like the following for your Sub Main:

Public Sub Main()
Dim Result As String = "System Variables:" & vbCrLf

For Each Item As Variable In Dts.Variables
If Item.Name <> "ErrorMessageText" Then
If Not
Item.Value Is Nothing Then
Result &= Item.Name & " = " & Item.Value.ToString() & vbCrLf
Else
Result &= Item.Name & " = Nothing" & vbCrLf
End If
End If
Next

Result &= vbCrLf & "Error Mesages:" & vbCrLf & _
Dts.Variables("ErrorMessageText").Value.ToString()

Dts.Variables("ErrorMessageText").Value = Result

Dts.TaskResult = Dts.Results.Success
End Sub
This will collect all the information we need for our error report.  We'll then email this to the appropriate person.  This ought to be configured in your package with the configuration in a central location.  I am not covering configuration in this session, however.

Templates, Templates, Templates!!!

The next part is really hard.  No, really....  Put the template package in your drive:\Program Files\Microsoft Visual Studio 8\Common7\IDE\PrivateAssemblies\ProjectItems\DataTransformationProject\DataTransformationItems folder [1].  Done.  Now you can right click on your project in the Solution Explorer, go to Add, New Item... and pick the template.

Now everyone uses the same error notification system and you have all the useful information you need when someone decides to change the file format of a data feed that you've been receiving for years without telling you!  You can now proudly go to your boss and tell them "It's not my code!  Blame the customer!!".

-Madman Out-

References:
[1] http://support.microsoft.com/kb/908018 Accessed 4/30/2008

posted @ Wednesday, April 30, 2008 2:05 PM | Feedback (7) |

I'm not dead YET!!

I apologize for the long pause in posting.  I have been extremely busy with my Masters program.  I took my finals on Monday for this semester, so I finally have some free time!!  I've prepared a wrap-up to the Performance Framework series.  It describes some sample testing sets that should get you thinking about how to go about tweaking performance and picking the best method to use to accomplish a goal.

-tired Madman Out-

posted @ Wednesday, April 30, 2008 11:17 AM | Feedback (0) |

Powered by:
Powered By Subtext Powered By ASP.NET