SSIS Performance Framework Part 3

Thu May 1, 2008 by James Nix in ssis-madness

You can expect:

This is a low intensity post (you should be able to do this in your sleep).

I Expect (that you have the following experience):

Simple SSIS Dataflow work
That you have read and understood/implemented parts 1 and 2

Performance Tuning for SSIS Packages.

Preparation

This is the 3rd and final part to this series. In this part, I will cover techniques and settings in your SSIS packages to tune them so that they perform at appropriate speeds. To really test speed, I need lots of data to test against. The first thing I'm going to do is generate 3m rows of sample data. I generated a table with an identity (duh!), a few varchar fields with varying length (Replicate('ge3', 1 + @cnt % 80) etc.), a few int fields, a few decimal fields and two date fields. The reason I want all this is because I want to make this as close to a real world test as I can. I then exported the data to a flat file (pipe delimited because I like pipes!) If you want my script to create the test data, ask me in a comment and I'll post it.

Tuning Test 1

For our first tuning test, I am importing the file I just exported. At 3m rows with variable data, it will be a good test of different methods and settings. I created a new package and set up the logging options as in parts 1 and 2. I also truncated my SampleData table and will use that for my inserts. In the package, I created 4 DataFlows for my inserts and 4 SQL Statements for truncating my SampleData table.

For DataFlow 1, I created a Flat File Source and in the flat file connection manager, I specified all the columns as varchar(250). I then created an OLE DB Destination and set it to use: FastLoad, TableLock and did not map my identity column (I could have told it to keep my identity if I wanted).
For DataFlow 2, I followed the pattern of DataFlow 1 except in the connection manager, I used the Suggest Types (1k records) feature on the Advanced tab.
For DataFlow 3, I used a SQL Server Destination which requires all mapped columns to be the EXACT type in the table, so in my connection manager, I had to ensure the data types were exact.
For DataFlow 4, I also used a SQL Server Destination, but I used the first connection manager and a Data Conversion transform to take the varchar(250) fields and convert them to the correct data type.
Notes:
- OLE DB Destinations allow you to insert 1 record at a time or use FastLoad which does a TRUE Bulk Load (use SQL Profiler if you don't believe me). Using FastLoad is PREFERRED!! You will have poor performance if you don't use it.
- I left the default of 0 for Commit Size on the OLE DB Destination. This allows the Bulk Load operation to choose how much data to commit. This is typically the DefaultBufferMaxRows setting on the DataFlow. This is a tuning option that can help performance if you are having memory bottlenecks.
- I used Table Lock on the OLE DB Destinations. This will ensure that you have full access to the table and will enable the Bulk Load operation to perform better.
- SQL Server Destinations require the package to be running on the SAME server as the database and the database MUST be a SQL Server 2005 database.

Here are the results:

Source	DataFlowSource	RecordCount	RecordsPerSec
Simple File Import With OLEDB	Sample Data Destination - Simple with OLEDB	3000000	24793.39
Typed File Import With OLEDB	Sample Data Destination - Typed with OLEDB	3000000	27272.73
Typed File Import With SQLServer	Sample Data Destination - Typed with SQLServer	3000000	30612.25
Simple File Import With SQLServer	Convert the data to the Table Types	3000000	28037.38
Simple File Import With SQLServer	Sample Data Destination - Typed with SQLServer	3000000	28037.38

As you can see and might have expected, Typed imports are faster. This is because the SQL Server is not having to do conversions. The SSIS engine is doing those conversions. Also expected is that the Data Conversion transformation slows down the process a little. One thing that is very important to note is that I am running these tests on my desktop and not a robust server. The differences between the last 3 tests is so close that you should consider them to be identical in terms of performance.

Tuning Test 2

Taking the fastest method from above, I performed a test with different transform items. I created a basline DataFlow to compare results against. I then used a few common items to demonstrate the performance of SSIS. This is just to show you how to proceed in your own testing.

Source	DataFlow Source	Record Count	RecordsPerSecond
Baseline Numbers	Sample Data Destination - Typed with SQLServer	3000000	27272.73
Multicast Test	Multicast	3000000	27272.73
Multicast Test	Sample Data Destination - Typed with SQLServer	3000000	27272.73
Integer Conditional Split Test	Conditional Split	3000000	26548.67
Integer Conditional Split Test	Union All	3000000	26548.67
Integer Conditional Split Test	Sample Data Destination - Typed with SQLServer	3000000	26548.67
FindString Conditional Split	Conditional Split	3000000	26086.96
FindString Conditional Split	Union All	3000000	26086.96
FindString Conditional Split	Sample Data Destination - Typed with SQLServer	3000000	25862.07

As you can see, the Multicast caused no degradation in performance whatsoever. This is because the output buffers reference the same memory. The integer test in the conditional split and the subsequent Union All only had a negligible effect on performance. The FindString split had a little bit more of a performance hit, but the effect was also very minor considering you are doing a string search!

SSIS is a very powerful tool. You can do a lot of things in a lot of different ways. It is up to the developer to pick the right tool for the right job. I can't stress this enough. You have to use the tool to become familiar with its power. You need to get in and play around with the features and figure out how to do things in different ways. The old adage of "practice makes perfect" applies directly to SSIS.

The goal of this post is to get you to think about how/what to test. The first set of tests was designed to help you think through how to do a certain task and then test the performance of your setup. The second set of tests was designed to show you that the built in controls will perform better than you might expect.

Lacking from this testing methodology is the impact on system resources. The focus of this test was purely for speed of execution. You can see from the results that doing fairly simple tasks works extremely well even on a desktop box without the robustness in memory and hard drive pipelines that you will get with a server. In addition to testing the speed of execution, you should familiarize yourself with PerfMon and use it to monitor the system resources. For instance, the Fuzzy logic components use a lot of memory as does the Sort component (they are asynchronous components and can cache some or all of the data coming into them).

-Madman out-

Legacy Comments


joe 2008-05-01	re: SSIS Performance Framework Part 3 It taking me forever to type the test data (I'm only up to 345,091!), could you please post that scripty you mentioned?

Madman 2008-05-01	re: SSIS Performance Framework Part 3 Here ya go: USE [Sandbox] GO /**** Object: Table [dbo].[SampleData] Script Date: 02/21/2008 11:34:24 ****/ SET ANSI_NULLS ON GO SET QUOTED_IDENTIFIER ON GO SET ANSI_PADDING ON GO Drop Table [dbo].[SampleData] Go CREATE TABLE [dbo].[SampleData]( [id] [int] IDENTITY(1,1) NOT NULL, [varchar1] [varchar](250) NULL, [varchar2] [varchar](250) NULL, [varchar3] [varchar](250) NULL, [int1] [int] NULL, [int2] [int] NULL, [decimal1] [decimal](18, 5) NULL, [decimal2] [decimal](9, 5) NULL, [date1] [datetime] NULL, [date2] [smalldatetime] NULL ) ON [PRIMARY] GO SET ANSI_PADDING OFF Go Declare @varchar1 varchar(250) Declare @varchar2 varchar(250) Declare @varchar3 varchar(250) Declare @int1 int Declare @int2 int Declare @decimal1 decimal(18, 5) Declare @decimal2 decimal(9, 5) Declare @date1 datetime Declare @date2 smalldatetime Declare @cnt int Set @cnt = 0 While @cnt < 3000000 Begin Set @varchar1 = Replicate('ge3', 1 + @cnt % 80) Set @varchar2 = Replicate('de', 1 + @cnt % 124) Set @varchar3 = Replicate('h523', 1 + @cnt % 50) Set @int1 = @cnt % 80 Set @int2 = @cnt % 124 Set @decimal1 = @cnt / 80.0 Set @decimal2 = @cnt / 351.0 Set @date1 = GetDate() + @cnt % 124 Set @date2 = GetDate() + @cnt % 842 Insert Into SampleData Select @varchar1, @varchar2, @varchar3, @int1, @int2, @decimal1, @decimal2, @date1, @date2 Set @cnt = @cnt + 1 End