You can expect:This is a low intensity post (you should be able to do this in your sleep).
I Expect (that you have the following experience):
- Simple SSIS Dataflow work
- That you have read and understood/implemented parts 1 and 2
Performance Tuning for SSIS Packages.
This is the 3rd and final part to this series. In this part, I will cover techniques and settings in your SSIS packages to tune them so that they perform at appropriate speeds. To really test speed, I need lots of data to test against. The first thing I'm going to do is generate 3m rows of sample data. I generated a table with an identity (duh!), a few varchar fields with varying length (Replicate('ge3', 1 + @cnt % 80) etc.), a few int fields, a few decimal fields and two date fields. The reason I want all this is because I want to make this as close to a real world test as I can. I then exported the data to a flat file (pipe delimited because I like pipes!) If you want my script to create the test data, ask me in a comment and I'll post it.
Tuning Test 1For our first tuning test, I am importing the file I just exported. At 3m rows with variable data, it will be a good test of different methods and settings. I created a new package and set up the logging options as in parts 1 and 2. I also truncated my SampleData table and will use that for my inserts. In the package, I created 4 DataFlows for my inserts and 4 SQL Statements for truncating my SampleData table.
- For DataFlow 1, I created a Flat File Source and in the flat file connection manager, I specified all the columns as varchar(250). I then created an OLE DB Destination and set it to use: FastLoad, TableLock and did not map my identity column (I could have told it to keep my identity if I wanted).
- For DataFlow 2, I followed the pattern of DataFlow 1 except in the connection manager, I used the Suggest Types (1k records) feature on the Advanced tab.
- For DataFlow 3, I used a SQL Server Destination which requires all mapped columns to be the EXACT type in the table, so in my connection manager, I had to ensure the data types were exact.
- For DataFlow 4, I also used a SQL Server Destination, but I used the first connection manager and a Data Conversion transform to take the varchar(250) fields and convert them to the correct data type.
- OLE DB Destinations allow you to insert 1 record at a time or use FastLoad which does a TRUE Bulk Load (use SQL Profiler if you don't believe me). Using FastLoad is PREFERRED!! You will have poor performance if you don't use it.
- I left the default of 0 for Commit Size on the OLE DB Destination. This allows the Bulk Load operation to choose how much data to commit. This is typically the DefaultBufferMaxRows setting on the DataFlow. This is a tuning option that can help performance if you are having memory bottlenecks.
- I used Table Lock on the OLE DB Destinations. This will ensure that you have full access to the table and will enable the Bulk Load operation to perform better.
- SQL Server Destinations require the package to be running on the SAME server as the database and the database MUST be a SQL Server 2005 database.
|Simple File Import With OLEDB||Sample Data Destination - Simple with OLEDB||3000000||24793.39|
|Typed File Import With OLEDB||Sample Data Destination - Typed with OLEDB||3000000||27272.73|
|Typed File Import With SQLServer||Sample Data Destination - Typed with SQLServer||3000000||30612.25|
|Simple File Import With SQLServer||Convert the data to the Table Types||3000000||28037.38|
|Simple File Import With SQLServer||Sample Data Destination - Typed with SQLServer||3000000||28037.38|
As you can see and might have expected, Typed imports are faster. This is because the SQL Server is not having to do conversions. The SSIS engine is doing those conversions. Also expected is that the Data Conversion transformation slows down the process a little. One thing that is very important to note is that I am running these tests on my desktop and not a robust server. The differences between the last 3 tests is so close that you should consider them to be identical in terms of performance.
Tuning Test 2Taking the fastest method from above, I performed a test with different transform items. I created a basline DataFlow to compare results against. I then used a few common items to demonstrate the performance of SSIS. This is just to show you how to proceed in your own testing.
|Source||DataFlow Source||Record Count||RecordsPerSecond|
|Baseline Numbers||Sample Data Destination - Typed with SQLServer||3000000||27272.73|
|Multicast Test||Sample Data Destination - Typed with SQLServer||3000000||27272.73|
|Integer Conditional Split Test||Conditional Split||3000000||26548.67|
|Integer Conditional Split Test||Union All||3000000||26548.67|
|Integer Conditional Split Test||Sample Data Destination - Typed with SQLServer||3000000||26548.67|
|FindString Conditional Split||Conditional Split||3000000||26086.96|
|FindString Conditional Split||Union All||3000000||26086.96|
|FindString Conditional Split||Sample Data Destination - Typed with SQLServer||3000000||25862.07|
As you can see, the Multicast caused no degradation in performance whatsoever. This is because the output buffers reference the same memory. The integer test in the conditional split and the subsequent Union All only had a negligible effect on performance. The FindString split had a little bit more of a performance hit, but the effect was also very minor considering you are doing a string search!
SSIS is a very powerful tool. You can do a lot of things in a lot of different ways. It is up to the developer to pick the right tool for the right job. I can't stress this enough. You have to use the tool to become familiar with its power. You need to get in and play around with the features and figure out how to do things in different ways. The old adage of "practice makes perfect" applies directly to SSIS.
The goal of this post is to get you to think about how/what to test. The first set of tests was designed to help you think through how to do a certain task and then test the performance of your setup. The second set of tests was designed to show you that the built in controls will perform better than you might expect.
Lacking from this testing methodology is the impact on system resources. The focus of this test was purely for speed of execution. You can see from the results that doing fairly simple tasks works extremely well even on a desktop box without the robustness in memory and hard drive pipelines that you will get with a server. In addition to testing the speed of execution, you should familiarize yourself with PerfMon and use it to monitor the system resources. For instance, the Fuzzy logic components use a lot of memory as does the Sort component (they are asynchronous components and can cache some or all of the data coming into them).
|re: SSIS Performance Framework Part 3
It taking me forever to type the test data (I'm only up to 345,091!), could you please post that scripty you mentioned?
|re: SSIS Performance Framework Part 3
Here ya go:
/****** Object: Table [dbo].[SampleData] Script Date: 02/21/2008 11:34:24 ******/
SET ANSI_NULLS ON
SET QUOTED_IDENTIFIER ON
SET ANSI_PADDING ON
Drop Table [dbo].[SampleData]
CREATE TABLE [dbo].[SampleData](
[id] [int] IDENTITY(1,1) NOT NULL,
[varchar1] [varchar](250) NULL,
[varchar2] [varchar](250) NULL,
[varchar3] [varchar](250) NULL,
[int1] [int] NULL,
[int2] [int] NULL,
[decimal1] [decimal](18, 5) NULL,
[decimal2] [decimal](9, 5) NULL,
[date1] [datetime] NULL,
[date2] [smalldatetime] NULL
) ON [PRIMARY]
SET ANSI_PADDING OFF
Declare @varchar1 varchar(250)
Declare @varchar2 varchar(250)
Declare @varchar3 varchar(250)
Declare @int1 int
Declare @int2 int
Declare @decimal1 decimal(18, 5)
Declare @decimal2 decimal(9, 5)
Declare @date1 datetime
Declare @date2 smalldatetime
Declare @cnt int
Set @cnt = 0
While @cnt < 3000000 Begin
Set @varchar1 = Replicate('ge3', 1 + @cnt % 80)
Set @varchar2 = Replicate('de', 1 + @cnt % 124)
Set @varchar3 = Replicate('h523', 1 + @cnt % 50)
Set @int1 = @cnt % 80
Set @int2 = @cnt % 124
Set @decimal1 = @cnt / 80.0
Set @decimal2 = @cnt / 351.0
Set @date1 = GetDate() + @cnt % 124
Set @date2 = GetDate() + @cnt % 842
Insert Into SampleData
Select @varchar1, @varchar2, @varchar3, @int1,
@int2, @decimal1, @decimal2, @date1, @date2
Set @cnt = @cnt + 1