posts - 10, comments - 49, trackbacks - 0

Wednesday, August 20, 2008

Using System Variables as Parameters in an OLE DB Source Query

You can expect:

This is a low intensity post (you should be able to do this in your sleep).

I Expect (that you have the following experience):

  • Simple SSIS Dataflow work

 

OLE DB Source Queries with Parameters

The OLE DB  Source allows you to enter a Select statement to retrieve data from a source.  As I stated several times before, this is the preferred method.  If you want to use parameters, this is the ONLY method you can use.  It is fairly easy to use parameters.  You just use a ? for each parameter you wish to map.  Then you click the Parameters button and map your parameters to variables.



This seems easy enough, but if you select a System variable your package will not run.  You will receive an error message similar to this one: [DTS.Pipeline] Error: component "OLE DB Source" (1) failed the pre-execute phase and returned error code 0xC0010001.  The reason for this is because when SSIS goes to map the parameters, it needs to lock the variables for reading.  We chose a System variable and the OLE DB Source fails to lock the variable.  I believe this is a bug. 


The Solution

I searched high and low for a workaround and the only solution I found was to stuff the whole select into a variable and then use a script to take the System variable and stuff it in the Where clause.  Ok, here comes a rant...

This is a CLUDGY FIX!  Don't do it!  You bypass the benefit of having a Parameterized query and you make your solution MUCH HARDER TO MAINTAIN.  I can't stress this enough.  You are making a huge mess for yourself and others.  I have people complain to me all the time that SSIS is horrible and then I look at how they set things up and I want to smack them upside the head! /rant

Most people who use SSIS that I've worked with don't know that you can create Variables that are evaluated as expressions.  Variables can just store data or they can execute an expression every time they are accessed.  This feature allows you to do some really nice things and in this case, is the solution to the problem.  I set up a variable named PackageName so that I could retrieve the System::PackageName variable.  In the Package Explorer tab, I navigated to my variable, set its EvaluateAsExpression property to True and set its Expression to @[System::PackageName].  I went back to the OLE DB Source and changed the mapping to my variable.  It now worked how I needed it to.

You can also get to the variable settings by just clicking on the variable in your Variables pane, but I wanted you to look at the Package Explorer!  While you are in the Package Explorer, look around a bit and get familiar with it.  It is a nice way to manage settings across your entire package quickly.

posted @ Wednesday, August 20, 2008 9:31 AM | Feedback (10) | Filed Under [ SSIS Madness ]

Thursday, May 29, 2008

Fast T-SQL Row Number Generator

You can expect:

This is a medium intensity post (you should not get a migraine, but I make no promises).

I Expect (that you have the following experience):

  • T-SQL Stored Proc experience
  • Windowed Function experience
  • Basic Join understanding

Row Number Generation

There are a number of ways to generate row numbers in T-SQL.  One way is to create a temp table with an identity column and do inserts into that table.  Typically you do it in a while loop with a counter.  I've used this method in the past and it works ok.  I was helping out a friend who was trying to help out a co-worker with this very problem.  I started playing around with other ways to generate row numbers / IDs and came up with what I think is a rather novel way of doing it.  There are several pieces to this and I'll cover each of them.

Windowed Functions

SQL Server 2005 introduced some "in-place aggregate" functions.  These allow you to get results you used to have to use a Group By to get, without the Group By.  They are helpful in situations where you need extra data that you just can't keep in your Select when using a Group By.  One very useful one is the Row_Number() function.  It has tremendous usefulness and I'm only using it in a very basic manner.

Fast Data Generation

I needed a way to quickly generate data.  SQL Server is meant for set based processing and so things like looping and UDF are not very speedy.  You should all know this and if you don't, you need to read more!  On a side rant, I've seen developers treat T-SQL like it was VB (I say VB because the people who do this are typically green developers).  T-SQL is NOT a programming language!  It is a Database language!

I decided to use SQL's set power for my purposes.  For those that need a refresher, a Full Cross Join will produce every possible combination of data from 2 or more tables.  My method was to create an in-memory table with 100 rows with 1 column.  That column contained the numbers 1 through 100.  A Full Cross Join of that table to itself would result in a Select with 100 * 100 = 10,000 records.  The Select is lightning fast compared to looping 10k times.  Adding 2 additional Full Joins results in the possibility of creating 100m records.  If you wanted to do more, you could add more Full Cross Joins.

Execution Plan Optimization

If I only want 5,000 row numbers, how do I limit my Full Cross Joins to only produce that many?  I could try to only insert the correct amount of records into my temp table so that the result is 5k.  This, however, would require computing the exact number, plus I wouldn't be able to generate any possible number I wanted.  I would not, for instance, be able to generate 58,391 row numbers.  I again turn to the power of SQL Server's set powers.  I wrapped the Full Cross Join Select inside another Select that had a Where to limit my results.  One might think that the SubSelect would have to finish before the outer Select would be run, but that is not the case.  The Optimizer looks at the execution plan and sees that I only want 58,391 rows and stops processing the SubSelect once that has been reached.  Therefore, I don't generate 100m rows and then only return 58,391, I only generate 58,391 rows.

The Query

Create Procedure GenerateRowNumbers(@NumberOfRows int, @StartNumber int)
As Begin
Declare
@NumGen Table (Num int)
Declare @cnt int
Set
@cnt = 1

While @cnt <= 100 Begin
Insert Into
@NumGen
Select @cnt
Set @cnt = @cnt + 1
End

Select
@StartNumber + RowNum
From
(
Select Row_Number() Over (Order By N1.Num) As RowNum
From @NumGen N1, @NumGen N2, @NumGen N3, @NumGen N4
) RowNums
Where RowNum <= @NumberOfRows

End

 

Conclusion

This method consumes much less memory than filling a temp table using a while loop.  It also consumes less CPU cycles.  This method is very fast and generates about 100k row numbers per second.  It handles the offset so you can start at any point and generate any number of row numbers.  It does have a 100m limit, but that can be overcome by adding another Full Cross Join.

-Madman Out-

posted @ Thursday, May 29, 2008 11:16 AM | Feedback (9) | Filed Under [ SQL Server Madness ]

Thursday, May 01, 2008

SSIS Performance Framework Part 3

You can expect:

This is a low intensity post (you should be able to do this in your sleep).

I Expect (that you have the following experience):

  • Simple SSIS Dataflow work
  • That you have read and understood/implemented parts 1 and 2

Performance Tuning for SSIS Packages.

Preparation

This is the 3rd and final part to this series.  In this part, I will cover techniques and settings in your SSIS packages to tune them so that they perform at appropriate speeds.  To really test speed, I need lots of data to test against.  The first thing I'm going to do is generate 3m rows of sample data.  I generated a table with an identity (duh!), a few varchar fields with varying length (Replicate('ge3', 1 + @cnt % 80) etc.), a few int fields, a few decimal fields and two date fields.  The reason I want all this is because I want to make this as close to a real world test as I can.  I then exported the data to a flat file (pipe delimited because I like pipes!)  If you want my script to create the test data, ask me in a comment and I'll post it.

Tuning Test 1

For our first tuning test, I am importing the file I just exported.  At 3m rows with variable data, it will be a good test of different methods and settings.  I created a new package and set up the logging options as in parts 1 and 2.  I also truncated my SampleData table and will use that for my inserts.  In the package, I created 4 DataFlows for my inserts and 4 SQL Statements for truncating my SampleData table.
  • For DataFlow 1, I created a Flat File Source and in the flat file connection manager, I specified all the columns as varchar(250).  I then created an OLE DB Destination and set it to use: FastLoad, TableLock and did not map my identity column (I could have told it to keep my identity if I wanted).
  • For DataFlow 2, I followed the pattern of DataFlow 1 except in the connection manager, I used the Suggest Types (1k records) feature on the Advanced tab.
  • For DataFlow 3, I used a SQL Server Destination which requires all mapped columns to be the EXACT type in the table, so in my connection manager, I had to ensure the data types were exact.
  • For DataFlow 4, I also used a SQL Server Destination, but I used the first connection manager and a Data Conversion transform to take the varchar(250) fields and convert them to the correct data type.
  • Notes:
    • OLE DB Destinations allow you to insert 1 record at a time or use FastLoad which does a TRUE Bulk Load (use SQL Profiler if you don't believe me).  Using FastLoad is PREFERRED!!  You will have poor performance if you don't use it.
    • I left the default of 0 for Commit Size on the OLE DB Destination. This allows the Bulk Load operation to choose how much data to commit.  This is typically the DefaultBufferMaxRows setting on the DataFlow.  This is a tuning option that can help performance if you are having memory bottlenecks.
    • I used Table Lock on the OLE DB Destinations.  This will ensure that you have full access to the table and will enable the Bulk Load operation to perform better.
    • SQL Server Destinations require the package to be running on the SAME server as the database and the database MUST be a SQL Server 2005 database.
Here are the results:
Source DataFlowSource RecordCount RecordsPerSec
Simple File Import With OLEDB Sample Data Destination - Simple with OLEDB 3000000 24793.39
Typed File Import With OLEDB Sample Data Destination - Typed with OLEDB 3000000 27272.73
Typed File Import With SQLServer Sample Data Destination - Typed with SQLServer 3000000 30612.25
Simple File Import With SQLServer Convert the data to the Table Types 3000000 28037.38
Simple File Import With SQLServer Sample Data Destination - Typed with SQLServer 3000000 28037.38

As you can see and might have expected, Typed imports are faster.  This is because the SQL Server is not having to do conversions.  The SSIS engine is doing those conversions.  Also expected is that the Data Conversion transformation slows down the process a little.  One thing that is very important to note is that I am running these tests on my desktop and not a robust server.  The differences between the last 3 tests is so close that you should consider them to be identical in terms of performance.

Tuning Test 2

Taking the fastest method from above, I performed a test with different transform items.  I created a basline DataFlow to compare results against.  I then used a few common items to demonstrate the performance of SSIS.  This is just to show you how to proceed in your own testing.
Source DataFlow Source Record Count RecordsPerSecond
Baseline Numbers Sample Data Destination - Typed with SQLServer 3000000 27272.73
Multicast Test Multicast 3000000 27272.73
Multicast Test Sample Data Destination - Typed with SQLServer 3000000 27272.73
Integer Conditional Split Test Conditional Split 3000000 26548.67
Integer Conditional Split Test Union All 3000000 26548.67
Integer Conditional Split Test Sample Data Destination - Typed with SQLServer 3000000 26548.67
FindString Conditional Split Conditional Split 3000000 26086.96
FindString Conditional Split Union All 3000000 26086.96
FindString Conditional Split Sample Data Destination - Typed with SQLServer 3000000 25862.07

As you can see, the Multicast caused no degradation in performance whatsoever.  This is because the output buffers reference the same memory.  The integer test in the conditional split and the subsequent Union All only had a negligible effect on performance.  The FindString split had a little bit more of a performance hit, but the effect was also very minor considering you are doing a string search!

SSIS is a very powerful tool.  You can do a lot of things in a lot of different ways.  It is up to the developer to pick the right tool for the right job.  I can't stress this enough.  You have to use the tool to become familiar with its power.  You need to get in and play around with the features and figure out how to do things in different ways.  The old adage of "practice makes perfect" applies directly to SSIS.

The goal of this post is to get you to think about how/what to test.  The first set of tests was designed to help you think through how to do a certain task and then test the performance of your setup.  The second set of tests was designed to show you that the built in controls will perform better than you might expect. 

Lacking from this testing methodology is the impact on system resources.  The focus of this test was purely for speed of execution.  You can see from the results that doing fairly simple tasks works extremely well even on a desktop box without the robustness in memory and hard drive pipelines that you will get with a server.  In addition to testing the speed of execution, you should familiarize yourself with PerfMon and use it to monitor the system resources.  For instance, the Fuzzy logic components use a lot of memory as does the Sort component (they are asynchronous components and can cache some or all of the data coming into them).

-Madman out-

posted @ Thursday, May 01, 2008 7:54 AM | Feedback (2) | Filed Under [ SSIS Madness ]

Wednesday, April 30, 2008

Error Handling Template

You can expect:

This is a medium intensity post (you should not get a migraine, but I make no promises).

I Expect (that you have the following experience):

  • Error Logging experience
  • UDF (User Defined Function) creation experience

Creating a template Package that handles errors.
(also how to create a template and make it available)

SSIS can be the wild west.  Each developer doing things their own way.  Errors are often overlooked and then when one happens no one knows why or how to fix it.  Logging is often not sufficiently enabled.  This causes maintenance headaches (especially with Excel, but that is a rabbit trail which dives steeply into the lowest reaches of Hell itself!).  What I am going to present with this post is how to log, compile and disseminate error information and not have your package just fail with no idea why.

 

Logging, GO!

The first thing we must do is create a new package (duh!).  Next we turn error logging on.  I've shown this before and here it is again (right click on the Control Flow canvas and go to Logging...):



And then set the logging details:



Handle That Event!

Next, we need to handle the error.  We could handle the OnError event, but that would fire on EVERY error that was generated.  A Task can generate an infinite (well, you get the idea) number of errors, so we don't want to fire an infinite number of error handlers!  What we really want to know is when a task fails.  We do this by setting up an event handler for the OnTaskFailed event (duh!):


After we create the event handler, we want to create a place to store the error information we are about to gather, so we create a variable.  I named mine ErrorMessageText.  Here is where things get fun.  We want to get all of the error information from the logging we've set up.  First thing we need to do is to set up some SQL to get it.  You will want to create a UDF for this because otherwise we are in the maintenance nightmare again.  Here is a sample that you can use:
Create Function GetErrors(@ExecutionID varchar(50))
Returns varchar(8000) As
Begin
Declare
@Result varchar(8000)

Select @Result = Coalesce(@Result, '') + '>>' + message + Char(10) + Char(13)
From sysdtslog90
Where event = 'OnError' And executionid = @ExecutionID
Order By id

Return @result

End

Now that we have an easy way to collect the error information (notice that I am passing in the execution id and retrieving all the errors for that execution) we need to get the data out of the database and into our message variable.  Grab an Execute SQL Task and configure it like so:






If you configure it just like this, it will work.  Note that the Execute SQL Task is VERY touchy about this kind of configuration and demands it be configured just so.

Next, we are going to drop in a Script Task to grab the rest of the information we want to present.  In the ReadOnlyVariables put:  ErrorMessageText, CreationDate, CreatorComputerName, CreatorName, ExecutionInstanceGUID, InteractiveMode, MachineName, OfflineMode, PackageID, PackageName, StartTime, UserName, VersionBuild, VersionGUID, VersionMajor, VersionMinor, ExecutionValue, SourceDescription, SourceID, SourceName

This allows us to produce as much information as possible for debugging.  Design the script and use something like the following for your Sub Main:

Public Sub Main()
Dim Result As String = "System Variables:" & vbCrLf

For Each Item As Variable In Dts.Variables
If Item.Name <> "ErrorMessageText" Then
If Not
Item.Value Is Nothing Then
Result &= Item.Name & " = " & Item.Value.ToString() & vbCrLf
Else
Result &= Item.Name & " = Nothing" & vbCrLf
End If
End If
Next

Result &= vbCrLf & "Error Mesages:" & vbCrLf & _
Dts.Variables("ErrorMessageText").Value.ToString()

Dts.Variables("ErrorMessageText").Value = Result

Dts.TaskResult = Dts.Results.Success
End Sub
This will collect all the information we need for our error report.  We'll then email this to the appropriate person.  This ought to be configured in your package with the configuration in a central location.  I am not covering configuration in this session, however.

Templates, Templates, Templates!!!

The next part is really hard.  No, really....  Put the template package in your drive:\Program Files\Microsoft Visual Studio 8\Common7\IDE\PrivateAssemblies\ProjectItems\DataTransformationProject\DataTransformationItems folder [1].  Done.  Now you can right click on your project in the Solution Explorer, go to Add, New Item... and pick the template.

Now everyone uses the same error notification system and you have all the useful information you need when someone decides to change the file format of a data feed that you've been receiving for years without telling you!  You can now proudly go to your boss and tell them "It's not my code!  Blame the customer!!".

-Madman Out-

References:
[1] http://support.microsoft.com/kb/908018 Accessed 4/30/2008

posted @ Wednesday, April 30, 2008 2:05 PM | Feedback (7) |

I'm not dead YET!!

I apologize for the long pause in posting.  I have been extremely busy with my Masters program.  I took my finals on Monday for this semester, so I finally have some free time!!  I've prepared a wrap-up to the Performance Framework series.  It describes some sample testing sets that should get you thinking about how to go about tweaking performance and picking the best method to use to accomplish a goal.

-tired Madman Out-

posted @ Wednesday, April 30, 2008 11:17 AM | Feedback (0) |

Wednesday, February 13, 2008

Row Numbers in a DataFlow

You can expect:

This is a low intensity post (you should be able to do this in your sleep).

I Expect (that you have the following experience):

  • Basic SSIS usage
  • Basic VB.Net usage

 Row Numbers in a DataFlow.

I try to post often in the Forums here at SQLTeam.com for a few reasons.  First, I get to help people in an interactive way which is very rewarding to me.  Second, I get to learn from others who post answers.  I appreciate the forums very much and the few tricks I have learned so far.  In this thread, a question about how to get row numbers for an incoming text file was posed.  The simple answer is, use a script transform.  The next question, "How?".

Script Transforms can be a very useful tool if used properly.  You must be careful not to throw a lot of code in there because you can slow your package down considerably.  For instance, DO NOT EVER call a web service from one (If you do, I won't help you anymore!  Well, maybe a little bit)!  If you have to do something like that, roll your own component and cache calls or multithread them (this WILL be a future post by the way).

For this example, we want to take a flat file, add a row number and attempt to save it to a database.  If this fails, we want to export the rows to a flat file and include the row number from the source file.  Here is my design:


For sample data, I exported my sysdtslog90 table.  I created the file source Connection Manager and my Flat File Source.  I then added the Script Transform.  I added an output column with a data type of "four byte signed int" shown here:


My input file does not have column headers, so I can start with 1.  If I had column headers, I would start with 2.  If I skipped more rows I would just adjust my starting number to where the DATA STARTS.  Then I edited my Script and entered the following:
Public Class ScriptMain : Inherits UserComponent
Private _RowNumber As Integer = 1

Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
Row.RowNumber = _RowNumber
_RowNumber += 1
End Sub

End Class
Of note here.  You could also check a column to see if there is a change and restart numbering if you wanted to do something fancy.  You could reproduce the functionality of Row_Number() Over (Partition By myColumn) using this method.

I then added my OLE DB Destination and added the Connection Manager for it and created the destination table and wired up the columns.  I made sure I used Fast Load into the table.  Because I want the error output, I set the "Rows per batch" and the "Maximum insert commit size" to the same thing.  For this example I set them to 5 since my sample data is only 1260 records.  When doing this in the future, DO NOT BE A KNUCKLEHEAD AND USE 5!!  This is a Bulk Insert, so a good size is 10k or so.  Of note, however, is that if you expect a LARGE number of failed records, I would consider checking the data first and doing something else with it.


Next I add another OLE DB Destination and wire it to the Error Output from the previous.  It will prompt me what to do on failure and I tell it to "Redirect Row".  When there is a failure, the WHOLE BATCH FAILS.  In this example, when there is a failure with the first Destination, 5 rows will be sent to the error output.  The second Destination inserts 1 record at a time, so when the offending record is reached again, it fails that record alone.  Next I add my Flat File Destination and create the Connection Manager for it.  I wire it up the same way with the error output.  I could if I wanted to try to capture the error message, but I'll leave that up to another post because it is not very friendly.  I map my fields and then I am done.

In my example, my destination table has enough room for each field being pulled in from the file.  I want it to fail, so I added a constraint on my table that I know will fail some records.  The result looks like this:


As you can see, only 24 records were bad, but caused 120 records to have to be inserted 1 at a time.  The results in my output file are:
5,OnProgress,MLS
10,OnProgress,ML
12,OnProgress,ML
17,OnProgress,ML
22,OnProgress,ML
27,OnProgress,ML
414,OnProgress,M
419,OnProgress,M
425,OnProgress,M
430,OnProgress,M
432,OnProgress,M
437,OnProgress,M
442,OnProgress,M
447,OnProgress,M
834,OnProgress,M
839,OnProgress,M
845,OnProgress,M
850,OnProgress,M
852,OnProgress,M
857,OnProgress,M
862,OnProgress,M
867,OnProgress,M
1254,OnProgress,
1259,OnProgress,

As you can see, the row numbers are now present in the error output and can be used to report back to the producer of the file.  This method DOES NOT allow you report on files that break the file format.  This method will not output a record has an extra or missing column delimiter for example.  For those types of things, you have to do some preprocessing step.

-Madman out-

posted @ Wednesday, February 13, 2008 10:31 AM | Feedback (11) | Filed Under [ SSIS Madness ]

Monday, February 11, 2008

SSIS Performance Framework Part 2

You can expect:

This is a medium intensity post (you should not get a migraine, but I make no promises).

I Expect (that you have the following experience):

  • Excel experience
  • Complex SQL queries

 Performance Reporting for SSIS Packages.

Part 1 helped us set up the logging for SSIS so that we could retrieve the data we need to make the pretty charts that help us in tuning our packages.  This information is VITAL for engaging in REAL performance tuning.  You need PROOF and not anecdotes about how X is faster than Y.  Many times in my career, I have had people try to convince me that their way is better or faster.  I used to try to argue about what was faster/better, but I soon learned the foolishness of this.  Hard evidence is the only way to make a determination on performance.  That is why this framework is so important.  It will bring you up to speed on how to get the best performance out of SSIS faster than going from site to site for tips.  It will allow you to do tests on your own in a very efficient manner.

Of note about the SQL in Part 1.  That SQL is meant to be used WHILE THE PACKAGE IS RUNNING.  It will let you see where the package is while it is running.  For long running packages, it is a way to peek at the health of the process.  For instance, if someone is doing bad things or the server is choking, you can tell by running that code and comparing results to past known good runs to ensure that you are not tracking behind in performance.  Part 2 is about looking at completed runs and analyzing them.  Just wanted to make that distinction.

The first thing you need to do is go back to the Logging config on your package and add the OnProgress event.  This will allow us to look at more than just the DataFlow items' progress.  I've crafted a pretty crazy query to produce some statistics that can be used in an Excel Pivot table/chart. 
Select 
'Run ' + Cast(RunNumber As varchar) As RunName, S.Source,
DataFlowSource, RecordCount, RecordsPerSec,
Min(S.StartTime) StartTime, Max(S.EndTime) EndTime,
DateDiff(ms, Min(S.StartTime), Max(S.EndTime)) Duration

From sysdtslog90 S
Inner Join ( Select Distinct Source, ExecutionID,
Row_Number() Over (Order By StartTime) As RunNumber
From sysdtslog90
Where event = 'PackageStart' ) P On S.ExecutionID = P.ExecutionID
Left Outer Join (
Select
SourceID, ExecutionID, DataFlowSource,
Sum
(RecordCount) RecordCount,
Min(StartTime) StartTime,
(Cast(Sum(RecordCount) as real) / DateDiff(ms, Min(StartTime), Max(EndTime)))
* 1000 As RecordsPerSec
From
(
Select SourceID, ExecutionID, StartTime, EndTime,
Left(
SubString(message,
CharIndex(':', message,
CharIndex(':', message,
CharIndex(':', message,
CharIndex(':', message, 56)
+ 1)
+ 1)
+ 1)
+ 2, Len(message)),
CharIndex(':',
SubString(message,
CharIndex(':', message,
CharIndex(':', message,
CharIndex(':', message,
CharIndex(':', message, 56)
+ 1)
+ 1)
+ 1)
+ 2, Len(message))
) - 2) As DataFlowSource,
Cast(Right(message, CharIndex(':',
Reverse(message)) - 2) As int) As RecordCount
From sysdtslog90
Where event = 'OnPipelineRowsSent'
) DF_Events
Group By
SourceID, ExecutionID, DataFlowSource
) DF On S.SourceID = DF.SourceID And S.ExecutionID = DF.ExecutionID
Where message <> 'Validating' Group By
RunNumber, S.Source, DataFlowSource, RecordCount,
DF.StartTime, RecordsPerSec,
Case When S.Source = P.Source Then 1 Else 0 End
Order By

RunNumber,
Case When S.Source = P.Source Then 1 Else 0 End Desc,
DF.StartTime,
Min(S.StartTime)

Update:  The above query had a bug with some DataFlow transforms.  It is fixed now, so please use the latest.  The bug was how I was retrieving the RecordCount field.  Some transforms would have "Input 1 : " and I was only looking for "Input : ".  This caused an error in how the Right() was pulling the data.  Thanks to Bill Fellows for pointing it out.

This data could also be used in other reporting things, but it was specifically designed for use in an Excel environment.  The Pivot functions can be used to chart record counts, record speeds and Control Flow Task speeds.  If you've never worked with Excel Pivot charts/tables, this is a great opportunity to start.  If you have trouble getting it to work, leave me a comment and I'll help you out.  The following chart shows the execution time of different Control Flow Tasks:


As you can see, the Data Transfer for Demo task is MUCH slower in Run 3.  This is because I tweaked to to be slower.  Below is a detailed view of the RPM (Records Per Minute) of that DataFlow task.  Note that I am using the same query results, I am just changing the Pivot options in my pivot table/chart.


We see here that the RPM is much slower.  Interestingly enough, the setting that is making this so slow is A DEFAULT!!!  The developer who made that a default needs a punch in the nose as most new SSIS developers won't understand this at first.  This leaves a bad impression if they aren't thorough enough to look at the dropdown.  Here is the setting that made this package come to a crawl:


You want the one that says "- fast load" because it does a REAL bulk load.  The one selected does individual insert statements FOR EVERY ROW PASSED IN!!  This is only useful in certain situations that I'll talk about at some point.  Part 3 of this series will cover what to play around with to increase performance now that you have a way to look at the results and make some sense of what needs to be changed.  Sorry this one is short, but I promise Part 3 will have some very good things!

-Madman out-

posted @ Monday, February 11, 2008 4:24 PM | Feedback (4) | Filed Under [ SSIS Madness ]

SSIS Performance Framework Part 1

You can expect:

This is a medium intensity post (you should not get a migraine, but I make no promises)

I Expect (that you have the following experience):

  • The ability to POINT and CLICK on your OWN (I won't hold your hand like a little girl. Get over it.)
  • at lease SOME SSIS experience
  • SSIS Logging
  • Stored Procedure writing

 Performance Measuring in SSIS

SSIS Logging lets you see what the heck is going on during run-time of your packages. You can see where failures occur and that is what MOST people are anxious to see. I am not most people and I want to see how fast my packages are running. You can see total execution time from an SQL Agent job, but that isn't enough! I want to know how EVERYTHING is performing. Armed with this information I know exactly what to focus on if things are not what I expect them to be. This post will show you how to set up logging to a SQL server and then USE that logging info for the forces of good.

The first thing to do is turn on logging for the package (right click on the Control Flow surface - ok, that was a freebie):



Use the provider for SQL Server and configure your SQL Server connection manager. SSIS will automatically create a table called sysdtslog90 if it does not exist. Be sure to check both boxes as illustrated here:



Next is the Details tab. These settings are package level settings and will propagate to all objects in the package. So, when you see the OnError event checked, this will automatically be checked for any object you go to configure the logging for.



Ok, create a Dataflow task and pipe some data from one place to another. For purposes of this example, I am going to use a OleDB source, a Lookup, and a OleDB Destination. This next sentence is VERY IMPORTANT, so LISTEN UP. You need to name each of the Dataflow (and everything, really) steps something MEANINGFUL. "OleDB Source" is NOT MEANINGFUL. If you clods aren't already doing this in your packages, than you fail SSIS 101. Go to jail, go directly to jail, do NOT pass Go, do NOT collect $200.



Next, go back to the Control Flow surface and go back into logging. You will now see that the Package is not the only container. You will see that the DataFlow container is grayed out. This is because it is inheriting the settings from the parent. If you want (and we do) to configure this container ourselves, you have to check the box (click it twice - there I go again! slap my hand) and enable the SQL Server logger. Then go to the Details tab. All the setting from the patent have to be re-set because we choose to configure it ourselves. Check the same 2 we had last time and also OnPipelineRowSent. This event lets us know how many rows were sent to the particular component in the DataFlow (this is a DataFlow specific event). There are a lot of events you can look at and it's up to you to determine what you want. For this demo (because it is more about logging for Performance tuning) these are the only ones needed:



Now, pull up your Management Studio and run a Select on the table (Select * From sysdtslog90). You will see a goldmine of information. Times and row counts = much win. Play around with the data a little and them come back to this post. Go, do it. Go on!!
Ok, a couple of things to note. Executionid is an ID that covers the complete execution of the package. The message field contains the count of records sent to the component. To get the rowcount out of the field, use:
CAST(RIGHT([MESSAGE], LEN([MESSAGE]) - CHARINDEX('INPUT : ', [MESSAGE]) - 7) AS INT)

This Stored Procedure offers a complete look at the progress of a package execution. It can be used to check performance over many executions to evaluate slow downs etc.
CREATE PROCEDURE [DBO].[CHECKPROGRESS]
(
@EXECUTIONID AS UNIQUEIDENTIFIER,
@MESSAGE AS VARCHAR(100),
@GOAL AS INT,
@PROCESS VARCHAR(50)
)
AS BEGIN
SELECT
PROCESS = @PROCESS, TIMEINSECS = DATEDIFF(SS, MIN(STARTTIME), MAX(ENDTIME)),
RECCOUNT = SUM(CAST(RIGHT([MESSAGE], LEN([MESSAGE]) -
CHARINDEX('INPUT : ', [MESSAGE]) - 7) AS INT)),
RECSPERSEC = SUM(CAST(RIGHT([MESSAGE], LEN([MESSAGE]) -
CHARINDEX('INPUT : ', [MESSAGE]) - 7) AS NUMERIC)) /
DATEDIFF(SS, MIN(STARTTIME), MAX(ENDTIME)),
RECSPERHOUR = (SUM(CAST(RIGHT([MESSAGE], LEN([MESSAGE]) -
CHARINDEX('INPUT : ', [MESSAGE]) - 7) AS NUMERIC)) /
DATEDIFF(SS, MIN(STARTTIME), MAX(ENDTIME))) * 60 * 60,
GOAL = @GOAL,
PERCENTCOMPLETE = (SUM(CAST(RIGHT([MESSAGE], LEN([MESSAGE]) -
CHARINDEX('INPUT : ', [MESSAGE]) - 7) AS NUMERIC)) /
@GOAL) * 100 ,
ESTIMATEDTOTALTIME = CAST(ROUND(DATEDIFF(SS, MIN(STARTTIME),
MAX(ENDTIME)) / (SUM(CAST(RIGHT([MESSAGE], LEN([MESSAGE]) -
CHARINDEX('INPUT : ', [MESSAGE]) - 7) AS NUMERIC)) /
@GOAL), 0) AS INT),
SERVERETA = DATEADD(SS, (CAST(ROUND(DATEDIFF(SS,
MIN(STARTTIME), MAX(ENDTIME)) / (SUM(CAST(RIGHT([MESSAGE],
LEN([MESSAGE]) - CHARINDEX('INPUT : ', [MESSAGE]) - 7)
AS NUMERIC)) / @GOAL), 0) AS INT) - DATEDIFF(SS,
MIN(STARTTIME), MAX(ENDTIME))), GETDATE()),
CURRENTSERVERTIME = GETDATE()

FROM
SYSDTSLOG90
WHERE EXECUTIONID = @EXECUTIONID
AND [EVENT] = 'ONPIPELINEROWSSENT' AND [MESSAGE] LIKE '%' + @MESSAGE + '%'
END

Here is a driver for that stored proc that matches my demo package.
DECLARE @MAX INT
DECLARE
@EXECID UNIQUEIDENTIFIER

SELECT
@MAX = MAX(ID)

FROM SYSDTSLOG90
WHERE STARTTIME BETWEEN CONVERT(VARCHAR, GETDATE(), 101)
AND
CONVERT(VARCHAR, DATEADD(DD, 1,GETDATE()), 101)


SELECT @EXECID = EXECUTIONID
FROM SYSDTSLOG90
WHERE ID = @MAX

DECLARE
@RESULTS TABLE ( PROCESS VARCHAR(50),
TIMEINSECS INT,
RECCOUNT INT,
RECSPERSEC REAL,
RECSPERHOUR REAL,
GOAL INT,
PERCENTCOMPLETE REAL,
ESTIMATEDTOTALTIME INT,
SERVERETA DATETIME,
CURRENTSERVERTIME DATETIME )

INSERT INTO
@RESULTS
EXEC CHECKPROGRESS @EXECID,
'LOOKUP THE DATA IN THE OTHER DEMO TABLE', 1, 'LOOKUP'

INSERT INTO
@RESULTS
EXEC CHECKPROGRESS @EXECID,
'RECORDS FOUND', 1, 'RECORDS FOUND'

INSERT INTO
@RESULTS
EXEC CHECKPROGRESS @EXECID,
'RECORDS NOT FOUND', 1, 'RECORDS NOT FOUND'

SELECT * FROM @RESULTS


If you run you package and then create the stored proc and then run the driver you should see some valuable results. I've used this exact stuff to measure performance day over day, month over month, etc. Keep in mind that there are other factors that can affect performance (such as some dimwit running a CURSOR on your server...), so sometimes you need to average runtimes. Armed with this data, you can march up to Excel and tell it, "Give me charts fool!". Charts make management giggly and thus you can defend your well written code when all of a sudden the Virtual your DB is running on decides to do something funny! (true story)

This post is the foundation for this performance series.  We will use the sysdtslog90 table extensively to determine actual performance and not just anecdotal performance.  The OnError stuff will be used in subsequent posts.  It is there for a reason if you were wondering.

-Madman out-

posted @ Monday, February 11, 2008 10:34 AM | Feedback (3) | Filed Under [ SSIS Madness ]

Introduction Part 2

I have always viewed "social" movements on the Internet with somewhat of a mocking look. Think of a cross between Spock's eyebrow raise and the gleeful mocking smile and laugh that you crack when you see someone run a red light right into the waiting arms of the cop on the other side of the intersection. I lump many "fantastic things" into this category such as MySpace, Facebook, Blogs, etc. There is not enough to keep our quickly declining intelligences busy, so let us fill them with more useless things! "OMG you didn't say HI!!1! on my MySpace page! I h4t3 y0u!"

Lately, my titanium exterior has been softened towards Blogs (but only technical ones!). I began to use SSIS in extreme ways the day after RTM. My first project was to de-dupe 41m records of telco (Sprint) customer data. The data was horrendous and SSIS had fuzzy logic! Sounds like a perfect fit! The previous project owner had done it all in T-SQL. Yech! Certainly this wonderful new Microsoft technology can solve all our ills! What I learned on the next few weeks was an important lesson about SSIS that you all should know/learn. SSIS is a very powerful technology, but you have to use it correctly to be effective.

"Well, duh" you may say, but over the last ~2.5 years working with SSIS I've seen people use it in the worst ways EVER. This brings me to the motivation for this blog. I am here to slap some sense into y'all who are making the SSIS gods (and little babies) cry many many rivers of tears. When I started with SSIS there was little RELIABLE technical information for SSIS. Most of what was out there was inaccurate or speculative (even from Microsoft! I know, shocking). There are now several Blogs with some very good information, but there is so much that I have learned that isn't out there. I felt that it would benefit the community to get some of that information out to the unwashed masses.

The tips/lessons I have to post range from Mindless to MindNumbing and I will tell you up front what is difficult and what is simple. I will ATTEMPT to post at least once a week, but I cannot guarantee that regularity. I have a TON of stuff to get out there to start with, so there will most likely be a big burst of postings at the outset and then it will taper down a bit. Also, school will affect how often I post.

Hang on for this wild ride... Big one coming at you to start us off!

posted @ Monday, February 11, 2008 9:56 AM | Feedback (0) |

Introduction Part 1

I started a Blog a few weeks ago to share my knowledge of SSIS with people.  I didn't think it would really get anyone's attention, I just felt like I needed to get information out there because so many people do horrible things with this technology!  In any case, through my blog (all 2 posts :P) and some interaction with a system I had built at a previous employer I drew the attention of Bill here at SQLTeam.com.  He has asked me to blog here, so I am moving the contents of my old Blog (it was Blogger and the interface kinda sucked a lot) to here.  I have my original post and then the start of a technical series that will be 3 or 4 LONG posts.  That is how I got here.  I've also started posting in the forums under the moniker Qualis.  Feel free to mock me endlessly if you wish, but be sure that I will correct HORRIBLE solutions where I find them!

posted @ Monday, February 11, 2008 9:56 AM | Feedback (3) |

Powered by:
Powered By Subtext Powered By ASP.NET