Latest Posts
I'll be giving a talk at the
Philly.NET Code Camp on 5/17/2008.
The
Single Responsibility Principle.
Dependency Injection.
Inversion of Control.
If you don't already know and apply these concepts in your day-to-day development this talk is for you.
The talk will present a very easy learning curve into these topics. We will cover the "why" as well as the "how." We will take a simple application written without these techniques and transform it step-by-step.
While we will see that these things make testing your code much easier, we will also see that their use goes far beyond testing scenarios.
In order to better communicate the "how" , we will not be using any pre-existing
IOC framework such as
Castle,
Spring.NET,
StructureMap, etc. Rather, we will build our own simple implementation during the talk.
In
this post James Kovacs is talking about becoming a ReSharper Jedi. Recently I attended a training class given by
JP Boodhoo who is probably Obi Wan to these guys, if not Yoda himself.
I admit to being completely overwhelmed for most of the week, not the least reason being the obsession with this topic -
Keyboard=Good Mouse=Bad.
I was shocked that someone would do training this way - after all, I was there to learn programming concepts, and that was greatly hindered by the fact that I just couldn't keep up.
But the real shock for me was how enthusiastic the class was (other than me that is). They took to it big time. Everyone wanted to learn which key combination did what and worked really hard at it. Well except for me. I gave up on that after the first day and was a lot happier.
Now, I've worked at call centers and collection agencies for most of my career, so I definitely appreciate how the keyboard is better when speed of input is the primary goal. You want your collectors or reps talking and typing and hitting the next call key even while hanging up as quickly and graciously as possible with the current call. I've got that. I've DONE that :)
But I can tell you for certain that in my programming career it is never the speed of clicking
File | New that slows my development efforts. Or the speed of using the mouse to right click on the project and Add a Class. Sure hitting some key combination would be FASTER. But I just don't need that kind of speed. Most of my time is spent thinking. I'm a pretty fast typist when I'm actually putting some code down, but... I just don't see programming as a hand-eye coordination type of endeavor.
It is really a bit humbling to me. I imagine these guys are so much smarter than I am, that in their mind they have already written the next 3 classes and they just need to push them onto the disk as fast as they can so they can move on to the next project. Wow!
After a long while I've finally managed to create a release version. I've also added some cool new features.
It is completely free and has no expiration date like the prior beta versions.
I've also changed the RSS feed to Feedburner to which you can subscribe to at: http://feeds.feedburner.com/SsmsToolsPack
If you're subscribed to the old feed please change it.
In SSMS Tools Pack 1.0 you can find these features:
- Uppercase/Lowercase keywords:
Set all keywords to uppercase or lowercase letters. Custom keywords can be added.
- Run one script on multiple databases:
Run selected or full window text on selected databases on the currently connected server.
- Copy execution plan bitmaps to clipboard:
Copy selected or all execution plans to a bitmap that is saved on the clipboard.
- Search Results in Grid Mode and Execution Plans:
Find all occurrences of your search string in the execution plans or in the results in datagrid mode.
- Generate Insert statements for a single table, the whole database or current resultsets in grids:
Generate insert statement from your data.
- Query Execution History (Soft Source Control):
Save all executed queries to file or database and easily find them.
- Text document Regions and Debug sections:
Add Regions and Debug section in your scripts to ease development experience.
- Running custom scripts from Object explorer's Context menu:
Speedy execution of custom scripts from Object Explorer's context menus.
- CRUD (Create, Read, Update, Delete) stored procedure generation:
Generate Customizable CRUD stored procedures for all tables in your database.
- New query template:
Create a template that is shown when creating a new query window.
Currently supported SQL Server Management Studio versions are:
SQL Server Management Studio 2008 CTP
SQL Server Management Studio 2005
SQL Server Management Studio 2005 Express
Hope you enjoy it!
And if you're feeling extra generous there's always PayPal. :)
The nomenclature used for referencing software releases and updates can be quite confusing. Let's consider an example to illustrate what the various terms mean.
As a new version of a product is being developed, it may be made available to select customers and community members for early testing. This is sometimes called alpha builds of the product.
As development progresses and the product becomes more and more polished, it's provided to a wider audience. This used to be called beta releases; for example beta 1, beta 2, etc. However a few years ago Microsoft changed the terminology for SQL Server pre-releases. They are now referred to as CTPs (Community Technology Previews). You can download the November CTP, for example.
As the product enters it's final stages before release, the feature set is complete and the product is undergoing final testing, it's called an RC (Release Candidate).
After a product has undergone significant testing and it's determined that no more changes will be made to the product before release, it's sometimes said that the product has gone golden. It's also called a GA (General Availability) release.
Once the bits been turned over to a company to mass produce the media (CDs, DVDs, etc), it's RTM'd (Released To Manufacturing).
Usually sometime around the RTM, the product version is "launched". The timing of the launch may or may not have any correlation with the time the product is actually available for purchase. The launch has more to do with marketing and product feature education than availability.
Finally the product is released! It's available for purchase from the normal distribution channels.
Over time, Hot Fixes are created by the dev team to address specific product issues affecting certain customers. Sometimes the issue is so wide spread, a GDR (General Distribution Release) is issued so that all customers will receive the updates.
Since hot fixes and GDRs are designed to quickly address specific problems encountered by specific customers, they can be issued rather often. The rapidity of the hot fixes and GDR's makes it impractical for many IT shops to keep up with the pace of the releases. So, a CU (Cumulative Update) is created that contains all of the applicable hot fixes. This makes it easier for customers who haven't been directly affected by the issues that sparked the hot fixes to remain current.
Once a large enough collection of changes have been gathered, an SP (Service Pack) will be issued. Historically, SPs have also been the release vehicle used to deliver new features that were not ready at the time of GA. For example, Database Mirroring was made available in SP1. SP2 brought us the custom reports as in the Performance Dashboard. Microsoft has since indicated that SPs will not be used as a release vehicle for new features.
And then the whole cycle starts overs. Feel like you're on a treadmill yet?
For more information, check out the following links:
Cheers!
Joe
I recently found out that we are not properly checking for data corruption in our "Integrity Checks" SQL job. I thought that a SQL job would fail if the job step that runs DBCC CHECKDB returned errors, but apparently that's not the case. Check out Paul Randal's blog for more information.
To properly check for data corruption via a SQL job, you should raise an error if @@ERROR does not equal zero.
In the next few days, I'll rewrite my isp_DBCC_CHECKDB stored procedure to include this. I will be testing my code change against a database that contains corruption. See this thread for more information.
As most you already know, Notification Services is not part of Microsoft SQL Server 2008. I think this is a shame since SSNS is really a great product. It may be rather complex at first glance, but it's a great product nonetheless. I'll probably devote a blog to it's abrupt deprecation at some point in the future.
In the meantime, if you're preparing to upgrade to SQL Server 2008, you should verify that you don't have any rogue SSNS instances running on any of your servers. I hope your environment is more controlled than that, but I've seen more than one shop where change control was, well in a word, uncontrolled.
So, how can you tell if a server has a SSNS installed on it? Run the following query in the msdb database. It'll list all SSNS instances installed on that SQL Server instance.
SELECT
InstanceName
,DatabaseName
FROM msdb.NS90.NSInstances
Cheers!
Joe
Here's an obscure piece of SQL you may not be aware of: The "ALL" option when using a GROUP BY.
Consider the following table:
Create table Sales
(
SaleID int identity not null primary key,
CustomerID int,
ProductID int,
SaleDate datetime,
Qty int,
Amount money
)
insert into Sales (CustomerID, ProductID, SaleDate, Qty, Amount)
select 1,1,'2008-01-01',12,400 union all
select 1,2,'2008-02-25',6,2300 union all
select 1,1,'2008-03-02',23,610 union all
select 2,4,'2008-01-04',1,75 union all
select 2,2,'2008-02-18',52,5200 union all
select 3,2,'2008-03-09',99,2300 union all
select 3,1,'2008-04-19',3,4890 union all
select 3,1,'2008-04-21',74,2840
SaleID CustomerID ProductID SaleDate Qty Amount
----------- ----------- ----------- ----------------------- ----------- ---------------------
9 1 1 2008-01-01 00:00:00.000 12 400.00
10 1 2 2008-02-25 00:00:00.000 6 2300.00
11 1 1 2008-03-02 00:00:00.000 23 610.00
12 2 4 2008-01-04 00:00:00.000 1 75.00
13 2 2 2008-02-18 00:00:00.000 52 5200.00
14 3 2 2008-03-09 00:00:00.000 99 2300.00
15 3 1 2008-04-19 00:00:00.000 3 4890.00
16 3 1 2008-04-21 00:00:00.000 74 2840.00
(8 row(s) affected)
Suppose we'd like to see the customers that were sold Product #1 along with the total amount that they spent.
We would basically write a simple SELECT with a GROUP BY like this:
select CustomerID, sum(Amount) as TotalAmount
from Sales
where ProductID = 1
group by CustomerID
And sure enough, we'd get our answer:
CustomerID TotalAmount
----------- ---------------------
1 1010.00
3 7730.00
(2 row(s) affected)
Now, let's say that we'd like to see
all customers that have been sold
any products, but we still just want to see the "TotalAmount" for ProductID #1. For customers that have never ordered ProductID #1, it should output a "TotalAmount" value of $0. One way to do this is with a CASE expression; instead of filtering so that only ProductID #1 is returned, we can conditionally SUM() the Amount only for orders for ProductID #1. Like this:
select CustomerID, sum(case when ProductID=1 then Amount else 0 end) as TotalAmount
from Sales
group by CustomerID
CustomerID TotalAmount
----------- ---------------------
1 1010.00
2 0.00
3 7730.00
(3 row(s) affected)
That gives us the results we want. Because we are not using a WHERE clause to filter the data, we see an entry for CustomerID #2 in the output.
However, in situations where you have written the above SQL, you could actually replace the SUM(CASE...) expression by using
GROUP BY ALL, instead of just a standard GROUP BY, like this:
select CustomerID, sum(Amount) as TotalAmount
from Sales
where ProductID = 1
group by all CustomerID
CustomerID TotalAmount
----------- ---------------------
1 1010.00
2 NULL
3 7730.00
Warning: Null value is eliminated by an aggregate or other SET operation.
(3 row(s) affected)
Notice that now
all Customers are now returned, and a NULL is shown as the TotalAmount for Customer #2, who has no orders for ProductID #1 ... Even though though the WHERE clause seems to indicate that we should
not be seeing customer #2 in the results!
The ALL option basically says "ignore the WHERE clause when doing the GROUPING, but still apply it for any aggregate functions". So, in this case, the WHERE clause is not considered when generating the population of CustomerID values, but it is applied when calculating the SUM. This is very much like our first solution, where we removed the WHERE clause completely, and used a SUM(CASE...) expression to conditionally calculate the aggregate.
Values that are excluded from the aggregation according to the WHERE clause have NULL values returned, as you can see in the result. A simple ISNULL() or COALESCE() will allow us to return 0 instead of NULL:
select CustomerID, isnull(sum(Amount),0) as TotalAmount
from Sales
where ProductID = 1
group by all CustomerID
CustomerID TotalAmount
----------- ---------------------
1 1010.00
2 0.00
3 7730.00
Warning: Null value is eliminated by an aggregate or other SET operation.
(3 row(s) affected)
Notice that the warning about NULL being aggregated still displays, since that is the standard behavior in SQL Server when you calculate an aggregate on a NULL value. You can turn these warnings off if you like for the during of the batch by issuing a
set ANSI_WARNINGS off command before your SELECT.
GROUP BY ALL is kind of obscure and neat to know, but not really useful in most situations since there are usually easier or better ways to get this result. For one thing, this won't work if we want all Customers to be displayed, since a customer
must have at least one order to show up in the result. If we want to see all customers, even those that have never ordered, we would need to do a LEFT OUTER JOIN from the Customers table to our Orders aggregate SELECT:
create table Customers (CustomerID int primary key)
insert into Customers
select 1 union all
select 2 union all
select 3 union all
select 4
-- Notice that we have 4 customers, but our Sales data has sales for only 3.
select c.customerID, isnull(s.TotalAmount,0) as TotalAmount
from Customers c
left outer join
(select customerID, sum(Amount) as TotalAmount
from Sales
where ProductID = 1
group by customerID) s on c.customerID = s.customerID
customerID TotalAmount
----------- ---------------------
1 1010.00
2 0.00
3 7730.00
4 0.00
(4 row(s) affected)
That is typically the standard way to return data for an entire population, regardless of existing transactions. GROUP BY ALL gets us close, but if a new customer has never made an Order, they will never show up in the results. Of course, depending on your needs, that may be what you want.
Another limitation is we can not use GROUP BY ALL if we want to return a grand total for all orders, along with the total just for ProductID #1. For example, using the SUM(CASE...) expression along with a regular SUM(), we can do this:
select CustomerID, sum(case when ProductID=1 then Amount else 0 end) as Product1Amount,
sum(Amount) as TotalAmount
from Sales
group by CustomerID
CustomerID Product1Amount TotalAmount
----------- --------------------- ---------------------
1 1010.00 3310.00
2 0.00 5275.00
3 7730.00 10030.00
(3 row(s) affected)
That lets us calculate two different totals all in one pass through the table. However, we cannot translate that using GROUP BY ALL, because while we will be able to return the Product1Amount, there would be no easy way to also get the TotalAmount for all products without an additional join or sub-query.
. . .
So, that's the story with GROUP BY ALL. It is interesting, and not widely well-known, and may even make for a good interview question if you really want to see how much SQL a candidate knows. But for practical purposes, it is pretty rarely used and there are generally better ways to get the same results more easily or more efficiently.
Anyone have a good situation or an example of where GROUP BY ALL really worked well for you? Be sure to share your experiences in the comments.
Out of the box, BizTalk Server 2006 doesn't provide any tool for auto clearing and cleaning up database used by BizTalk, this cause problems like performance and data storage issues. While browsing BizTalk installation directory (\BizTalk Installation Folder\Schema) I've found out few sql scripts that can used to cleanup the MessageBox = BizTalkMsgBoxDb and Tracking = BizTalkDTADb database.
To clean MessageBox:
1) Stop all BizTalk related services.
2) Open Analyzer and open sql script: msgbox_cleanup_logic.sql, press F5 (this will create the sp) - be sure to use BizTalkMsgBoxDb.
3) run bts_CleanupMsgbox
4) Start services.
For Tracking:
1) Execute dtasp_CleanHMData sp from BizTalkDTADb
There are many sql script located in that Schema Directory and by reading and analyzing this scripts it will give us better understanding on how BizTalk stores the data.
Concurrency is confusing. Terms line “optimistic concurrency control” and “pessimistic concurrency control” can mean quite different things depending on the context. I hope to clarify some points surrounding concurrency as it relates to application and database design with a series of posts. In this first post, I’ll review some basic concurrency concepts and highlight some differences in concurrency models from a database and application perspective.
In computer science, “concurrency control ensures that correct results for concurrent operations are generated, while getting those results as quickly as possible” (from the Concurrency control wiki). Every major DBMS supports ACID transactional reliability so one might think “correct results” would be a non-issue. In practice however, both the application and database must work in concert to provide the appropriate level of data integrity and performance while minimizing user rework to address conflicts and deadlocks.
Concurrency Control: Optimistic, Pessimistic and Chaos
“Optimistic”, “pessimistic” and “chaos” refer to the safeguards one takes based on the likelihood of concurrent updates and how much rework is acceptable. Optimistic concurrency control is used when it is unlikely that different users will update the same data. In the unlikely event that the same data is updated by different users, the conflict is detected when data are saved and the second user must redo/merge changes in order to prevent overwriting the changes made by the first user. Users of a well designed optimistic concurrency application experience fast response time and are inconvenienced only in the rare case of an update conflict.
Pessimistic concurrency control is used when it is likely that the same data will be updated by different users. To prevent the need to redo or merge changes, an application serializes data access so that only one user can edit data at a time. The obvious downside is that subsequent users must wait until preceding user(s) has completed their changes and this can increase response time or data unavailability. However, overall user productivity can be better than optimistic currency control because rework is avoided.
Chaos concurrency control (also known as Anarchy) is used in situations when concurrent updates are not possible or “last in wins” is acceptable. No safeguards need to be taken with chaos concurrency because there is either no chance of conflicts or overwrites are ok. Chaos concurrency is typically used in single-user applications or in multi-user applications where data are segregated in such a way that concurrent updates are either not possible (e.g. unique web session key) or so unlikely (e.g. CustomerID key) that the risk of lower concurrency level isn’t warranted.
The choice between optimistic, pessimistic and chaos involves striking a balance between data availability, integrity, rework and development effort. It is important to pick the concurrency model appropriate for the task at hand. Using the wrong model can result in unnecessary blocking, long response times and data problems. For example, using a chaos model in a multi-user system (which is often done inadvertently simply because concurrency wasn’t considered during development) can lead to lost updates.
Concurrency Control Semantics
The meanings of optimistic vs. pessimistic concurrency control are different depending on whether the context is the application or database server. Here’s an excerpt from the SQL Server Books Online:
|
· Optimistic concurrency control works to minimize reader/writer blocking. With optimistic concurrency control methods, read operations do not use read locks that block data modification operations.
· Pessimistic concurrency control works to ensure that read operations access current data and that data being read cannot be modified. With pessimistic concurrency control methods, read operations use read locks that block data modification. The locks placed by a read operation are released when the read operation is finished.
|
Source: SQL Server 2005 Books Online, Types of Concurrency Control (http://msdn.microsoft.com/en-us/library/ms189132.aspx)
Now let’s take a look at Types of Concurrency Control in the Visual Studio 2008 .NET documentation:
|
· Optimistic concurrency control: A row is unavailable to other users only while the data is actually being updated. The update examines the row in the database and determines whether any changes have been made. Attempting to update a record that has already been changed results in a concurrency violation.
· Pessimistic concurrency control: A row is unavailable to users from the time the record is fetched until it is updated in the database.
|
Visual Studio 2008 .NET, Types of Concurrency Control (http://msdn.microsoft.com/en-us/library/cs6hb8k4.aspx)
Here the table summarizing the VS and SQL 2005 documentation:
|
|
Optimistic
|
Pessimistic
|
|
|
SQL Server
|
Application
|
SQL Server
|
Application
|
|
Readers block readers
|
No
|
No
|
No
|
Yes
|
|
Readers block writers
|
No
|
No
|
Yes
|
Yes
|
|
Writers block writers
|
Yes
|
Yes
|
Yes
|
Yes
|
Table 2: Application Concurrency Control
You might be surprised to learn that readers block other readers when an application uses pessimistic concurrency control. I should clarify that “readers block readers” is typically done only when the intent of reading data is a subsequent update (e.g. a user clicks edit) and it is likely that other sessions will try to edit the same data. This approach ensures no conflicts can occur when data are later saved and prevents data overwrites but at the cost of concurrent data access.
I think some confusion about concurrency stems from the fact that applications may or may not leverage database features in concurrency control implementation. For example, a pessimistic control application might use database transactions and locking mechanisms to ensure that only one user can edit inventory quantity. The same pessimistic application could instead serialize data access in the middle tier and perform data access using an optimistic or even chaos model. Concurrency control implementation details depend much on application architecture; there is not necessarily a right or wrong approach as long as concurrency objectives are met.
My next post in this concurrency model series will discuss how SQL Server transaction isolation levels and row versioning in relate to concurrency control. I’ll also discuss how applications can leverage these features to maximize concurrency.
I've previously shown how to setup Database Mirroring in SQL Server 2005 with T-SQL, but how do you failover to the mirrored databases using T-SQL? Here's how:
--Run on principal
USE master
GO
ALTER DATABASE dbName SET SAFETY FULL
GO
ALTER DATABASE dbName SET PARTNER FAILOVER
GO
--Run on new principal
USE master
GO
ALTER DATABASE dbName SET SAFETY OFF
GO
If you are using synchronous database mirroring, just ignore the "SET SAFETY" commands.
We use asynchronous database mirroring for performance reasons as our sites are 300 miles apart. Even though we have a fast network in between the two sites, the latency is too high when we tried it in synchronous mode.
Just as the title says :)
You can read the article here.
Thanx to Kevin Kline for publishing it!
You can expect:
This is a low intensity post (you should be able to do this in your sleep).
I Expect (that you have the following experience):
- Simple SSIS Dataflow work
- That you have read and understood/implemented parts 1 and 2
Performance Tuning for SSIS Packages.
Preparation
This is the 3rd and final part to this series. In this part, I will cover techniques and settings in your SSIS packages to tune them so that they perform at appropriate speeds. To really test speed, I need lots of data to test against. The first thing I'm going to do is generate 3m rows of sample data. I generated a table with an identity (duh!), a few varchar fields with varying length (Replicate('ge3', 1 + @cnt % 80) etc.), a few int fields, a few decimal fields and two date fields. The reason I want all this is because I want to make this as close to a real world test as I can. I then exported the data to a flat file (pipe delimited because I like pipes!) If you want my script to create the test data, ask me in a comment and I'll post it.
Tuning Test 1
For our first tuning test, I am importing the file I just exported. At 3m rows with variable data, it will be a good test of different methods and settings. I created a new package and set up the logging options as in parts 1 and 2. I also truncated my SampleData table and will use that for my inserts. In the package, I created 4 DataFlows for my inserts and 4 SQL Statements for truncating my SampleData table.
- For DataFlow 1, I created a Flat File Source and in the flat file connection manager, I specified all the columns as varchar(250). I then created an OLE DB Destination and set it to use: FastLoad, TableLock and did not map my identity column (I could have told it to keep my identity if I wanted).
- For DataFlow 2, I followed the pattern of DataFlow 1 except in the connection manager, I used the Suggest Types (1k records) feature on the Advanced tab.
- For DataFlow 3, I used a SQL Server Destination which requires all mapped columns to be the EXACT type in the table, so in my connection manager, I had to ensure the data types were exact.
- For DataFlow 4, I also used a SQL Server Destination, but I used the first connection manager and a Data Conversion transform to take the varchar(250) fields and convert them to the correct data type.
- Notes:
- OLE DB Destinations allow you to insert 1 record at a time or use FastLoad which does a TRUE Bulk Load (use SQL Profiler if you don't believe me). Using FastLoad is PREFERRED!! You will have poor performance if you don't use it.
- I left the default of 0 for Commit Size on the OLE DB Destination. This allows the Bulk Load operation to choose how much data to commit. This is typically the DefaultBufferMaxRows setting on the DataFlow. This is a tuning option that can help performance if you are having memory bottlenecks.
- I used Table Lock on the OLE DB Destinations. This will ensure that you have full access to the table and will enable the Bulk Load operation to perform better.
- SQL Server Destinations require the package to be running on the SAME server as the database and the database MUST be a SQL Server 2005 database.
Here are the results:
| Source |
DataFlowSource |
RecordCount |
RecordsPerSec |
| Simple File Import With OLEDB |
Sample Data Destination - Simple with OLEDB |
3000000 |
24793.39 |
| Typed File Import With OLEDB |
Sample Data Destination - Typed with OLEDB |
3000000 |
27272.73 |
| Typed File Import With SQLServer |
Sample Data Destination - Typed with SQLServer |
3000000 |
30612.25 |
| Simple File Import With SQLServer |
Convert the data to the Table Types |
3000000 |
28037.38 |
| Simple File Import With SQLServer |
Sample Data Destination - Typed with SQLServer |
3000000 |
28037.38 |
As you can see and might have expected, Typed imports are faster. This is because the SQL Server is not having to do conversions. The SSIS engine is doing those conversions. Also expected is that the Data Conversion transformation slows down the process a little. One thing that is very important to note is that I am running these tests on my desktop and not a robust server. The differences between the last 3 tests is so close that you should consider them to be identical in terms of performance.
Tuning Test 2
Taking the fastest method from above, I performed a test with different transform items. I created a basline DataFlow to compare results against. I then used a few common items to demonstrate the performance of SSIS. This is just to show you how to proceed in your own testing.
| Source |
DataFlow Source |
Record Count |
RecordsPerSecond |
| Baseline Numbers |
Sample Data Destination - Typed with SQLServer |
3000000 |
27272.73 |
| Multicast Test |
Multicast |
3000000 |
27272.73 |
| Multicast Test |
Sample Data Destination - Typed with SQLServer |
3000000 |
27272.73 |
| Integer Conditional Split Test |
Conditional Split |
3000000 |
26548.67 |
| Integer Conditional Split Test |
Union All |
3000000 |
26548.67 |
| Integer Conditional Split Test |
Sample Data Destination - Typed with SQLServer |
3000000 |
26548.67 |
| FindString Conditional Split |
Conditional Split |
3000000 |
26086.96 |
| FindString Conditional Split |
Union All |
3000000 |
26086.96 |
| FindString Conditional Split |
Sample Data Destination - Typed with SQLServer |
3000000 |
25862.07 |
As you can see, the Multicast caused no degradation in performance whatsoever. This is because the output buffers reference the same memory. The integer test in the conditional split and the subsequent Union All only had a negligible effect on performance. The FindString split had a little bit more of a performance hit, but the effect was also very minor considering you are doing a string search!
SSIS is a very powerful tool. You can do a lot of things in a lot of different ways. It is up to the developer to pick the right tool for the right job. I can't stress this enough. You have to use the tool to become familiar with its power. You need to get in and play around with the features and figure out how to do things in different ways. The old adage of "practice makes perfect" applies directly to SSIS.
The goal of this post is to get you to think about how/what to test. The first set of tests was designed to help you think through how to do a certain task and then test the performance of your setup. The second set of tests was designed to show you that the built in controls will perform better than you might expect.
Lacking from this testing methodology is the impact on system resources. The focus of this test was purely for speed of execution. You can see from the results that doing fairly simple tasks works extremely well even on a desktop box without the robustness in memory and hard drive pipelines that you will get with a server. In addition to testing the speed of execution, you should familiarize yourself with PerfMon and use it to monitor the system resources. For instance, the Fuzzy logic components use a lot of memory as does the Sort component (they are asynchronous components and can cache some or all of the data coming into them).
-Madman out-
You can expect:
This is a medium intensity post (you should not get a migraine, but I make no promises).
I Expect (that you have the following experience):
- Error Logging experience
- UDF (User Defined Function) creation experience
Creating a template Package that handles errors.
(also how to create a template and make it available)
SSIS can be the wild west. Each developer doing things their own way. Errors are often overlooked and then when one happens no one knows why or how to fix it. Logging is often not sufficiently enabled. This causes maintenance headaches (especially with Excel, but that is a rabbit trail which dives steeply into the lowest reaches of Hell itself!). What I am going to present with this post is how to log, compile and disseminate error information and not have your package just fail with no idea why.
Logging, GO!
The first thing we must do is create a new package (duh!). Next we turn error logging on. I've shown this before and here it is again (right click on the Control Flow canvas and go to Logging...):
And then set the logging details:
Handle That Event!
Next, we need to handle the error. We could handle the OnError event, but that would fire on EVERY error that was generated. A Task can generate an infinite (well, you get the idea) number of errors, so we don't want to fire an infinite number of error handlers! What we really want to know is when a task fails. We do this by setting up an event handler for the OnTaskFailed event (duh!):
After we create the event handler, we want to create a place to store the error information we are about to gather, so we create a variable. I named mine ErrorMessageText. Here is where things get fun. We want to get all of the error information from the logging we've set up. First thing we need to do is to set up some SQL to get it. You will want to create a UDF for this because otherwise we are in the maintenance nightmare again. Here is a sample that you can use:
Create Function GetErrors(@ExecutionID varchar(50))
Returns varchar(8000) As
Begin
Declare @Result varchar(8000)
Select @Result = Coalesce(@Result, '') + '>>' + message + Char(10) + Char(13)
From sysdtslog90
Where event = 'OnError' And executionid = @ExecutionID
Order By id
Return @result
End
Now that we have an easy way to collect the error information (notice that I am passing in the execution id and retrieving all the errors for that execution) we need to get the data out of the database and into our message variable. Grab an Execute SQL Task and configure it like so:
If you configure it just like this, it will work. Note that the Execute SQL Task is VERY touchy about this kind of configuration and demands it be configured just so.
Next, we are going to drop in a Script Task to grab the rest of the information we want to present. In the ReadOnlyVariables put: ErrorMessageText, CreationDate, CreatorComputerName, CreatorName, ExecutionInstanceGUID, InteractiveMode, MachineName, OfflineMode, PackageID, PackageName, StartTime, UserName, VersionBuild, VersionGUID, VersionMajor, VersionMinor, ExecutionValue, SourceDescription, SourceID, SourceName
This allows us to produce as much information as possible for debugging. Design the script and use something like the following for your Sub Main:
Public Sub Main()
Dim Result As String = "System Variables:" & vbCrLf
For Each Item As Variable In Dts.Variables
If Item.Name <> "ErrorMessageText" Then
If Not Item.Value Is Nothing Then
Result &= Item.Name & " = " & Item.Value.ToString() & vbCrLf
Else
Result &= Item.Name & " = Nothing" & vbCrLf
End If
End If
Next
Result &= vbCrLf & "Error Mesages:" & vbCrLf & _
Dts.Variables("ErrorMessageText").Value.ToString()
Dts.Variables("ErrorMessageText").Value = Result
Dts.TaskResult = Dts.Results.Success
End Sub
This will collect all the information we need for our error report. We'll then email this to the appropriate person. This ought to be configured in your package with the configuration in a central location. I am not covering configuration in this session, however.
Templates, Templates, Templates!!!
The next part is really hard. No, really.... Put the template package in your
drive:\Program Files\Microsoft Visual Studio 8\Common7\IDE\PrivateAssemblies\ProjectItems\DataTransformationProject\DataTransformationItems folder [1]. Done. Now you can right click on your project in the Solution Explorer, go to Add, New Item... and pick the template.
Now everyone uses the same error notification system and you have all the useful information you need when someone decides to change the file format of a data feed that you've been receiving for years without telling you! You can now proudly go to your boss and tell them "It's not my code! Blame the customer!!".
-Madman Out-
References:
[1] http://support.microsoft.com/kb/908018 Accessed 4/30/2008
I apologize for the long pause in posting. I have been extremely busy with my Masters program. I took my finals on Monday for this semester, so I finally have some free time!! I've prepared a wrap-up to the Performance Framework series. It describes some sample testing sets that should get you thinking about how to go about tweaking performance and picking the best method to use to accomplish a goal.
-tired Madman Out-
SQL Server 2005 introduced the new DMV views. They are great and a real improvement to debug and optimize queries.
I find sys.dm_db_index_physical_stats very useful and often write this type of code
page_count
FROM sys.dm_db_index_physical_stats(DB_ID('MyDB'), OBJECT_ID('MyTable'), NULL, NULL, NULL)
to find out if the query optimizer has choosen the "right" index for the query.
This can be done in SQL Server 2000 too!
Use the DBCC SHOWCONTIG command. Maybe most of you have only used this with tables too see table fragmentation?
Well, you can use it for indexes too.
Use
DBCC SHOWCONTIG ('MyDB..MyTable') WITH ALL_INDEXES, TABLERESULTS
SELECT index_id,