Guessing Excel Data Types
Note to Self
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Jet\4.0\Engines\Excel: TypeGuessRows = 0 means scan everything.
Note to Others
About 10 years ago I stumbled across this bit of information just when I needed it and it saved my project. Then for some reason, a few years later when it would have been nice, but not critical, for some reason I could not find it again anywhere. Well, now I have stumbled across it again, and to preserve my future self from nightmares and sudden baldness due to pulling my hair out, I have decided to blog it in the hopes that I can find it again this way.
Here’s the story… When you query data from an Excel spreadsheet, such as with old-fashioned DTS packages in SQL 2000 (my first reference) or simply with an OLEDB Data Adapter from ASP.NET (recent task) and if you are using the Microsoft Jet 4.0 driver (newer ones may deal with this differently) then you can get funny results where the query reports back that a cell value is null even when you know it contains data.
What happens is that Excel doesn’t really have data types. While you can format information in cells to appear like certain data types (e.g. Date, Time, Decimal, Text, etc.) that is not really defining the cell as being of a certain type like we think of when working with databases. But, presumably, to make things more convenient for the user (programmer) when you issue a query against Excel, the query processor tries to guess what type of data is contained in each column and returns it in an appropriate manner. This is all well and good IF your data is consistent in every row and matches what the processor guessed. And, for efficiency’s sake, when the query processor is trying to figure out each column’s data type, it does so by analyzing only the first 8 rows of data (default setting).
Now here’s the problem, suppose that your spreadsheet contains information about clothing, and one of the columns is Size. Now suppose that in the first 8 rows, all of your sizes look like 32, 34, 18, 10, and so on, using numbers, but then, somewhere after the 8th row, you have some rows with sizes like S, M, L, XL. What happens is that by examining only the first 8 rows, the query processor inferred that the column contained numerical data, and then when it hits the non-numerical data in later rows, it comes back blank. Major bummer, and a real pain to track down if you don’t know that Excel is doing this, because you study the spreadsheet and say, “the data is RIGHT THERE! WHY doesn’t the query see it?!?!” And the hair-pulling begins.
So, what’s a developer to do? One option is to go to the registry setting noted above and change the DWORD value of TypeGuessRows from the default of 8 to 0 (zero). Setting this value to zero will force Jet to scan every row in the spreadsheet before making its determination as to what type of data the column contains. And that means that in the example above, it would have treated the column as a string rather than as numeric, and presto! your query now returns all of the values that you know are in there.
Of course, there is a caveat… if you are querying large spreadsheets, making Jet scan every row can be quite a performance hit. You could enter a different number (more than 8) that you believe is a better sampling of rows to make the guess, but you still have the possibility that every row scanned looks alike, but that later rows are different, and that you might get blanks when there really is data there. That’s the type of gamble, I really don’t like to take with my data.
Anyone with a better approach, or with experience with more recent drivers that have a better way of handling data types, please chime in!