Extract XML structure automatically
Today I am going to write about how to extract the XML structure from a file. The basic idea is to bulk read the file from disk, place the content in an XML variable and traverse elements in the variable and ultimately output a resultset showing the structure of xml file.
I often use this to determine what kind of XML the file is by comparing the returned resultset with a lookup-table.
See comment in code to understand what happens
CREATE PROCEDURE dbo.uspGetFileStructureXML
(
@FileName NVARCHAR(256)
)
AS
-- Prevent unwanted resultsets back to client
SET NOCOUNT ON
-- Initialize command string, return code and file content
DECLARE @cmd NVARCHAR(MAX),
@rc INT,
@Data XML
-- Make sure accents are preserved if encoding is missing by adding encoding information UTF-8
SET @cmd = 'SELECT @Content = CASE
WHEN BulkColumn LIKE ''%xml version="1.0" encoding="UTF%'' THEN BulkColumn
ELSE ''<?xml version="1.0" encoding="UTF-8"?>'' + BulkColumn
END
FROM OPENROWSET(BULK ' + QUOTENAME(@FileName, '''') + ', SINGLE_CLOB) AS f'
-- Read the file
EXEC @rc = sp_executesql @cmd, N'@Content XML OUTPUT', @Content = @Data OUTPUT
-- What? An error?
IF @@ERROR <> 0 OR @rc <> 0
BEGIN
SET @cmd = CHAR(10) + ERROR_MESSAGE()
RAISERROR('The file %s was not read.%s', 18, 1, @FileName, @cmd)
RETURN -100
END
-- Create a staging table holding element names
CREATE TABLE #Nodes
(
NodeLevel INT NOT NULL,
RootName NVARCHAR(MAX) NOT NULL,
ElementName NVARCHAR(MAX) NOT NULL
)
-- Initialize some control variables
DECLARE @NodeLevel INT
-- Begin at root level
SET @NodeLevel = 1
-- Iterate all levels until no more levels found
WHILE @@ROWCOUNT > 0
BEGIN
-- Build a dynamic SQL string for each level
SELECT @cmd = 'SELECT DISTINCT
' + STR(@NodeLevel) + ',
t.n.value(''local-name(..)[1]'', ''VARCHAR(MAX)'') AS RootName,
t.n.value(''local-name(.)[1]'', ''VARCHAR(MAX)'') AS ElementName
FROM @n.nodes(''' + REPLICATE('/*', @NodeLevel) + ''') AS t(n)',
@NodeLevel = @NodeLevel + 1
-- Store the result in the staging table
INSERT #Nodes
(
NodeLevel,
RootName,
ElementName
)
EXEC sp_executesql @cmd, N'@n XML', @n = @Data
END
-- Reveal the XML file structure
SELECT NodeLevel,
RootName,
ElementName
FROM #Nodes
ORDER BY NodeLevel,
RootName
-- Clean up
DROP TABLE #Nodes
GO
If you already have the XML data in a variable, the stored procedure can be simplified as this following code
CREATE PROCEDURE dbo.uspGetVariableStructureXML
(
@Data XML
)
AS
-- Prevent unwanted resultsets back to client
SET NOCOUNT ON
-- Initialize command string, return code and file content
DECLARE @cmd NVARCHAR(MAX),
@rc INT
-- Create a staging table holding element names
CREATE TABLE #Nodes
(
NodeLevel INT NOT NULL,
RootName NVARCHAR(MAX) NOT NULL,
ElementName NVARCHAR(MAX) NOT NULL
)
-- Initialize some control variables
DECLARE @NodeLevel INT
-- Begin at root level
SET @NodeLevel = 1
-- Iterate all levels until no more levels found
WHILE @@ROWCOUNT > 0
BEGIN
-- Build a dynamic SQL string for each level
SELECT @cmd = 'SELECT DISTINCT
' + STR(@NodeLevel) + ',
t.n.value(''local-name(..)[1]'', ''VARCHAR(MAX)'') AS RootName,
t.n.value(''local-name(.)[1]'', ''VARCHAR(MAX)'') AS ElementName
FROM @n.nodes(''' + REPLICATE('/*', @NodeLevel) + ''') AS t(n)',
@NodeLevel = @NodeLevel + 1
-- Store the result in the staging table
INSERT #Nodes
(
NodeLevel,
RootName,
ElementName
)
EXEC sp_executesql @cmd, N'@n XML', @n = @Data
END
-- Reveal the XML file structure
SELECT NodeLevel,
RootName,
ElementName
FROM #Nodes
ORDER BY NodeLevel,
RootName
-- Clean up
DROP TABLE #Nodes
GO
Legacy Comments
MS SQL REMOTE DBA
2009-03-05 |
re: Extract XML structure automatically I would suggest #Nodes table uses a clustered index (surrogate) and using try catch in sps :) BTW new sql server 2008 datatype hierarchy makes such operations much easier. Thanks for the stored procedures. I am always having problems with reading xml from files. And is there a way to directly assign the file contenct to an xml or at least text variable? |
Adam Machanic
2009-03-05 |
re: Extract XML structure automatically Not sure if my previous post went through, but I'll just sum it up: Great job! |
Arnold Fribble
2009-03-05 |
re: Extract XML structure automatically I have an XSLT 2.0 transform that does a similar thing -- not quite the same, it returns an XML document whose tree is all the unique-named element paths rather than a list of unique parent-child element names at each level. However, it does seem to be a good deal faster for large/deep files, so it might make a useful preprocessor, since the rowset returned by dbo.uspGetVariableStructureXML will be the same in both cases. When I say large files, my test file of 103MB returned 2500 rows with a maximum level of 16. dbo.uspGetVariableStructureXML took 8m 56s to run, the XSLT using Saxon b 9.1 took 25s on the same machine. In both cases, about 20s of that seems to have been loading the XML file. <?xml version="1.0" encoding="utf-8"?> <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml" version="1.0" encoding="utf-8" indent="yes"/> <xsl:template match="/" name="subtree"> <xsl:param name="parents" select="."/> <xsl:for-each-group select="$parents/*" group-by="name()"> <xsl:copy> <xsl:call-template name="subtree"> <xsl:with-param name="parents" select="current-group()"/> </xsl:call-template> </xsl:copy> </xsl:for-each-group> </xsl:template> </xsl:stylesheet> |
Peso
2009-03-05 |
re: Extract XML structure automatically Thanks. In most cases I have found that only first iteration will suffice, ie RootNode. In some cases two iterations will do, ie NodeLevel 1 and NodeLevel 2. |
Greig
2009-03-05 |
re: Extract XML structure automatically I assume it depends on the xml you feed the procedure with. If it is more attribute related than those sps may fail |
Nathan Skerl
2009-03-06 |
re: Extract XML structure automatically If we have data in xml variable already how about CTE route? : declare @Stage table (StageId int identity(1,1) primary key clustered, RootName nvarchar(50), ElementName nvarchar(50)) insert into @Stage (RootName, ElementName) select cast(c.query('fn:local-name(..)') as nvarchar(50)) [RootName], cast(c.query('fn:local-name(.)') as nvarchar(50)) [ElementName] from @Data.nodes(N'//*') x(c) ;with NodeRoot as ( select RootName, ElementName from @Stage ), NodePath as ( select RootName, ElementName, 1 NodeLevel from @Stage where RootName = '' union all select nr.RootName, nr.ElementName, NodeLevel + 1 from NodePath np join NodeRoot nr on np.ElementName = nr.RootName ) select NodeLevel, RootName, ElementName from NodePath order by NodeLevel, RootName |
Nathan Skerl
2009-03-06 |
re: Extract XML structure automatically np.ElementName = nr.RootName ^^ of course requires unique element names ; ) |
Sam
2011-05-16 |
re: Extract XML structure automatically Is it possible do the reverse? I have to upload an XML file to a SQL Server Database. How can I achieve this? Thanks, Sam. |
Judith
2011-08-16 |
re: Extract XML structure automatically i do not know if i have a VMWare. |