Thinking outside the box

Patron Saint of Lost Yaks
posts - 203, comments - 734, trackbacks - 4

My Links

Advertisement

News

Archives

Post Categories

Extract XML structure automatically

Today I am going to write about how to extract the XML structure from a file. The basic idea is to bulk read the file from disk, place the content in an XML variable and traverse elements in the variable and ultimately output a resultset showing the structure of xml file.

I often use this to determine what kind of XML the file is by comparing the returned resultset with a lookup-table.
See comment in code to understand what happens


CREATE PROCEDURE dbo.uspGetFileStructureXML
(
    @FileName NVARCHAR(256)
)
AS

-- Prevent unwanted resultsets back to client
SET NOCOUNT ON

-- Initialize command string, return code and file content
DECLARE @cmd NVARCHAR(MAX),
        @rc INT,
        @Data XML

-- Make sure accents are preserved if encoding is missing by adding encoding information UTF-8
SET     @cmd = 'SELECT  @Content = CASE
                                      
WHEN BulkColumn LIKE ''%xml version="1.0" encoding="UTF%'' THEN BulkColumn
                                       ELSE ''<?xml version="1.0" encoding="UTF-8"?>'' + BulkColumn
                                   END
                FROM    OPENROWSET(BULK ' + QUOTENAME(@FileName, '''') + ', SINGLE_CLOB) AS f'

-- Read the file
EXEC    @rc = sp_executesql @cmd, N'@Content XML OUTPUT', @Content = @Data OUTPUT

-- What? An error?
IF @@ERROR <> 0 OR @rc <> 0
        BEGIN
                SET     @cmd = CHAR(10) + ERROR_MESSAGE()
                RAISERROR('The file %s was not read.%s', 18, 1, @FileName, @cmd)
                RETURN  -100
        END

-- Create a staging table holding element names
CREATE TABLE #Nodes
             (
                 NodeLevel INT NOT NULL,
                 RootName NVARCHAR(MAX) NOT NULL,
                 ElementName NVARCHAR(MAX) NOT NULL
             )

-- Initialize some control variables
DECLARE     @NodeLevel INT

-- Begin at root level
SET     @NodeLevel = 1

-- Iterate all levels until no more levels found
WHILE @@ROWCOUNT > 0
   
BEGIN
  
         -- Build a dynamic SQL string for each level
            SELECT   @cmd = 'SELECT  DISTINCT
                                     ' + STR(@NodeLevel) + ', 
                                     
t.n.value(''local-name(..)[1]'', ''VARCHAR(MAX)'') AS RootName,
                                     t.n.value(''local-name(.)[1]'', ''VARCHAR(MAX)'') AS ElementName
                             FROM    @n.nodes(''' + REPLICATE('/*', @NodeLevel) + ''') AS t(n)',
                     @NodeLevel = @NodeLevel + 1

           
-- Store the result in the staging table
            INSERT  #Nodes
                    (
                       
NodeLevel,
                        RootName,
                        ElementName
                    )
            EXEC    sp_executesql @cmd, N'@n XML', @n = @Data
    END

-- Reveal the XML file structure
SELECT    NodeLevel,
         
RootName,
          ElementName
FROM      #Nodes
ORDER BY  NodeLevel,
          RootName

-- Clean up
DROP TABLE  #Nodes
GO

If you already have the XML data in a variable, the stored procedure can be simplified as this following code


CREATE PROCEDURE dbo.uspGetVariableStructureXML
(
    @Data XML
)
AS

-- Prevent unwanted resultsets back to client
SET NOCOUNT ON

-- Initialize command string, return code and file content
DECLARE @cmd NVARCHAR(MAX),
        @rc INT 

-- Create a staging table holding element names
CREATE TABLE #Nodes
             (
                 NodeLevel INT NOT NULL,
                 RootName NVARCHAR(MAX) NOT NULL,
                 ElementName NVARCHAR(MAX) NOT NULL 
             )

-- Initialize some control variables
DECLARE     @NodeLevel INT

-- Begin at root level
SET     @NodeLevel = 1

-- Iterate all levels until no more levels found
WHILE @@ROWCOUNT > 0
   
BEGIN
  
         -- Build a dynamic SQL string for each level
            SELECT   @cmd = 'SELECT  DISTINCT
                                     ' + STR(@NodeLevel) + ', 
                                    
t.n.value(''local-name(..)[1]'', ''VARCHAR(MAX)'') AS RootName,
                                     t.n.value(''local-name(.)[1]'', ''VARCHAR(MAX)'') AS ElementName
                             FROM    @n.nodes(''' + REPLICATE('/*', @NodeLevel) + ''') AS t(n)',
                     @NodeLevel = @NodeLevel + 1

           
-- Store the result in the staging table
            INSERT  #Nodes
                    (
                       
NodeLevel,
                        RootName,
                        ElementName
                    )
            EXEC    sp_executesql @cmd, N'@n XML', @n = @Data
    END

-- Reveal the XML file structure
SELECT    NodeLevel,
         
RootName,
          ElementName
FROM      #Nodes
ORDER BY  NodeLevel,
          RootName

-- Clean up
DROP TABLE  #Nodes
GO

Print | posted on Thursday, March 05, 2009 3:06 PM | Filed Under [ SQL Server 2008 Algorithms SQL Server 2005 ]

Feedback

Gravatar

# re: Extract XML structure automatically

I would suggest #Nodes table uses a clustered index (surrogate) and using try catch in sps :)
BTW new sql server 2008 datatype hierarchy makes such operations much easier.
Thanks for the stored procedures. I am always having problems with reading xml from files.
And is there a way to directly assign the file contenct to an xml or at least text variable?
3/5/2009 5:22 PM | MS SQL REMOTE DBA
Gravatar

# re: Extract XML structure automatically

Not sure if my previous post went through, but I'll just sum it up: Great job!
3/5/2009 5:39 PM | Adam Machanic
Gravatar

# re: Extract XML structure automatically

I have an XSLT 2.0 transform that does a similar thing -- not quite the same, it returns an XML document whose tree is all the unique-named element paths rather than a list of unique parent-child element names at each level.

However, it does seem to be a good deal faster for large/deep files, so it might make a useful preprocessor, since the rowset returned by dbo.uspGetVariableStructureXML will be the same in both cases.

When I say large files, my test file of 103MB returned 2500 rows with a maximum level of 16. dbo.uspGetVariableStructureXML took 8m 56s to run, the XSLT using Saxon b 9.1 took 25s on the same machine. In both cases, about 20s of that seems to have been loading the XML file.

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="utf-8" indent="yes"/>

<xsl:template match="/" name="subtree">
<xsl:param name="parents" select="."/>
<xsl:for-each-group select="$parents/*" group-by="name()">
<xsl:copy>
<xsl:call-template name="subtree">
<xsl:with-param name="parents" select="current-group()"/>
</xsl:call-template>
</xsl:copy>
</xsl:for-each-group>
</xsl:template>
</xsl:stylesheet>
3/5/2009 7:13 PM | Arnold Fribble
Gravatar

# re: Extract XML structure automatically

Thanks.

In most cases I have found that only first iteration will suffice, ie RootNode.

In some cases two iterations will do, ie NodeLevel 1 and NodeLevel 2.
3/5/2009 9:08 PM | Peso
Gravatar

# re: Extract XML structure automatically

I assume it depends on the xml you feed the procedure with. If it is more attribute related than those sps may fail
3/5/2009 10:12 PM | Greig
Gravatar

# re: Extract XML structure automatically

If we have data in xml variable already how about CTE route? :

declare @Stage table (StageId int identity(1,1) primary key clustered, RootName nvarchar(50), ElementName nvarchar(50))

insert into @Stage (RootName, ElementName)
select cast(c.query('fn:local-name(..)') as nvarchar(50)) [RootName],
cast(c.query('fn:local-name(.)') as nvarchar(50)) [ElementName]
from @Data.nodes(N'//*') x(c)

;with NodeRoot as
( select RootName,
ElementName
from @Stage
),
NodePath as
( select RootName,
ElementName,
1 NodeLevel
from @Stage
where RootName = ''
union all
select nr.RootName,
nr.ElementName,
NodeLevel + 1
from NodePath np
join NodeRoot nr on
np.ElementName = nr.RootName
)
select NodeLevel, RootName, ElementName
from NodePath
order
by NodeLevel, RootName
3/6/2009 2:16 AM | Nathan Skerl
Gravatar

# re: Extract XML structure automatically


np.ElementName = nr.RootName


^^ of course requires unique element names ; )
3/6/2009 2:31 AM | Nathan Skerl
Gravatar

# re: Extract XML structure automatically

Is it possible do the reverse? I have to upload an XML file to a SQL Server Database. How can I achieve this?

Thanks,
Sam.
5/16/2011 8:14 AM | Sam
Gravatar

# re: Extract XML structure automatically

i do not know if i have a VMWare.
8/16/2011 8:05 PM | Judith
Comments have been closed on this topic.

Powered by:
Powered By Subtext Powered By ASP.NET