Sunday, August 22, 2010

ssis->xml data extract pattern

Today, I would like to share one of the better ways to extract the XML data based on some cool work done over the last couple of weeks. Let me call it "SSIS - XML data extract pattern"

Let me put across a simple XML structure and explain the approach that was used along with the reasons to do so.

<employee empCd="E1" fname = "vinuthan" lname="bhat">
   <hobby name = "trecking"/>
   <hobby name = "music"/>
<employee empCd="E2" fname = "vb" lname="b">
   <hobby name = "dancing"/>
   <hobby name = "music"/>

Obvious choice in SSIS would be to use the data flow task along with the XML source adapter and the oledb destination. One has to select the XML source file and the XSD schema so that the Input and Output buffers form appropriate mapping internally. Here, I would rather prefer to create my own XSD instead of using the Generate XSD. The latter option does not provide the max length value in the XSD generated and by default assumes char length as 255.

Each of the nodes in XML form an output in the pipeline [Right click->Advanced Properties]. The approach is here to extract the data to an intermediate staging tables (that have the same structure as that of our XML) and then load it to main tables from staging tables (breaking of single extract to two different extracts).

Add 4 OLEDB destinations and map each of the outputs to the 4 staging tables that are created. Looking at the attributes under each of the output nodes, one would soon realize that the nodes are related to each other by an internal node id attribute. This makes me think that the nodes get linked like the way we normally do in a master-child database table design. The id value are unique integer value identifier generated by the XML source adapter.  For the 4 outputs [Employees, Employee, Hobbies, Hobby] we have, the staging table design looks as follows :-

Table 1: STG.Employees [EmployeesId]
Table 2: STG.Employee [EmployeesId, EmployeeId, EmpCd, fname, lname]
Table 3: STG.Hobbies [HobbiesId, EmployeeId]
Table 4: STG.Hobby [HobbyId, HobbiesId, name]

Now map the outputs to each of these tables respectively. The EmployeesId, EmployeeId, HobbiesId, HobbyId are the the internal ids that I am talking about (you do not see the sample XML above). 

Once the data is in staging tables, it is a very simple data extract to the main tables using the OLEDB source and destination adapters with some look up tasks for the master table ids.

So why this approach:
1. An alternate way would require heavy use of merge joins, look ups and sort tasks. The sort task is a big performance hit. In the current approach, we do not have to use the merge join, sort tasks.

2. Complex or large XML  in real time require many merge joins, look ups and sort tasks as every output extract requires you to have linking of nodes from the root to enable data extract from the node of interest. In the current approach, the no of intermediate tasks are reduced.

3. As a result of #2, one cannot extract the data in parallel. The output of a node can be used only once from an XML source to destination and extract from each node require us to start linking from the root node. This results us in using multiple data flow task (one of each node) each pointing to same XML physical file. In the current approach, we are maximizing the parallel execution/ data extract capabilities by having all the nodes output from a single XML source to adapter to multiple OLEDB destinations.

Also, when there are many data flow tasks pointing to the same XML file to extract data, one would end up getting an Out of Memory error at one point of time. This would force us to group data flow tasks and make them run sequentially. The out of memory error is being confirmed by Microsoft as limitations.

4. When XSD schema changes, one need to change all the tasks used in the path as a result of the recreation of the lineage ids. In the current approach, as we have reduced the no of tasks, we reduce the effort in the no of tasks that need to be revisited. One cannot avoid this as a result of the way data flow tasks are designed.

5. Simple to use as it is a direct mapping of XML to database tables and then from staging table to main tables.

6. Performance is high and maintenance is easy.

In summary the following points were considered for the approach:
  • Performance
  • Parallel processing
  • Amount of rework
  • Simplicity and Maintainability

Hope the "SSIS - XML data extract pattern" is helpful.


  1. Can you upload a sample package representing your example.

  2. Thanks for your solution. I am have to do the same task. In you example,, EmployeesId, EmployeeId, HobbiesId, HobbyId are the internal Id. Is there any way that we can make that column as an identity column? And can we insert the values of the EmployeeID in the Employee Table(Identity column value) into the Hobby Table(I want EmployeeID column in the Hobby table).
    And I am new in XML. I am not getting it that from that XML file,,how we get 4 tables instead of 2.
    I have to achieve that same. Can you please help.

    1. The approach is to stage the table first in the same format as XML and then generate the data in the main table. To check whether it itself can be used as identity column, you can check whether inner node restarts from 1 in each of the parent nodes. If yes then you cannot use as identity column

  3. i tried to do the execute the above package but i am not getting the HobbyId column in the Hobby table.when i execute the same package again, its giving me the same internal id for EmployeesId, EmployeeId, HobbiesId, HobbyId.
    In my application i am getting the same king of XML file and I have to load it into the SQL server tables. I am getting 10 tables from the XML file.
    Can you help me out in that. How to create the relationship in that.

  4. This comment has been removed by the author.

  5. can you please upload the package again or screen shots

  6. yes could you please upload the package again...or have screen shots, thank you so much!

  7. Hello Dude,

    XML is best suited to storing data, so it's inevitable that at some point someone will ask you to pull information from a database and manipulate it as though it were XML. Thanks for sharing it.....

    Best Web Scraping Software

    1. Yes true....I have in recent times got many instances where data came in XML format. It could be great if microsoft address some pain points in new release when XSD changes.

  8. Hi is it possible to get the pakages thanks ?

  9. Can you provide more details on how you get the staging data into the main tables and transform the SSIS generated IDs to unique IDs


Note: Only a member of this blog may post a comment.


Related Posts with Thumbnails