Sunday, August 22, 2010

ssis->xml data extract pattern

Today, I would like to share one of the better ways to extract the XML data based on some cool work done over the last couple of weeks. Let me call it "SSIS - XML data extract pattern"

Let me put across a simple XML structure and explain the approach that was used along with the reasons to do so.


<xml>
 <employees>
<employee empCd="E1" fname = "vinuthan" lname="bhat">
<hobbies>
   <hobby name = "trecking"/>
   <hobby name = "music"/>
</hobbies>
</employee>
<employee empCd="E2" fname = "vb" lname="b">
<hobbies>
   <hobby name = "dancing"/>
   <hobby name = "music"/>
</hobbies>
</employee>
</employees>
</xml>

Obvious choice in SSIS would be to use the data flow task along with the XML source adapter and the oledb destination. One has to select the XML source file and the XSD schema so that the Input and Output buffers form appropriate mapping internally. Here, I would rather prefer to create my own XSD instead of using the Generate XSD. The latter option does not provide the max length value in the XSD generated and by default assumes char length as 255.

Each of the nodes in XML form an output in the pipeline [Right click->Advanced Properties]. The approach is here to extract the data to an intermediate staging tables (that have the same structure as that of our XML) and then load it to main tables from staging tables (breaking of single extract to two different extracts).

Add 4 OLEDB destinations and map each of the outputs to the 4 staging tables that are created. Looking at the attributes under each of the output nodes, one would soon realize that the nodes are related to each other by an internal node id attribute. This makes me think that the nodes get linked like the way we normally do in a master-child database table design. The id value are unique integer value identifier generated by the XML source adapter.  For the 4 outputs [Employees, Employee, Hobbies, Hobby] we have, the staging table design looks as follows :-

Table 1: STG.Employees [EmployeesId]
Table 2: STG.Employee [EmployeesId, EmployeeId, EmpCd, fname, lname]
Table 3: STG.Hobbies [HobbiesId, EmployeeId]
Table 4: STG.Hobby [HobbyId, HobbiesId, name]

Now map the outputs to each of these tables respectively. The EmployeesId, EmployeeId, HobbiesId, HobbyId are the the internal ids that I am talking about (you do not see the sample XML above). 

Once the data is in staging tables, it is a very simple data extract to the main tables using the OLEDB source and destination adapters with some look up tasks for the master table ids.

So why this approach:
1. An alternate way would require heavy use of merge joins, look ups and sort tasks. The sort task is a big performance hit. In the current approach, we do not have to use the merge join, sort tasks.

2. Complex or large XML  in real time require many merge joins, look ups and sort tasks as every output extract requires you to have linking of nodes from the root to enable data extract from the node of interest. In the current approach, the no of intermediate tasks are reduced.

3. As a result of #2, one cannot extract the data in parallel. The output of a node can be used only once from an XML source to destination and extract from each node require us to start linking from the root node. This results us in using multiple data flow task (one of each node) each pointing to same XML physical file. In the current approach, we are maximizing the parallel execution/ data extract capabilities by having all the nodes output from a single XML source to adapter to multiple OLEDB destinations.

Also, when there are many data flow tasks pointing to the same XML file to extract data, one would end up getting an Out of Memory error at one point of time. This would force us to group data flow tasks and make them run sequentially. The out of memory error is being confirmed by Microsoft as limitations.

4. When XSD schema changes, one need to change all the tasks used in the path as a result of the recreation of the lineage ids. In the current approach, as we have reduced the no of tasks, we reduce the effort in the no of tasks that need to be revisited. One cannot avoid this as a result of the way data flow tasks are designed.

5. Simple to use as it is a direct mapping of XML to database tables and then from staging table to main tables.

6. Performance is high and maintenance is easy.

In summary the following points were considered for the approach:
  • Performance
  • Parallel processing
  • Amount of rework
  • Simplicity and Maintainability

Hope the "SSIS - XML data extract pattern" is helpful.

Thursday, August 12, 2010

ssis->dimension without a business key

Wondering how to handle dimensions with no business keys? Well, here is a cool way out.
Recently I have been extracting data from XML files having nodes with all its attributes marked as NULLABLE in XSD schema.

Soon I realised that this table cannot have a definite business key. So now how one decides whether it is an update or an insert.  No truncate and reload please :)

How about creating a key based all column combination (exclusing IDs and any non business related columns). Hmmm. Next thing in mind was to create a Hash code.

The good guy, SQL 2008 made my life easy as it provides a cool function called CHECKSUM(column1, column2....). So now generate a old value hash and then compare it with the new column value hash to decide whether the data need to be updated or inserted.

This strategy could also be used to avoid unnecessary updates when there is no change in column value and yet the data has come in the source.

A common example for this could be handling of Address dimension where we do not have any attribute that could be treated as a business key. [Street address, City, State, Country, Postal Code]

Here is a snapshot of how it is done in SSIS package.


LinkWithin

Related Posts with Thumbnails