Let me put across a simple XML structure and explain the approach that was used along with the reasons to do so.
<xml> <employees> <employee empCd="E1" fname = "vinuthan" lname="bhat"> <hobbies> <hobby name = "trecking"/> <hobby name = "music"/> </hobbies> </employee> <employee empCd="E2" fname = "vb" lname="b"> <hobbies> <hobby name = "dancing"/> <hobby name = "music"/> </hobbies> </employee> </employees> </xml> |
Obvious choice in SSIS would be to use the data flow task along with the XML source adapter and the oledb destination. One has to select the XML source file and the XSD schema so that the Input and Output buffers form appropriate mapping internally. Here, I would rather prefer to create my own XSD instead of using the Generate XSD. The latter option does not provide the max length value in the XSD generated and by default assumes char length as 255.
Each of the nodes in XML form an output in the pipeline [Right click->Advanced Properties]. The approach is here to extract the data to an intermediate staging tables (that have the same structure as that of our XML) and then load it to main tables from staging tables (breaking of single extract to two different extracts).
Add 4 OLEDB destinations and map each of the outputs to the 4 staging tables that are created. Looking at the attributes under each of the output nodes, one would soon realize that the nodes are related to each other by an internal node id attribute. This makes me think that the nodes get linked like the way we normally do in a master-child database table design. The id value are unique integer value identifier generated by the XML source adapter. For the 4 outputs [Employees, Employee, Hobbies, Hobby] we have, the staging table design looks as follows :-
Table 1: STG.Employees [EmployeesId]
Table 2: STG.Employee [EmployeesId, EmployeeId, EmpCd, fname, lname]
Table 3: STG.Hobbies [HobbiesId, EmployeeId]
Table 4: STG.Hobby [HobbyId, HobbiesId, name]
Now map the outputs to each of these tables respectively. The EmployeesId, EmployeeId, HobbiesId, HobbyId are the the internal ids that I am talking about (you do not see the sample XML above).
Each of the nodes in XML form an output in the pipeline [Right click->Advanced Properties]. The approach is here to extract the data to an intermediate staging tables (that have the same structure as that of our XML) and then load it to main tables from staging tables (breaking of single extract to two different extracts).
Add 4 OLEDB destinations and map each of the outputs to the 4 staging tables that are created. Looking at the attributes under each of the output nodes, one would soon realize that the nodes are related to each other by an internal node id attribute. This makes me think that the nodes get linked like the way we normally do in a master-child database table design. The id value are unique integer value identifier generated by the XML source adapter. For the 4 outputs [Employees, Employee, Hobbies, Hobby] we have, the staging table design looks as follows :-
Table 1: STG.Employees [EmployeesId]
Table 2: STG.Employee [EmployeesId, EmployeeId, EmpCd, fname, lname]
Table 3: STG.Hobbies [HobbiesId, EmployeeId]
Table 4: STG.Hobby [HobbyId, HobbiesId, name]
Now map the outputs to each of these tables respectively. The EmployeesId, EmployeeId, HobbiesId, HobbyId are the the internal ids that I am talking about (you do not see the sample XML above).
Once the data is in staging tables, it is a very simple data extract to the main tables using the OLEDB source and destination adapters with some look up tasks for the master table ids.
So why this approach:
1. An alternate way would require heavy use of merge joins, look ups and sort tasks. The sort task is a big performance hit. In the current approach, we do not have to use the merge join, sort tasks.
2. Complex or large XML in real time require many merge joins, look ups and sort tasks as every output extract requires you to have linking of nodes from the root to enable data extract from the node of interest. In the current approach, the no of intermediate tasks are reduced.
3. As a result of #2, one cannot extract the data in parallel. The output of a node can be used only once from an XML source to destination and extract from each node require us to start linking from the root node. This results us in using multiple data flow task (one of each node) each pointing to same XML physical file. In the current approach, we are maximizing the parallel execution/ data extract capabilities by having all the nodes output from a single XML source to adapter to multiple OLEDB destinations.
Also, when there are many data flow tasks pointing to the same XML file to extract data, one would end up getting an Out of Memory error at one point of time. This would force us to group data flow tasks and make them run sequentially. The out of memory error is being confirmed by Microsoft as limitations.
4. When XSD schema changes, one need to change all the tasks used in the path as a result of the recreation of the lineage ids. In the current approach, as we have reduced the no of tasks, we reduce the effort in the no of tasks that need to be revisited. One cannot avoid this as a result of the way data flow tasks are designed.
5. Simple to use as it is a direct mapping of XML to database tables and then from staging table to main tables.
6. Performance is high and maintenance is easy.
In summary the following points were considered for the approach:
- Performance
- Parallel processing
- Amount of rework
- Simplicity and Maintainability
Hope the "SSIS - XML data extract pattern" is helpful.