Importing XML to a Relational Database Part I
I've come to a project that requires some data in XML format to be imported into a relational database. My understanding of the XML files is that they may be in various standard formats, suggesting that they all are well defined and conforms to some schema. The goal is form a mapping from XML data to the database, so how does one accomplish this? There are XML importing tools for Microsoft Access, but I do not have personal resources to spend on part-time projects and so those are not an option. A brief inspection of those family of tools suggests that they mostly import the XML files by create table structures based on the XML (more specifically, its schema), which seems useless when the goal is to add to an existing database. Therefore, if adding the structure does not work, then just add the data to the existing table structure.
Stating the solution is always simpler than implementing it or working out all the details. Getting the data out and forming it into a new structure is fairly straight forward when the XML is known, but becomes a bit more difficult when constructing a generic solution that could work with any XML file and a wider range of database designs. The non-generic solution involves finding all 1-1 mappings between the XML and database fields, selecting the XML data, forming SQL insert/update queries in the appropriate order deemed by the particular structure of the database (such as inserting rows that may be used as foreign keys first), and performing any additional queries.
A generic solution does not know which order to perform the SQL queries, or what XML nodes to map, or the structure of the database or XML file. The generic solution takes in account situations where the XML has fewer or more fields that what the database requires without having to rewrite code and mundane insert queries. For example, and a set of XML files describe a set of metal parts for some machine. The database tracks whether the part is metal, plastic, or rubber. The XML files have no field describing the fact that they are made from metal, yet to be imported into the database they must have that property. One could simply alter the non-generic solution by changing the insert statement, or one could change some mapping markup. For example (not exactly what I am using):
<map>
<dbfield>part.partmaterial</dbfield>
<value>metal</value>
</map>
In this xml fragment it is clear that the field partmaterial of the table part is mapped to a static value metal. Then this mapping can be used for all metal parts and if a new part comes by that is known to be metal, then the same XML config file can use. This formatting style also makes it completely obvious what is being mapped to what, unlike an SQL insert query.
The solution I am pursuing is using a configuration file (or mapping file) to map particular XML text nodes to fields in a database and focusing on 1-1 mappings first and ignoring runtime constants (such as surrogate keys) until everything else works. There are limitations and does nothing about forming triggers or procedures based on the XML data. I don't plan on supporting them, so the problem is solved.
As of now the generic tool is not completed from end to end (from reading the mapping file, validating the mapping file, checking for mapping for semantic errors, forming and executing the SQL queries, defining runtime constants). In the future this post may be updated or new posted be created with additional information on the process.
Comments and constructive criticism are welcomed in the comments.