Importing XML to a Relational Database Part II
In this second part (part I) focusing on how to import XML into a relational database I will provide details on a solution that I have developed. I am not primarily interested in importing XML "databases", but more interested in importing a single record because once a single record is imported it is simpler to apply the same logic to more than one record. I will also not be concerned about generating tables based from XML schema or DTD because I assume that a database already exists. To do this I will be using PHP, xPath, and MySQL.
You can find a draft of the guide to the xml2db tool at: http://download.whedimer.com/doc/xml2db.pdf .
From the first part of this series, I mentioned that importing XML into a relational database can be done by mapping particular text nodes in the XML to particular fields in database tables using some sort of mapping. This is the approach I will be taking, using PHP to handle the parsing and implementation, xPath to handle the selection of XML nodes and MySQL as the database to import into.
The focus will be on creating a tool does the following in order:
- reads and loads the mapping configuration
- checks the mapping for validity
- performs the mapping and imports the data
The overall tasks are: reading the mapping file, creating a model of mapping, validating the model, and then use the model to execute insertions in the database. Before we can begin on the first step, reading and loading the mapping configuration, we must decide what sort of markup should be used to specify the mapping. In part one I used XML to show a basic mapping between an database field and a static value and will extend upon it further to include more data sources. To fully understand what kinds of mappings are required, we have to understand XML schema, such as the fact that a node's existence can be optional or a node might be empty. I will not be explaining XML or how to write or understand XML schema or DTDs.
The Mapping File
A mapping file is a list of maps from some datasource to a single database field. The most common datasources are static values, and XML node(s) in the file to import. There are some other less common sources, but lets focus on the common ones. So we create a mapping prototype.
Note: we will not use attributes of nodes due to various issues they raise.
<mapset>
<map>
<dbfield>table1.field1</dbfield>
<value>static value</value>
</map>
</mapset>
With this prototype and every revision to it, we will implement everything up to and including validation. We use PHP to load it into DOMDocument object, making sure it load correctly before moving on to the next step, getting values from the DOMDocument into our own model. What I did was to create a MapSet object that took a file and created a list of Map objects that included all the data required for a single mapping. The Map object has a type varaible based on the type of datasource it was given. An alternative would be create a separate object for each map type (or datasource) that extends a base Map object.
For this prototype, we just need the MapSet object to create Map and configure them with the database field and the static value using the Map's accessor methods for those fields. More specifically the MapSet object gets DOMXPath from the DOMDocument and uses the xPath expression "/mapset/map" to get a NodeList, which is iterated upon to generate each Map object. When the MapSet has no more Map objects to create, then we pass MapSet to a checking function.
For now all the checking function does is to ensure the supplied database field exists, which can be done either by executing a select statement and checking for errors, or by first getting the list of tables, the fields for the tables mentioned in the database, store them, and then compared against them. We must check the mapping file because we do not want to come across a table or field that does not exist when executing SQL statements, and then tell the user that a partial import happened, leaveing the database in an unknown state.
In the next part we'll extend the prototype to actually use data from the xml data file we want to import.
Importing XML to a Relational Database Part I
I've come to a project that requires some data in XML format to be imported into a relational database. My understanding of the XML files is that they may be in various standard formats, suggesting that they all are well defined and conforms to some schema. The goal is form a mapping from XML data to the database, so how does one accomplish this? There are XML importing tools for Microsoft Access, but I do not have personal resources to spend on part-time projects and so those are not an option. A brief inspection of those family of tools suggests that they mostly import the XML files by create table structures based on the XML (more specifically, its schema), which seems useless when the goal is to add to an existing database. Therefore, if adding the structure does not work, then just add the data to the existing table structure.
Stating the solution is always simpler than implementing it or working out all the details. Getting the data out and forming it into a new structure is fairly straight forward when the XML is known, but becomes a bit more difficult when constructing a generic solution that could work with any XML file and a wider range of database designs. The non-generic solution involves finding all 1-1 mappings between the XML and database fields, selecting the XML data, forming SQL insert/update queries in the appropriate order deemed by the particular structure of the database (such as inserting rows that may be used as foreign keys first), and performing any additional queries.
A generic solution does not know which order to perform the SQL queries, or what XML nodes to map, or the structure of the database or XML file. The generic solution takes in account situations where the XML has fewer or more fields that what the database requires without having to rewrite code and mundane insert queries. For example, and a set of XML files describe a set of metal parts for some machine. The database tracks whether the part is metal, plastic, or rubber. The XML files have no field describing the fact that they are made from metal, yet to be imported into the database they must have that property. One could simply alter the non-generic solution by changing the insert statement, or one could change some mapping markup. For example (not exactly what I am using):
<map>
<dbfield>part.partmaterial</dbfield>
<value>metal</value>
</map>
In this xml fragment it is clear that the field partmaterial of the table part is mapped to a static value metal. Then this mapping can be used for all metal parts and if a new part comes by that is known to be metal, then the same XML config file can use. This formatting style also makes it completely obvious what is being mapped to what, unlike an SQL insert query.
The solution I am pursuing is using a configuration file (or mapping file) to map particular XML text nodes to fields in a database and focusing on 1-1 mappings first and ignoring runtime constants (such as surrogate keys) until everything else works. There are limitations and does nothing about forming triggers or procedures based on the XML data. I don't plan on supporting them, so the problem is solved.
As of now the generic tool is not completed from end to end (from reading the mapping file, validating the mapping file, checking for mapping for semantic errors, forming and executing the SQL queries, defining runtime constants). In the future this post may be updated or new posted be created with additional information on the process.
Comments and constructive criticism are welcomed in the comments.
PHP Lib Testing
Lately the PHP has gained a few units tests. I know usually the tests are done while the code is being developed and there is less to gain by writing tests after the fact, but the point it to help future development. The whole library is not completely cover by test cases, but it in progress. Currently, I received some consultant work to do and so increases the unit tests will have to wait a while.
PHP lib
I created a PHP library for my Master's PILOT and had been debating on whether to release it to the public in a GPL license. I finally realized that there are many others PHP libraries out there with many more features and proven uses and so the chances of my library being used in any significant way is small. So why release it then? It might teach someone something and would increase the quality of the library with potential public eyes on it. I spent some time creating a script that automatically creates an tarball of the latests version from the repository and can be found at download.whedimer.com/snapshot/wlib_php. Stable releases will eventually come.
Criticism on the library is welcome so long as the language used is civil and specific towards a particular aspect (no vague or general criticisms please). Keep in mind I have little professional experience and so keep your expectations low.
Purpose
Welcome to the Whedimer blog. Expect to find posts recording my efforts in computer programming and documentation. While there are many development blogs out there the goal is not to compete with them. Nearly all of the good development and tech blogs out there are authored by people with a decent amount of professional experience. However with the case of this blog, the author is one with little professional experience. Some facts about the author:
- graduate and undergraduate degree in computer science
- two internships (one software engineering, one research) and one graduate research assistantship
- liberal arts education
- web development experience (graduate school work)
Expect addional posts convering various topics in the future.
PHP Lib Fully Commented
The first round of commenting for the PHP library is now complete. More commenting is needed to bring it up the the standards that I think are nescessary to call the code "well commented". All the classes, instance variables, methods and functions have autodoc comments. For those who might wonder why the commenting is happening after the writing of the code, it was simply a matter of time. The code was written as part of my Master's project and so proper code standards were ignored to focuse on the problem I was working on. The same sort of thing will happen with automated testing, which is currently being worked on.