XML Persistence Part 2
In the first part of this series I gave a brief overview of XML Persistence, though it was a bit broad, it basically isolated a few scenarios when persisting your data as native XML, would make sense. Just so we don’t repeat the material, the bullet points below reiterate some instances where you might benefit from native XML storage.
• You have a dynamic hierarchical data model and a schema that might change with time.
• You current application domain is closely tied to some industry standard XML schema, though you have to support the data in the XML format for interoperability.
• You have an XML based SOA architecture and majority of your use cases involve moving the domain state in some XML format.
Being that you have now qualified a part or all of your domain persistence requirements for native XML persistence, let’s take a look at the requirements for an XML persistence store to support persisting enterprise application state. I also explain the requirements in more detail below. I will use these requirements when discussing various products and strategies in part 3.
XML Persistence Store Requirements
- Support of storage of XML in its native format, through support of the XML data model.
- Flexible XML schema-less or XML schema bound storage
- Extensible on-the-fly storage of nodes
- XML schema versioning to address evolving standards
- Indexing (element, attribute and full-text)
- Support for XML querying and transformation facilities
- Support the ability to query external data sources (XML and non-XML)
- Support for CRUD operations including node level updates
- XQuery Stored Procedures
- ACID/Distributed Transactions Support
- Enterprise Reliability
- High Availability
- Ability to cache XML and non-XML data sources in a consistent XML view
- APIs – Support for various language bindings (Java, .NET, Web Services, etc…)
- Administrative and development tools
(1) Supporting of XML in its native format means a lot of things. The storage mechanism should allow flexibility in storing XML hierarchical representation. Because XML can be semi-structured and/or structured (XML Schema), the mechanism should support storage of both, schema-less and schema-bound storage. For the most part, when persisting your enterprise data, the XML format will be based on a schema-imposed constraint, though making it structured and valid. There are use cases where schema validation is not required and you just want to store arbitrary XML data/messages from various sources. That’s where the flexibility of schema-less storage comes in.
Schema versioning is also a very important requirement. One of Schema as well as XML biggest benefit is the flexibility of dynamic hierarchical storage. With that in mind, Schema applies a set of constraints on the data and the constraints might change as application requirements evolve. One of the restrictions in the relational world is that when your constraints change, you have to go back and update all of the data that was collected before such constraints are imposed. In some instances it meant modifying data and/or purging it. That was resolved through various means, but they weren’t very pretty. With Schema versioning, as your schema changes it’s versioned and therefore imposes a set of constraints on all new data. The old data still references the previous schema version number, though allowing the data to remain unchanged. This is very important, especially in applications that have data and requirements sensitive to changes, whether by regulatory or other requirements.
Indexing is yet another requirements if you want to efficiently retrieve the stored data. The database should support multiple indexing levels (i.e. element, attribute, full text).
(2) Data retrieval is probably the most important aspect of storing data. Flexibility and efficiency of retrieving data can make or break an application. Relational storage popularity is in part due to the standardization and popularity of SQL, which allows efficient retrieval of data from relational store. It’s optimized for retrieval of data from conceptually rectangular data sets. XML has various query standards. The one that closely resembles SQL in its promise and abilities is XQuery. Before XQuery, XML data was mostly queried using XPath and XML parsers (i.e. SAX, DOM). The first was limited in its abilities; the later was extremely tedious and cumbersome, as it was low level.
XQuery is a flexible query specification for XML. It defines and works on the XQuery 1.0/XPath 2.0 data model (XDM), which is an abstraction over XML data. It’s based on the XML Infoset and adds a few other features (XML Schema types, document collections, ordered/heterogeneous sequences, and typed atomic values). XPath 2.0 is a subset of XQuery. XQuery stored procedures make more efficient use of XQuery, though allowing similar benefits as PL/SQL, T-SQL, IBM DB2 SQL, etc…
The querying capabilities should be able to support CRUD operations with node level granularity. This is important, as when storing data that conforms to a schema, the data will be stored in single collection and row, though the collection row will become the full storage schema for your application. You can logically shred your XML document into multiple collections/rows, but this becomes more tedious and less of an attraction if the database supports granular operations with consistency and efficiency as updating a tuple within a relational store.
XSLT is also an important standard, that allows for transformation of result sets to other formats. It allows the use of native XML facilities to facilitate the transformation from within XQuery and/or some native persistence storage API. Hopefully the native XML database supports XSLT 2.0, as it works upon the same data model as XQuery (XDM).
(3,4,5,6) These requirements are consistent with any enterprise information systems (EIS) that support high availability and transactional requirements. I won’t go into any more detail about those, as they are already very well defined within the realms of the relational storage features and the same principles would apply here.
(7) As with any other storage medium, you must access it somehow. APIs for the most common programming environments are very important. Though you might be using one language/framework today, these requirements/standards have a way of changing and though you have to plan for the future. Basically, what I’m trying to say is, don’t just ensure that your language/environment is supported, and rather ensure you have the flexibility of choosing now and/or in the future. I won’t get into a language war here, but one important standard to watch, is XQJ (XQuery API for Java), under development as JSR 225. XQJ, though not very new, is now picking up steam in the Java community and most vendors are beginning to implement its interfaces. XQJ is very similar in functionality to JDBC, but specifically designed for connection and manipulating XQuery data sources. The day will come, when XQJ will be packaged as a part of the JSE, and though allow most persistence providers to implement a driver interface for their implementation. I hope such standards are also embraced and developed in other programming environments with time.
(8) Administrative tools are very important for any database product. Whether it’s a command line interface or a GUI based tool, they allow more efficient administration and maintenance of the database. The development tools are also a great compliment as they allow for rapid application development. Most commercial and Open Source relational database vendors do a great job in providing such tools.
In the discussion of particular native XML database products in the next part, I will cover some of the tools and features. In the meantime, there are also tools out there that support XQuery/XPath/XSLT execution through a GUI environment, provide support to hook into multiple XQuery engines, as well as intuitive debuggers. I personally enjoy working with Oxygen, which currently supports Saxon, eXist, Berkeley DB XML, X-Hive/DB, MarkLogic, and TigerLogic for XQuery. It also supports Saxon, Xalan, and JAXP for XSLT transformations. Oxygen is cross platform, with distributions for most major platforms (Windows, OS X, Linux). Though there are some other products that provide the same level of support, Oxygen is comparatively very inexpensive and works great on my OS X.
Above I provided a list of requirements that we used to guide us in our storage native XML persistence needs. Here is some idea of what’s to come in the next few parts of these articles.
Part 3 – Evaluation of various native XML database products (mostly the ones that we have looked at, as well as ones that we have researched).
Part 4 – Efficient XML storage architecture and strategy.
Part 5 – XQOM (XQuery Object Mapping) framework introduction.
I’d also like to hear some ideas of what some of you are interested in, your experiences, comments, etc…. I really appreciate all of your input. I received numerous responses to Part 1 of this series and I will try to use most of these comments/suggestions in the following parts.