September, 2006


16
Sep 06

XML Persistence Part 2

XML Persistence Part 2

In the first part of this series I gave a brief overview of XML Persistence, though it was a bit broad, it basically isolated a few scenarios when persisting your data as native XML, would make sense.  Just so we don’t repeat the material, the bullet points below reiterate some instances where you might benefit from native XML storage.

•    You have a dynamic hierarchical data model and a schema that might change with time.
•    You current application domain is closely tied to some industry standard XML schema, though you have to support the data in the XML format for interoperability.
•    You have an XML based SOA architecture and majority of your use cases involve moving the domain state in some XML format.

Being that you have now qualified a part or all of your domain persistence requirements for native XML persistence, let’s take a look at the requirements for an XML persistence store to support persisting enterprise application state.  I also explain the requirements in more detail below.  I will use these requirements when discussing various products and strategies in part 3.

XML Persistence Store Requirements

  1. Support of storage of XML in its native format, through support of the XML data model.
    • Flexible XML schema-less or XML schema bound storage
    • Extensible on-the-fly storage of nodes
    • XML schema versioning to address evolving standards
    • Indexing (element, attribute and full-text)
  2. Support for XML querying and transformation facilities
    • XQuery/XPath
      • Support the ability to query external data sources (XML and non-XML)
      • Support for CRUD operations including node level updates
    • XSLT
    • XQuery Stored Procedures
  3. ACID/Distributed Transactions Support
  4. Concurrency
    • Isolation
    • Locking
  5. Enterprise Reliability
    • High Availability
    • Replication
    • Backup/Restore
  6. Caching
    • Ability to cache XML and non-XML data sources in a consistent XML view
  7. APIs – Support for various language bindings (Java, .NET, Web Services, etc…)
  8. Administrative and development tools

(1) Supporting of XML in its native format means a lot of things.  The storage mechanism should allow flexibility in storing XML hierarchical representation.  Because XML can be semi-structured and/or structured (XML Schema), the mechanism should support storage of both, schema-less and schema-bound storage.  For the most part, when persisting your enterprise data, the XML format will be based on a schema-imposed constraint, though making it structured and valid.  There are use cases where schema validation is not required and you just want to store arbitrary XML data/messages from various sources.  That’s where the flexibility of schema-less storage comes in.

Schema versioning is also a very important requirement.  One of Schema as well as XML biggest benefit is the flexibility of dynamic hierarchical storage.  With that in mind, Schema applies a set of constraints on the data and the constraints might change as application requirements evolve.  One of the restrictions in the relational world is that when your constraints change, you have to go back and update all of the data that was collected before such constraints are imposed.  In some instances it meant modifying data and/or purging it.  That was resolved through various means, but they weren’t very pretty.  With Schema versioning, as your schema changes it’s versioned and therefore imposes a set of constraints on all new data.  The old data still references the previous schema version number, though allowing the data to remain unchanged.  This is very important, especially in applications that have data and requirements sensitive to changes, whether by regulatory or other requirements.

Indexing is yet another requirements if you want to efficiently retrieve the stored data.  The database should support multiple indexing levels (i.e. element, attribute, full text).

(2) Data retrieval is probably the most important aspect of storing data.  Flexibility and efficiency of retrieving data can make or break an application.  Relational storage popularity is in part due to the standardization and popularity of SQL, which allows efficient retrieval of data from relational store.  It’s optimized for retrieval of data from conceptually rectangular data sets.  XML has various query standards.  The one that closely resembles SQL in its promise and abilities is XQuery.  Before XQuery, XML data was mostly queried using XPath and XML parsers (i.e. SAX, DOM).  The first was limited in its abilities; the later was extremely tedious and cumbersome, as it was low level.

XQuery is a flexible query specification for XML.  It defines and works on the XQuery 1.0/XPath 2.0 data model (XDM), which is an abstraction over XML data.  It’s based on the XML Infoset and adds a few other features (XML Schema types, document collections, ordered/heterogeneous sequences, and typed atomic values).  XPath 2.0 is a subset of XQuery.  XQuery stored procedures make more efficient use of XQuery, though allowing similar benefits as PL/SQL, T-SQL, IBM DB2 SQL, etc…

The querying capabilities should be able to support CRUD operations with node level granularity.  This is important, as when storing data that conforms to a schema, the data will be stored in single collection and row, though the collection row will become the full storage schema for your application.  You can logically shred your XML document into multiple collections/rows, but this becomes more tedious and less of an attraction if the database supports granular operations with consistency and efficiency as updating a tuple within a relational store.

XSLT is also an important standard, that allows for transformation of result sets to other formats.  It allows the use of native XML facilities to facilitate the transformation from within XQuery and/or some native persistence storage API.  Hopefully the native XML database supports XSLT 2.0, as it works upon the same data model as XQuery (XDM).

(3,4,5,6) These requirements are consistent with any enterprise information systems (EIS) that support high availability and transactional requirements.   I won’t go into any more detail about those, as they are already very well defined within the realms of the relational storage features and the same principles would apply here.

(7) As with any other storage medium, you must access it somehow.  APIs for the most common programming environments are very important.  Though you might be using one language/framework today, these requirements/standards have a way of changing and though you have to plan for the future.  Basically, what I’m trying to say is, don’t just ensure that your language/environment is supported, and rather ensure you have the flexibility of choosing now and/or in the future.  I won’t get into a language war here, but one important standard to watch, is XQJ (XQuery API for Java), under development as JSR 225.  XQJ, though not very new, is now picking up steam in the Java community and most vendors are beginning to implement its interfaces.  XQJ is very similar in functionality to JDBC, but specifically designed for connection and manipulating XQuery data sources.  The day will come, when XQJ will be packaged as a part of the JSE, and though allow most persistence providers to implement a driver interface for their implementation.  I hope such standards are also embraced and developed in other programming environments with time.

(8) Administrative tools are very important for any database product.  Whether it’s a command line interface or a GUI based tool, they allow more efficient administration and maintenance of the database.  The development tools are also a great compliment as they allow for rapid application development. Most commercial and Open Source relational database vendors do a great job in providing such tools.

In the discussion of particular native XML database products in the next part, I will cover some of the tools and features.  In the meantime, there are also tools out there that support XQuery/XPath/XSLT execution through a GUI environment, provide support to hook into multiple XQuery engines, as well as intuitive debuggers.  I personally enjoy working with Oxygen, which currently supports Saxon, eXist, Berkeley DB XML, X-Hive/DB, MarkLogic, and TigerLogic for XQuery.  It also supports Saxon, Xalan, and JAXP for XSLT transformations.  Oxygen is cross platform, with distributions for most major platforms (Windows, OS X, Linux).  Though there are some other products that provide the same level of support, Oxygen is comparatively very inexpensive and works great on my OS X.

Above I provided a list of requirements that we used to guide us in our storage native XML persistence needs.  Here is some idea of what’s to come in the next few parts of these articles.

Part 3 – Evaluation of various native XML database products (mostly the ones that we have looked at, as well as ones that we have researched).
Part 4 – Efficient XML storage architecture and strategy.
Part 5 –     XQOM (XQuery Object Mapping) framework introduction.

I’d also like to hear some ideas of what some of you are interested in, your experiences, comments, etc….  I really appreciate all of your input.  I received numerous responses to Part 1 of this series and I will try to use most of these comments/suggestions in the following parts.


11
Sep 06

Basic Dependency Injection (DI) for XQOM

So last night I was trying to figure out the best way of managing dependencies within code for XQOM.  My biggest requirement was no XML.  That’s funny, considering that XQOM has an XML configuration component.  So I’m not necessarily against XML configuration, though it tends to scatter the code/configuration combination.  I also don’t have a strong preference of XML vs. annotations.  I think each has its own place and I don’t really see a blurry line in the middle.  I mean, anything that you can configure using annotations, something that’s class or method dependent and without much repetition, etc…  A good example is JPA annotations, though lacking in features as compared to Hibernate, I think conceptually they are great (and Hibernate’s extensions are awesome).  I mean, it just makes perfect sense to annotate your POJO with relational persistence stuff, since each configuration (annotation) is closely tied to class/method/attribute.  XML configuration files are great for anything that’s global and/or verbose, like query mappings with iBatis SQL Maps and XQOM, global properties used by multiple classes.

In the case of XQOM, though I could have settled with XML configurations, the idea of having to edit XML each time an implementation class changes didn’t make a lot of sense to me.  The Drivers are a reification of the Separated Interface pattern and being that providers will implement the drivers in their own package namespace, yet another XML configuration file within the classpath that needs to be discovered and parsed, would quickly become a mess.  I needed something more flexible and because my object dependency and lifecycle requirements are not complex, I didn’t want to get Spring or PicoContainer into the mix, though I almost went with Pico, mostly due to its runtime code configuration capabilities.

After going back and forth about the benefits of using a IOC framework, I still couldn’t justify using it for my requirements.  Rudimentary IOC implementations that don’t require complex object lifecycle management are very simple to implement and that’s not where Spring shines per say.

The next solution was to implement the Registry pattern.  A global singleton would do the job, but the drawback of having to recompile the code each time the implementation changes, was again something I didn’t want to bother with.  Especially when we talk about Driver implementations.

So after more thought I settled with the JDBC model.  I really like it.  It’s sort of a Plugin pattern reification (PEAA), though without any static configuration files, which dynamically binds itself.  Basically, the Driver interface must implement a static code block, which upon loading, will register itself with the global singleton.  That’s basically what happens in JDBC when you execute…

Class.forName(“org.namespace.DriverImpl”);

Basically the DriverImpl class would have something like this…

package org.namespace;

public class DriverImpl implements Driver {

  static {
    DriverManager.register(new DriverImpl());
  }   

}

The DriverManager class would then look like this…

public class DriverManager {
  private volatile static Driver driverImpl;

  public static void registerDriver(Driver driver) {
    driverImpl = driver;
  }

}

I use volatile keyword, since there might be a use-case of different threads registering a different driver, and I want to ensure that it’s visible to all other threads.  I could have synchronized the registerDriver method as well, in this case, I don’t think it matters.  I try not to use synchronized in cases where data corruption is not a concern, rather the actual visibility of data.

With that implementation and a homegrown constructor based DI (CDI) or as also sometimes called “type 3 IOC”, I was able to achieve what I needed without adding any extra dependencies and overhead.  Don’t get me wrong, I think Spring is great, but again, it should be evaluated based on your requirements.  It might be the case that I will actually require it later in the project, but my current DI requirements are just too simple.


7
Sep 06

To maven or not?

For the last year or so, I’ve been using maven pretty happily for most of my projects, other than the ones that I don’t initiate and am forced to use ant.

I was never a big fan of ant, mostly because of the startup time it takes to build a robust build.  Yes, most developers reuse builds from previous projects, etc…, but most of my projects have been pretty unique in structure, though I always had to do some tweaking.

Maven introduced a standardized build structure.  I love it, a simple maven command to generate the initial project structure and I’m off and running.  Any customizations are easily added, etc…

But honestly, the biggest reason I’m using maven, is its transitive dependencies resolution.  Well, at least in concept it’s great.  Also, I’ve been using it for the last year without any major issues.  I was using it with mostly popular libraries (dependencies).

The XQOM project has a few dependencies that I’ve been battling with.  Its internal XML/Object mapper is JiBX.  JiBX is an awesome library.  I think it’s leaps and bounds ahead of it’s competitors like JAXB, Castor, etc…  It’s basically in XML/Object mapping world as what Hibernate is in ORM world.  No, it’s not an XML persistence solutions, that’s what XQOM is, but you get the point.

So although JiBX is a greatly architected/developed library, it has some maintenance issues.  The developers are not big on keeping the maven repo up to date.  The artifacts that are available in the repo, have transitive dependency issues, missing pom files, etc…  The subprojects like maven-jibx-plugin and IDEA plugins, all use different versions, etc…  So what’s the problem, let’s see…

I’m using JiBX 1.1 for dependencies, runtime and compile time.  Because JiBX has some differences in bytecode enhancement code injection between 1.0.1 and 1.1, you can’t use a binding compiler of one version and runtime of another.  Well, that seems straight forward enough, right?  No…

maven-jibx-plugin has a dependency of 1.0.1, and though when it’s run, it’s post compile goal of bytecode enhancement is executed with 1.0.1 libs.  The code base has a runtime dependency on 1.1, though when tests are executed with surefire, they are executed using the runtime dependency of the  project, though there is the complaint about the version mismatch.  Can’t use 1.1 runtime to execute code compiled with 1.0.1 binding compiler.  One way of resolving this, is basically modifying the local POM for the plugin, that will use the 1.1 dependencies.

It gets even better.  The IDEA plugin is dependent on 1.0 RC1 version, so when I’m building with IDEA, to say execute my TestNG tests from the IDE, it yet again complains, because the project dependency is 1.1.

OK, maybe this is just one project that I ran into that has these issues and maybe they’ll get it together one day.  I used to hate manually installing sun libs, since they couldn’t be hosted at ibiblio.  Now, sun has setup a mvn repository, so it’s just a matter of adding it to your settings.

Aside from these difficulties and spending more time than I wanted to get it to work, I think Maven is great.  I’m sticking with it for now and the near future.  I just wish that more and more projects either update ibiblio and/or provide good maven repositories.  Maven is no longer an alternative build tool, it’s now mainstream just like ant.


1
Sep 06

Ajax/XSLT views

Most of the popular view technologies for web applications today are for some reason centered around HTML.  That seems like a crazy statement, since one might say that web applications should be tightly coupled to HTML-based views since HTML is what is supported by the browser.  But web applications today are so much more than just being supported and rendered in a browser.  They are about streaming information on the web into a view technology, be it the browser, wireless devices, desktop clients, etc… to render a usable representation of that data for the consumer (note: consumer is not meant here in a marketing sense). 

Though most MVC frameworks today provide some support to render non-HTML views, most are still tightly coupled to HTML and if you want to go outside of that norm, it becomes more difficult with less and less reasons for using the framework in the first place.  You also might loose some abstraction from various APIs (servlet request/response, etc…) that these frameworks are so good at providing.

XSLT views are a nice, lightweight approach of rendering views for presentation.  It allows for flexible support of various view technologies and internationalization.  With the popularity of Web 2.0/Ajax applications, you’d think it would only be natural to start embracing the client side browser XSLT facilities.  Most popular browsers, IE, Firefox, Opera, and Safari provide such functionality.  IE and Firefox has the most robust XSLT facility, with Opera and Safari catching up.

Here are some benefits and features:

  • XSL transformations are fast.  In some instances faster than downloading large HTML pages. 
  • XSL stylesheets can be cached on the client side, they can even be precompiled with client-based javascript extensions and/or by default browser functionality. 
  • You can control transformations with javascript and/or allow the browser to handle the transformation by appending an xsl stylesheet element instruction to the XML file streamed to the client.
  • XML can be transformed to different formats HTML, WML, etc… depending on the client requesting it, by just appending the correct stylesheet.  It can even be transformed to different HTML DOMs for different browsers, though you can better control browser rendering/support.
  • XSLT processing pipelines can be used to transform to various intermediary formats in a pipeline.
  • Ajax is a natural way of working with XML on the client and allows asynchronous loading of XML data, which can later be transformed on the fly with various stylesheets.

There are some arguments that have been raised over the last few years.  Mostly by developers that argue that moving any logic to the client side is a bad idea, browser incompatibility issues will get in the way, javascript should be kept to a minimal, etc…

All of these points were valid at some point, but are quickly loosing credibility.  First, you don’t have to move any logic to the client side, and even if you do, you can separate view logic from business logic and though only move a subset of the view logic to the client, if it makes performance sense.  Second, browser incompatibility, though still an issue, is becoming less and less a roadblock, as the gap in feature support is closing and there are many frameworks available out there that provide cross-browser functionality (i.e. Google’s AJAXSLT).

Of course you don’t want to use XSLT views blindly in all applications.  As with anything else, this would be a part of the design-time decision and would have to be evaluated.  Here are some valid scenarios where you might want to consider using XSLT views.

  • You’re persisting your data in XML.  It only makes sense to stream it as it persists and transform to viewable formats either on the client and/or on the server.
  • Your HTML pages are huge and it takes a long time to stream and render by the browser.  You might consider just streaming the data in XML and having a cached client side XSLT stylesheet that will do the rest.
  • Your server load needs to be reduced and though you want to utilize some of the client browser processing power.  You can do client side transformation which will take away some of the load burden of processing and transforming the data for presentation on the server side.
  • You need a more robust way of supporting multiple view technologies (i.e. HTML, WML, SVG, etc…).  XSLT will allow you to have a standard data feed that’s transformed to multiple views on the fly by attaching an appropriate stylesheet.
  • You need to support and render data retrieved from a web service and/or some other application.  The data needs to be displayed in a consistent manner.  You can employ an XSLT pipeline to first transform the incoming data to an application standard format that is understood by the final XSLT stylesheet.  The final transformation can then render the data to the view technology.

There are more of these scenarios than I can think of right now, but even if your application’s requirement(s) fits one or more of the above, that does not automatically mean that XSLT views should be used.  You should first evaluate the actual relevance of the above scenarios to your application (i.e. what percentage of the application will be effected by this functionality).

I hope that eventually component based frameworks like JSF, Shale and action based frameworks like Struts, Webwork (and others) will embrace the raw XML views with the ability to apply XSLT on the server and client transparently.  I sort of like Spring MVC, which has a quite nice built in support for XSLT views and great flexibility over the amount of functionality you want to use, etc…  Spring MVC allows you to use only the components of the MVC you need, without forcing you to delegate control over to the framework.  Though it has multiple implementations for Controllers and Views that you can reuse and/or extend.  A prime example of great OO design.