Saturday, March 29, 2008

Java XML related concepts and implementations

Revolving around XML processing, there are many concepts and corresponding Java libraries/packages. For a beginner, he/she may get confused with bunch of different concepts and libraries. I aim to record what I have learnt about xml processing.

(1) The first aspect is how to parse and process xml documents. Following is a list of current popular xml processing interface.

(1.1) DOM (Document Object Model)
This is a model standardized by W3C (http://www.w3.org/DOM/DOMTR). It is a suite which contains three levels. DOM is platform- and language-neutral interface by which users can manipulate xml document. For example, users can retrieve the root element of a xml document and get its child nodes and attributes.
Generally, implementation of DOM is to build a tree structure in memory. It means that only when the whole document is gained by the parser can the parser build dom tree. The obvious drawback is its performance considering a large xml document. What's more, sometimes the user just wants to get a small part of a large xml document. In this case, to build a in-memory tree is not a good solution.
(1.2)SAX (Simple API for XML)
    This interface does not come from formal organization. At first, it was a Java implementation of XML parser with some different concepts from DOM. After it was published, it gradually was accepted by industry. Finally, SAX becomes a set of interfaces which define a new way to parse XML document. It is language independent. Currently, SAX are implemented in many programming languages.
    " SAX is a streaming interface — applications receive information from XML documents in a continuous stream, with no backtracking or navigation allowed. This approach makes SAX extremely efficient, handing XML documents of nearly any size in linear time and near-constant memory, but it also places greater demands on the software developer's skills."
    SAX is the best known example of Event-based APIs. "An event-based API, on the other hand, reports parsing events (such as the start and end of elements) directly to the application through callbacks, and does not usually build an internal tree. The application implements handlers to deal with the different events, much like handling events in a graphical user interface."
    This mechanism requires programmers to understand concept similar to state-machine. Programmers should maintain a state machine conversion of different states is based on events SAX generates.
(1.3)XPP (XML Pull Parser)
    SAX is push-based interface. It means SAX reads XML document from a stream and invokes corresponding event handlers you have registered. However, this is still inefficient considering sometimes users just want to handle a small part of xml document. In this case, XPP is better. XPP manipulates corresponding elements/attributes based on users' requests. As a result, only parts of xml document are processed with other parts untouched.
(1.4)StAX (Streaming API for XMl)
    Excerpt from http://en.wikipedia.org/wiki/StAX:
    "These two access metaphors can be thought of as polar opposites. A tree based API allows unlimited, random, access and manipulation, while an event based API is a 'one shot' pass through the source document.

StAX was designed as a median between these two opposites. In the StAX metaphor, the programmatic entry point is a cursor that represents a point within the document. The application moves the cursor forward - 'pulling' the information from the parser as it needs. This is different from an event based API - such as SAX - which 'pushes' data to the application - requiring the application to maintain state between events as necessary to keep track of location within the document."   

    Here(http://www.xml.com/pub/a/2001/11/14/dom-sax.html?page=1) is a good discussion about DOM and SAX in high level other than detailed processing details.

(2) Validation of XML document.
(2.1)DOM
    Validation of XML document based on W3C Schema or DTD is straightforward by using DOM. DOM maintains a global structure of processed xml document so that all information of the document can be retrieved easily without much effort.
(2.2)SAX
    Validation in SAX requires more effort. The reason is that SAX does not maintain information about full document. Certain kinds of validation require to access information of document in full. For example, DTD IDREF attribute is used to refer to other elements defined in document. It requires there exists an element in the document that uses that string as an ID attribute. So, to work around this issue, one needs to maintain every encountered ID attribute and IDREF attribute. In addition, some kinds of XML document processing require to access whole document (i.e. XPath). In this case, to build a document tree is better, which actually is included in DOM.
(2.3)XPP
    The same kind of thought applies here as well. To validate a XML document, XPP needs more workaround than DOM and SAX.
(3) Java-specific XML processing interface
     To make Java programmers access various xml processing interfaces in easier way, various Java-specific interfaces have been defined. This can make XML processing code parser independent. In other words, user can switch between different parsers if these parsers comply with the same set of interface definition. This is a good thing because the code is not bound to a specific parser.
(2.1) JAXP
    This is constructed by JCP. It provides a set of interfaces by which Java programmers are able to process xml documents. It contains DOM, SAX, XSLT, XInclude, XPath and XML validation. Actually, by putting JAXP into JDK, it increases its chance of being accepted more and more. Users don't need to install additional packages to make use of JAXP. This makes it convenient to use and makes configuration of run time environment more easily. To process XML documents, JDK is all you need. Of course, if you are not satisfied with performance of built-in parser shipped with JDK, you can download and install other parsers. This increases flexibility for sure.
(2.1) dom4j
    I think this interface is an extension of JAXP. It means dom4j is compliant with JAXP but it provides more extra functionalities. It seems that dom4j is more convenient for Java programmers to work with.
(2.1) JDOM
    This also shares the same goal with other interfaces described above. If I don't remember something wrong, it came out before JAXP. By using many Java language specific features, it can ease processing of XML document in Java. However, I don't think currently it is still popular considering standard JAXP and JAXP extension dom4j.
(4) Parsers
    Now it is time to introduce some XML parsers which actually process XML documents. Two popular open source parses I know are Xerces and Crimson. You can google and find comparison between these two parsers. Basically, both of them support parsing by DOM and SAX.
(5) Relationship between these terms.
    To eliminate possible confusion, some clarification of relationship among these terms may be necessary.
    For APIs/Interfaces, they define implementation independent specification of XMl manipulation. They are just sets of interfaces and not implementations. Implementation of these interfaces are generally based on parsers which do the actual work. The implementation consists of a slim layer sitting on top of functionalities of parsers. It wraps functionalities of underlying parsers (may be Xerces, Crimson) to provide a unified interface which is defined by corresponding specification (JAXP or dom4j...). So generally after you download a library from website of dom4j or JDOM or JAXP, the library contains Xerces/dom4j lib. You can figure it out if you browse directory layout of the dom4j/JDOM/JAXP library.
(6) Further work
    Although JAXP/dom4j/JDOM eases processing of XML document, it is far from what Java programmers expect. To work on DOM level is still a clumsy job. You have to retrieve an element or attribute so that you can manipulate its content. Generally, programmers need to know details of the XML documents.
    One advanced idea of XML processing is to build correspondence between Java classes and XML Schema(may be DTD) and correspondence between Java objects and XML document. Then programmers just need to manipulate those Java objects instead of elements/attributes in XML documents.
    There are many libraries which implement binding between XMl and Java, e.g. JIBX, XMLBeans, ADB. JCP constructed a standard called JAXB (Java API for XML Binding?). Glassfish provides a reference implementation. JAXME is also an implementation of JAXB. However JAXME have not published a new version since 2006. So I don't know whether it is still developed actively.
    However, because correspondence between these two parts sometimes is not so natural, it may increase burden of programmers. Sometimes, the correspondence does not comply with what programmers expect. In this case, some human intervention is necessary. Then programmers must understand details of rules used during conversion. There may be many rules so that it is not a trivial task to grasp them all. Sometimes you don't need to understand them all. However, to figure out which rule you should customize is still not an easy task. As a result, some programmers prefer to manipulate XML using DOM/SAX instead of XML-Java binding.
(7) Related area
    To illustrate usage of XML related technologies, one inevitable area is Web Service.
    Some useful projects which ease of development of web service in Java are created. Here(http://wiki.apache.org/ws/StackComparison) is a list of frameworks and some stack comparison is presented as well. One additional lib which is not mentioned in that article is XINS(http://xins.sourceforge.net/). For web service client, WSIF is a client framework which can make web service client be composed easily. It is donated by IBM to Apache foundation. However, it seems not to be actively developed now. I am not sure.
    Every time variation of Java technologies exist, JCP is willing to standardize it. The web service area is not exceptional. Firstly, it proposed JAX-RPC specification. It standardized web service development based on WSDL/SOAP. One part inside the specification is binding between Java classes/objects and WSDL. In JAX-RPC specification, it contains detailed binding rules. After some time, JAX-RPC evolved into 2.0 and it is renamed to JAX-WS 2.0. The reason may be WS is a buzzword in industry and it may better capture the intent of the specification. In JAX-WS 2.0, WSDL-Java binding is delegated to JAXB. I remember JAX-WS implementation is included in JDK 6. JAX-RPC/JAX-WS specification describes interface by which programmers can easily build web service client and server programs. Besides, programmers can customized handling of transmissioned messages(may be SOAP) by plugging in handlers.