Saturday, March 29, 2008

Java XML related concepts and implementations

Revolving around XML processing, there are many concepts and corresponding Java libraries/packages. For a beginner, he/she may get confused with bunch of different concepts and libraries. I aim to record what I have learnt about xml processing.

(1) The first aspect is how to parse and process xml documents. Following is a list of current popular xml processing interface.

(1.1) DOM (Document Object Model)
This is a model standardized by W3C (http://www.w3.org/DOM/DOMTR). It is a suite which contains three levels. DOM is platform- and language-neutral interface by which users can manipulate xml document. For example, users can retrieve the root element of a xml document and get its child nodes and attributes.
Generally, implementation of DOM is to build a tree structure in memory. It means that only when the whole document is gained by the parser can the parser build dom tree. The obvious drawback is its performance considering a large xml document. What's more, sometimes the user just wants to get a small part of a large xml document. In this case, to build a in-memory tree is not a good solution.
(1.2)SAX (Simple API for XML)
    This interface does not come from formal organization. At first, it was a Java implementation of XML parser with some different concepts from DOM. After it was published, it gradually was accepted by industry. Finally, SAX becomes a set of interfaces which define a new way to parse XML document. It is language independent. Currently, SAX are implemented in many programming languages.
    " SAX is a streaming interface — applications receive information from XML documents in a continuous stream, with no backtracking or navigation allowed. This approach makes SAX extremely efficient, handing XML documents of nearly any size in linear time and near-constant memory, but it also places greater demands on the software developer's skills."
    SAX is the best known example of Event-based APIs. "An event-based API, on the other hand, reports parsing events (such as the start and end of elements) directly to the application through callbacks, and does not usually build an internal tree. The application implements handlers to deal with the different events, much like handling events in a graphical user interface."
    This mechanism requires programmers to understand concept similar to state-machine. Programmers should maintain a state machine conversion of different states is based on events SAX generates.
(1.3)XPP (XML Pull Parser)
    SAX is push-based interface. It means SAX reads XML document from a stream and invokes corresponding event handlers you have registered. However, this is still inefficient considering sometimes users just want to handle a small part of xml document. In this case, XPP is better. XPP manipulates corresponding elements/attributes based on users' requests. As a result, only parts of xml document are processed with other parts untouched.
(1.4)StAX (Streaming API for XMl)
    Excerpt from http://en.wikipedia.org/wiki/StAX:
    "These two access metaphors can be thought of as polar opposites. A tree based API allows unlimited, random, access and manipulation, while an event based API is a 'one shot' pass through the source document.

StAX was designed as a median between these two opposites. In the StAX metaphor, the programmatic entry point is a cursor that represents a point within the document. The application moves the cursor forward - 'pulling' the information from the parser as it needs. This is different from an event based API - such as SAX - which 'pushes' data to the application - requiring the application to maintain state between events as necessary to keep track of location within the document."   

    Here(http://www.xml.com/pub/a/2001/11/14/dom-sax.html?page=1) is a good discussion about DOM and SAX in high level other than detailed processing details.

(2) Validation of XML document.
(2.1)DOM
    Validation of XML document based on W3C Schema or DTD is straightforward by using DOM. DOM maintains a global structure of processed xml document so that all information of the document can be retrieved easily without much effort.
(2.2)SAX
    Validation in SAX requires more effort. The reason is that SAX does not maintain information about full document. Certain kinds of validation require to access information of document in full. For example, DTD IDREF attribute is used to refer to other elements defined in document. It requires there exists an element in the document that uses that string as an ID attribute. So, to work around this issue, one needs to maintain every encountered ID attribute and IDREF attribute. In addition, some kinds of XML document processing require to access whole document (i.e. XPath). In this case, to build a document tree is better, which actually is included in DOM.
(2.3)XPP
    The same kind of thought applies here as well. To validate a XML document, XPP needs more workaround than DOM and SAX.
(3) Java-specific XML processing interface
     To make Java programmers access various xml processing interfaces in easier way, various Java-specific interfaces have been defined. This can make XML processing code parser independent. In other words, user can switch between different parsers if these parsers comply with the same set of interface definition. This is a good thing because the code is not bound to a specific parser.
(2.1) JAXP
    This is constructed by JCP. It provides a set of interfaces by which Java programmers are able to process xml documents. It contains DOM, SAX, XSLT, XInclude, XPath and XML validation. Actually, by putting JAXP into JDK, it increases its chance of being accepted more and more. Users don't need to install additional packages to make use of JAXP. This makes it convenient to use and makes configuration of run time environment more easily. To process XML documents, JDK is all you need. Of course, if you are not satisfied with performance of built-in parser shipped with JDK, you can download and install other parsers. This increases flexibility for sure.
(2.1) dom4j
    I think this interface is an extension of JAXP. It means dom4j is compliant with JAXP but it provides more extra functionalities. It seems that dom4j is more convenient for Java programmers to work with.
(2.1) JDOM
    This also shares the same goal with other interfaces described above. If I don't remember something wrong, it came out before JAXP. By using many Java language specific features, it can ease processing of XML document in Java. However, I don't think currently it is still popular considering standard JAXP and JAXP extension dom4j.
(4) Parsers
    Now it is time to introduce some XML parsers which actually process XML documents. Two popular open source parses I know are Xerces and Crimson. You can google and find comparison between these two parsers. Basically, both of them support parsing by DOM and SAX.
(5) Relationship between these terms.
    To eliminate possible confusion, some clarification of relationship among these terms may be necessary.
    For APIs/Interfaces, they define implementation independent specification of XMl manipulation. They are just sets of interfaces and not implementations. Implementation of these interfaces are generally based on parsers which do the actual work. The implementation consists of a slim layer sitting on top of functionalities of parsers. It wraps functionalities of underlying parsers (may be Xerces, Crimson) to provide a unified interface which is defined by corresponding specification (JAXP or dom4j...). So generally after you download a library from website of dom4j or JDOM or JAXP, the library contains Xerces/dom4j lib. You can figure it out if you browse directory layout of the dom4j/JDOM/JAXP library.
(6) Further work
    Although JAXP/dom4j/JDOM eases processing of XML document, it is far from what Java programmers expect. To work on DOM level is still a clumsy job. You have to retrieve an element or attribute so that you can manipulate its content. Generally, programmers need to know details of the XML documents.
    One advanced idea of XML processing is to build correspondence between Java classes and XML Schema(may be DTD) and correspondence between Java objects and XML document. Then programmers just need to manipulate those Java objects instead of elements/attributes in XML documents.
    There are many libraries which implement binding between XMl and Java, e.g. JIBX, XMLBeans, ADB. JCP constructed a standard called JAXB (Java API for XML Binding?). Glassfish provides a reference implementation. JAXME is also an implementation of JAXB. However JAXME have not published a new version since 2006. So I don't know whether it is still developed actively.
    However, because correspondence between these two parts sometimes is not so natural, it may increase burden of programmers. Sometimes, the correspondence does not comply with what programmers expect. In this case, some human intervention is necessary. Then programmers must understand details of rules used during conversion. There may be many rules so that it is not a trivial task to grasp them all. Sometimes you don't need to understand them all. However, to figure out which rule you should customize is still not an easy task. As a result, some programmers prefer to manipulate XML using DOM/SAX instead of XML-Java binding.
(7) Related area
    To illustrate usage of XML related technologies, one inevitable area is Web Service.
    Some useful projects which ease of development of web service in Java are created. Here(http://wiki.apache.org/ws/StackComparison) is a list of frameworks and some stack comparison is presented as well. One additional lib which is not mentioned in that article is XINS(http://xins.sourceforge.net/). For web service client, WSIF is a client framework which can make web service client be composed easily. It is donated by IBM to Apache foundation. However, it seems not to be actively developed now. I am not sure.
    Every time variation of Java technologies exist, JCP is willing to standardize it. The web service area is not exceptional. Firstly, it proposed JAX-RPC specification. It standardized web service development based on WSDL/SOAP. One part inside the specification is binding between Java classes/objects and WSDL. In JAX-RPC specification, it contains detailed binding rules. After some time, JAX-RPC evolved into 2.0 and it is renamed to JAX-WS 2.0. The reason may be WS is a buzzword in industry and it may better capture the intent of the specification. In JAX-WS 2.0, WSDL-Java binding is delegated to JAXB. I remember JAX-WS implementation is included in JDK 6. JAX-RPC/JAX-WS specification describes interface by which programmers can easily build web service client and server programs. Besides, programmers can customized handling of transmissioned messages(may be SOAP) by plugging in handlers.

Sunday, March 16, 2008

Java Bytecode

When javac is used to compile a Java program, bytecode(.class files) is generated which can be run on JVM.
Command javap can be used to display content of .class files in a meaningful way.
    javap ClassName        //print out field/method definition.
    javap -c ClassName    //print out all disassembled code.

A simple introduction:
http://www.ibm.com/developerworks/ibm/library/it-haggar_bytecode/

JVM specification:
http://java.sun.com/docs/books/jvms/

Java Reflection and bytecode ...

I found a series of very useful articles about Java. Here is the address:http://www.ibm.com/developerworks/java/library/j-dyn0429/.
In this series, the author introduces classloading, bytecode, reflection, Class-transformation on-the-fly...

Javassist --- manipulation of bytecode
Introduction from official site:
"Javassist (Java programming assistant) is a load-time reflective system for Java. It is a class library for editing bytecodes in Java; it enables Java programs to define a new class at runtime and to modify a class file before the JVM loads it. Unlike other similar systems, Javassist provides source-level abstraction; programmers can modify a class file without detailed knowledge of the Java bytecode. They do not have to even write an inserted bytecode sequence; Javassist instead can compile a fragment of source text on line (for example, just a single statement). This ease of use is a unique feature of Javassit against other tools."
http://labs.jboss.com/javassist/
http://www.csg.is.titech.ac.jp/~chiba/javassist/

Configuration Of Tomcat

(Part of this post is cited from Tomcat Offcial site)
    I have been using Tomcat as JSP/Servlet container in current project. However, I have not got full understanding about Tomcat configuration. Every time I need to modify configuration, I search on the web and learn those related elements. I don't have a big picture about it. To fully understand the configuration, I read through official document at this site http://tomcat.apache.org/tomcat-6.0-doc/config/. And as I expect, it helps me a lot.
The configuration file of Tomcat is $TOMCAT_HOME/conf/server.xml.
(1) Server

Root element is server.
(2) Service
    A Service element represents the combination of one or more Connector components that share a single Engine component for processing incoming requests. So it looks like a container to enclose related Connector and Engine.
(3) Connector
    Connector receives requests from clients. There are two kinds of connector:HTTP and AJP.
HTTP Connector enables Tomcat to function as a stand-alone web server. An instance of connector listens for connections on a specific port number. One service can contain more than one connector. Normally, every connector maintains its own thread pool. However, this can be altered by setting property executor. If so, the connector will use the executor and all thread related properties are ignored. One advantage of using executor is that it can shared between components.
    "Each incoming request requires a thread for the duration of that request. If more simultaneous requests are received than can be handled by the currently available request processing threads, additional threads will be created up to the configured maximum (the value of the maxThreads attribute). If still more simultaneous requests are received, they are stacked up inside the server socket created by the Connector, up to the configured maximum (the value of the acceptCount attribute. Any further simultaneous requests will receive "connection refused" errors, until resources are available to process them."
    If SSL on HTTP is needed, corresponding connector element should be modified. Property redirectPort can be used to redirect SSL request received by a plain connector to another connector which supports SSL.
(4) Engine
    Following all connector elements in a service component is engine element. Engine element represents processing machinery associated with a service. It gets and processes all requests from connectors and returns the response to Connector. Note: exactly one engine element should be defined inside a service element.
(5) Host
    Inside engine element, multiple host elements can be nested. Each host element represents a virtual host. At least one host is required, and one of the hosts MUST match value of property defaultHost of engine element. Element alias can be used to give alias. Property appBase is used to specify application base directory of the virtual host. Properties deployOnStartUp and autoDeploy can be configured to set functionality of auto-deployment.
(6) Context
Context element represents a web application which is run inside a certain virtual host. "Each web application is based on a Web Application Archive (WAR) file, or a corresponding directory containing the corresponding unpacked contents, as described in the Servlet Specification (version 2.2 or later)." However, in high version of Tomcat, Context elements do not need to written in server.xml. Detailed information is:
    "For Tomcat 6, unlike Tomcat 4.x, it is NOT recommended to place <Context> elements directly in the server.xml file. This is because it makes modifing the Context configuration more invasive since the main conf/server.xml file cannot be reloaded without restarting Tomcat.
Context elements may be explicitly defined:
  • in the $CATALINA_HOME/conf/context.xml file: the Context element information will be loaded by all webapps
  • in the $CATALINA_HOME/conf/[enginename]/[hostname]/context.xml.default file: the Context element information will be loaded by all webapps of that host
  • in individual files (with a ".xml" extension) in the $CATALINA_HOME/conf/[enginename]/[hostname]/ directory. The name of the file (less the .xml extension) will be used as the context path. Multi-level context paths may be defined using #, e.g. context#path.xml. The default web application may be defined by using a file called ROOT.xml.
  • if the previous file was not found for this application, in an individual file at /META-INF/context.xml inside the application files
  • inside a Host element in the main conf/server.xml

How to map incoming request to specific web app?
    "The web application used to process each HTTP request is selected by Catalina based on matching the longest possible prefix of the Request URI against the context path of each defined Context. Once selected, that Context will select an appropriate servlet to process the incoming request, according to the servlet mappings defined in the web application deployment descriptor file (which MUST be located at /WEB-INF/web.xml within the web app's directory hierarchy)."

Deployment
    If you want to develop and deploy your JSP/Servlet applications, consult this page: http://tomcat.apache.org/tomcat-6.0-doc/appdev/deployment.html. It contains information about: Standard directory layout, where to put shared libraries, web.xml and deployment. This link also contains some useful information: http://tomcat.apache.org/tomcat-6.0-doc/config/host.html#Automatic%20Application%20Deployment.

Configuration of virtual host:
http://hi.baidu.com/zouziting/blog/item/7e3ad7808f3491d79123d936.html

Saturday, March 15, 2008

Edit binary file in hex mode in vim

It will be nice if we can edit binary file directly in vim. VIM provides basic support which has a few restrictions.
(1) Open a binary file
    vim -b datafile
or
    :set binary
(2) Many characters are unprintable. You can see the Hex format by using:
    :set display=uhex
Or you can use ga command to see the value of current character.
(3) To see current position, use
    g CTRL-G
The output is verbose:
    Col 6 of 38; Line 31 of 31; Word 94 of 96; Byte 747 of 780
(4) Move to a specific byte offset:
    234go
(5) xxd can be used to convert the file into hex dump format
    %!xxd
Result should look like this:   
    0000000: 6262 630a 6465 660a 6768 696b 0aab de0a  bbc.def.ghik....
There are two parts: hex part and printable character part.
Go back:
    %!xxd -r
Note: only changes in hex part have effect. Changes in printable text part are ignored.

Of course, tool xxd can be used independently in command line.

Friday, March 14, 2008

Google Apps

I have heard of "Google Apps" for a long time, but I have not tried it. Recently, I am investigating some web2.0 applications including Google YouTube, CodeSearch... Gradually, I knew more about Google Apps. It is integration of various Google Applications including Gmail, doc&Spreadsheet, GTalk... The standard edition of Google Apps is free!! And premier edition is $50 which is much cheaper than Microsoft office. As a result, it may become first choice of small companies which don't want to spend lots of money to buy MS Office. By using Google Apps, users can build web site for their companies and make use of Google Doc, GTalk, Gmail and much more applications made by Google.
However, these two rivals are not counterparts. Google Apps aims to provide a way for businesses to build applications conveniently. The end users don't need to maintain any software and hardware to run the system. The infrastructure is provided and managed by Google. In addition, all applications in Google Apps are web-based which means collaboration and communication become light-weight. In other words, users don't need to install any client-side software. This is the largest advantage of Google Apps.
But, Google Apps is not as powerful as MS Office. For example, Google Doc&Spreadsheets provides just very basic editing functionality, which may not meet requirements of advanced users. If someday Google provide its own desktop tools like MS Office, future of Google Apps will be bright.
After all, it is a good thing because Google Apps is a new choice for users. And it may benefit small companies which own limited amount of money.