Monday, December 29, 2008

HTML table cell text wrap

By default, if you set width of table cell, words would be wrapped automatically.  A single word would not be split into pieces. Usually, an English word is not so long. But, some "words" may be very long. For example, URLs without word delimiter would be considered as a single word.
CSS property word-break can be used to force split a long word into more than one line.
usage: work-break:break-all
Unfortunately, that does not work on FireFox :-( I searched the internet and found that FireFox does not support that property. Besides, there is no equivalent in FireFox.

Friday, December 26, 2008

How to use Maven2

Download and install maven: http://maven.apache.org/download.html.

Running Maven
http://maven.apache.org/guides/getting-started/maven-in-five-minutes.html
http://maven.apache.org/guides/getting-started/index.html
Generally the local repository is provided in USER_HOME/.m2/repository.

Configuration
http://maven.apache.org/guides/mini/guide-configuring-maven.html
Three levels:

Build your own private/internal repository:
This article introduces how to create a repository using Artifactory: http://www.theserverside.com/tt/articles/article.tss?l=SettingUpMavenRepository. In addition, the author also compares some mainstream maven remote repository managers including Standard maven proxy, Dead simple Maven Proxy, Proximity and Artifactory.
In my case, I also use Artifactory and deploy it to tomcat. It has a nice web-based interface. Artifactory uses database(derby I think) to store various repository data so a user can not know the repository content by directly looking at the directory.

Deploy your artifacts to remote repository by using maven-deploy plugin:
http://maven.apache.org/plugins/maven-deploy-plugin/usage.html
(1) If the artifacts are built by using Maven, you should use deploy:deploy Mojo.
In your pom.xml, element <distributionManagement/> should be inserted to tell Maven how to deploy current package. If your repository is secured, you may also want to configure your settings.xml file to define corresponding <server/> entries which provides authentication information.
Command: mvn deploy.
(2) If the artifacts are NOT built by using Maven, you should use deploy:deploy-file Mojo.
Sample command:
mvn deploy:deploy-file -Dpackaging=jar -Durl=file:/grids/c2/www/htdocs/maven2 
-Dfile=./junit.jar -DgroupId=gridshib -DartifactId=junit -Dversion=GTLAB

FAQ:
(1) What does maven standard directory layout look like?
http://maven.apache.org/guides/introduction/introduction-to-the-standard-directory-layout.html
(1) How to specify parent artifact in pom.xml?
Read http://maven.apache.org/guides/introduction/introduction-to-the-pom.html.
(2) If a dependent package can not be download from central Maven repository, three methods can be used to deal with it:

"
  1. Install the dependency locally using the install plugin. The method is the simplest recommended method. For example:
    mvn install:install-file -Dfile=non-maven-proj.jar -DgroupId=some.group -DartifactId=non-maven-proj -Dversion=1

    Notice that an address is still required, only this time you use the command line and the install plugin will create a POM for you with the given address.

  2. Create your own repository and deploy it there. This is a favorite method for companies with an intranet and need to be able to keep everyone in synch. There is a Maven goal called deploy:deploy-file which is similar to the install:install-file goal (read the plugin's goal page for more information).
  3. Set the dependency scope to system and define a systemPath. This is not recommended, however, but leads us to explaining the following elements:
"
(2) How to add new repository?
Put following code snippet into pom.xml or settings.xml.
<repository>
  <id>your-new-repository-id</id>
  <name>New Maven Repository </name>
  <layout>default</layout>
  <url>Address of the new repository</url>
  <snapshots>
    <enabled>enable-it?</enabled>
  </snapshots>
  <releases>
    <enabled>enable-it?</enabled>
  </releases>
</repository>
(3) How to disable default central maven repository?
Put following snippet into your pom.xml.
<repository>
  <id>central</id>
  <name>Maven Repository Switchboard</name>
  <layout>default</layout>
  <url>http://repo1.maven.org/maven2</url>
  <snapshots>
    <enabled>false</enabled>
  </snapshots>
  <releases>
    <enabled>false</enabled>
  </releases>
</repository>

(4) How can I package source code without run test?
Feed parameter -Dmaven.test.skip=true into the command line.
Note this property is defined by maven plugin surefire.
(5) Why does "mvn clean" delete my source code?
In your pom.xml, if content of element <directory> nested in element <build> is "./", "mvn clean" will delete all content in current directory including the src directory.
There are two more elements which can be used to specify locations of compiled classes.
outputDirectory:  The directory where compiled application classes are placed.
testOutputDirectory:  The directory where compiled test classes are placed.
(6) How to add resources into built package?
http://maven.apache.org/guides/getting-started/index.html#How_do_I_add_resources_to_my_JAR.
http://maven.apache.org/guides/getting-started/index.html#How_do_I_filter_resource_files
(7) Sometime, you want to use some libraries at compilation time and you don't want Maven to add them into your package(jar or war). How to do that?
Use dependency scope "provided" instead of default "compile". Read this post for details:
http://maven.apache.org/general.html#scope-provided. And this post elaborates Maven's dependency mechanism: http://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html#Dependency_Scope
(8) How to build a war instead of jar?
Use Maven WAR Plugin: http://maven.apache.org/plugins/maven-war-plugin/usage.html.
First you should set entry packaging in pom.xml to war.
<packaging>war</packaging>
Then you can use one of the following commands to build your war:
mvn package
mvn compile war:war
mvn compile war:exploded
mvn compile war:inplace

Also you can filter resources of your web app. See this post http://maven.apache.org/plugins/maven-war-plugin/examples/adding-filtering-webresources.html.
(9) How to make war and jar at the same time?
By default your source code is compiled into class files and placed into directory WEB-INF/classes. Sometimes you may want to build a jar and then put it into directory WEB-INF/lib.
http://communitygrids.blogspot.com/2007/11/maven-making-war-and-jar-at-same-time.html
http://maven.apache.org/plugins/maven-war-plugin/war-mojo.html
http://maven.apache.org/plugins/maven-war-plugin/faq.html#attached
(10) help plugin can be used get information about a project, plugin or the system.
http://maven.apache.org/plugins/maven-help-plugin/
mvn help:describe -DgroupId=org.apache.maven.plugins -DartifactId=maven-compiler-plugin -Dfull=true
mvn help:describe -DgroupId=org.apache.maven.plugins -DartifactId=maven-compiler-plugin
mvn help:system
mvn help:all-profiles
mvn help:active-profiles


Properties reference:
http://docs.codehaus.org/display/MAVENUSER/MavenPropertiesGuide

During my using of Maven2, I encountered several bugs:
(1) targetPath support on the webResources war plugin parameter:
http://jira.codehaus.org/browse/MWAR-54
(2) maven-war-plugin webResources -- relative path:
http://www.mail-archive.com/users@maven.apache.org/msg77274.html
http://jira.codehaus.org/browse/MNG-2382
http://jira.codehaus.org/browse/MWAR-79
http://jira.codehaus.org/browse/MWAR-77

Install and Configure Ubuntu 8.10

Recently, I got a used desktop which would be used as linux server. I installed Ubuntu 8.10. The installation was easy. I downloaded the Ubuntu .iso file and burned it to a CD. Then just installed Ubuntu using the CD.

Vim upgrade
By default, Ubuntu just installed vim-tiny whose functionalities may not satisfy end user's requirements. For example, opening a directory using vim-tiny would fail rather than list all files in the directory in vim. So I installed vim-full which includes all vim features using this command:
sudo apt-get install vim-full
Then usually vim is configured to meet user's specicial requirements. In my case, the configuration file $HOME/.vimrc looks like:

set tabstop=4
set shiftwidth=4
set wrapmargin=8
set smartindent	"smart indentation
set expandtab  	"expand tabs to spaces
set ruler 		"display ruler on bottom right corner
set nu			"display line number
set incsearch	"turn on incremental search
set hlsearch		"highlight search result

:colorscheme ron 
:filetype indent on		"enable special indentation rules according to file type.

"highlight the 81-th character in each line
au BufEnter * ":/\%81c"
"in default configuration, textwidth option is set
"so I want to override the default value.
au BufRead * set tw=0

"when you open a file, cursor is moved to previous position when you edited the file last time.
au BufReadPost * if line("'\"") > 0|if line("'\"") <= line("$")|exe("norm '\"")|else|exe "norm $"|endif|endif

:syntax on		"turn on syntax detection and highlight
"stop cursor blinking. Only available when compiled with GUI enabled,
"and for MS-DOS and Win32 consol
set guicursor=a:blinkon0

All available options of vim can be read here: http://www.vim.org/htmldoc/index.html

Browser
Firefox is installed by default which is the default web browser. Several addons were installed manually.

Tab Mix Plus Tab browsing with added boost.
All-in-One gestures Support mouse gestures.
 
Firebug Web development Evolved
RestTest

Allow users to send HTTP requests with customized heads and data.

Poster

A developer tool for interacting with web services and other web resources that lets you make HTTP requests, set the entity body, and content type.
Similar to RestTest. But Poster works on FF3 while RestTest does not.

HttpFox An HTTP analyzer
Web Developer Adds a menu and a toolbar with various web developer tools
 
FireShot Take a screenshot and edit it
Google Notebook Firefox addon for google notebook application
QuickRestart Add a "Restart Firefox" item to the "File" menu.
MenuEditor Customize application menus
FlashGot Enables single and massive downloads using external download managers. This addon itself is not a download program.

Terminal configuration
I like "green on black" color theme.

Keyboard shortcuts
Click System -> Keyboard Shortcuts to display the shortcut setting dialog. Shortcuts I use very often include:
Super + s     #start a terminal
Super + m    #toggle maximization state
Super + n     #minimize window
Ctrl+Alt+L    #lock screen
Alt + F2       #Show run application dialog
Alt + Tab     #switch between different windows
Ctrl + Alt + Left/Right/Up/Down    #switch to different workspaces

Following serveral shortcuts are available only after you install compiz and enable corresponding components.
Super + E      #Expo key
Super + Tab  #another window switcher
Super + leftclick    #move the window

Wireless network configuration
Wired network connection is established successfully. However, after I moved my desktop, it is not close to the router and I don't want to connect them using a cable.
I have a D-Link System AirPlus G DWL-G122 Wireless USB Adapter. After I plugged it in, I used command lsusb to see whether the device was detected.  It was detected but it did not work. Obviously, a driver is needed to make it work. I read this post: https://help.ubuntu.com/community/WifiDocs/Driver/Ndiswrapper. It states

D-Link DWL-G122 USB Wireless device: As of December 2008, Ubuntu 8.10 provides full 'out of the box' support for this device, using the rt73usb driver. In this case, there is no need to use ndiswrapper at all and there is no need to make any changes to the default /etc/modprobe.d/blacklist file."

But on my machine, the wireless adapter still did not work even after the module rt73usb was loaded. I guess the reason is that the device mentioned in the manual and my device are different although they seem the same.
Then I followed the instructions on post https://help.ubuntu.com/community/WifiDocs/Driver/Ndiswrapper and it worked well.
Network manager is a convenient tool to configure your network.
http://projects.gnome.org/NetworkManager/
https://help.ubuntu.com/community/NetworkManager

Network Management
Some packages and their useful commands to manage network devices

Supported commands by package "net-tools":
ifconfig: configure the kernel-resident network interfaces.

Supported commands by package "wireless-tools":
iwconfig: it is dedicated to the wireless interfaces. It can be used to set the parameters of NIC which are specific to wireless connection.
iwlist: display additional information from a wireless NIC. "iwlist scan" returns a list of available wireless networks.

Supported commands by package "network-manager":
NetworkManager: network management daemon.
nm-tool: utility to report state of network manager in text mode.
nm-system-settings:

Supported commands by package "network-manager-gnome":
nm-applet: a graphical networkmanager applet. It displays an icon in notification area(usually at top right corner) for managing network devices and connections. Usually, it is started up automatically when the system boots up.
nm-connection-editor: display a graphcial connection configuration tool.

Supported commands by package "gnome-nettool":
gnome-nettool : GNOME network tools. This tool can be used to display detailed network information.

gnome-network-preference : Set network proxy preferences

Compiz window manager
A good reference: https://help.ubuntu.com/community/CompositeManager/CompizFusion
Compiz provides some very cool features I like.
Use command ccsm to display CompizConfig Settings manager.
More shortcuts:
Super + E             #Expo key. supported by "Expo" component.
Super + Tab          #another window switcher. supported by "Shift Switcher" component and "Ring Switcher" component.
Alt + left-button    #move the window. supported by "Move Window" component.
Alt + mid-button   #resize the window. supported by "Resize Window" component.

Input Method
To input Chinese, I added Chinese language support. Executing command "gnome-language-selector" would display "language support" dialog and you can select any language in the list you want to use. Releated packages are installed automatically.
SCIM (smart comman input method) is installed by default. Try command "scim-setup" to configure scim. Use command "im-switch -z en_US -s scim" to switch input method to scim. Then restart X. scim would be started automatically.
Also you can manually start scim using command "scim -d".
To make the lookup table follow the cursor, in scim setup dialog uncheck
FrontEnd -> Global Setup -> Embed Preedit String into client window
and
Panel -> GTK -> Embedded lookup table

Misc. Useful packages I installed:
(*) sudo apt-get install gnome-device-manager
As its name implies, it is a device manager with GUI. Then it can be accessed by clicking Applications->System Tools -> Device Manager.
(*) Totem is installed by default which is a video player.
(*) sudo apt-get install vlc  #vlc player
Also, I tried to find a video player in Ubuntu repositories which can play .rm and .rmvb files. None of vlc, Kmplayer and mplayer can do that job. It seems that the main problem is license which means including real codec is illegal. Finally I downloaded linux version of realplayer from official real web site and installed it successfully.
(*) sudo apt-get install sun-java6-jdk  #jdk 6
Then use update-alternatives --config java to set default java executable.
(*) Package "Transmission" is installed by default which is a BitTorrent client program.
(*) sudo apt-get install deluge-torrent (a bittorrent client)
(*) sudo apt-get install d4x (a download manager)
(*) sudo apt-get install gwget (another download manager)

More tips
How to take screenshots: http://tips.webdesign10.com/how-to-take-a-screenshot-on-ubuntu-linux
How to view btchina in Linux?
http://mozilla.sociz.com/viewthread.php?tid=2367
First install greasemonkey addon and then install the script at http://userscripts.org/scripts/show/33286 (click the button "Install" at top right corner). After successful installation, you can set the options of the script by clicking Tools->GreaseMonkey->User Script Commands -> option-here.
If you use FireFox on Windows, addon IETab can be used.
Flashget on Linux?
Candidates: gwget and WebDownloader for X (d4x) (can be installed from Ubuntu repositories). wxDownload Fast and trueDownloader(maybe must be installed from source).
Pick the ones you like from this list http://en.wikipedia.org/wiki/List_of_download_managers.
How to configure multiple versions of a program?
use utility update-alternatives.

Saturday, December 20, 2008

Latex resources

[How to insert figures to Latex document]
ftp://ftp.tex.ac.uk/tex-archive/info/epslatex.pdf
http://www.miwie.org/tex-refs/html/index.html
FAQ: http://www.physics.ohio-state.edu/~faqomatic/fom-serve/cache/103.html

[Screen presentation]
http://www.math.uakron.edu/~dpstory/pdf_present.html
http://www.miwie.org/presentations/html/index.html

References
http://www-h.eng.cam.ac.uk/help/tpl/textprocessing/teTeX/latex/latex2e-html/index.html
http://www-h.eng.cam.ac.uk/help/tpl/textprocessing/

Document Structuring Conventions and EPS

Document Structuring Conventions (DSC)

PS standard does not specify the overall structure of a PS language program. DSC includes information about the document structure and printing requirements in a way that does NOT affect the PS interpreter in any manner.
DSC comments are specified by two percent characters (%%) as the first characters on a line without leading white space. These two percent characters are immediately followed by a unique keyword - not white space. The keyword starts with a capital letter, e.g. BoundingBox. Some keywords end with a colon (part of the keyword) which indicates that the keyword should be provided arguments. One space character should be inserted between ending colon and its arguments.
Example:
%%BoundingBox: 10 10 200 200

Resources
http://partners.adobe.com/public/developer/en/ps/5001.DSC_Spec.pdf
http://hepunx.rl.ac.uk/~adye/psdocs/DSC.html
http://en.wikipedia.org/wiki/Document_Structuring_Conventions

EPS

"An EPS file is a PS language program describing the appearance of a single page. The purpose of the EPS file is to be included in another PS language page description. The EPS file can contain any combination of text, graphics, and images."
EPS should be DSC-conformant.
EPS file format allows for an optional screen preview image.
Two required DSC header comments are:
%!PS-Adobe-3.0 EPSF-3.0
%%BoundingBox: llx lly urx ury

The first line is the version comment. The second required DSC header comment provides information about size of the EPS file. They are expressed in the default PS coordinate system. For an EPS file, the bounding box is the smallest rectangle that encloses all the marks painted on the single page of the EPS file.
Also, header comments %%Creator:, %%Title:, and %%CreationDate: are recommended.

Coordinate System Transformation
Transform PS coordinate system according to the final placement of the EPS file.
Steps:
(a) Translate the origin of PS coordinate system to the user-chosen origin
(b) Rotate, if the user has rotated the EPS file
(c) Scale
After previous three steps, the lower-left corner of EPS file's bounding box is translated to the user-chosen origin.

Spec: http://partners.adobe.com/public/developer/en/ps/5002.EPSF_Spec.pdf

Bounding box

This piece of information is critical especially when  the .ps figures would be inserted into latex document. Bounding box specifies which region the text and figures in ps file occupy. Latex needs this piece of information because it would reserve corresponding space for the content in ps file. It is specified in a single comment line:
%%BoundingBox: llx lly urx ury
It appears in the one of the first lines of the .ps file.
Parameters:

llx
x coordinate of lower left corner
lly
y coordinate of lower left corner
urx
x coordinate of upper right corner
ury
y coordinate of upper right corner
Example:
%%BoundingBox: 50 50 410 302
Note:
(1) Origin of the coordinate system is lower left corner.
(2) Unit of coordinates is Postscript point which is 1/72 of an inch. 
A PostScript point is slightly larger than a TEX point, which is 1/72.27 of an inch. In TEX and LATEX, PostScript points are called "big points" and abbreviated bp while TEX points are called "points" and abbreviated pt."(from http://www.physics.ohio-state.edu/~faqomatic/fom-serve/cache/105.html)
(3) 0 0 612 792 is the coordinates of a US Letter–sized page

Resources
http://www.physionet.org/physiotools/plt/plt/html/node47.html
http://amath.colorado.edu/documentation/LaTeX/reference/bbox.html

Additional Resources

Postscript:
http://www.tailrecursive.org/postscript/postscript.html
http://hepunx.rl.ac.uk/~adye/psdocs/

Friday, October 10, 2008

JAXP + Java Endorsed Packages

In this post I summarized XML related stuff. After all JAXP was constructed with the hope to unify interface.
Here is a very good FAQ about JAXP.

Pluggability
Pluggability is necessary so that programmers can choose the desired implementations of JAXP.
From the JAXP specification, I found the lookup procedure. For XSLT, the procedure is:

• Use the javax.xml.transform.TransformerFactory system property
• Use the properties file "lib/jaxp.properties" in the JRE directory. This configuration file is in standard java.util.Properties format and contains the fully qualified name of the implementation class with the key being the system property defined above. The jaxp.properties file is read only once by the JSR-000206 Java™API for XML Processing (“Specification”) implementation and its values are then cached for future use. If the file does not exist when the first attempt is made to read from it, no further attempts are made to check for its existence. It is not possible to change the value of any property in jaxp.properties after it has been read for the first time.
• Use the Services API (as detailed in the JAR specification), if available, to determine the classname.
The Services API will look for the classname in the file META-INF/services/javax.xml.transform.TransformerFactory
in jars available to the runtime.
• Platform default TransformerFactory instance.

It seems that the #3 option is used mostly. If you download an implementation jar (xalan.jar or xerces-api.jar...), you should be able to find the corresponding file META-INF/services/javax.xml.xxx.xxx.

Problem with Java SE 1.4
However, there is a problem for Java 1.4. It bundled in an implementation of JAXP 1.1 from Apache. Unfortunately, the developers did not change the package names. Even if you downloads and uses a newer version of package from Apache, none of the plug-in options described above works because the class loader always uses the built-in implementation.

Solution
Then two solutions are provided:
(1) Change the package names to other internal package names (e.g. com.sun.org.apache.*)
This is what Sun does in later Java SE releases.
(2) For old Java SE, Endorsed override mechanism can be used.
Java 5 Endorsed Standard: http://java.sun.com/j2se/1.5.0/docs/guide/standards/index.html.
Users can specify the endorsed package path by setting system property - java.endorsed.dirs.
The default path is: <jre-home>/lib/endorsed
On my machine, the path is /usr/java/jdk1.5.0_09/jre/lib/endorsed.
Information about system properties can be accessed here http://java.sun.com/docs/books/tutorial/essential/environment/sysprop.html.
Also I found a very simple program on the web to list all system properties:

public class DisplaySystemProps {
    public static void main(String[] args) {
         System.getProperties().list(System.out);
    }
}
Then you can check the properties you are interested in, such as user home directory, java home, vm version, java class version, path separator, file separator...

Tomcat
For old tomcat releases which rely on old versions of Java, users can put XML parser  into CATALINA_HOME/common/lib.
But for latest versions, it does not work. Read http://tomcat.apache.org/tomcat-6.0-doc/class-loader-howto.html for more information.

"Classes which are part of the JRE base classes cannot be overriden. For some classes (such as the XML parser components in J2SE 1.4+), the J2SE 1.4 endorsed feature can be used.
In previous versions of Tomcat, you could simply replace the XML parser in the $CATALINA_HOME/common/lib directory to change the parser used by all web applications. However, this technique will not be effective when you are running on JSE 5, because the usual class loader delegation process will always choose the implementation inside the JDK in preference to this one.
Tomcat utilizes this mechanism by including the system property setting -Djava.endorsed.dirs=$JAVA_ENDORSED_DIRS in the command line that starts the container."
In other words, users MUST use Endorsed override mechanism to use another XML parser/XSLT in Tomcat. See this post for more information.

Thursday, August 28, 2008

Ruby notes(2) - OO

Function

Definition
def method([arg1, ..., argn,..., *arg, &arg])
    statements
end
Singleton Method
def obj.method([arg1, ..., argn,..., *arg, &arg])
    statements [rescue [exception [, exception...]] [=>var] [then] code ]... [else code ] [ensure code] end

undef method    #make the method undefined.yield(expr...)    #execute the block passed in to current method.
super(expr...)    #execute the method in super class.
super                #execute the method in super class with current method's arguments are passed in.
yield
alias newmethodname oldmethodname
Method Invocation
(*) General
method ([param1 ...[, *param [, &param]]])
method [param1 ...[, *param [, &param]]]
obj.method([param1 ...[, *param [, &param]]])
obj.method [param1 ...[, *param [, &param]]])
obj::method([param1 ...[, *param [, &param]]])
obj::method [param1 ...[, *param [, &param]]]
(*) With blocks
methdo { |[var1 [, var2 ...]]|
    code
}
method do |[var1 [, var2 ...]]|
    code goes here
end
A block has its own local scope and code within a block can access local variables of outer scope.

Class

Definition
class classname [ < superclass]
    code
end

classname MUST be a constant instead of a global or local variable. Class definition introduces a new scope. In order for different definitions of the same class to be merged, one of the two conditions must be met:
(1) A class does not include superclass
(2) if a class definition includes superclass, superclass MUST match super class of previous declaration.

class << object
    code
end

Creation
Instances of a class are created by using method new.
str = String.new or str = String::new

Modules

Definition
module modulename
    code
end
modulename MUST be a constant instead of a global or local variable. Module definition introduces a new scope. Different definitions of the same module are merged.

Sunday, August 24, 2008

Locale in linux/unix

(1) Get locale information
Use command locale.

locale    //get current locale environment for each locale category defined by the LC_* environment variables.
locale -a //output names of all available locales
locale -m //Write names of available charmaps.
locale -k LC_CTYPE //Write names and values of selected keywords (In this case, it is "LC_CTYPE").

command: localedef
"The  localedef  program  reads  the  indicated  charmap  and input files, compiles them to a form usable by the locale(7) functions in the C library, and places the six output files in the outputpath directory."

When a program looks up locale dependent values, it does this according to the following environment variables, in priority order:

  1. LANGUAGE
  2. LC_ALL
  3. LC_xxx, according to selected locale category: LC_CTYPE, LC_NUMERIC, LC_TIME, LC_COLLATE, LC_MONETARY, LC_MESSAGES, ...
  4. LANG

Variables whose value is set but is empty are ignored in this lookup.
Set locale environment variable (E.g. LC_ALL, LANG...):
ll_cc.encoding  E.g. de_DE.UTF-8, zh_CN.UTF-8
ll_cc@variant  E.g. de_DE@euro, sr_RS@latin
"There is also a special locale, called ‘C’. When it is used, it disables all localization: in this locale, all programs standardized by POSIX use English messages and an unspecified character encoding (often US-ASCII, but sometimes also ISO-8859-1 or UTF-8, depending on the operating system)."
http://www.gnu.org/software/gettext/manual/gettext.html#Setting-the-POSIX-Locale

C lib:
char *setlocale(int category, const char *locale): set the current locale.

Ruby notes-basic(1)

Semicolons and newline characters are interpreted as ending of a statement. +, = or \ can be use as line continuation characters.
Comments
Single-line comments: Comments extend from # to the end of a line.
Multi-line comments:

=begin
comments go here
=end
Note: '=begin' and '=end' must appear at the beginning of a line.
Integer literals: 123, 0123(octal), 0xf34(hex), 0b1123(binary), ?d(char code for 'd')
Float-point literals: 23.45, 3e4, 4E10, 5e+39
String literals:
double quotations: allow substitution and escape character sequences.
single quotations: don't allow substitution and escape character sequences except \\ and \'.
Adjacent strings are concatenated automatically. "abc""def" => "abcdef"
Command execution: `command`. The syntax is similar to the counterpart of bash. It allows substituion and escape character sequences. The output generated by execution of the command is converted to a string.
here documents:

<<EOF
content goes here
EOF
The string is in double quotations.
<<"EOF"
content goes here
EOF
<<'EOF'
content goes here
EOF
The string is in single quotations.
<<`EOF`
command goes here
EOF
The string is in back quotes.
<<-EOF
command goes here
EOF
The delimiter ('EOF' in this example) needs not be at the beginning of a line.

Symbols:
"A symbol is an object corresponding to an identifier or variable"
:name    #symbol for 'name'
:$name   #symbol for variable 'name'
Array
[], [1,2,3], [1,[2,3]]
%w(str1 str2 str3) => ["str1", "str2", "str3"]
Dictionary(Map, or Hash):
{key => value}
Regular expression:
/pattern/
/pattern/options

Alternatives:
%!string! and %Q!string! are equivalent to "string".
%q!string! is equivalent to 'string'
%x!command! is equivalent to `command`.
%r!pattern!

Variable

Type Description Example Note
global variables can be accessed globally by and program. $var Uninitialized global variables are set to nil.
instance variables that belong to an object. @var
class variables that belong to a class. @@var Class variables must be initialized before they are accessed in methods.
contants Constants as you know in other languages Var must begin with an upper case letter.May not be defined in methods.
local Local variables. var must begin with a lowercase letter or _

Variable substitutions: #{varname}. E.g. #{filename}, #{@conf}, #{$stdin} 
Variable assignments
var = value
var1, var2, ..., *varn = expr1, expr2, ..., *exprn
Operators:

expr?expr1:expr2
    The same as the tertiary operator in C/C++.
defined? varname
    #return a description about the variable/method or nil.

Control statements

Name Syntax Alternatives
if if cond [then]
    code
[elsif cond [then]
    code]
[else
    code]
end
code if cond
unless unless cond [then]
    code
[else
    code]
end
code unless cond
case case expr
[when expr [,expr ...] [then]
    code ]...
[else
    code]
end
 
while while cond [do]
    code
end
(1)code while cond
(2)begin code end while cond.
Note: code is executed once before cond is evaluated.
until until cond [do]
    code
end
(1) code until cond
(2) begin code end until cond
Note: code is executed once before cond is evaluated.
for for var[, var2 ...] in expr [do]
    code
end
expr do |var [,var...]|
    code
end
break Break from look  
next like continue in C++  
redo re-execute the loop body once without evaluation of conditional expression.  
retry re-execute a call to a method  
begin begin
    code
[rescue [exception_class [, excep_cls...]] [=>var] [then]
    code ]...
[else
    code ]
[ensure
    code ]
end
Code in ensure clause is always executed.
Code in else clause is executed only when no exceptions are raised.
Code in rescue clause is executed when exception is caught.
rescue code rescue expr Evaluate the expr only if an exception is caught.
raise raise exception_class, message
raise exception_object
raise message
raise
Raise an exception.
To call raise in a rescue clause will re-raise the exception.
BEGIN BEGIN{
    code
}
Code to be executed before the program is run.
END END{
    code
}
Code to be executed when the interpreter quits.

Monday, August 18, 2008

Linux shared object (.so)

Search path the linker uses to locate shared object files (Following content is from manual of command ld):
"1.  Any directories specified by -rpath-link options.
2.  Any  directories  specified  by -rpath options.  The difference between -rpath and -rpath-link is that directories specified by -rpath  options are included in the executable and used at run-time, whereas the -rpath-link option is only effective at link time. It is for the native linker only.
3.  On  an  ELF system, if the -rpath and "rpath-link" options were not used, search the contents of the environment variable "LD_RUN_PATH". It is for the native linker only.
4.  On  SunOS, if the -rpath option was not used, search any directories specified using -L options.
5.  For a native linker, the contents of the  environment  variable "LD_LIBRARY_PATH".
6.  For  a  native  ELF  linker, the directories in "DT_RUNPATH" or "DT_RPATH"  of  a  shared  library  are  searched  for shared libraries  needed  by it. The "DT_RPATH" entries are ignored if "DT_RUNPATH" entries exist.
7.  The default directories, normally /lib and /usr/lib.
8.  For  a  native  linker  on  an  ELF system, if the file /etc/ld.so.conf  exists,  the list of directories found in that file."

Resources
A good tutorial about static, shared dynamic and loadable linux libraries
A Chinese post introducing using and generation of linux shared object libraries
List of GNU tools which manipulate linux libraries

Saturday, August 09, 2008

API design

Many times I struggled with API design dilemma which I didn't know how to solve perfectly. As all know, API design is important. However, it is difficult to get it right. Actually, nothing is absolutely right or wrong. What matters is tradeoff. I googled and found a pretty good article (written by Michi Henning from ZeroC) which give insight about API design. The article is here.

I agree with most of the author's viewpoints. Here I just would like to outline the elaborate article. Most following content is from the article. Gradually, I will add my own viewpoints and guidelines about how to design good API.
The effect of bad-designed API:
"Even minor and quite innocent design flaws have a tendency to get magnified out of all proportion because APIs are provided once, but are called many times. If a design flaw results in awkward or inefficient code, the resulting problems show up at every point the API is called. In addition, separate design flaws that in isolation are minor can interact with each other in surprisingly damaging ways and quickly lead to a huge amount of collateral damage.
The lower in the abstraction hierarchy an API defect occurs, the more serious are the consequences."
How to design good API:
"(1) APIs should be designed from the perspective of the caller.
     APIs should be documented before they are implemented.
(2) An API must provide sufficient functionality for the caller to achieve its task.
(3) An API should be minimal, without imposing undue inconvenience on the caller.
(4) APIs cannot be designed without an understanding of their context.
(5) General-purpose APIs should be "policy-free;" special-purpose APIs should be "policy-rich."
(6) Good APIs don't pass the buck.
(7) Good APIs are ergonomic. "

Friday, July 25, 2008

Time and date (ISO 8601, RFC 3339, W3C)

Illustration by examples.
(1) ISO 8601
Because the specification also supports rarely-used features, I will not give an exhaustive description. I would like to give some common examples.
(*) Date

Use month and day
YYYY-MM-DD 2008-07-08 MM:01 - 12
DD: 01-31
YYYY-MM 2008-07 Seconds are omitted.
( YYYYMM is not allowed )
YYYY 2008  
YYYYMMDD 20080708 hyphens are omitted
 Use week and day in a week
YYYY-Www-d 2008-W01-1 ww: week number(01-52/53)
d:day in the week(1-7)
YYYY-Www 2008-W01 first week in 2008
YYYYWwwd 2008W011  
YYYYWww 2008W01  
Use day in the year
YYYY-ddd 2008-014 ddd: day in the year
000 - 365(366 in leap years)
YYYYddd 2008123  

(*)Time

hh:mm:ss 16:23:42 hh: hour (00 - 24)
mm:minute(00-59)
ss:second(00-60)
hhmmss 162342 colons are omitted
hh:mm 16:23 seconds are omitted
hhmm 1623  
hh 16 both minutes and seconds are omitted
     
hh:mm:ss.mil 16:23:42.123 More precise

(*)Time zone

hh:mm:ssZ 12:23:43Z 'Z' means the time is measured in UTC.
time+hh:mm 13:23+01:00 The time is hh hours and mm minutes ahead of UTC
time+hhmm 13:23+0100 colon is omitted.
time+hh 13:23+01 The time is hh hours ahead of UTC
time-hh:mm
time-hhmm
time-hh
11:23-01:00
11:23-0100
11:23-01
The time is after UTC. The zone is in west of the zero meridian.

(*) Put together

<date>T<time> 2008-02-12T12:23:34
20080212T122334
2008T1223
<data>T<time>Z 2008-02-12T12:23:34Z
<data>T<time>+<zone> 2008-02-12T13:23:34+01
<data>T<time>-<zone> 2008-02-12T11:23:34-01

Resources:
http://www.cl.cam.ac.uk/~mgk25/iso-time.html
http://en.wikipedia.org/wiki/ISO_8601

(2) RFC 3339
This is a profile of ISO 8601 which defines date and time to be used on internet.
Following is excerpt from the specification:

   date-fullyear   = 4DIGIT
   date-month      = 2DIGIT  ; 01-12
   date-mday       = 2DIGIT  ; 01-28, 01-29, 01-30, 01-31 based on
                             ; month/year
   time-hour       = 2DIGIT  ; 00-23
   time-minute     = 2DIGIT  ; 00-59
   time-second     = 2DIGIT  ; 00-58, 00-59, 00-60 based on leap second
                             ; rules
   time-secfrac    = "." 1*DIGIT
   time-numoffset  = ("+" / "-") time-hour ":" time-minute
   time-offset     = "Z" / time-numoffset

   partial-time    = time-hour ":" time-minute ":" time-second
                     [time-secfrac]
   full-date       = date-fullyear "-" date-month "-" date-mday
   full-time       = partial-time time-offset

   date-time       = full-date "T" full-time

(3) W3C
W3C also defines a profile of ISO 8601 which can be accessed here.

Tuesday, July 22, 2008

Hex edit on linux

There are several useful tools which can be used to view and edit a binary file in hex mode.
(1) xxd (make a hexdump or do the reverse)
xxd infile        //hex dump. By default, in every line 16 bytes are displayed.
xxd -b infile    //bitdump instead of hexdump
xxd -c 10 infile //in every line 10 bytes are displayed instead of default value 16.
xxd -g 4 infile  //every 4 bytes form a group and groups are separated by whitespace.
xxd -l 100 infile     //just output 100 bytes
xxd -u infile    //use upper case hex letters.
xxd -p infile    //hexdump is displayed in plain format, no line numbers.
xxd -i infile     // output in C include file style.
E.g. output looks like:
unsigned char __1[] = {
  0x74, 0x65, 0x73, 0x74, 0x0a
};
unsigned int __1_len = 5;

xxd -r -p infile //convert from hexdump into binary. This requires a plain hex dump format(without line numbers).
E.g. in the infile, content should look like: 746573740a.
Note: additional whitespace and new-breaks are allowed. So 74 65 73 740a is also legal.
xxd -r infile     //similar to last command except line numbers should be specified also.
E.g. in the infile, content should look like: 0000000: 7465 7374 0a.

xxd can easily be used in vim which is described here.

(2) od(dump files in octal and other formats)
Switches:
    -A  how offsets are printed
    -t  specify output format
        (d:decimal; u:unsigned decimal;o:octet; x: hex; a: characters; c:ASCII characters or escape sequences.
        Adding a z suffix  to any type adds a display of printable characters to the end of each line of output)
    -v  Output consecutive lines that are identical.
    -j   skip specified number of bytes first.
    -N  only output specified number of bytes.

Example:
dd bs=512 count=1 if=/dev/hda1 | od -Ax -tx1z -v

(3) hexdump/hd
Currently, I have not used this command. Maybe in the near future I will update this part if I get experience about hexdump.

File system and partition information in Linux/Unix

I would like to summarize some useful commands which can be used to obtain information about partition table and file systems.
Some basic information (model, capacity, driver version...) about hard disk can be obtained by accessing files under these directories:
/proc/ide/hda, /proc/ide/hdc ...(For IDE equitments)
/proc/scsi/  (For SCSI equitments).
Useful posts:
How to add new hard disks, how to check partitions...?
File system related information
Partitions and Volumes

Partition table:
All partitions of a disk are numbered next way: 1-4 - primary and extended, 5-16 (15) - logical.
(1) fdisk (from package util-linux)
Partition table manipulator for Linux.
fdisk -l device       //list partition table
fdisk -s partition    //get size of a partition. If the parameter is a device, get capacity of the device.

(2) cfdisk
Curses based disk partition table manipulator for Linux. More user-friendly.
cfdisk -Ps    //print partition table in sector format
cfdisk -Pr    //print partition table in raw format(chunk of hex numbers)
cfdisk -Pt    //print partition table in table format.

(3) sfdisk
List partitions:
sfdisk -s device/partition    //get size of a device or partition
sfdisk -l device                  //list partition table of a device
sfdisk -g device                 //show kernel's idea of geometry
sfdisk -G device                //show geometry guessed based on partition table
sfdisk -d device                //Dump the partitions of a device in a format useful as input to sfdisk.
sfdisk device -O file          //Just  before  writing  the new partition, output the sectors that are going to be overwritten to file
sfdisk device -I fiel           //restore old partition table which is preserved by using -O tag.
Check partitions:
sfdisk -V device                 //apply consistency check
It can also modify partition table.

(4) parted
An interactive partition manipulation programs. Use print to get partition information.

File system:
use "man fs" to get linux file system type descriptions.
/etc/fstab        file system table
/etc/mtab        table of mounted file systems
(1) df (in coreutils)
Report file system disk space usage.
df -Th    //list fs information including file system type in human readable format
(2) fsck (check and repair a Linux file system)
fsck is simply a front-end for the various file system checkers (fsck.fstype) available under  Linux.
(3) mkfs (used to build a Linux file system on a device, usually a hard disk partition.)
It is a front-end to many file system specific builder(mkfs.fstype).Various backend fs builder:
mkdosfs,   mke2fs,   mkfs.bfs,   mkfs.ext2,   mkfs.ext3,   mkfs.minix, mkfs.msdos, mkfs.vfat, mkfs.xfs, mkfs.xiafs.
(4) badblocks - search a device for bad blocks
(5) mount
(6) unmount

Ext2/Ext3
(1) dumpe2fs (from package e2fsprogs, for ext2/ext3 file systems)
Prints the super block and blocks group information for the filesystem.
dumpe2fs -h /dev/your_device      //get superblock information
dumpe2fs /dev/your_device | grep -i superblock      //get backups of superblock.
(2) debugfs (debug a file system)
Users can open and close a fs. And link/unlink, create...files.
(3) e2fsck (check a Linux ext2/ext3 file system)
This tool is mainly used to repair a file system.

Bash cheat sheet

Part of content in this post is from other web sites, mainly from Bash manual. Because all formal documents and introduction books are long and too detailed , I just list some key points and usage examples so that later I can quickly remember a bash feature by reading this post instead of the whole bash manual.

[Bash switch]
First, get a list of installed shells: chsh -l or cat /etc/shells.
Then use chsh -s to change your default shell. It modifies file /ect/passwd to reflect your change of shell preference.

[Output]
[echo] Output the arguments. If -n is given, trailing newline is suppressed. If -e is given, interpretation of escape characters is turned on.
[Output redirection]
> output redirection
"Code can be added to a program to figure out when its output is being redirected. Then, the program can behave differently in those two cases—and that’s what ls is doing." This means what you see on screen may be different from content of the file to which you redirect the output. For command ls, use ls -C>file.
command 1 > file1 2 > file2
1: stdout; 2: stderr; 1 is the default number.
command >& file
command &> file
command > file 2>&1

These three commands redirect stdout and stderr to the same file.
The tee command writes the output to the filename specified as its parameter and also write that same output to standard out.
>> append

[Input]
< input
<< EOF here-document.
"all lines of the here-document are subjected to parameter expansion, command substitution, and arithmetic expansion.”
<< \EOF or << 'EOF' or <<E\OF
When we escape some or all of the characters of the EOF, bash knows not to do the expansions.
<<-EOF
The hyphen just after the << is enough to tell bash to ignore the leading tab characters so that you can indent here-document. This is for tab characters only and not arbitrary white space.
use read to get input from user.

[Command execution]
get exit status of last executed command using $? (0-255).
(1) command1 & command2 & command3       //execute commands independently
(2) command1 && command2 && command3  //execute commands sequentialy
(3) command1 || command2
(4) command_name&
If a command is terminated by the control operator ‘&’, the shell executes the command asynchronously in a subshell. The shell does not wait for the command to finish, and the return status is 0 (true)
(5) nohup - run a command immune to hangups, with output to a non-tty
(6) command1; command2;
Commands separated by a ‘;’ are executed sequentially; the shell waits for each command to terminate in turn. The return status is the exit status of the last command executed.
(7) $( cmd ) or `cmd`: execute the command and return the output.
(8) $(( arithmetic_op )): do arithmetic operation and return the output.

[Pipe]
Each command in a pipeline is executed in its own subshell.

[Group commands]
{ comamnd1; command2; } > file            //use current shell environment
( command1; command2; ) >file               //use a subshell to execute grouped commands

[Variables]
Array variable: varname=(value1, value2); use ${varname[index]} to access an element.
Type of all variables is string. And sometimes bash operators can treat the content as other types.
Assignment: VARNAME=VALUE    //Note: there is not whitespace around equal sign.Or else bash interpreter
                                              //can not distinguish between variable name and command name. So it will
                                              //VARNAME as a command name.
Get value: $VARNAME    //the dollar sign is a must-have.
Use full syntax(braces) to separate variable name from surrounding text:
E.g. ${VAR}notvarname
export can be used to export a variable to environment so that it can be used by other scripts.
export VAR=value
Exported variables are call by value. Change of variable value in called script does not affect original value in calling script.
Use command set to see all defined variables in the environment
Use command env to see all exported variables in the environment.
${varname:-defaultvalue}: if variable varname is set and not empty, return its value; else return defaultvalue.
${varname:=defaultvalue}: similar to last one except that defaultvalue is assigned to variable varname if the variable is empty or not set. varname can not be positional variables(1, 2, ..., *)
${varname=defaultvalue}: Assignment happens only when variable varname is not set.
${varname:?defaultvalue}: If varname is unset or empty, print defaultvalue and exit.
Parameters passed to a script can be retrieved by accessing special variables: ${1},${2}...
Use variable ${#} to get number of parameters.
Command shift can be used to shift the parameters: The positional parameters from $N+1 ... are renamed to $1 ...  If N is not given, it is assumed to be 1.

[Special variables]
Variable $* can be used to get all parameters. If a certain parameter contains blank, use "$@".
E.g. command "param1" "second param"
value of $* or $@: param1 second param
value of "$@": "param1" "second param"
value of "$*": "param1 second param"

Always use "$@" if possible! And always use "${varname}"(include double quotes) to access value of the variable.
If a parmeter contains blanks, use double quotes to retrieve the value.
ls "${varname}"   instead of   ls ${varname}.
${#}: Expands to the number of positional parameters in decimal.
${-}: (A hyphen) Expands to the current option flags as specified upon invocation, by the set builtin command, or those set by the shell itself (such as the -i option).
${$}: Expands to the process id of the shell. In a () subshell, it expands to the process id of the invoking shell, not the subshell.
${!}: Expands to the process id of the most recently executed background (asynchronous) command.
${0}: Expands to the name of the shell or shell script. This is set at shell initialization.

[Misc]
(1) ${#var}: return length of the string
# character denotes start of a comment.
(2) Long comment: use "do nothing(:)" and here-document.
E.g.
:<<EOF
doc goes here
EOF

[Quotation]
Unquoted text and double quoted text is subject to shell expansion.
(1) In double quoted text, '\' can be used to escape special characters.  Backslashes preceding characters without a special meaning are left unmodified.
(2) Single quoted text is not subject to shell expansion. So you can not use escape sequence in single quoted text. So text 'use \' in single quoted text' is incorrect. It also means you can not include a single quote inside single quoted text, even if using a '\'. You can escape a single quote outside of surrounding single quotes.
(3) Words of the form $'string' are treated specially. The word expands to string, with backslash-escaped characters replaced as specified by the ANSI C standard. The expanded result is single-quoted, as if the dollar sign had not been present.

[Function/Command scope]
If you want to enforce using of an external command instead of built-in command, use enable -n. Or you can use command command, which ignores shell functions.
E.g. enable -n built-in-function
       command command_name

[Control structures and more]
(1) until test-command; do statements; done
(2) while test-command; do statements; done
(3) for var in words; do statements; done
(4) for ((expr1; expr2; expr3 )); do statements; done
(5) if test-commands; then statements;
     elif test_commands2; then statements;
     else statements;
     fi
(6) case word in
     pattern) statements;;
     pattern2|pattern3) statements;;
     *) statements;;
     esac
(7) select var in words; do statements; done
The commands are executed after each selection until a break command is executed, at which point the select command completes.
(8) (( arithmetic_expression))
If the value of the expression is non-zero, return status is 0; otherwise return status is 1. This is equivalent to
let "expression".
(9)[[ ]]

[Shell expansions]
(1) Brace expansion (note: ${...} is not expanded)
a{b,c,d}e  => abd ace ade
(2) Tilde expansion
~: value of $HOME
~username: home directory of user username.
~+: $PWD
~-: $OLDPWD
~N: equivalent to 'dirs +N'
~-N: equivalent to 'dirs -N'
(3) Parameter expansion
(4) Command substitution
$(command) or `command`: executs command and replaces the command substitution with the standard output of the command with any trailing newlines deleted.
(5) Arithmetic expansion
$(( expression )): allows the evalution of an arithmetic expression and substitution of the result.
(6) Process substitution
Actually I don't understand it much.
(7) Word splitting
"The shell scans the results of parameter expansion, command substitution, and arithmetic expansion that did not occur within double quotes for word splitting."
(8) Filename expansion
"After word splitting, unless the -f option has been set (see The Set Builtin), Bash scans each word for the characters ‘*’, ‘?’, and ‘[’. If one of these characters appears, then the word is regarded as a pattern, and replaced with an alphabetically sorted list of file names matching the pattern."

[Useful commands]
(*) customize shell promote

set environment variable "PS1"
(*) How to find commands?
Try following commands:
[type]
    a bash command. Searches files in the PATH, aliases, keywords, functions, built-in commands...
[which]
    displays the full path of the executables that would have been executed when this argument had been entered at the shell prompt(just searches files in $PATH).
[apropos]
    search the whatis database for strings(same as man -k)
[locate, slocate]
    reads  one or more databases prepared by updatedb and writes file names matching at least one of the PATTERNs to standard output. Location of the actual database varies from system to system. 
[whereis]
    locate the binary, source, and manual page files for a command. An entry is displayed only when the whole searched word is matched(not a substring of a long word).
[find]
    search for files in a directory hierarchy.
(*) Info about a command
[man]:
    display manual of the command;
[help]:
    display information of bash built-in commands. For a bash build-in command, if you use man command to display its information, you will get a large bash manual.
[info]:
    display Info doc of an arbitrary command.
(*) Get information about a file
[ls]
    List  information  about  the FILEs (the current directory by default). You can either list directory contents of get information about a file(a regular file, not directory).
E.g.
    ls -a .*  //list all files name of which startes with a .? no!!! Bash will expand pattern .* at first. So
               // ls .. is included which causes content of parent directory is displayed. And so on and so on.
               // You can use echo command to see what the pattern is expanded to. E.g. echo .*
    ls -d .*  //-d switch enforce ls to list directory entries instead of contents.
[stat]
    Display file or file system status.
[find]
    E.g. find /path/ -name file_name -printf '%m %u %t'
[file]
    determine file type. There are three sets of tests, performed in this order: filesystem tests, magic number tests, and language tests.

Saturday, July 19, 2008

Send email in PHP

On linux, a tool called sendmail can be used to send emails. You don't need a SMTP server to send emails out. An interesting thing about sendmail is that you can not set some common email fields easily. For example, you can not set email subject directly on command line by using a switch. Instead following syntax should be used:

sendmail destination@domain.com
Subject: This is the subject.
Main body goes here
.

In PHP, users can send email easily because PHP relies on sendmail. See this page for instructions. Main configuration options in php.ini:

Mail configuration options
Name Default Changeable Changelog
SMTP "localhost" PHP_INI_ALL  
smtp_port "25" PHP_INI_ALL Available since PHP 4.3.0.
sendmail_from NULL PHP_INI_ALL  
sendmail_path "/usr/sbin/sendmail -t -i" PHP_INI_SYSTEM  

Friday, July 11, 2008

Linux/Unix Terminal

Terminfo database
At first, we need a database which describes capabilities of terminals. There are two basic databases there: TermCap and Terminfo. The latter is more widely used currently. You can create a new description file and edit it in textual mode. Then you can use a tool called tic to compile original text into binary format which is used by other programs. The Terminfo is generally located under: /usr/share/terminfo.
Some useful information you may care about can be accessed here and here. From there, you also can download latest terminfo database which contains information about many kinds of terminals.
Single UNIX Specification contains a section defining terminfo source format which can be accessed here.

Text terminal
Control terminals in low-level:
ioctls: gives very low-level access.
termios: "The termios functions describe a general terminal interface that is provided to control asynchronous communications ports."(From manpage).

High-level wrapper libraries:
Low-level APIs(ioctls/termios) don't hide complexity of terminal programming. Your programs must depend on specific terminal type. Some wrapper libraries exist which provide portable interface to many terminal types.
(1) ncurses (http://invisible-island.net/ncurses/ncurses.html)
Allow programmer to write TUI in a terminal-independent manner. It also does optimization work to reduce latency when using remote shells. Written in C.
Moreover, another package called Curses Development Kit(CDK) provides more functionalities. Mainly, it provides a library of curses widgets which can be used and customized in your programs.
Here is a good tutorial about how to use Ncurses in C.
Though ncurses tests $TERMINFO first, otherwise it reads from $HOME/.terminfo before the standard location,
(2) shell curses(http://www.mtxia.com/css/Downloads/Scripts/Korn/Functions/shcurses/)
It enables shell programmers to do TUI programming easily. It worked originally in Korn shell. I am not sure whether it has been ported to other shells.
(3) S-Lang

GUI
Currently, GUI exists almost everywhere in the computer world. Pure text terminal is not used very often. X window system is the big player in linux/unix. As a result, programmers can provide a GUI to the customers instead of TUI.
Some interesting articles: State of Linux graphics, various Linux/Unix desktops, X window resource/link collection.

Pseudo terminal/Terminal emulator
With disappearance(maybe a few ones still are being used) of traditional serial port terminals, code of terminal control was rewritten to promote reuse. So currently terminal is just a term describing such devices/programs which can send requests to master and handle responses from master. No physical serial port is involved. Most of the time we use remote shells running on internet. In directory /dev/, there are corresponding device files for these pseudo terminals. Commonly used file names are pty*, pts*,ptty...

Some useful commands:
(*) tic - the terminfo entry-description compiler.
translates  a terminfo file from source format into compiled format.  The compiled format is necessary for use with the library routines in ncurses.
(*)captoinfo (alias for tic) - convert a termcap description into a terminfo description
   infotocap(alias for tic) -
(*) screen: The screen program allows you to run multiple virtual terminals, each with its own interactive shell, on a single physical terminal or terminal emulation window.
/******* Set/Rest terminal ******/
(*) tset : terminal initialization
(*) setterm: set terminal attributes.
    setterm  writes  to  standard  output  a character string that will invoke the specified terminal capabilities. Where possible terminfo is consulted to find the string to use. The attribute names used in this command are different from cap names in terminfo database!!!
(*) stty - prints or changes terminal characteristics, such as baud rate.(related to line settings)
/****** Query  ********/
(*) infocmp - compare or print out terminfo descriptions. It can be used to print all capabilities.
(*) tput : query terminfo database. Query a specific capability.
e.g. tput setaf     //get foreground color.
       tput cols      //get number of columns
       tput longname  //get longname description of current terminal.
(*)  toe - table of (terminfo) entries. With  no  options,  toe  lists  all available terminal types by primary name with descriptions.

How to know your current terminal type?
tset -q
infocmp
How to tell linux type of your terminal?
Usually, you must explicitly tell linux type of your terminal. You can set environment variable $TERM.

Thursday, July 10, 2008

Unicode

Frequently, I am using utf-8, gb2312, ansii... For ANSII encoding, I understand it well becaues it is simple. For Unicode, I have been confused with various terms(e.g. utf-8, utf-16, utf-32, UCS-2, UCS-4...). Now, I do some research and summarize what I have learnt.

Resources:
Official site: http://www.unicode.org/
Unicode Standard (version 5): http://www.unicode.org/versions/Unicode5.0.0/bookmarks.html
Unicode FAQ: http://unicode.org/faq/
FAQ of various encoding forms: http://unicode.org/faq/utf_bom.html
Online tools:
An online Unicode conversion tool: http://rishida.net/scripts/uniview/conversion.php (Display various encoding representations of what you input.)
Unihan charset database: http://www.unicode.org/charts/unihan.html (You can query by Unicode code point.)
Another good tool: http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%26%2320320%3B&mode=char

Windows XP provides a useful tool called "Character Map": Start -> All Programs -> Accessories -> System Tools -> Character Map.
In linux/Unix, iconv is a powerful tool which can be used to convert between different character encodings. (It seems BOM described below is not handled by iconv. So you should not include BOM in the source file.)

Two entities:
Unicode consortium
ISO/IEC JTC1/SC2/WG2
Good news is that these two entities are well synchronized and a standard from one entity is aligned to corresponding standard from the other entity.
Unicode 5 is synchronized with Amd2 to ISO/IEC 10646:2003 plus Sindhi additions.

Encoding forms
Functionality of encoding forms is to map every Unicode code point to a unique byte sequence. Unicode code point arrangement is the same while various encoding forms float around.
ISO/IEC 10646 defines 4 forms of encoding of universal character set: UCS-4, UCS-2, UTF-8 and UTF-16. 
Code unit: encoding of every character consists of integral number of code units. For example, for UTF-16 code unit is 16bits which means length of encoding is 16bits or 32bits.
(1) UCS-4/UTF-32
Currently, they are almost identical. It uses exactly 32bits for each code point. So it is fixed length encoding.
"This single 4-byte-long code unit corresponds to the Unicode scalar value, which is the abstract number associated with a Unicode character. "
(2) UCS-2/UTF-16
Code unit is 16 bits. Commonly used characters usually can be encoded in 1 code unit (16bits).
From wikipedia: "UTF-16: For characters in the Basic Multilingual Plane (BMP) the resulting encoding is a single 16-bit word. For characters in the other planes, the encoding will result in a pair of 16-bit words, together called a surrogate pair. All possible code points from U+0000 through U+10FFFF, except for the surrogate code points U+D800–U+DFFF (which are not characters), are uniquely mapped by UTF-16 regardless of the code point's current or future character assignment or use."
From wikipedia: "UCS-2 (2-byte Universal Character Set) is an obsolete character encoding which is a predecessor to UTF-16. The UCS-2 encoding form is nearly identical to that of UTF-16, except that it does not support surrogate pairs and therefore can only encode characters in the BMP range U+0000 through U+FFFF. As a consequence it is a fixed-length encoding that always encodes characters into a single 16-bit value."
In a word, UCS-2 is a fixed-length encoding which can encode characters in BMP range U+0000 through U+FFFF while UTF-16 is variable length encoding which supports characters in other planes by using surrogate pair.
Note:The two values FFFE16 and FFFF16 as well as the 32 values from FDD016 to FDEF16 represent noncharacters. They are invalid in interchange, but may be freely used internal to an implementation. Unpaired surrogates are invalid as well, i.e. any value in the range D80016 to DBFF16 not followed by a value in the range DC0016 to DFFF16.
(3) UTF-8
Code unit is 8 bits. It encodes one code point in one to four octets.
From wikipedia: "It is able to represent any character in the Unicode standard, yet the initial encoding of byte codes and character assignments for UTF-8 is backwards compatible with ASCII. For these reasons, it is steadily becoming the preferred encoding for e-mail, web pages"
Note: initial encoding in UTF-8 is NOT compatible with latin1(ISO-8859).

Byte Order
For UTF-16 and UTF-32, code unit is more than 1 byte. A natural problem is the byte order, big endian(MSB first) or little endian(MSB last)? For UTF-8, this problem does not exist because code unit is 1 byte. To solve this problem, Byte Order Mark(BOM) can be used. Actually, BOM does not only indicate the byte order but also define encoding form.
BOM table: From http://unicode.org/faq/utf_bom.html:

Bytes Encoding Form
00 00 FE FF UTF-32, big-endian
FF FE 00 00 UTF-32, little-endian
FE FF UTF-16, big-endian
FF FE UTF-16, little-endian
EF BB BF UTF-8

Here are some guidelines to follow(From http://unicode.org/faq/utf_bom.html):

  1. A particular protocol (e.g. Microsoft conventions for .txt files) may require use of the BOM on certain Unicode data streams, such as files. When you need to conform to such a protocol, use a BOM.
  2. Some protocols allow optional BOMs in the case of untagged text. In those cases,
    • Where a text data stream is known to be plain text, but of unknown encoding, BOM can be used as a signature. If there is no BOM, the encoding could be anything.
    • Where a text data stream is known to be plain Unicode text (but not which endian), then BOM can be used as a signature. If there is no BOM, the text should be interpreted as big-endian.
  3. Some byte oriented protocols expect ASCII characters at the beginning of a file. If UTF-8 is used with these protocols, use of the BOM as encoding form signature should be avoided. For example, bash file starting with #!/bin/sh.
  4. Where the precise type of the data stream is known (e.g. Unicode big-endian or Unicode little-endian), the BOM should not be used. In particular, whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be used.

Summary:

From http://unicode.org/faq/utf_bom.html:

Name UTF-8 UTF-16 UTF-16BE UTF-16LE UTF-32 UTF-32BE UTF-32LE
Smallest code point 0000 0000 0000 0000 0000 0000 0000
Largest code point 10FFFF 10FFFF 10FFFF 10FFFF 10FFFF 10FFFF 10FFFF
Code unit size 8 bits 16 bits 16 bits 16 bits 32 bits 32 bits 32 bits
Byte order N/A <BOM> big-endian little-endian <BOM> big-endian little-endian
Minimal bytes/character 1 2 2 2 4 4 4
Maximal bytes/character 4 4 4 4 4 4 4

Note: all valid code points that are encoded are the same: from U+0000 through U+10FFFF.

More:
None of the UTFs can generate every arbitrary byte sequence. In other words, not every 4-byte-long byte sequence represents  a legal coding in UCS4.
From Unicode consortium site: "Each UTF is reversible, thus every UTF supports lossless round tripping: mapping from any Unicode coded character sequence S to a sequence of bytes and back will produce S again. To ensure round tripping, a UTF mapping  must also map all code points that are not valid Unicode characters to unique byte sequences. These invalid code points are the 66 noncharacters (including FFFE and FFFF), as well as unpaired surrogates."
This means that: to guarantee reversibility, not only valid characters but also non-valid characters must be considered and encoded appropriately.

How to fit a Unicode character into ANSII stream?
See http://unicode.org/faq/utf_bom.html#31.  Several methods are used in practice: (1) UTF-8; (2) '\uXXXX' in C or Java (3) "&#XXXX;" in HTML or XML.