Friday, July 25, 2008

Time and date (ISO 8601, RFC 3339, W3C)

Illustration by examples.
(1) ISO 8601
Because the specification also supports rarely-used features, I will not give an exhaustive description. I would like to give some common examples.
(*) Date

Use month and day
YYYY-MM-DD 2008-07-08 MM:01 - 12
DD: 01-31
YYYY-MM 2008-07 Seconds are omitted.
( YYYYMM is not allowed )
YYYY 2008  
YYYYMMDD 20080708 hyphens are omitted
 Use week and day in a week
YYYY-Www-d 2008-W01-1 ww: week number(01-52/53)
d:day in the week(1-7)
YYYY-Www 2008-W01 first week in 2008
YYYYWwwd 2008W011  
YYYYWww 2008W01  
Use day in the year
YYYY-ddd 2008-014 ddd: day in the year
000 - 365(366 in leap years)
YYYYddd 2008123  

(*)Time

hh:mm:ss 16:23:42 hh: hour (00 - 24)
mm:minute(00-59)
ss:second(00-60)
hhmmss 162342 colons are omitted
hh:mm 16:23 seconds are omitted
hhmm 1623  
hh 16 both minutes and seconds are omitted
     
hh:mm:ss.mil 16:23:42.123 More precise

(*)Time zone

hh:mm:ssZ 12:23:43Z 'Z' means the time is measured in UTC.
time+hh:mm 13:23+01:00 The time is hh hours and mm minutes ahead of UTC
time+hhmm 13:23+0100 colon is omitted.
time+hh 13:23+01 The time is hh hours ahead of UTC
time-hh:mm
time-hhmm
time-hh
11:23-01:00
11:23-0100
11:23-01
The time is after UTC. The zone is in west of the zero meridian.

(*) Put together

<date>T<time> 2008-02-12T12:23:34
20080212T122334
2008T1223
<data>T<time>Z 2008-02-12T12:23:34Z
<data>T<time>+<zone> 2008-02-12T13:23:34+01
<data>T<time>-<zone> 2008-02-12T11:23:34-01

Resources:
http://www.cl.cam.ac.uk/~mgk25/iso-time.html
http://en.wikipedia.org/wiki/ISO_8601

(2) RFC 3339
This is a profile of ISO 8601 which defines date and time to be used on internet.
Following is excerpt from the specification:

   date-fullyear   = 4DIGIT
   date-month      = 2DIGIT  ; 01-12
   date-mday       = 2DIGIT  ; 01-28, 01-29, 01-30, 01-31 based on
                             ; month/year
   time-hour       = 2DIGIT  ; 00-23
   time-minute     = 2DIGIT  ; 00-59
   time-second     = 2DIGIT  ; 00-58, 00-59, 00-60 based on leap second
                             ; rules
   time-secfrac    = "." 1*DIGIT
   time-numoffset  = ("+" / "-") time-hour ":" time-minute
   time-offset     = "Z" / time-numoffset

   partial-time    = time-hour ":" time-minute ":" time-second
                     [time-secfrac]
   full-date       = date-fullyear "-" date-month "-" date-mday
   full-time       = partial-time time-offset

   date-time       = full-date "T" full-time

(3) W3C
W3C also defines a profile of ISO 8601 which can be accessed here.

Tuesday, July 22, 2008

Hex edit on linux

There are several useful tools which can be used to view and edit a binary file in hex mode.
(1) xxd (make a hexdump or do the reverse)
xxd infile        //hex dump. By default, in every line 16 bytes are displayed.
xxd -b infile    //bitdump instead of hexdump
xxd -c 10 infile //in every line 10 bytes are displayed instead of default value 16.
xxd -g 4 infile  //every 4 bytes form a group and groups are separated by whitespace.
xxd -l 100 infile     //just output 100 bytes
xxd -u infile    //use upper case hex letters.
xxd -p infile    //hexdump is displayed in plain format, no line numbers.
xxd -i infile     // output in C include file style.
E.g. output looks like:
unsigned char __1[] = {
  0x74, 0x65, 0x73, 0x74, 0x0a
};
unsigned int __1_len = 5;

xxd -r -p infile //convert from hexdump into binary. This requires a plain hex dump format(without line numbers).
E.g. in the infile, content should look like: 746573740a.
Note: additional whitespace and new-breaks are allowed. So 74 65 73 740a is also legal.
xxd -r infile     //similar to last command except line numbers should be specified also.
E.g. in the infile, content should look like: 0000000: 7465 7374 0a.

xxd can easily be used in vim which is described here.

(2) od(dump files in octal and other formats)
Switches:
    -A  how offsets are printed
    -t  specify output format
        (d:decimal; u:unsigned decimal;o:octet; x: hex; a: characters; c:ASCII characters or escape sequences.
        Adding a z suffix  to any type adds a display of printable characters to the end of each line of output)
    -v  Output consecutive lines that are identical.
    -j   skip specified number of bytes first.
    -N  only output specified number of bytes.

Example:
dd bs=512 count=1 if=/dev/hda1 | od -Ax -tx1z -v

(3) hexdump/hd
Currently, I have not used this command. Maybe in the near future I will update this part if I get experience about hexdump.

File system and partition information in Linux/Unix

I would like to summarize some useful commands which can be used to obtain information about partition table and file systems.
Some basic information (model, capacity, driver version...) about hard disk can be obtained by accessing files under these directories:
/proc/ide/hda, /proc/ide/hdc ...(For IDE equitments)
/proc/scsi/  (For SCSI equitments).
Useful posts:
How to add new hard disks, how to check partitions...?
File system related information
Partitions and Volumes

Partition table:
All partitions of a disk are numbered next way: 1-4 - primary and extended, 5-16 (15) - logical.
(1) fdisk (from package util-linux)
Partition table manipulator for Linux.
fdisk -l device       //list partition table
fdisk -s partition    //get size of a partition. If the parameter is a device, get capacity of the device.

(2) cfdisk
Curses based disk partition table manipulator for Linux. More user-friendly.
cfdisk -Ps    //print partition table in sector format
cfdisk -Pr    //print partition table in raw format(chunk of hex numbers)
cfdisk -Pt    //print partition table in table format.

(3) sfdisk
List partitions:
sfdisk -s device/partition    //get size of a device or partition
sfdisk -l device                  //list partition table of a device
sfdisk -g device                 //show kernel's idea of geometry
sfdisk -G device                //show geometry guessed based on partition table
sfdisk -d device                //Dump the partitions of a device in a format useful as input to sfdisk.
sfdisk device -O file          //Just  before  writing  the new partition, output the sectors that are going to be overwritten to file
sfdisk device -I fiel           //restore old partition table which is preserved by using -O tag.
Check partitions:
sfdisk -V device                 //apply consistency check
It can also modify partition table.

(4) parted
An interactive partition manipulation programs. Use print to get partition information.

File system:
use "man fs" to get linux file system type descriptions.
/etc/fstab        file system table
/etc/mtab        table of mounted file systems
(1) df (in coreutils)
Report file system disk space usage.
df -Th    //list fs information including file system type in human readable format
(2) fsck (check and repair a Linux file system)
fsck is simply a front-end for the various file system checkers (fsck.fstype) available under  Linux.
(3) mkfs (used to build a Linux file system on a device, usually a hard disk partition.)
It is a front-end to many file system specific builder(mkfs.fstype).Various backend fs builder:
mkdosfs,   mke2fs,   mkfs.bfs,   mkfs.ext2,   mkfs.ext3,   mkfs.minix, mkfs.msdos, mkfs.vfat, mkfs.xfs, mkfs.xiafs.
(4) badblocks - search a device for bad blocks
(5) mount
(6) unmount

Ext2/Ext3
(1) dumpe2fs (from package e2fsprogs, for ext2/ext3 file systems)
Prints the super block and blocks group information for the filesystem.
dumpe2fs -h /dev/your_device      //get superblock information
dumpe2fs /dev/your_device | grep -i superblock      //get backups of superblock.
(2) debugfs (debug a file system)
Users can open and close a fs. And link/unlink, create...files.
(3) e2fsck (check a Linux ext2/ext3 file system)
This tool is mainly used to repair a file system.

Bash cheat sheet

Part of content in this post is from other web sites, mainly from Bash manual. Because all formal documents and introduction books are long and too detailed , I just list some key points and usage examples so that later I can quickly remember a bash feature by reading this post instead of the whole bash manual.

[Bash switch]
First, get a list of installed shells: chsh -l or cat /etc/shells.
Then use chsh -s to change your default shell. It modifies file /ect/passwd to reflect your change of shell preference.

[Output]
[echo] Output the arguments. If -n is given, trailing newline is suppressed. If -e is given, interpretation of escape characters is turned on.
[Output redirection]
> output redirection
"Code can be added to a program to figure out when its output is being redirected. Then, the program can behave differently in those two cases—and that’s what ls is doing." This means what you see on screen may be different from content of the file to which you redirect the output. For command ls, use ls -C>file.
command 1 > file1 2 > file2
1: stdout; 2: stderr; 1 is the default number.
command >& file
command &> file
command > file 2>&1

These three commands redirect stdout and stderr to the same file.
The tee command writes the output to the filename specified as its parameter and also write that same output to standard out.
>> append

[Input]
< input
<< EOF here-document.
"all lines of the here-document are subjected to parameter expansion, command substitution, and arithmetic expansion.”
<< \EOF or << 'EOF' or <<E\OF
When we escape some or all of the characters of the EOF, bash knows not to do the expansions.
<<-EOF
The hyphen just after the << is enough to tell bash to ignore the leading tab characters so that you can indent here-document. This is for tab characters only and not arbitrary white space.
use read to get input from user.

[Command execution]
get exit status of last executed command using $? (0-255).
(1) command1 & command2 & command3       //execute commands independently
(2) command1 && command2 && command3  //execute commands sequentialy
(3) command1 || command2
(4) command_name&
If a command is terminated by the control operator ‘&’, the shell executes the command asynchronously in a subshell. The shell does not wait for the command to finish, and the return status is 0 (true)
(5) nohup - run a command immune to hangups, with output to a non-tty
(6) command1; command2;
Commands separated by a ‘;’ are executed sequentially; the shell waits for each command to terminate in turn. The return status is the exit status of the last command executed.
(7) $( cmd ) or `cmd`: execute the command and return the output.
(8) $(( arithmetic_op )): do arithmetic operation and return the output.

[Pipe]
Each command in a pipeline is executed in its own subshell.

[Group commands]
{ comamnd1; command2; } > file            //use current shell environment
( command1; command2; ) >file               //use a subshell to execute grouped commands

[Variables]
Array variable: varname=(value1, value2); use ${varname[index]} to access an element.
Type of all variables is string. And sometimes bash operators can treat the content as other types.
Assignment: VARNAME=VALUE    //Note: there is not whitespace around equal sign.Or else bash interpreter
                                              //can not distinguish between variable name and command name. So it will
                                              //VARNAME as a command name.
Get value: $VARNAME    //the dollar sign is a must-have.
Use full syntax(braces) to separate variable name from surrounding text:
E.g. ${VAR}notvarname
export can be used to export a variable to environment so that it can be used by other scripts.
export VAR=value
Exported variables are call by value. Change of variable value in called script does not affect original value in calling script.
Use command set to see all defined variables in the environment
Use command env to see all exported variables in the environment.
${varname:-defaultvalue}: if variable varname is set and not empty, return its value; else return defaultvalue.
${varname:=defaultvalue}: similar to last one except that defaultvalue is assigned to variable varname if the variable is empty or not set. varname can not be positional variables(1, 2, ..., *)
${varname=defaultvalue}: Assignment happens only when variable varname is not set.
${varname:?defaultvalue}: If varname is unset or empty, print defaultvalue and exit.
Parameters passed to a script can be retrieved by accessing special variables: ${1},${2}...
Use variable ${#} to get number of parameters.
Command shift can be used to shift the parameters: The positional parameters from $N+1 ... are renamed to $1 ...  If N is not given, it is assumed to be 1.

[Special variables]
Variable $* can be used to get all parameters. If a certain parameter contains blank, use "$@".
E.g. command "param1" "second param"
value of $* or $@: param1 second param
value of "$@": "param1" "second param"
value of "$*": "param1 second param"

Always use "$@" if possible! And always use "${varname}"(include double quotes) to access value of the variable.
If a parmeter contains blanks, use double quotes to retrieve the value.
ls "${varname}"   instead of   ls ${varname}.
${#}: Expands to the number of positional parameters in decimal.
${-}: (A hyphen) Expands to the current option flags as specified upon invocation, by the set builtin command, or those set by the shell itself (such as the -i option).
${$}: Expands to the process id of the shell. In a () subshell, it expands to the process id of the invoking shell, not the subshell.
${!}: Expands to the process id of the most recently executed background (asynchronous) command.
${0}: Expands to the name of the shell or shell script. This is set at shell initialization.

[Misc]
(1) ${#var}: return length of the string
# character denotes start of a comment.
(2) Long comment: use "do nothing(:)" and here-document.
E.g.
:<<EOF
doc goes here
EOF

[Quotation]
Unquoted text and double quoted text is subject to shell expansion.
(1) In double quoted text, '\' can be used to escape special characters.  Backslashes preceding characters without a special meaning are left unmodified.
(2) Single quoted text is not subject to shell expansion. So you can not use escape sequence in single quoted text. So text 'use \' in single quoted text' is incorrect. It also means you can not include a single quote inside single quoted text, even if using a '\'. You can escape a single quote outside of surrounding single quotes.
(3) Words of the form $'string' are treated specially. The word expands to string, with backslash-escaped characters replaced as specified by the ANSI C standard. The expanded result is single-quoted, as if the dollar sign had not been present.

[Function/Command scope]
If you want to enforce using of an external command instead of built-in command, use enable -n. Or you can use command command, which ignores shell functions.
E.g. enable -n built-in-function
       command command_name

[Control structures and more]
(1) until test-command; do statements; done
(2) while test-command; do statements; done
(3) for var in words; do statements; done
(4) for ((expr1; expr2; expr3 )); do statements; done
(5) if test-commands; then statements;
     elif test_commands2; then statements;
     else statements;
     fi
(6) case word in
     pattern) statements;;
     pattern2|pattern3) statements;;
     *) statements;;
     esac
(7) select var in words; do statements; done
The commands are executed after each selection until a break command is executed, at which point the select command completes.
(8) (( arithmetic_expression))
If the value of the expression is non-zero, return status is 0; otherwise return status is 1. This is equivalent to
let "expression".
(9)[[ ]]

[Shell expansions]
(1) Brace expansion (note: ${...} is not expanded)
a{b,c,d}e  => abd ace ade
(2) Tilde expansion
~: value of $HOME
~username: home directory of user username.
~+: $PWD
~-: $OLDPWD
~N: equivalent to 'dirs +N'
~-N: equivalent to 'dirs -N'
(3) Parameter expansion
(4) Command substitution
$(command) or `command`: executs command and replaces the command substitution with the standard output of the command with any trailing newlines deleted.
(5) Arithmetic expansion
$(( expression )): allows the evalution of an arithmetic expression and substitution of the result.
(6) Process substitution
Actually I don't understand it much.
(7) Word splitting
"The shell scans the results of parameter expansion, command substitution, and arithmetic expansion that did not occur within double quotes for word splitting."
(8) Filename expansion
"After word splitting, unless the -f option has been set (see The Set Builtin), Bash scans each word for the characters ‘*’, ‘?’, and ‘[’. If one of these characters appears, then the word is regarded as a pattern, and replaced with an alphabetically sorted list of file names matching the pattern."

[Useful commands]
(*) customize shell promote

set environment variable "PS1"
(*) How to find commands?
Try following commands:
[type]
    a bash command. Searches files in the PATH, aliases, keywords, functions, built-in commands...
[which]
    displays the full path of the executables that would have been executed when this argument had been entered at the shell prompt(just searches files in $PATH).
[apropos]
    search the whatis database for strings(same as man -k)
[locate, slocate]
    reads  one or more databases prepared by updatedb and writes file names matching at least one of the PATTERNs to standard output. Location of the actual database varies from system to system. 
[whereis]
    locate the binary, source, and manual page files for a command. An entry is displayed only when the whole searched word is matched(not a substring of a long word).
[find]
    search for files in a directory hierarchy.
(*) Info about a command
[man]:
    display manual of the command;
[help]:
    display information of bash built-in commands. For a bash build-in command, if you use man command to display its information, you will get a large bash manual.
[info]:
    display Info doc of an arbitrary command.
(*) Get information about a file
[ls]
    List  information  about  the FILEs (the current directory by default). You can either list directory contents of get information about a file(a regular file, not directory).
E.g.
    ls -a .*  //list all files name of which startes with a .? no!!! Bash will expand pattern .* at first. So
               // ls .. is included which causes content of parent directory is displayed. And so on and so on.
               // You can use echo command to see what the pattern is expanded to. E.g. echo .*
    ls -d .*  //-d switch enforce ls to list directory entries instead of contents.
[stat]
    Display file or file system status.
[find]
    E.g. find /path/ -name file_name -printf '%m %u %t'
[file]
    determine file type. There are three sets of tests, performed in this order: filesystem tests, magic number tests, and language tests.

Saturday, July 19, 2008

Send email in PHP

On linux, a tool called sendmail can be used to send emails. You don't need a SMTP server to send emails out. An interesting thing about sendmail is that you can not set some common email fields easily. For example, you can not set email subject directly on command line by using a switch. Instead following syntax should be used:

sendmail destination@domain.com
Subject: This is the subject.
Main body goes here
.

In PHP, users can send email easily because PHP relies on sendmail. See this page for instructions. Main configuration options in php.ini:

Mail configuration options
Name Default Changeable Changelog
SMTP "localhost" PHP_INI_ALL  
smtp_port "25" PHP_INI_ALL Available since PHP 4.3.0.
sendmail_from NULL PHP_INI_ALL  
sendmail_path "/usr/sbin/sendmail -t -i" PHP_INI_SYSTEM  

Friday, July 11, 2008

Linux/Unix Terminal

Terminfo database
At first, we need a database which describes capabilities of terminals. There are two basic databases there: TermCap and Terminfo. The latter is more widely used currently. You can create a new description file and edit it in textual mode. Then you can use a tool called tic to compile original text into binary format which is used by other programs. The Terminfo is generally located under: /usr/share/terminfo.
Some useful information you may care about can be accessed here and here. From there, you also can download latest terminfo database which contains information about many kinds of terminals.
Single UNIX Specification contains a section defining terminfo source format which can be accessed here.

Text terminal
Control terminals in low-level:
ioctls: gives very low-level access.
termios: "The termios functions describe a general terminal interface that is provided to control asynchronous communications ports."(From manpage).

High-level wrapper libraries:
Low-level APIs(ioctls/termios) don't hide complexity of terminal programming. Your programs must depend on specific terminal type. Some wrapper libraries exist which provide portable interface to many terminal types.
(1) ncurses (http://invisible-island.net/ncurses/ncurses.html)
Allow programmer to write TUI in a terminal-independent manner. It also does optimization work to reduce latency when using remote shells. Written in C.
Moreover, another package called Curses Development Kit(CDK) provides more functionalities. Mainly, it provides a library of curses widgets which can be used and customized in your programs.
Here is a good tutorial about how to use Ncurses in C.
Though ncurses tests $TERMINFO first, otherwise it reads from $HOME/.terminfo before the standard location,
(2) shell curses(http://www.mtxia.com/css/Downloads/Scripts/Korn/Functions/shcurses/)
It enables shell programmers to do TUI programming easily. It worked originally in Korn shell. I am not sure whether it has been ported to other shells.
(3) S-Lang

GUI
Currently, GUI exists almost everywhere in the computer world. Pure text terminal is not used very often. X window system is the big player in linux/unix. As a result, programmers can provide a GUI to the customers instead of TUI.
Some interesting articles: State of Linux graphics, various Linux/Unix desktops, X window resource/link collection.

Pseudo terminal/Terminal emulator
With disappearance(maybe a few ones still are being used) of traditional serial port terminals, code of terminal control was rewritten to promote reuse. So currently terminal is just a term describing such devices/programs which can send requests to master and handle responses from master. No physical serial port is involved. Most of the time we use remote shells running on internet. In directory /dev/, there are corresponding device files for these pseudo terminals. Commonly used file names are pty*, pts*,ptty...

Some useful commands:
(*) tic - the terminfo entry-description compiler.
translates  a terminfo file from source format into compiled format.  The compiled format is necessary for use with the library routines in ncurses.
(*)captoinfo (alias for tic) - convert a termcap description into a terminfo description
   infotocap(alias for tic) -
(*) screen: The screen program allows you to run multiple virtual terminals, each with its own interactive shell, on a single physical terminal or terminal emulation window.
/******* Set/Rest terminal ******/
(*) tset : terminal initialization
(*) setterm: set terminal attributes.
    setterm  writes  to  standard  output  a character string that will invoke the specified terminal capabilities. Where possible terminfo is consulted to find the string to use. The attribute names used in this command are different from cap names in terminfo database!!!
(*) stty - prints or changes terminal characteristics, such as baud rate.(related to line settings)
/****** Query  ********/
(*) infocmp - compare or print out terminfo descriptions. It can be used to print all capabilities.
(*) tput : query terminfo database. Query a specific capability.
e.g. tput setaf     //get foreground color.
       tput cols      //get number of columns
       tput longname  //get longname description of current terminal.
(*)  toe - table of (terminfo) entries. With  no  options,  toe  lists  all available terminal types by primary name with descriptions.

How to know your current terminal type?
tset -q
infocmp
How to tell linux type of your terminal?
Usually, you must explicitly tell linux type of your terminal. You can set environment variable $TERM.

Thursday, July 10, 2008

Unicode

Frequently, I am using utf-8, gb2312, ansii... For ANSII encoding, I understand it well becaues it is simple. For Unicode, I have been confused with various terms(e.g. utf-8, utf-16, utf-32, UCS-2, UCS-4...). Now, I do some research and summarize what I have learnt.

Resources:
Official site: http://www.unicode.org/
Unicode Standard (version 5): http://www.unicode.org/versions/Unicode5.0.0/bookmarks.html
Unicode FAQ: http://unicode.org/faq/
FAQ of various encoding forms: http://unicode.org/faq/utf_bom.html
Online tools:
An online Unicode conversion tool: http://rishida.net/scripts/uniview/conversion.php (Display various encoding representations of what you input.)
Unihan charset database: http://www.unicode.org/charts/unihan.html (You can query by Unicode code point.)
Another good tool: http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%26%2320320%3B&mode=char

Windows XP provides a useful tool called "Character Map": Start -> All Programs -> Accessories -> System Tools -> Character Map.
In linux/Unix, iconv is a powerful tool which can be used to convert between different character encodings. (It seems BOM described below is not handled by iconv. So you should not include BOM in the source file.)

Two entities:
Unicode consortium
ISO/IEC JTC1/SC2/WG2
Good news is that these two entities are well synchronized and a standard from one entity is aligned to corresponding standard from the other entity.
Unicode 5 is synchronized with Amd2 to ISO/IEC 10646:2003 plus Sindhi additions.

Encoding forms
Functionality of encoding forms is to map every Unicode code point to a unique byte sequence. Unicode code point arrangement is the same while various encoding forms float around.
ISO/IEC 10646 defines 4 forms of encoding of universal character set: UCS-4, UCS-2, UTF-8 and UTF-16. 
Code unit: encoding of every character consists of integral number of code units. For example, for UTF-16 code unit is 16bits which means length of encoding is 16bits or 32bits.
(1) UCS-4/UTF-32
Currently, they are almost identical. It uses exactly 32bits for each code point. So it is fixed length encoding.
"This single 4-byte-long code unit corresponds to the Unicode scalar value, which is the abstract number associated with a Unicode character. "
(2) UCS-2/UTF-16
Code unit is 16 bits. Commonly used characters usually can be encoded in 1 code unit (16bits).
From wikipedia: "UTF-16: For characters in the Basic Multilingual Plane (BMP) the resulting encoding is a single 16-bit word. For characters in the other planes, the encoding will result in a pair of 16-bit words, together called a surrogate pair. All possible code points from U+0000 through U+10FFFF, except for the surrogate code points U+D800–U+DFFF (which are not characters), are uniquely mapped by UTF-16 regardless of the code point's current or future character assignment or use."
From wikipedia: "UCS-2 (2-byte Universal Character Set) is an obsolete character encoding which is a predecessor to UTF-16. The UCS-2 encoding form is nearly identical to that of UTF-16, except that it does not support surrogate pairs and therefore can only encode characters in the BMP range U+0000 through U+FFFF. As a consequence it is a fixed-length encoding that always encodes characters into a single 16-bit value."
In a word, UCS-2 is a fixed-length encoding which can encode characters in BMP range U+0000 through U+FFFF while UTF-16 is variable length encoding which supports characters in other planes by using surrogate pair.
Note:The two values FFFE16 and FFFF16 as well as the 32 values from FDD016 to FDEF16 represent noncharacters. They are invalid in interchange, but may be freely used internal to an implementation. Unpaired surrogates are invalid as well, i.e. any value in the range D80016 to DBFF16 not followed by a value in the range DC0016 to DFFF16.
(3) UTF-8
Code unit is 8 bits. It encodes one code point in one to four octets.
From wikipedia: "It is able to represent any character in the Unicode standard, yet the initial encoding of byte codes and character assignments for UTF-8 is backwards compatible with ASCII. For these reasons, it is steadily becoming the preferred encoding for e-mail, web pages"
Note: initial encoding in UTF-8 is NOT compatible with latin1(ISO-8859).

Byte Order
For UTF-16 and UTF-32, code unit is more than 1 byte. A natural problem is the byte order, big endian(MSB first) or little endian(MSB last)? For UTF-8, this problem does not exist because code unit is 1 byte. To solve this problem, Byte Order Mark(BOM) can be used. Actually, BOM does not only indicate the byte order but also define encoding form.
BOM table: From http://unicode.org/faq/utf_bom.html:

Bytes Encoding Form
00 00 FE FF UTF-32, big-endian
FF FE 00 00 UTF-32, little-endian
FE FF UTF-16, big-endian
FF FE UTF-16, little-endian
EF BB BF UTF-8

Here are some guidelines to follow(From http://unicode.org/faq/utf_bom.html):

  1. A particular protocol (e.g. Microsoft conventions for .txt files) may require use of the BOM on certain Unicode data streams, such as files. When you need to conform to such a protocol, use a BOM.
  2. Some protocols allow optional BOMs in the case of untagged text. In those cases,
    • Where a text data stream is known to be plain text, but of unknown encoding, BOM can be used as a signature. If there is no BOM, the encoding could be anything.
    • Where a text data stream is known to be plain Unicode text (but not which endian), then BOM can be used as a signature. If there is no BOM, the text should be interpreted as big-endian.
  3. Some byte oriented protocols expect ASCII characters at the beginning of a file. If UTF-8 is used with these protocols, use of the BOM as encoding form signature should be avoided. For example, bash file starting with #!/bin/sh.
  4. Where the precise type of the data stream is known (e.g. Unicode big-endian or Unicode little-endian), the BOM should not be used. In particular, whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be used.

Summary:

From http://unicode.org/faq/utf_bom.html:

Name UTF-8 UTF-16 UTF-16BE UTF-16LE UTF-32 UTF-32BE UTF-32LE
Smallest code point 0000 0000 0000 0000 0000 0000 0000
Largest code point 10FFFF 10FFFF 10FFFF 10FFFF 10FFFF 10FFFF 10FFFF
Code unit size 8 bits 16 bits 16 bits 16 bits 32 bits 32 bits 32 bits
Byte order N/A <BOM> big-endian little-endian <BOM> big-endian little-endian
Minimal bytes/character 1 2 2 2 4 4 4
Maximal bytes/character 4 4 4 4 4 4 4

Note: all valid code points that are encoded are the same: from U+0000 through U+10FFFF.

More:
None of the UTFs can generate every arbitrary byte sequence. In other words, not every 4-byte-long byte sequence represents  a legal coding in UCS4.
From Unicode consortium site: "Each UTF is reversible, thus every UTF supports lossless round tripping: mapping from any Unicode coded character sequence S to a sequence of bytes and back will produce S again. To ensure round tripping, a UTF mapping  must also map all code points that are not valid Unicode characters to unique byte sequences. These invalid code points are the 66 noncharacters (including FFFE and FFFF), as well as unpaired surrogates."
This means that: to guarantee reversibility, not only valid characters but also non-valid characters must be considered and encoded appropriately.

How to fit a Unicode character into ANSII stream?
See http://unicode.org/faq/utf_bom.html#31.  Several methods are used in practice: (1) UTF-8; (2) '\uXXXX' in C or Java (3) "&#XXXX;" in HTML or XML.

Sunday, July 06, 2008

SVG tools

Official Resources:
http://www.svgi.org/
An almost comprehensive list of vector drawing programs(actually some 3d software is included as well):
http://www.maa.org/editorial/mathgames/mathgames_08_01_05.html

Free Editors:
Inkscape(most popular open source one): Besides svg, it can save file as .ps, .eps ... It also allows users to export file to .PNG(just .png, not .jpeg or .gif).
yED: a good graph editor which can automatically beautify line layout in the graph.

Standalone Viewer:
Batik Squiggle(pure Java): zoom in/out, transform... Can export .svg to .png, .jpeg and .gif.

Convertor :
ImageMagick: a powerful command line tool which provides functionalities of conversion, edition and composition. .svg can be converted to .jpg, .gif, .eps, ...

Conversion of .svg to .eps: Inkscape provides a better result than ImageMagick.

Thursday, July 03, 2008

Useful free graphic software

Some useful image processing software:
Of course, Photoshop is a good one but it is not free. Some free ones exist out there. Here is a list which introduces some excellent free image processing software. GIMP, Paint.NET, Photobie, PhotoFiltre. This site contains more software.

With popularity of web 2.0 apps, many online image processing apps emerge. I list some ones I know:
http://fotoflexer.com/
http://www.picnik.com/
https://www.photoshop.com/express/landing.html
http://www.cellsea.com/media/index.htm
http://www.creatingonline.com/Online_Image_Editor/index.html
http://www.phixr.com/photo/userindex
http://www.lunapic.com/editor/
http://www.picresize.com/

Online Image resizer/optimizer:
http://webresizer.com/
http://www.resize2mail.com/
http://www.shrinkpictures.com/
http://www.resizr.com/
http://resizr.lord-lance.com/ (only .gif is accepted)
http://www.resizeimage.4u2ges.com/
http://www.picresize.com/
http://online-image-resize.kategorie.cz/
http://rsizr.com/
http://rsizr.com/ (Crop and resize)
http://www.freeimageconverter.com/
Standalone software:
http://www.imageresizer.com/
http://bluefive.pair.com/pixresizer.htm
http://www.imageconverterplus.com/ (seems a good one, but not free)
http://adionsoft.net/fastimageresize/
http://www.microsoft.com/windowsxp/downloads/powertoys/xppowertoys.mspx (MS image resizer power toy)

Vector Graphic Tools:
Inkscape, Dia, Synfig(Animation software), yED(good line layout), Sodipodi(not under active development, continue on Inkscape), Xara Xtreme for Linux, Zcubes(online vector drawing tool), Skencil(only on Unix/Linux)
Sketsa(SVG editor, not free)

Free desktop publish tool:
Scribus

Image viewer:
irfanview, xnview

Tuesday, July 01, 2008

"Hello World" Programs in different languages

Every time when we learn a new programming language, the first program is always to output "Hello World".
Someone collects "Hello World" programs written in hundreds of different languages. Check out :http://www.ntecs.de/old-hp/uu9r/lang/html/lang-all.en.html