Wednesday, January 21, 2009

Uniform Resource Identifier(URI) and Uniform Resource Locators (URL) - (RFC 1738 and RFC 3986)

Character Escape
URI consists of a set of characters.
uric = reserved | unreserved | escaped

Reserved characters
Principle: a character is reserved if the semantics of the URI changes if the character is replaced with its escaped escaped encoding.
    gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"
    sub-delims  = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="


Their usage within URI component is limited to their reserved purpose. If the data would conflict with the reserved purpose, it must be escaped.
Characters in the "reserved" set are NOT reserved in all contexts. The reserved characters in a URI component is defined by that specific component.

Unreserved characters
a-z A-z 0-9 "-" "_" "." "~"
These characters can be escaped WITHOUT changing the semantics of the URI.
But this should NOT be done UNLESS the escape is necessary.

Disallowed characters
Some characters are disallowed for various reasons. To use those characters, they MUST be escaped.
Disallowed US-ASCII Characters:
control:      <US-ASCII coded characters 0x00-0x1F and 0x7F>
space:        <US-ASCII coded character 0x20>
delimiters:  < > # % "
unwise:       { } | \ ^ [ ] `

When to escape?
When a character does not have a representation using an unreserved character, it must be escaped. It includes:
(1) data that does not correspond to printable characters (ANSII coding)
(2) disallowed characters
Note: here, whether a character is unreserved is context-specific.

Escape sequences:
A "%" followed by hex representation of the character.
escaped = "%" hex hex
E.g. %20 %35
Uppercase hexadecimal digits should be used in percent-encoding!

Syntax
Generic URI syntax:
 
    <scheme>:<scheme-specific-part>
Interpretation of scheme-specific-part depends on the scheme.
    <scheme>://<authority><path>?<query>

scheme
    alpha *( alpha | digit | "+" | "-" | "." )

authority
URI component authority can be internet-based server or a scheme-specific registry.
authority (server based) = username@host:port
userinfo = *( unreserved | escaped |";" | ":" | "&" | "=" | "+" | "$" | "," )
About domain label:

"The rightmost domain label of a fully qualified domain name will never start with a digit, thus syntactically distinguishing domain names from IPv4 addresses, and may be followed by a single "." if it is necessary to distinguish between the complete domain name and any local domain."

Query
query = *uric
Within a query component, the characters ";", "/", "?", ":", "@", "&", "=", "+", ",", and "$" are reserved.

Fragment
Fragment is not part of a URI, but is often used in conjunction with a URI.
URI-reference = [ absoluteURI | relativeURI ] [ "#" fragment ]

from RFC 2396:
"The semantics of a fragment identifier is a property of the data resulting from a retrieval action, regardless of the type of URI used in the reference.
   A fragment identifier is only meaningful when a URI reference is intended for retrieval and the result of that retrieval is a document for which the identified fragment is consistently defined."

Relative URI reference
to be continued in the future.

Specific schemes

scheme syntax Explanation Note
file file://<host>/<path> Access a file on a specific host.
<host> can be "localhost" or empty to indicate local host. E.g. file:///usr/home
Unlike http and ftp, It does not specify an internet protocol to access the files.
ftp ftp://<host>:<port>/
<cwd1>/<cwd2>/.../<cwdN>/
<name>;type=<typecode>
<cwd1> through <cwdN> are strings and <typecode> can be "a", "i" or "d". If <typecode> is "d", <name> is used as the argument of NLIST command. Within the <name> or a CWD component, / and ; must be escaped. E.g. ftp://test.com/%2Froot/a.txt
mailto mailto:<mail-address> RFC 2822 specifies the format of internet messages. Usually, "%" must be escaped.
http http://<host>:<port>/
<path>?<query>
   

Resources
URI working group: http://labs.apache.org/webarch/uri/