Strings and characters

Strings are the basic communication medium for Unix processes, so a Unix programming environment must have reasonable facilities for manipulating them. Scsh provides a powerful set of procedures for processing strings and characters. Besides the the facilities described in this chapter, scsh also provides

5.1  Manipulating file names

These procedures do not access the file-system at all; they merely operate on file-name strings. Much of this structure is patterned after the gnu emacs design. Perhaps a more sophisticated system would be better, something like the pathname abstractions of COMMON LISP or MIT Scheme. However, being Unix-specific, we can be a little less general.

5.1.1  Terminology

These procedures carefully adhere to the POSIX standard for file-name resolution, which occasionally entails some slightly odd things. This section will describe these rules, and give some basic terminology.

A file-name is either the file-system root (``/''), or a series of slash-terminated directory components, followed by a a file component. Root is the only file-name that may end in slash. Some examples:

File name Dir components File component
src/des/main.c ("src" "des") "main.c"
/src/des/main.c ("" "src" "des") "main.c"
main.c () "main.c"

Note that the relative filename src/des/main.c and the absolute filename /src/des/main.c are distinguished by the presence of the root component "" in the absolute path.

Multiple embedded slashes within a path have the same meaning as a single slash. More than two leading slashes at the beginning of a path have the same meaning as a single leading slash -- they indicate that the file-name is an absolute one, with the path leading from root. However, POSIX permits the OS to give special meaning to two leading slashes. For this reason, the routines in this section do not simplify two leading slashes to a single slash.

A file-name in directory form is either a file-name terminated by a slash, e.g., ``/src/des/'', or the empty string, ``''. The empty string corresponds to the current working directory, whose file-name is dot (``.''). Working backwards from the append-a-slash rule, we extend the syntax of POSIX file-names to define the empty string to be a file-name form of the root directory ``/''. (However, ``/'' is also acceptable as a file-name form for root.) So the empty string has two interpretations: as a file-name form, it is the file-system root; as a directory form, it is the current working directory. Slash is also an ambiguous form: / is both a directory-form and a file-name form.

The directory form of a file-name is very rarely used. Almost all of the procedures in scsh name directories by giving their file-name form (without the trailing slash), not their directory form. So, you say ``/usr/include'', and ``.'', not ``/usr/include/'' and ``''. The sole exceptions are file-name-as-directory and directory-as-file-name, whose jobs are to convert back-and-forth between these forms, and file-name-directory, whose job it is to split out the directory portion of a file-name. However, most procedures that expect a directory argument will coerce a file-name in directory form to file-name form if it does not have a trailing slash. Bear in mind that the ambiguous case, empty string, will be interpreted in file-name form, i.e., as root.

5.1.2  Procedures

(file-name-directory? fname)     --->     boolean         (procedure) 
(file-name-non-directory? fname)     --->     boolean         (procedure) 
These predicates return true if the string is in directory form, or file-name form (see the above discussion of these two forms). Note that they both return true on the ambiguous case of empty string, which is both a directory (current working directory), and a file name (the file-system root).
File name ...-directory? ...-non-directory?
"src/des" #f #t
"src/des/" #t #f
"/" #t #f
"." #f #t
"" #t #t

(file-name-as-directory fname)     --->     string         (procedure) 
Convert a file-name to directory form. Basically, add a trailing slash if needed:
(file-name-as-directory "src/des") "src/des/"
(file-name-as-directory "src/des/") "src/des/"
., /, and "" are special: 
(file-name-as-directory ".") ""
(file-name-as-directory "/") "/"
(file-name-as-directory "") "/"

(directory-as-file-name fname)     --->     string         (procedure) 
Convert a directory to a simple file-name. Basically, kill a trailing slash if one is present:
(directory-as-file-name "foo/bar/") "foo/bar"
/ and "" are special: 
(directory-as-file-name "/") "/"
(directory-as-file-name "") "." (i.e., the cwd)

(file-name-absolute? fname)     --->     boolean         (procedure) 
Does fname begin with a root or ~ component? (Recognising ~ as a home-directory specification is an extension of POSIX rules.)
(file-name-absolute? "/usr/shivers") #t
(file-name-absolute? "src/des") #f
(file-name-absolute? "~/src/des") #t
Non-obvious case: 
(file-name-absolute? "") #t (i.e., root)

(file-name-directory fname)     --->     string or false         (procedure) 
Return the directory component of fname in directory form. If the file-name is already in directory form, return it as-is.
(file-name-directory "/usr/bdc") "/usr/"
(file-name-directory "/usr/bdc/") "/usr/bdc/"
(file-name-directory "bdc/.login") "bdc/"
(file-name-directory "main.c") ""
Root has no directory component: 
(file-name-directory "/") ""
(file-name-directory "") ""

(file-name-nondirectory fname)     --->     string         (procedure) 
Return non-directory component of fname.
(file-name-nondirectory "/usr/ian") "ian"
(file-name-nondirectory "/usr/ian/") ""
(file-name-nondirectory "ian/.login") ".login"
(file-name-nondirectory "main.c") "main.c"
(file-name-nondirectory "") ""
(file-name-nondirectory "/") "/"

(split-file-name fname)     --->     string list         (procedure) 
Split a file-name into its components.
(split-file-name "src/des/main.c") 
("src" "des" "main.c") 
(split-file-name "/src/des/main.c") 
("" "src" "des" "main.c") 
(split-file-name "main.c") 
(split-file-name "/") 

(path-list->file-name path-list [dir])     --->     string         (procedure) 
Inverse of split-file-name.

(path-list->file-name '("src" "des" "main.c")) 
    ==>  "src/des/main.c"
(path-list->file-name '("" "src" "des" "main.c"))
    ==>  "/src/des/main.c"
Optional dir arg anchors relative path-lists:
(path-list->file-name '("src" "des" "main.c")
    ==>  "/usr/shivers/src/des/main.c"
The optional dir argument is usefully (cwd).

(file-name-extension fname)     --->     string         (procedure) 
Return the file-name's extension.
(file-name-extension "main.c") ".c"
(file-name-extension "main.c.old") ".old"
(file-name-extension "/usr/shivers") ""
Weird cases: 
(file-name-extension "foo.") "."
(file-name-extension "foo..") "."
Dot files are not extensions: 
(file-name-extension "/usr/shivers/.login") ""

(file-name-sans-extension fname)     --->     string         (procedure) 
Return everything but the extension.
(file-name-sans-extension "main.c") "main"
(file-name-sans-extension "main.c.old") "main.c""
(file-name-sans-extension "/usr/shivers") 
Weird cases: 
(file-name-sans-extension "foo.") "foo"
(file-name-sans-extension "foo..") "foo."
Dot files are not extensions: 
(file-name-sans-extension "/usr/shivers/.login") 

Note that appending the results of file-name-extension and file-name-sans-extension in all cases produces the original file-name.

(parse-file-name fname)     --->     [dir name extension]         (procedure) 
Let f be (file-name-nondirectory fname). This function returns the three values:

The inverse of parse-file-name, in all cases, is string-append. The boundary case of / was chosen to preserve this inverse.

(replace-extension fname ext)     --->     string         (procedure) 
This procedure replaces fname's extension with ext. It is exactly equivalent to
(string-append (file-name-sans-extension fname) ext)

(simplify-file-name fname)     --->     string         (procedure) 
Removes leading and internal occurrences of dot. A trailing dot is left alone, as the parent could be a symlink. Removes internal and trailing double-slashes. A leading double-slash is left alone, in accordance with POSIX. However, triple and more leading slashes are reduced to a single slash, in accordance with POSIX. Double-dots (parent directory) are left alone, in case they come after symlinks or appear in a /../machine/... ``super-root'' form (which POSIX permits).

(resolve-file-name fname [dir])     --->     string         (procedure) 

(expand-file-name fname [dir])     --->     string         (procedure) 
Resolve and simplify the file-name.

(absolute-file-name fname [dir])     --->     string         (procedure) 
Convert file-name fname into an absolute file name, relative to directory dir, which defaults to the current working directory. The file name is simplified before being returned.

This procedure does not treat a leading tilde character specially.

(home-dir [user])     --->     string         (procedure) 
home-dir returns user's home directory. User defaults to the current user.

(home-dir) "/user1/lecturer/shivers"
(home-dir "ctkwan") "/user0/research/ctkwan"

(home-file [user] fname)     --->     string         (procedure) 
Returns file-name fname relative to user's home directory; user defaults to the current user.
(home-file "man") "/usr/shivers/man"
(home-file "fcmlau" "man") "/usr/fcmlau/man"

The general substitute-env-vars string procedure, defined in the previous section, is also frequently useful for expanding file-names.

5.2  Other string manipulation facilities

(substitute-env-vars fname)     --->     string         (procedure) 
Replace occurrences of environment variables with their values. An environment variable is denoted by a dollar sign followed by alphanumeric chars and underscores, or is surrounded by braces.

(substitute-env-vars "$USER/.login") 
(substitute-env-vars "${USER}_log") "shivers_log"

5.3  ASCII encoding

(char->ascii character)     --->     integer         (procedure) 
(ascii->char integer)     --->     character         (procedure) 
These are identical to char->integer and integer->char except that they use the ASCII encoding.

5.4  Character predicates

(char-letter? character)     --->     boolean         (procedure) 
(char-lower-case? character)     --->     boolean         (procedure) 
(char-upper-case? character)     --->     boolean         (procedure) 
(char-title-case? character)     --->     boolean         (procedure) 
(char-digit? character)     --->     boolean         (procedure) 
(char-letter+digit? character)     --->     boolean         (procedure) 
(char-graphic? character)     --->     boolean         (procedure) 
(char-printing? character)     --->     boolean         (procedure) 
(char-whitespace? character)     --->     boolean         (procedure) 
(char-blank? character)     --->     boolean         (procedure) 
(char-iso-control? character)     --->     boolean         (procedure) 
(char-punctuation? character)     --->     boolean         (procedure) 
(char-hex-digit? character)     --->     boolean         (procedure) 
(char-ascii? character)     --->     boolean         (procedure) 
Each of these predicates tests for membership in one of the standard character sets provided by the SRFI-14 character-set library. Additionally, the following redundant bindings are provided for R5RS compatibility:
R5RS name scsh definition
char-alphabetic? char-letter+digit?
char-numeric? char-digit?
char-alphanumeric? char-letter+digit?

5.5  Deprecated character-set procedures

The SRFI-13 character-set library grew out of an earlier library developed for scsh. However, the SRFI standardisation process introduced incompatibilities with the original scsh bindings. The current version of scsh provides the library obsolete-char-set-lib, which contains the old bindings found in previous releases of scsh. The following table lists the members of this library, along with the equivalent SRFI-13 binding. This obsolete library is deprecated and not open by default in the standard scsh environment; new code should use the SRFI-13 bindings.

Old obsolete-char-set-lib SRFI-13 char-set-lib


chars->char-set list->char-set
ascii-range->char-set ucs-range->char-set (not exact)
predicate->char-set char-set-filter (not exact)
char-set-every? char-set-every
char-set-any? char-set-any
char-set-invert char-set-complement
char-set-invert! char-set-complement!
char-set:alphabetic char-set:letter
char-set:numeric char-set:digit
char-set:alphanumeric char-set:letter+digit
char-set:control char-set:iso-control
Note also that the ->char-set procedure no longer handles a predicate argument.