file-name-as-directory & filename grammar

scsh.net
comp.lang.scheme.scsh 2001

From: Olin Shivers <shivers@tokyo.cc.gatech.edu>
Newsgroups: comp.lang.scheme.scsh
Subject: file-name-as-directory & filename grammar
Date: 10 Mar 2001 20:36:33 -0500
Organization: College of Computing, Georgia Tech
Message-ID: <k1snkljdcu.fsf@tokyo.cc.gatech.edu>
A while ago, Noah Friedman raised the issue of file-name-as-directory's
behaviour on the empty string. Scsh defines f-n-a-d to return "/" in this
case; Noah felt it should return "./".

This is something that is discussed at some length in the scsh manual, but
Noah was not convinced. This issue has been raised again by Mike Sperber, in
the context of the release we're assembling, so I have written up the
following explication that may help explain the why of scsh's design.

Filenames are strings that belong to some grammar defining all the
syntactically legal filenames. This grammar, if it is a good one, does more
than specify the set of legal strings that are filenames -- its structure
mirrors the structure of the associated filesystem tree that the filenames
index. [Note TrivialGrammar] Operations on the strings that are filenames
should mirror operations on these parse trees should mirror operations on the
associated filesystem tree; effectively, the filename string ops are
implicitly ops on the associated parse trees -- there's an isomorphism.
Furthermore, other things being equal, the simpler grammar is the better
grammar.

Here is a rather simple (five productions) grammar that fills the bill:
    name ::= [^/]+	; A non-empty sequence of chars not containing 
		    	; slash (or nul). This includes "." and "..".

    afile ::= "" | adir name 	; Absolute file name
    adir  ::= afile "/"		; Absolute directory

    rfile ::= rdir name		; Relative file name
    rdir  ::= "" | rfile "/"	; Relative directory
This grammar gives a very simple invariant: to locate a relative file or directory R within an absolute or relative directory D,
  just append them: D,R. R determines whether the result is a dir or file;
  D determines whether the result is absolute or relative.
So here is a general resolve function:
    resolve(cwd, file) =
      if FILE is absolute then return FILE
      else return concat(CWD, FILE)

  Note that FILE is a filename (afile or rfile) and CWD is a directory
  (adir or rdir).
The empty-string afile is the root directory (as a file name)
The empty-string rdir is the current-working directory.
Dot & dot-dot are not special in the grammar. This is reflected in the fact
  that they are not special in the "kernel filesystem" navigation code &
  typedefs presented below in ML (rather, they are handled by the primitive
  "semantic" function FETCH). Filename simplification that eliminates .'s and
  ..'s exploits their semantics; again, at the syntactic level, they
  are just another filename.
The five-line grammar above is not only very simple, it also structurally
  reflects the semantics of what we are doing with filesystems. That is,
  if you build an AST for the grammar, then parsing a filename into an
  AST gives a structure that is useful for navigating the Unix filesystem.

  In fact, the code below shows a simple ML AST for the grammar above,
  and some filesystem navigation functions you might write for moving
  around the directory structure inside a kernel. 
Some alternate productions, that give equivalent grammars
  + afile ::= "" | afile "/" name	; Can eliminate AFILE's need for ADIR.
  + adir  ::= "/" rdir 			; Alternate production for ADIR
Any proposal for manipulating filenames needs to proceed from a
  grammar explaining filenames, both absolute and relative. So,
  I'd like to see the grammar your system is using, when you propose
  alternate definitions for functions that manipulate strings in
  your grammar.
    -Olin

(* ML-style code defining an AST for the filename grammar and a simple
** navigation function to retrieve entries from the "kernel's" filesystem
** data structure. We assume the kernel's directory data structure can be
** indexed with a FETCH operation.
*)
datatype Afile = afroot | afpath of Adir*Name (* afile ::= "" | adir name *)
type Adir = Afile                             (* adir  ::= afile "/"      *)
type Rfile = Rdir * Name		      (* rfile ::= rdir name      *)
datatype Rdir = rdroot | rdpath of Rfile      (* rdir  ::= "" | rfile "/" *)

datatype Filename = afn of Afile | rfn of Rfile
datatype Dir = ad of Adir | rd of Rdir

(* Fetch the node of ROOT indexed by an absolute file name *)
fun afn2entry(root, afroot) = rootdir
  | afn2entry(root, afpath(adir,name)) = fetch(afn2entry(root,adir), name)

(* Fetch the node of CWD indexed by a relative file name *)
fun rf2entry(cwd, (rd,n)) = fetch(rd2entry(cwd,rd), n)

(* Fetch the note of ROOT indexed by a relative directory. *)
fun rd2entry(cwd, rdroot) = cwd
  | rd2entry(cwd, rdpath(rdir,name)) = fetch(rd2entry(cwd,rdir), name)
[Note TrivialGrammar]
In fact, useful structure is the sole benefit of a good grammar for
filenames, since filenames are sufficiently simple that any string is
a syntactically legal filename. So the trivial grammar given by regexp
  .*

is sufficient to define the set of legal strings.

In the grammars we're constructing, we're ignoring other issues such as how
long a filename can be, in order to focus on the directory-structure issues;
it's not relevant to the discussion at hand.