Comments on the proposed SRE notation

scsh.net
comp.lang.scheme.scsh 1998

From: Olin Shivers <shivers@lambda.ai.mit.edu>
Subject: SRE comments
Date: 1998/09/04
Message-ID: <qijbtovwwh7.fsf@lambda.ai.mit.edu>
Organization: Artificial Intelligence Lab, MIT
Newsgroups: comp.lang.scheme.scsh


Thanks for the feedback! Below are my replies to four or five different posts.
The same small set of issues have been raised a couple of times, so they are
clearly worth addressing.
    -Olin

-------------------------------------------------------------------------------
   From: Andy Gaynor <silver@mail.webspan.net>

   I think you should get your usage of <m> and <n> straightened out like I did
   below, even if you're not fond of "m:n" et al.

   (m:n <m> <n> <sre> ...)

       To indicate that <n> is unbound, use oo, which looks a lot like the
       infinity symbol.
       When <m> = <n>, the "m:" in "m:n" and <m> can be omitted.
       When <m> = 0, the "m" in "m:n" can be "0" and <m> omitted.
       When <m> = 1, the "m" in "m:n" can be "1" and <m> omitted.
       When <m> = 0 and <n> = 1, "m:n" can be "0:1" and <m> and <n> omitted.
       When <n> is unbound, the "n" in "m:n" can be "oo" and <n> omitted.

       Derived         Canonical          Original Derived  Original Canonical
		       (m:n <m> <n> ...)                    (** <n> <m> ...)
       (  n  <n> ...)  (m:n <n> <n> ...)  (=  <n>     ...)  (** <n> <n> ...)
       (0:n  <n> ...)  (m:n 0   <n> ...)                    (** 0   <n> ...)
       (1:n  <n> ...)  (m:n 1   <n> ...)                    (** 1   <n> ...)
       (  oo     ...)  (m:n 0   oo  ...)  (*  <n>     ...)  (** 0   #f  ...)
       (0:1      ...)  (m:n 0   1   ...)  (?          ...)  (** 0   1   ...)
       (0:oo     ...)  (m:n 0   oo  ...)                    (** 0   #f  ...)
       (1:oo     ...)  (m:n 1   oo  ...)                    (** 1   #f  ...)
       (m:oo <m> ...)  (m:n <m> oo  ...)  (>= <n>     ...)  (** <n> #f  ...)

This is a very intriguing suggestion. But not without its own problems.
- With this naming convention, we want to use some consistent default for the
  lower bound when the "m" prefix is missing, i.e. (:n ...) or (:oo ...). But
  there are multiple possible default start points: 0, 1, or m. There's no
  clear best choice.

- "m" and "n" are completely meaning-free lexemes, unlike lexemes such as
  "from", "to", "<", "<=". Because of their long-standing use in traditional
  notation, "?", "+", and "*" have meaning that assist the reader and writer
  of SREs.

  
   >     (|  <sre> ...)              Choice (OR is R5RS symbol; | is unspecified)
   >     (or <sre> ...)

   Consider "|".

In my current definition and implementation, *both* identifiers are valid.
In systems not allowing a | symbol, one uses OR. Similarly for : and SEQ.

   >     (:   <sre> ...)             Sequence (SEQ is Common Lisp symbol)
   >     (seq <sre> ...)             

   Consider "&".  ":" has too many potential meanings already.

No good -- & is already paired up with | as the intersection/union pair
of set operators.

   >     ,@<exp>                     Dynamically computed regexp.
   >     ,<exp>                      Same as ,@<exp>, but no submatch info.
   >                                     <EXP> must produce a character, string,
   >                                     char-set, or regexp.

   I question the wisdom of adding more meaning to ",@" and ",".  Consider
   removing the additional meaning and introducing another form to indicate that
   the context is not available.  Perhaps:


   I think things should at least accommodate Latin-1.
   
The design is char-set neutral. Regexps talk about strings composed of
whatever the underlying character set provides.  In a Scheme (or Lisp, etc.)
that provides Latin-1 characters, you get Latin-1 regexps. On a Unicode
Scheme, you get Unicode regexps.

As for my reference implementation, it's trivial to extend it handle
Latin-1. You simply extend the underlying character type to Latin-1 (which is
trivial in Scheme 48, mostly requiring you to slightly tweak the definitions
of the base character sets like char-set:alpha or char-set:graphic in
char-set.el).

   Even when embedded in a nice language, a minilanguage such
   as this still makes me uncomfortable.  I start thinking along these
   lines...  Why don't we simply supply normal code which is processed in an
   environment which binds the salient macros and procedures, and perhaps
   processing the final value?  

Well, whenever I design a little language for embedding in Scheme, I always
take care of this issue by providing an equivalent (and, in the
implementation, underlying) set of *procedures* and data structures.  In scsh,
you have the FORK, EXEC, and other system calls that provide the functionality
of the process-form notation. You can program directly with these procedural
mechanisms; or you can design your *own* specialised notation by writing
macros that expand into Scheme expressions that use these functions. There's a
similar procedure/syntax pairing in AWK. In the SRE case, you have the ADT
constructors as procedural equivalents to SRE notation, and the text-search
procedures accept regexps constructed either way.

So I have given you what you want. If you don't want to use the SRE notation,
you don't have to use it. You can use the the ADT constructor functions
directly; you can write your own functions in terms of these constructors;
you can build your own specialised notation. The existence of implementation
support for SREs never limits you; the machinery is wide open and notationally
neutral.

   Oh, there always seems to be some argument
   favoring this feature, and some justification for that feature, but I'm
   never convinced that inherent overhead is justified.  Btw, I'm familiar
   with Scsh and some of what you've previously written on the topic.  I'm not
   denigrating your work, but questioning the wisdom of this design principle.

The argument for this kind of thing is the argument in favor of "little
languages," suitably adapted to an sexp/macro framework. Specialised notations
can provide big notational leverage -- that is the whole point of regular
expressions, after all, since they are just a specialised, compact notation
for a restricted set of matchers that could be written directly in C or
Scheme. Realising these specialised "little language" notations in an sexp
framework allows you to embed them into general Scheme programs using macros,
so you can have your cake and eat it, too. I have written on this subject --
see the "little languages" paper in my list of papers on my web page
    http://www.ai.mit.edu/~shivers/

-------------------------------------------------------------------------------
    From: Ray Dillinger <bear@sonic.net>

    I have to say that I have one major problem with this design 
    though; it implements a language whose S-expressions are far too 
    easy to mistake for those of the procedure representation in 
    scheme.  

That's going to be true of *any* sexp-based notation.

    In particular, this design adds new and incompatible meanings 
    to (* ...), (+ ...) (< ...), (>...), ,@ ,and other symbols 
    already defined in scheme.  

Quite right. Because this notation *is not* Scheme. It's a separate notation.

    Also, I think that a *necessary* thing in the design of a 
    regexp system is that each and every expression which is 
    part of regexp matching ought to be a scheme function, which 
    takes a stream (or false) as one of its arguments and returns 
    a stream (or false ) as its result.  That way the concatenation 
    of regexps can be represented as a composition of functions, 
    and the function names have unambiguous meanings in scheme, 
    so they can be defined using lambdas, etc.

    So, I would prefer to write: 

    (repetitions+ (match #/a (match-one-of "foo" "baz" instring)))

    for the regexp better known as 

    ((foo|baz)a)+

    and if I get back #f, I know that it did not match -- or 
    if I get back "flarp" I know that it matched and that the 
    next five characters after the match were "flarp".  Or if 
    I get back the empty string I know that it matched and used up 
    all of the characters in the string it was passed.  

No -- you are talking about how one *uses* regexps -- for searching,
or matching, and so forth. SRE notation is *simply* for writing them
down. If you have a *use* in mind, then you can write a function that
implements this use using regexp ADT values, or write a macro that implements
this functionality. SREs are use-independent.

For example, if we *wired in* the idea that regexps were represented by
bactracking Scheme search functions, as you suggest, then we'd be screwed when
we wanted an SRE to stand for a big, hairy LETREC that directly implemented
the regexp's corresponding DFA.

Or perhaps we might want to write a macro or function that used regexps as
*generators*, not acceptors -- i.e., we could "run them backwards" to get them
to produce strings in the regular set, as opposed to accepting strings in the
set. That would not be possible if you represented regexps as Scheme search
functions -- you would have already committed to a particular use.

So you want to keep your notation use-neutral. 

Also, bear in mind that SREs are a notation for a computation. In this
respect, we want to make it possible to use them as a "computational spec" in
the same manner in which we use other programming language notation -- i.e.,
we should be able to use them *at the language level*, not simply as a
description of a data value. (As in my example above, where an SRE could be
used to represent the equivalent DFA.) David Rush made this point in an
earlier post.

    Hmmmm.....  Then I think about it and realize that these 
    will have to return multiple results somehow because of the 
    implicit backtracking involved in regexp matching.  So we're 
    now talking about a stream of possible matches instead of a 
    single result....  and functions that take stream arguments. 

    This one is a tough nut.  

Ahh... as you see, this "necessary" thing you want is actually not at
all necessary or even easy to define.

-------------------------------------------------------------------------------
    From: Andy Gaynor <silver@mail.webspan.net>

	Ray Dillinger <bear@sonic.net> wrote:
	> In particular, this design adds new and incompatible meanings to (*
	> ...), (+ ...) (< ...), (>...), ,@ ,and other symbols already defined
	> in scheme.  I can't help thinking that these will muddy namespace
	> issues.

	tob@world.std.com ("Tom Breton") wrote:
	> I agree.  This is reminiscent of C++ operator overloading.  Why be
	> so afraid of adding a few new words?

    Namespace burden, both real and conceptual.

    In general, I like the concept of operator overloading.  I'll concede that
    it can cause quite a few language design headaches, and overloading things
    in straight Scheme is a cumbersome manual task which often proves
    inefficient in practice.

No, no. I'm not overloading operators. SRE's ARE NOT SCHEME. They are a
separate notation. You may embed them in Scheme, using my (RX <sre> ...)
special form. But you don't have to. The meaning of a string is dependent
upon the language in which you interpret the meaning of that string. It isn't
"overloading" to say that
    (f + 3)
means two different things depending on whether it's C or Scheme. Similarly,
    (** 3 5 "foo")
means two different things depending on whether it's Scheme or SRE.

Harvey Stein made these points in an earlier post.

-------------------------------------------------------------------------------
    From: "schwartz+!usenet%"@((Scott Schwartz))bio.cse.psu.edu 

    Rob Pike wrote a couple of papers on "Structural Regular Expression",
    which are used in the "sam" editor.  They have some nice properties,
    and it would be appealing if your stuff could harmoniously incorporate
    those ideas, especially since the initials are the same. :-) (The
    papers are available somewhere under http://plan9.bell-labs.com; click
    on "distribution", and the picture of the manual.)

    In a nutshell, each operator either generates a sequence of regions or
    does something to a sequence of regions.  You string them together,
    something like a shell pipeline.

Yes, this is a very interesting paper. My problem with it is that you don't
want to limit yourself to the parsing power of regexps to partition up your
editor buffers. You want a system that makes the simple things easy, and the
complex things possible. Pike's system uses regexps, which makes the simple
easy (which is good), but doesn't make the complex possible.

If you want to sit down and use the full power of Scheme to write a little
function that will parse a stream into chunks, where each chunk is, say, a C
function, or a Scheme expression, you should be able to use this function in
your "structured editing" system simply by naming it.

Pike's system is a lovely framework. But it doesn't have a Turing-equivalent
notation for specifying the component parsers, and as such is limited. You
are limited to what you can parse with regular grammars.

Doing a system like Pike's in Scheme would give you the best of both worlds.

One of my rules for designing specialised notations to be embedded inside
Scheme is to provide, wherever possible, for "escapes" to Scheme computation.
The specialised notation makes the simple things easy; the general Scheme
escapes make the complex possible.

-------------------------------------------------------------------------------
    From: Marco Antoniotti <marcoxa@copernico.parades.rm.cnr.it>

    ':' and '|' and other ones, are certain to wreak havoc in any CL based
    implementation (assuming you do not use call/cc or other too schemeish
    constructs in your reference implementation).

Yes, quite right. That's why I defined slightly more verbose equivalent
operator names (SEQ, OR) to be used when the low-level lexical details of a
given language's s-expression form doesn't permit these symbols to be written
in an easy way -- and Common Lisp is one such system.

-------------------------------------------------------------------------------
    From: Andy Gaynor <silver@mail.webspan.net>
    Subject: Re: The SRE regular-expression notation
    
    At least there's "," and ",@", so the SRE interface can be escaped when
    necessary, so there's no lack of genericity.

No, no. If you want to build regexps at run-time with procedural constructors,
use the ADT's constructor procedures. Just say
    (re-seq (re-string "foo")
            (re-repeat 0 #f re-any)
	    (re-string "bar"))
instead of
    (rx "foo" (* any) "bar")
What you want is *directly* provided by the system. I think this will be
significantly more verbose and harder to read than writing the same thing
in SREs -- but opinions vary, and you are free to use either system. I would
prefer to write my *static* regexps using SREs, and to reserve the ADT
functions for writing functions that compute on regexp values.

    Concerning my vague suggestion, I was thinking
    along the following lines...  The body of an `sre' syntactic form (ie
    `(sre . body)') is normal Scheme code processed with some particular
    bindings in effect, which includes variables bound to interesting
    attributes and common regular expressions, some syntactic forms and
    procedures for manipulating regular expressions, etc.  The body must
    return a regular expression.  

This is not what happens. To repeat: an SRE is an s-expression notation for
regular expressions that is *completely distinct* from Scheme (in principle,
anyway -- the few extensions that allow Scheme expressions to be embedded
within an SRE can be removed cleanly). There is a Scheme special form -- the
(RX <sre> ...) form -- that allows you to *embed* an SRE expression within a
Scheme expression. There is no form, in either Scheme or SRE notation, that
uses the keyword SRE (as you mention). So an SRE is not "Scheme code processed
with some particular bindings in effect." It isn't Scheme code at all. It's a
notation for regexps. Some clients of expressions written in this notation
might translate these SRE expressions into Scheme code. Or C code. Or a data
structure. Or something else. It's just a way to write down a regular
expression.

    Assuming the same base conventions, I suppose most simpler regular
    expressions would probably be coded similarly with either interface.
    However, when it comes down to computing REs and juggling them about, what
    I propose wins hands-down since arbitrary Scheme forms can be seamlessly
    incorporated (without using "," and ",@").

Once again, it's not either/or. I provide *both* a notation *and* a 
procedure/data-structure. Don't criticise SRE notation for being something
it isn't. It's a notation, not a data structure. If you want a data structure
or procedural constructor, then use that. If you want a specialised notation,
then you've got SREs.
-------------------------------------------------------------------------------