Rationale of the proposed SRE ADT

scsh.net
comp.lang.scheme.scsh 1998

From: Olin Shivers <shivers@lambda.ai.mit.edu>
Subject: SRE discussion: back-references, ADTs
Date: 1998/08/28
Message-ID: <qijg1egx4cv.fsf@lambda.ai.mit.edu>
Organization: Artificial Intelligence Lab, MIT
Newsgroups: comp.lang.scheme.scsh,comp.lang.scheme


   From: Richard Stallman <rms@santafe.edu>

   I like the idea of a Lisp syntax for regexps.  Emacs now has one;
   I'll include its implementation below.

Howdy, Richard. Your msg raises some issues of general interest, so I am going
to post this reply to comp.lang.scheme.scsh to add to the general discussion.

I know about Bob Glickstein's nice sregex.el package; if you read my post
completely, you will see that I refer to it. I carefully reviewed it before
posting my system, and have corresponded with Bob about it. The SRE system is
a superset of Bob's design -- it has a couple of powerful features that
sregex.el doesn't have, such as a full character-set algebra.

   I do not see the point of this:

       - An abstract data type (ADT) representation for regexp values.

   Given a sexp representation, anything that computes regexps could
   operate on that representation.  Having another one seems like
   excess complexity.

Let me give you an analogy. Would you rather write code that manipulates
list structure -- such as REVERSE or ASSOC -- that worked in terms of 
strings like
    "(a (b c) ; I'm a comment -- please ignore me\nd (e f) g)"
or would you rather first parse these expressions into cons cells, and
write your REVERSE and ASSOC functions in terms of these data structures?

The ADT gives you *less* complexity, not more. The SRE sexp representation is
a rich, baroque notation designed for expressiveness. It is for *people* to
use. The ADT is a minimal, simple representation. It is for *programs* to
use. You *write* SREs. You *compute* with ADT values.

For example, these are all different ways to write the same regexp, which
matches one consonant:
    (- alpha ("aeiou") ("AEIOU"))
    (uncase (- lower ("aeiou")))
    (w/nocase (- (/"az") #\a #\e #\i #\o #\u))
    (w/nocase ("bcdfghjklmnpqrstvwxy"))    
    (w/nocase (/ "bd" "fh" "jn" "pt" "vz"))
Every single one of these different SRE's is represented by the same
ADT value, a char-set re. Now, if you want to write code that *manipulates*
regexps, you want a *simple* system -- fewer cases to handle.  Instead of
writing code that works with all the baroque surface details, you write
code that only has to work in terms of the small, basic, orthogonal set of
cases represented by the ADT. 

That's why the ADT.

       SRE notation allows you to compute parts of a regular expressions
       at run time.

   Why make this a separate feature?  Why not just remind people that
   they can use backquote together with SREs?

Because that would confuses a data structure with a language-level
construct. SRE's are a *notation*, for expressing a computation -- that's
language level, like variables, conditionals, function calls.

ADT values are *not* a language-level construct. They are a run-time value.

You can use SREs as a notation in language contexts that have nothing to
do with run-time values. For example, you might want to write a fancy
macro that would expand a static SRE out into a big PROG or LETREC *directly*
implementing the DFA's state machine.

   Sorry, I don't know the term "deleted submatches".

It's discussed at some length in the SRE document.

       ** No back-references

       Completely non-regular -- there's a reason these were dropped from
       Posix' "extended" (= "modern") regexp notation.

   I don't know of any good reason; if we use this, we would add that feature
   as a matter of course.

My feeling about back references is as follows. Regexps are based on a deep
theory -- regular sets and DFA's -- that has tremendous implications about the
operations you can perform on them and the ways you can implement them.
Back-references completely shatter this framework. They rule out certain
extremely efficient implementations. They rule out certain operations. They
have nothing to do with the idea of a "regular expression." They are not one
with the deep structure of the system.

Furthermore, I do not feel that they are all that important a feature. My
claim is this: if you want to push beyond the power of regular-string parsing,
don't break the regexp system by larding on extra mechanisms that screw up the
fundamental assumptions of the whole system; design and use a tool that is
specifically for more powerful grammars. For example, Manuel Serrano has a
really lovely little pattern-matcher/parser notation in his Bigloo Scheme
system that allows you to specify more powerful patterns. This system is built
*on top of* a regexp matcher. He gets the power, but doesn't violate the
structure of his framework. *That's* the kind of system you should use if you
want to write non-regular patterns.

I am not alone in holding these opinions. Henry Spencer is one of the
two or three biggest regexp gurus around today -- his regexp package is
*the* standard free software regexp package in use. If you check his man
page, you will see that he refers to basic Posix regexps, which
allow back-references, as "obsolete," and he refers to extended Posix
regexps, which do *not* allow back-references, as "modern." That tells
the story right there. There's a reason why people, after careful
consideration, left back references out of the extended Posix regexp spec.

If you want a second opinion, you might check with Tom Lord, who is also a
major regexp wizard, and a member of the free software community.

In sum, if you feel you *have* to have them for, say, backwards compatibility,
then you should hack them into any emacs port you might do of SREs as a
non-compatible emacs extension. But I do not intend to put them into the base
notation for the reasons I've outlined. And the more gratuitous changes you
make to the notation & API, the more you damage the point of having a single,
carefully-thought-out standard that is consistent across implementations.
    -Olin