From: Olin Shivers <shivers@lambda.ai.mit.edu>
Subject: SRE comments
Date: 1998/09/04
Message-ID: <qijbtovwwh7.fsf@lambda.ai.mit.edu>
Organization: Artificial Intelligence Lab, MIT
Newsgroups: comp.lang.scheme.scsh
Thanks for the feedback! Below are my replies to four or five different posts.
The same small set of issues have been raised a couple of times, so they are
clearly worth addressing.
-Olin
-------------------------------------------------------------------------------
From: Andy Gaynor <silver@mail.webspan.net>
I think you should get your usage of <m> and <n> straightened out like I did
below, even if you're not fond of "m:n" et al.
(m:n <m> <n> <sre> ...)
To indicate that <n> is unbound, use oo, which looks a lot like the
infinity symbol.
When <m> = <n>, the "m:" in "m:n" and <m> can be omitted.
When <m> = 0, the "m" in "m:n" can be "0" and <m> omitted.
When <m> = 1, the "m" in "m:n" can be "1" and <m> omitted.
When <m> = 0 and <n> = 1, "m:n" can be "0:1" and <m> and <n> omitted.
When <n> is unbound, the "n" in "m:n" can be "oo" and <n> omitted.
Derived Canonical Original Derived Original Canonical
(m:n <m> <n> ...) (** <n> <m> ...)
( n <n> ...) (m:n <n> <n> ...) (= <n> ...) (** <n> <n> ...)
(0:n <n> ...) (m:n 0 <n> ...) (** 0 <n> ...)
(1:n <n> ...) (m:n 1 <n> ...) (** 1 <n> ...)
( oo ...) (m:n 0 oo ...) (* <n> ...) (** 0 #f ...)
(0:1 ...) (m:n 0 1 ...) (? ...) (** 0 1 ...)
(0:oo ...) (m:n 0 oo ...) (** 0 #f ...)
(1:oo ...) (m:n 1 oo ...) (** 1 #f ...)
(m:oo <m> ...) (m:n <m> oo ...) (>= <n> ...) (** <n> #f ...)
This is a very intriguing suggestion. But not without its own problems.
- With this naming convention, we want to use some consistent default for the
lower bound when the "m" prefix is missing, i.e. (:n ...) or (:oo ...). But
there are multiple possible default start points: 0, 1, or m. There's no
clear best choice.
- "m" and "n" are completely meaning-free lexemes, unlike lexemes such as
"from", "to", "<", "<=". Because of their long-standing use in traditional
notation, "?", "+", and "*" have meaning that assist the reader and writer
of SREs.
> (| <sre> ...) Choice (OR is R5RS symbol; | is unspecified)
> (or <sre> ...)
Consider "|".
In my current definition and implementation, *both* identifiers are valid.
In systems not allowing a | symbol, one uses OR. Similarly for : and SEQ.
> (: <sre> ...) Sequence (SEQ is Common Lisp symbol)
> (seq <sre> ...)
Consider "&". ":" has too many potential meanings already.
No good -- & is already paired up with | as the intersection/union pair
of set operators.
> ,@<exp> Dynamically computed regexp.
> ,<exp> Same as ,@<exp>, but no submatch info.
> <EXP> must produce a character, string,
> char-set, or regexp.
I question the wisdom of adding more meaning to ",@" and ",". Consider
removing the additional meaning and introducing another form to indicate that
the context is not available. Perhaps:
I think things should at least accommodate Latin-1.
The design is char-set neutral. Regexps talk about strings composed of
whatever the underlying character set provides. In a Scheme (or Lisp, etc.)
that provides Latin-1 characters, you get Latin-1 regexps. On a Unicode
Scheme, you get Unicode regexps.
As for my reference implementation, it's trivial to extend it handle
Latin-1. You simply extend the underlying character type to Latin-1 (which is
trivial in Scheme 48, mostly requiring you to slightly tweak the definitions
of the base character sets like char-set:alpha or char-set:graphic in
char-set.el).
Even when embedded in a nice language, a minilanguage such
as this still makes me uncomfortable. I start thinking along these
lines... Why don't we simply supply normal code which is processed in an
environment which binds the salient macros and procedures, and perhaps
processing the final value?
Well, whenever I design a little language for embedding in Scheme, I always
take care of this issue by providing an equivalent (and, in the
implementation, underlying) set of *procedures* and data structures. In scsh,
you have the FORK, EXEC, and other system calls that provide the functionality
of the process-form notation. You can program directly with these procedural
mechanisms; or you can design your *own* specialised notation by writing
macros that expand into Scheme expressions that use these functions. There's a
similar procedure/syntax pairing in AWK. In the SRE case, you have the ADT
constructors as procedural equivalents to SRE notation, and the text-search
procedures accept regexps constructed either way.
So I have given you what you want. If you don't want to use the SRE notation,
you don't have to use it. You can use the the ADT constructor functions
directly; you can write your own functions in terms of these constructors;
you can build your own specialised notation. The existence of implementation
support for SREs never limits you; the machinery is wide open and notationally
neutral.
Oh, there always seems to be some argument
favoring this feature, and some justification for that feature, but I'm
never convinced that inherent overhead is justified. Btw, I'm familiar
with Scsh and some of what you've previously written on the topic. I'm not
denigrating your work, but questioning the wisdom of this design principle.
The argument for this kind of thing is the argument in favor of "little
languages," suitably adapted to an sexp/macro framework. Specialised notations
can provide big notational leverage -- that is the whole point of regular
expressions, after all, since they are just a specialised, compact notation
for a restricted set of matchers that could be written directly in C or
Scheme. Realising these specialised "little language" notations in an sexp
framework allows you to embed them into general Scheme programs using macros,
so you can have your cake and eat it, too. I have written on this subject --
see the "little languages" paper in my list of papers on my web page
http://www.ai.mit.edu/~shivers/
-------------------------------------------------------------------------------
From: Ray Dillinger <bear@sonic.net>
I have to say that I have one major problem with this design
though; it implements a language whose S-expressions are far too
easy to mistake for those of the procedure representation in
scheme.
That's going to be true of *any* sexp-based notation.
In particular, this design adds new and incompatible meanings
to (* ...), (+ ...) (< ...), (>...), ,@ ,and other symbols
already defined in scheme.
Quite right. Because this notation *is not* Scheme. It's a separate notation.
Also, I think that a *necessary* thing in the design of a
regexp system is that each and every expression which is
part of regexp matching ought to be a scheme function, which
takes a stream (or false) as one of its arguments and returns
a stream (or false ) as its result. That way the concatenation
of regexps can be represented as a composition of functions,
and the function names have unambiguous meanings in scheme,
so they can be defined using lambdas, etc.
So, I would prefer to write:
(repetitions+ (match #/a (match-one-of "foo" "baz" instring)))
for the regexp better known as
((foo|baz)a)+
and if I get back #f, I know that it did not match -- or
if I get back "flarp" I know that it matched and that the
next five characters after the match were "flarp". Or if
I get back the empty string I know that it matched and used up
all of the characters in the string it was passed.
No -- you are talking about how one *uses* regexps -- for searching,
or matching, and so forth. SRE notation is *simply* for writing them
down. If you have a *use* in mind, then you can write a function that
implements this use using regexp ADT values, or write a macro that implements
this functionality. SREs are use-independent.
For example, if we *wired in* the idea that regexps were represented by
bactracking Scheme search functions, as you suggest, then we'd be screwed when
we wanted an SRE to stand for a big, hairy LETREC that directly implemented
the regexp's corresponding DFA.
Or perhaps we might want to write a macro or function that used regexps as
*generators*, not acceptors -- i.e., we could "run them backwards" to get them
to produce strings in the regular set, as opposed to accepting strings in the
set. That would not be possible if you represented regexps as Scheme search
functions -- you would have already committed to a particular use.
So you want to keep your notation use-neutral.
Also, bear in mind that SREs are a notation for a computation. In this
respect, we want to make it possible to use them as a "computational spec" in
the same manner in which we use other programming language notation -- i.e.,
we should be able to use them *at the language level*, not simply as a
description of a data value. (As in my example above, where an SRE could be
used to represent the equivalent DFA.) David Rush made this point in an
earlier post.
Hmmmm..... Then I think about it and realize that these
will have to return multiple results somehow because of the
implicit backtracking involved in regexp matching. So we're
now talking about a stream of possible matches instead of a
single result.... and functions that take stream arguments.
This one is a tough nut.
Ahh... as you see, this "necessary" thing you want is actually not at
all necessary or even easy to define.
-------------------------------------------------------------------------------
From: Andy Gaynor <silver@mail.webspan.net>
Ray Dillinger <bear@sonic.net> wrote:
> In particular, this design adds new and incompatible meanings to (*
> ...), (+ ...) (< ...), (>...), ,@ ,and other symbols already defined
> in scheme. I can't help thinking that these will muddy namespace
> issues.
tob@world.std.com ("Tom Breton") wrote:
> I agree. This is reminiscent of C++ operator overloading. Why be
> so afraid of adding a few new words?
Namespace burden, both real and conceptual.
In general, I like the concept of operator overloading. I'll concede that
it can cause quite a few language design headaches, and overloading things
in straight Scheme is a cumbersome manual task which often proves
inefficient in practice.
No, no. I'm not overloading operators. SRE's ARE NOT SCHEME. They are a
separate notation. You may embed them in Scheme, using my (RX <sre> ...)
special form. But you don't have to. The meaning of a string is dependent
upon the language in which you interpret the meaning of that string. It isn't
"overloading" to say that
(f + 3)
means two different things depending on whether it's C or Scheme. Similarly,
(** 3 5 "foo")
means two different things depending on whether it's Scheme or SRE.
Harvey Stein made these points in an earlier post.
-------------------------------------------------------------------------------
From: "schwartz+!usenet%"@((Scott Schwartz))bio.cse.psu.edu
Rob Pike wrote a couple of papers on "Structural Regular Expression",
which are used in the "sam" editor. They have some nice properties,
and it would be appealing if your stuff could harmoniously incorporate
those ideas, especially since the initials are the same. :-) (The
papers are available somewhere under http://plan9.bell-labs.com; click
on "distribution", and the picture of the manual.)
In a nutshell, each operator either generates a sequence of regions or
does something to a sequence of regions. You string them together,
something like a shell pipeline.
Yes, this is a very interesting paper. My problem with it is that you don't
want to limit yourself to the parsing power of regexps to partition up your
editor buffers. You want a system that makes the simple things easy, and the
complex things possible. Pike's system uses regexps, which makes the simple
easy (which is good), but doesn't make the complex possible.
If you want to sit down and use the full power of Scheme to write a little
function that will parse a stream into chunks, where each chunk is, say, a C
function, or a Scheme expression, you should be able to use this function in
your "structured editing" system simply by naming it.
Pike's system is a lovely framework. But it doesn't have a Turing-equivalent
notation for specifying the component parsers, and as such is limited. You
are limited to what you can parse with regular grammars.
Doing a system like Pike's in Scheme would give you the best of both worlds.
One of my rules for designing specialised notations to be embedded inside
Scheme is to provide, wherever possible, for "escapes" to Scheme computation.
The specialised notation makes the simple things easy; the general Scheme
escapes make the complex possible.
-------------------------------------------------------------------------------
From: Marco Antoniotti <marcoxa@copernico.parades.rm.cnr.it>
':' and '|' and other ones, are certain to wreak havoc in any CL based
implementation (assuming you do not use call/cc or other too schemeish
constructs in your reference implementation).
Yes, quite right. That's why I defined slightly more verbose equivalent
operator names (SEQ, OR) to be used when the low-level lexical details of a
given language's s-expression form doesn't permit these symbols to be written
in an easy way -- and Common Lisp is one such system.
-------------------------------------------------------------------------------
From: Andy Gaynor <silver@mail.webspan.net>
Subject: Re: The SRE regular-expression notation
At least there's "," and ",@", so the SRE interface can be escaped when
necessary, so there's no lack of genericity.
No, no. If you want to build regexps at run-time with procedural constructors,
use the ADT's constructor procedures. Just say
(re-seq (re-string "foo")
(re-repeat 0 #f re-any)
(re-string "bar"))
instead of
(rx "foo" (* any) "bar")
What you want is *directly* provided by the system. I think this will be
significantly more verbose and harder to read than writing the same thing
in SREs -- but opinions vary, and you are free to use either system. I would
prefer to write my *static* regexps using SREs, and to reserve the ADT
functions for writing functions that compute on regexp values.
Concerning my vague suggestion, I was thinking
along the following lines... The body of an `sre' syntactic form (ie
`(sre . body)') is normal Scheme code processed with some particular
bindings in effect, which includes variables bound to interesting
attributes and common regular expressions, some syntactic forms and
procedures for manipulating regular expressions, etc. The body must
return a regular expression.
This is not what happens. To repeat: an SRE is an s-expression notation for
regular expressions that is *completely distinct* from Scheme (in principle,
anyway -- the few extensions that allow Scheme expressions to be embedded
within an SRE can be removed cleanly). There is a Scheme special form -- the
(RX <sre> ...) form -- that allows you to *embed* an SRE expression within a
Scheme expression. There is no form, in either Scheme or SRE notation, that
uses the keyword SRE (as you mention). So an SRE is not "Scheme code processed
with some particular bindings in effect." It isn't Scheme code at all. It's a
notation for regexps. Some clients of expressions written in this notation
might translate these SRE expressions into Scheme code. Or C code. Or a data
structure. Or something else. It's just a way to write down a regular
expression.
Assuming the same base conventions, I suppose most simpler regular
expressions would probably be coded similarly with either interface.
However, when it comes down to computing REs and juggling them about, what
I propose wins hands-down since arbitrary Scheme forms can be seamlessly
incorporated (without using "," and ",@").
Once again, it's not either/or. I provide *both* a notation *and* a
procedure/data-structure. Don't criticise SRE notation for being something
it isn't. It's a notation, not a data structure. If you want a data structure
or procedural constructor, then use that. If you want a specialised notation,
then you've got SREs.
-------------------------------------------------------------------------------
Up |