SRE vs. Perl's RegExp Notation by Example

scsh.net

comp.lang.scheme.scsh 2003

Date: Tue, 18 Feb 2003 05:25:29 GMT
From: Anton van Straaten <anton@appsolutions.com>
Subject: Re: Usage of Shivers' SRE regular expression notation

zhaoway wrote:
> Could you please provide some usage examples for SRE? I want to
> understand what the goodness of such an notation is comparing with
> plain string notation. Olin Shivers had explainations. Here I wanted
> to know some example projects using SRE.

A simple example I used recently was to extract the URIs or filenames
referenced by HREF="..." and FILE="..." statements in server-side HTML
templates (which couldn't be successfully parsed by a pure HTML parser).  I
used the following SRE expression, in SCSH:

(rx (w/nocase (: (or "href" "file") (* whitespace) "=" (* whitespace) "\""
(submatch (* (~ ("?\"")))) "\"")))

That's the concise version (note the final clauses are a bit suspect, but it
worked for my purposes so I left it alone).  An advantage of SREs is that
they're structured, and can be commented, split over multiple lines,
abstracted, and composed.  To demonstrate this, the above can also be
written something like this (untested):

(define QUOTE "\"") -- match a double quote
(define OPTWHITE (rx (* whitespace))) -- zero or more whitespace chars
(define KEYWORDS (rx (or "href" "file")))
(define (wrap str pat) (rx ,pat ,str ,pat)) -- surround str with specified
pattern
(rx
  (w/nocase
    (:
      ,KEYWORDS
      ,(wrap "=" OPTWHITE)
      ,(wrap (rx (submatch (* (~ ("?\""))))) QUOTE))))

The structure of SREs is clearer and less ambiguous.  It's analogous to the
difference between the Scheme expression "a+b", i.e. a string, and (+ a b),
i.e. a list containing operator and operands.  SRE can be manipulated and
composed by a program more easily than a string can be, because strings have
no implicit structure.

If you're writing large or complex regexps, the ability to structure them as
above is valuable.  I understand Perl 6 is attempting to achieve a similar
goal.  One of these days, Perl might be almost as powerful as Scheme for
regexp processing...  ;)

Anton


Date: Tue, 18 Feb 2003 06:13:38 GMT
From: Anton van Straaten <anton@appsolutions.com>
Subject: Re: Usage of Shivers' SRE regular expression notation

I wrote:
> (rx (w/nocase (: (or "href" "file") (* whitespace) "=" (* whitespace)
> "\"" (submatch (* (~ ("?\"")))) "\"")))

In case it isn't clear, this simply defines an SRE, it doesn't actually do
anything with it.  Same for the longer version I gave.

The code that did something with the above expression looked more like this:

(define pattern (rx (w/nocase (: (or "href" "file")
  (* whitespace) "=" (* whitespace)
  "\"" (submatch (* (~ ("?\"")))) "\""))))

(regexp-for-each
  pattern
  (lambda (m)
    (let-match m (m s url)
      -- do stuff with matched url
    ))
  str)

Date: Tue, 18 Feb 2003 16:21:25 +0900
From: Alex Shinn <foof@synthcode.com>
Subject: Re: Usage of Shivers' SRE regular expression notation

>>>>> "Anton" == Anton van Straaten <anton@appsolutions.com> writes:

    Anton> A simple example I used recently was to extract the URIs or
    Anton> filenames referenced by HREF="..." and FILE="..." statements
    Anton> in server-side HTML templates (which couldn't be successfully
    Anton> parsed by a pure HTML parser).  I used the following SRE
    Anton> expression, in SCSH:

    Anton> (rx (w/nocase (: (or "href" "file") (* whitespace) "=" (*
    Anton> whitespace) "\"" (submatch (* (~ ("?\"")))) "\"")))

For comparison, the Perl 5 equivalent is

  /(href|file)\s*=\s*"([^"]*)"/i

the equivalent (overly-) commented, structured version is

  my $KEYWORDS = "( href | file )";
  my $OPTWHITE = "\s*";             # zero or more whitespace chars

  /
    $KEYWORDS   # match either an href or file
    $OPTWHITE   # optional whitespace
    =
    $OPTWHITE
    "
    (           # start group
      [^"]      # a non-quote character
      *         # ... repeated zero or more times
    )           # end group
    "
  /xi

Using variable interpolation in Perl regular expressions is quite
common.  In fact, many of the most common recurring patterns in regular
expressions are collected in the Regexp::Common module.  The above could
be written:

  use Regexp::Common;

  / (href|file) \s* = \s* $RE{quoted} /xi

Most regular expression operations I find are concatenations of other
regular expressions, so this works out well.  You can also define higher
order regular expression functions, with only slightly less convenience
than in an sexp syntax, but that's of questionable use to begin with.
If you need that much structure, regular expressions are probably the
wrong tool.  Something that bridges the gap between regular expressions
and full grammars (i.e. what Perl 6 plans) would be great in Scheme, but
that doesn't seem to be what SRE's are for.

Generally, I'm a huge advocate of sexp's over other syntax in almost
every case.  But the above example looks about as simple as you can hope
for in Perl.  The SRE version has about the same number of conceptual
elements, but is much more verbose and "looks" nothing like what you're
trying to match.  I'd like to see an example where SRE really shines
over conventional regexp syntax.

-- 
Alex

Date: Tue, 18 Feb 2003 09:49:26 GMT
From: Anton van Straaten <anton@appsolutions.com>
Subject: Re: Usage of Shivers' SRE regular expression notation

Alex Shinn wrote:
> Generally, I'm a huge advocate of sexp's over other syntax in almost
> every case.  But the above example looks about as simple as you can hope
> for in Perl.

I deliberately chose a simple example.  I still think SREs have advantages,
but either way, I'm happy to just avoid Perl, which is what I used to use
for this kind of task, when nothing other than perhaps awk would easily do
the trick.  It's not so much that Perl regexps are bad - it's the rest of
the language...

An exercise which might be interesting would be to work through Larry Wall's
Apocalypse 5 (http://www.perl.com/pub/a/2002/06/04/apo5.html) and see how
many of the regexp-specific problems it mentions are addressed, or at least
are addressable, by SREs.  I get the sense that Perl is rushing headlong in
the "piling feature on top of feature" direction which Scheme has
(theoretically) disavowed.

> The SRE version has about the same number of conceptual
> elements, but is much more verbose and "looks" nothing like what you're
> trying to match.

I'm all in favor of the template argument, that the shape of the program
should match the shape of the data, but I don't see traditional string
regexps as meeting this criterion, most of the time.

I like structured regexps.  Making interpolated strings *look* structured
isn't quite the same thing - for example, the way you lay out the text isn't
necessarily the way the parser will understand it.

To some extent, it may be a taste thing, but that's only because of the
issue you mention, which is that regexps in generally can only take you so
far.  So SREs may not be really necessary, given the relative simplicity of
the application domain, but for me, they're nice to have.

> I'd like to see an example where SRE really shines
> over conventional regexp syntax.

Your criteria are different than mine.  I think SREs do shine, for many of
the same reasons that I like sexp syntax in other areas.

Anton

Date: Wed, 19 Feb 2003 11:11:40 +0100
From: Andreas Bernauer <andreas.bernauer@gmx.de>
Subject: Re: Usage of Shivers' SRE regular expression notation

On Tue, Feb 18, 2003 at 04:21:25PM +0900, Alex Shinn wrote:
>     Anton> (rx (w/nocase (: (or "href" "file") (* whitespace) "=" (*
>     Anton> whitespace) "\"" (submatch (* (~ ("?\"")))) "\"")))
> 
> For comparison, the Perl 5 equivalent is
> 
>   /(href|file)\s*=\s*"([^"]*)"/i
> 
> the equivalent (overly-) commented, structured version is
> 
>   my $KEYWORDS = "( href | file )";
>   my $OPTWHITE = "\s*";             # zero or more whitespace chars
> 
>   /
>     $KEYWORDS   # match either an href or file
>     $OPTWHITE   # optional whitespace
>     =
>     $OPTWHITE
>     "
>     (           # start group
>       [^"]      # a non-quote character
>       *         # ... repeated zero or more times
>     )           # end group
>     "
>   /xi
> 

What about sub-matches [\(...\)] in the interpolated strings and the
"basic" string? Which one can I access?


-- 
Andreas.