Pattern-matching strings with regular expressions
Scsh provides a rich facility for matching regular-expression patterns in strings. The system is composed of several pieces:
An s-expression notation for writing down general regular expressions. In most systems, regexp patterns are encoded as string literals, such as
"g(oo|ee)se"
. In scsh, they are written using s-expressions, such as(: "g" (| "oo" "ee") "se")
, and are called sre's. The sre notation has several advantages over the traditional string-based notation. It's more expressive, can be commented, and can be indented to expose the structure of the form.An abstract data type (ADT) representation for regexp values. Traditional regular-expression systems compute regular expressions from run-time values using strings. This can be awkward. Scsh, instead, provides a separate data type for regexps, with a set of basic constructor and accessor functions; regular expressions can be dynamically computed and manipulated using these functions.
Some tools that work on the regexp ADT: case-sensitve to case-insensitive regexp transform, a regexp simplifier, and so forth.
Parsers and unparsers that can convert between external representations and the regexp ADT. The supported external representations are
Posix strings
S-expression notation (that is, sre's)
Being able to convert regexps to Posix strings allows implementations to implement regexp matching using standard Posix C-based engines.
Macro support for the s-expression notation. The rx macro provides a new special form that allows you to embed regexps in the s-expression notation within a Scheme program. Evaluating the macro form produces a regexp ADT value which can be used by Scheme pattern-matching procedures and other regexp consumers.
Pattern-matching and searching procedures. Spencer's Posix regexp engine is linked in to the runtime; the regexp code uses this engine to provide text matching.
The regexp language supported is a complete superset of Posix functionality, providing:
sequencing and choice (|)
repetition (*, +, ?, {m,n})
character classes (e.g., [aeiou]) and wildcard (.)
beginning/end of string anchors (
^
,$
)case-sensitivity control
submatch-marking
6.1 Summary SRE syntax
The following figures give a summary of the SRE syntax; the next section is a friendlier tutorial introduction.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Figure 2: SRE syntax summary (part 1) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||
Figure 3: SRE syntax summary (part 2) | |||||||||||||||||||||
range-spec ::= string | char The chars are taken in pairs to form inclusive ranges.
| |||||||||||||||||||||||||||||||||||||||
Figure 4: SRE character-class names and range specs. | |||||||||||||||||||||||||||||||||||||||
<cset-sre> ::= (~ <cset-sre> ...) Set complement-of-union | (- <cset-sre> ...) Set difference | (& <cset-sre> ...) Intersection | (| <cset-sre> ...) Set union | (/ <range-spec> ...) Range | (<string>) Constant set | <char> Singleton constant set | <string> For 1-char string "c" | <class-name> Constant set | ,<exp> <exp> evals to a char-set, | ,@<exp> char, single-char string, or re-char-set regexp. | (uncase <cset-sre>) Case-folding | (w/case <cset-sre>) | (w/nocase <cset-sre>)
|
Figure 5: applied to SRE's that specify character sets. These are the ``type-checking'' rules for character-set SRE's. |
6.2 Examples
(- alpha ("aeiouAEIOU")) ; Various forms of
(- alpha ("aeiou") ("AEIOU")) ; non-vowel letter
(w/nocase (- alpha ("aeiou")))
(- (/"azAZ") ("aeiouAEIOU"))
(w/nocase (- (/"az") ("aeiou")))
;;; Upper-case letter, lower-case vowel, or digit
(| upper ("aeiou") digit)
(| (/"AZ09") ("aeiou"))
;;; Not an SRE, but Scheme code containing some embedded SREs.
(let* ((ws (rx (+ whitespace))) ; Seq of whitespace
(date (rx (: (| "Jan" "Feb" "Mar" ...) ; A month/day date.
,ws
(| ("123456789") ; 1-9
(: ("12") digit) ; 10-29
"30" "31"))))) ; 30-31
;; Now we can use DATE several times:
(rx ... ,date ... (* ... ,date ...)
... .... ,date))
;;; More Scheme code
(define (csl re) ; A comma-separated list of RE's is
(rx (| "" ; either zero of them (empty string), or
(: ,re ; one RE, followed by
(* ", " ,re))))) ; Zero or more comma-space-RE matches.
(csl (rx (| "John" "Paul" "George" "Ringo")))
6.3 A short tutorial
S-expression regexps are called "SRE"s. Keep in mind that they are not Scheme expressions; they are another, separate notation that is expressed using the underlying framework of s-expression list structure: lists, symbols, etc. SRE's can be embedded inside of Scheme expressions using special forms that extend Scheme's syntax (such as the rx macro); there are places in the SRE grammar where one may place a Scheme expression. In these ways, SRE's and Scheme expressions can be intertwined. But this isn't fundamental; SRE's may be used in a completely Scheme-independent context. By simply restricting the notation to eliminate two special Scheme-embedding forms, they can be a completely independent notation.
Constant strings
The simplest SRE is a string, denoting a constant regexp. For example, the SRE
matches only the string <<capital-S, little-p, little-o, little-t>>. There is no interpretation of the characters in the string at all -- the SRE
"Spot"
matches the string <<period, asterisk, open-bracket>>.
".*["
Simple character sets
To specify a set of characters, write a list whose single element is a string containing the set's elements. So the SRE
only matches a vowel. One way to think of this, notationally, is that the set brackets are (" and ").
("aeiou")
Wild card
Another simple SRE is the symbol any, which matches any single character -- including newline, but excluding ASCII NUL.
Sequences
We can form sequences of SRE's with the SRE (: sre ...). So the SRE
matches any three-character string starting with ``x'' and ending with ``z''. As we'll see shortly, many SRE forms have bodies that are implicit sequences of other SRE's, analogous to the manner in which the body of a Scheme lambda or let expression is an implicit begin sequence. The regexp (seq sre ...) is completely equivalent to (: sre ...); it's included in order to have a syntax that doesn't require : to be a legal symbol 13
(: "x" any "z")
Choices
The SRE (| sre ...) is a regexp that matches anything any of the sre regexps match. So the regular expression
matches either the string ``sasha'' or the string ``Pete''. The regexp
(| "sasha" "Pete")
is the same as
(| ("aeiou") ("0123456789"))
The regexp (or sre ...) is completely equivalent to (| sre ...); it's included in order to have a syntax that doesn't require | to be a legal symbol.
("aeiou0123456789")
Repetition
There are several SRE forms that match multiple occurences of a regular expression. For example, the SRE (* sre ...) matches zero or more occurences of the sequence (: sre ...). Here is the complete list of SRE repetition forms:
SRE means at least no more than (* sre ...) zero-or-more 0 infinity (+ sre ...) one-or-more 1 infinity (? sre ...) zero-or-one 0 1 (= from sre ...) exactly-n from from (>= from sre ...) n-or-more from infinity (** from to sre ...) n-to-m from to
A from field is a Scheme expression that produces an integer. A to field is a Scheme expression that produces either an integer, or false, meaning infinity.
While it is illegal for the from or to fields to be negative, it is allowed for from to be greater than to in a ** form -- this simply produces a regexp that will never match anything.
As an example, we can describe the names of car/cdr access functions ("car", "cdr", "cadr", "cdar", "caar" , "cddr", "caaadr", etc.) with either of the SREs
We can limit the a/d chains to 4 characters or less with the SRE
(: "c" (+ (| "a" "d")) "r")
(: "c" (+ ("ad")) "r")
(: "c" (** 1 4 ("ad")) "r")
Some boundary cases:
(** 5 2 "foo") ; Will never match
(** 0 0 "foo") ; Matches the empty string
Character classes
There is a special set of SRE's that form ``character classes'' -- basically, a regexp that matches one character from some specified set of characters. There are operators to take the intersection, union, complement, and difference of character classes to produce a new character class. (Except for union, these capabilities are not provided for general regexps as they are computationally intractable in the general case.)
A single character is the simplest character class: #\x
is a character
class that matches only the character ``x''. A string that has only one
letter is also a character class: "x" is the same SRE as #\x
.
The character-set notation (string) we've seen is a primitive character class, as is the wildcard any. When arguments to the choice operator, |, are all character classes, then the choice form is itself a character-class. So these SREs are all character-classes:
However, these SRE's are not character-classes:
("aeiou")
(| #\a #\e #\i #\o #\u)
(| ("aeiou") ("1234567890"))
"aeiou"
(| "foo" #\x)
The (~ cset-sre ...) char class matches one character not in the specified classes:
matches any character that is not a digit.
( ("0248") ("1359"))
More compactly, we can use the / operator to specify character sets by giving the endpoints of contiguous ranges, where the endpoints are specified by a sequence of strings and characters. For example, any of these char classes
matches a letter or a digit. The range endpoints are taken in pairs to form inclusive ranges of characters. Note that the exact set of characters included in a range is dependent on the underlying implementation's character type, so ranges may not be portable across different implementations.(/ #\A#\Z#\a#\z#\0 #\9) (/ "AZ" #\a#\z"09") (/ "AZ" #\a"z09") (/"AZaz09")
There is a wide selection of predefined, named character classes that may be used. One such SRE is the wildcard any. nonl is a character class matching anything but newline; it is equivalent to
and is useful as a wildcard in line-oriented matching.(~ #\newline)
There are also predefined named char classes for the standard Posix and Gnu character classes:
See the scsh character-set documentation or the Posix isalpha(3) man page for the exact definitions of these sets.
scsh name Posix/ctype Alternate name Comment lower-case lower upper-case upper alphabetic alpha numeric digit num alphanumeric alnum alphanum punctuation punct graphic graph blank (Gnu extension) whitespace space white ``space'' is deprecated. printing control cntrl hex-digit xdigit hex ascii (Gnu extension)
You can use either the long scsh name or the shorter Posix and alternate names
to refer to these char classes.
The standard Posix name ``space'' is provided,
but deprecated, since it is ambiguous. It means ``whitespace,'' the set of
whitespace characters, not the singleton set of the #\space
character.
If you want a short name for the set of whitespace characters, use the
char-class name ``white'' instead.
Char classes may be intersected with the operator (& cset-sre ...), and set-difference can be performed with (- cset-sre ...). These operators are particularly useful when you want to specify a set by negation with respect to a limited universe. For example, the set of all non-vowel letters is
whereas writing a simple complement
(- alpha ("aeiou") ("AEIOU"))
gives a char class that will match any non-vowel -- including punctuation, digits, white space, control characters, and ASCII nul.
( ("aeiouAEIOU"))
We can compute a char class by writing the SRE
where cset-exp is a Scheme expression producing a value that can be coerced to a character set: a character set, character, one-character string, or char-class regexp value. This regexp matches one character from the set.
,cset-exp
The char-class SRE ,@cset-exp is entirely equivalent to ,cset-exp when cset-exp produces a character set (but see below for the more general non-char-class context, where there is a distinction between ,exp and ,@exp.
As an example of character-class SREs, an SRE that matches a lower-case vowel, upper-case letter, or digit is
or, equivalently
(| ("aeiou") (/"AZ09"))
Boundary cases: the empty-complement char class
(| ("aeiou") upper-case numeric)
matches any character; it is equivalent to any. The empty-union char class
( )
never matches at all. This is rarely useful for human-written regexps, but may be of occasional utility in machine-generated regexps, perhaps produced by macros.
(|)
The rules for determining if an SRE is a simple, char-class SRE or a more complex SRE form a little ``type system'' for SRE's. See the summary section preceding this one for a complete listing of these rules.
{Note There is no way to include the ASCII NUL character in a character set or search for it in any other way using regular expression. This is because the POSIX regexp facility is based on the C language which uses ASCII NUL to terminate strings.}
Case sensitivity
There are three forms that control case sensitivity:
(uncase sre ...)
(w/case sre ...)
(w/nocase sre ...)
uncase is a regexp operator producing a regexp that matches any case permutation of any string that matches (: sre ...). For example, the regexp
matches the strings ``foo'', ``foO'', ``fOo'', ``fOO'', ``Foo'', ...
(uncase "foo")
Expressions in SRE notation are interpreted in a lexical case-sensitivy context. The forms w/case and w/nocase are the scoping operators for this context, which controls how constant strings and char-class forms are interpreted in their bodies. So, for example, the regexp
defines a case-insensitive match for all of its elements except for the sub-element "Bar", which must match exactly capital-B, little-a, little-r. The default, the outermost, top-level context is case sensitive.
(w/nocase "abc"
(* "FOO" (w/case "Bar"))
("aeiou"))
The lexical case-sensitivity context affects the interpretation of
constant strings, such as "foo",
chars, such as
#\x
,char sets, such as ("abc"), and
ranges, such as (/"az") that appear within that context. It does not affect dynamically computed regexps -- ones that are introduced by ,exp and ,@exp forms. It does not affect named char-classes -- presumably, if you wrote lower, you didn't mean alpha.
uncase is not the same as w/nocase. To point up one distinction, consider the two regexps
(uncase ( "a"))
(w/nocase ( "a"))
The regexp (~ "a") matches any character except ``a,'' which means it does match ``A.'' Now, (uncase re) matches any case-permutation of a string that re matches. (~ "a") matches ``A,'' so (uncase (~ "a")) matches ``A'' and ``a'' -- and, for that matter, every other character. So (uncase (~ "a")) is equivalent to any.
In contrast, (w/nocase (~ "a")) establishes a case-insensitive lexical context in which the "a" is interpreted, making the SRE equivalent to (~ ("aA")).
Dynamic regexps
SRE notation allows you to compute parts of a regular expressions at run time. The SRE
is a regexp whose body exp is a Scheme expression producing a string, character, char-set, or regexp as its value. Strings and characters are converted into constant regexps; char-sets are converted into char-class regexps; and regexp values are substituted in place. So we can write regexps like this
,exp
This is how you can drop computed strings, such as someone's name, or the decimal numeral for a computed number, into a complex regexp.
(: "feeding the "
,(if (> n 1) "geese" "goose"))
If we have a large, complex regular expression that is used multiple times in some other, containing regular expression, we can name it, using the binding forms of the embedding language (e.g., Scheme), and refer to it by name in the containing expression. For example, consider the Scheme expression
where the (rx sre ...) macro is the Scheme special form that produces a Scheme regexp value given a body in SRE notation.
(let* ((ws (rx (+ whitespace))) ; Seq of whitespace
;; Something like "Mar 14"
(date (rx (: (| "Jan" "Feb" "Mar" ...)
,ws
(| ("123456789") ; 1-9
(: ("12") digit) ; 10-29
"30" ; 30
"31"))))) ; 31
;; Now we can use DATE several times:
(rx ... ,date ... (* ... ,date ...)
... ,date ...))
As we saw in the char-class section, if a dynamic regexp is used
in a char-class context (e.g., as an argument to a ~
operation),
the expression must be coercable not merely to a general regexp,
but to a character sre -- so it must be either a singleton string,
a character, a scsh char set, or a char-class regexp.
We can also define and use functions on regexps in the host language. For example, consider the following Scheme expressions, containing embedded SRE's (inside the rx macro expressions) which in term contain embedded Scheme expressions computing dynamic regexps:
We leave the extension of csl to allow for an optional ``and'' between the last two matches as an exercise for the interested reader (e.g., to match ``John, Paul, George and Ringo'').
(define (csl re)
;; A comma-separated list of RE's is either
(rx (| "" ; zero of them (empty string),
(: ,re ; or RE followed by
(* ", " ,re))))); zero or more comma-space-RE matches.
(rx ... ,date ...
,(csl (rx (| "John" "Paul" "George" "Ringo")))
...
,(csl date)
...)
Note, in passing, one of the nice features of SRE notation: they can be commented, and indented in a fashion to show the lexical extent of the subexpressions.
When we embed a computed regexp inside another regular expression with the ,exp form, we must specify how to account for the submatches that may be in the computed part. For example, suppose we have the regexp
It's clear that the submatch for the (* "foo") part of the regexp is submatch #1, and the (? "bar") part is submatch #2. But what number submatch is the "baz" submatch? It's not clear. Suppose the Scheme expression (f x) produces a regular expression that itself has 3 subforms. Are these counted (making the "baz" submatch #6), or not counted (making the "bar" submatch #3)?
(rx (submatch (* "foo"))
(submatch (? "bar"))
,(f x)
(submatch "baz"))
SRE notation provides for both possibilities. The SRE
does not contribute its submatches to its containing regexp; it has zero submatches. So one can reliably assign submatch indices to forms appearing after a ,exp form in a regexp.
,exp
On the other hand, the SRE
``splices'' its resulting regexp into place, exposing its submatches to the containing regexp. This is useful if the computed regexp is defined to produce a certain number of submatches -- if that is part of exp's ``contract.''
,@exp
String and line units
The regexps bos and eos match the empty string at the beginning and end of the string, respectively.
The regexps bol and eol match the empty string at the beginning and end of a line, respectively. A line begins at the beginning of the string, and just after every newline character. A line ends at the end of the string, and just before every newline character. The char class nonl matches any character except newline, and is useful in conjunction with line-based pattern matching.
{Note bol and eol are not supported by scsh's current regexp search engine, which is Spencer's Posix matcher. This is the only element of the notation that is not supported by the current scsh reference implementation.}
Posix string notation
The SRE (posix-string string),
where string is a string literal
(not a general Scheme expression), allows one to use Posix string
notation for a regexp. It's intended as backwards compatibility and
is deprecated.
For example, (posix-string "[aeiou]+|x*|y{3,5}")
matches
a string of vowels, a possibly empty string of x's, or three to five
y's.
Note that parentheses are used ambiguously in Posix notation -- both for grouping and submatch marking. The (posix-string string) form makes the conservative assumption: all parentheses introduce submatches.
Deleted submatches
Deleted submatches, or ``DSM's,'' are a subtle feature that are never required in expressions written by humans. They can be introduced by the simplifier when reducing regular expressions to simpler equivalents, and are included in the syntax to give it expressibility spanning the full regexp ADT. They may appear when unparsing simplified regular expressions that have been run through the simplifier; otherwise you are not likely to see them. Feel free to skip this section.
The regexp simplifier can sometimes eliminate entire sub-expressions from a regexp. For example, the regexp
can be simplified to
(: "foo" (** 0 0 "apple") "bar")
since (** 0 0 "apple") will always match the empty string. The regexp
"foobar"
can be simplified to
(| "foo"
(: "Richard" (|) "Nixon")
"bar")
The empty choice (|) can't match anything, so the whole
(| "foo" "bar")
sequence can't match, and we can remove it from the choice.
(: "Richard" (|) "Nixon")
However, if deleting part of a regular expression removes a submatch form, any following submatch forms will have their numbering changed, which would be an error. For example, if we simplify
to
(: (** 0 0 (submatch "apple"))
(submatch "bar"))
then the "bar" submatch changes from submatch #2 to submatch #1 -- so this is not a legal simplification.
(submatch "bar")
When the simplifier deletes a sub-regexp that contains submatches, it introduces a special regexp form to account for the missing, deleted submatches, thus keeping the submatch accounting correct.
is a regexp that matches the sequence (: sre ...). pre and post are integer constants. The DSM form introduces pre deleted submatches before the body, and post deleted submatches after the body. If the body (: sre ...) itself has body-sm submatches, then the total number of submatches for the DSM form is
(dsm pre post sre ...)
pre + body-sm + post. |
These extra, deleted submatches are never assigned string indices in any match values produced when matching the regexp against a string.
As examples,
can be simplified to
(| (: (submatch "Richard") (|) "Nixon")
(submatch "bar"))
The regexp
(dsm 1 0 (submatch "bar"))
can be simplified to
(: (** 0 0 (submatch "apple"))
(submatch "bar"))
(dsm 1 0 (submatch "bar"))
6.3.1 Embedding regexps within Scheme programs
SRE's can be placed in a Scheme program using the (rx sre ...) Scheme form, which evaluates to a Scheme regexp value.
6.3.1.1 Static and dynamic regexps
We separate SRE expressions into two classes: static and dynamic expressions. A static expression is one that has no run-time dependencies; it is a complete, self-contained description of a regular set. A dynamic expression is one that requires run-time computation to determine the particular regular set being described. There are two places where one can embed run-time computations in an SRE:
The from or to repetition counts of **, =, and >= forms;
,exp and ,@exp forms.
A static SRE is one that does not contain any ,exp or ,@exp forms, and whose **, =, and >= forms all contain constant repetition counts.
Scsh's rx macro is able, at macro-expansion time, to completely parse, simplify and translate any static SRE into the equivalent Posix string which is used to drive the underlying C-based matching engine; there is no run-time overhead. Dynamic SRE's are partially simplified and then expanded into Scheme code that constructs the regexp at run-time.
6.4 Regexp functions
6.4.1 Obsolete, deprecated procedures
These two procedures are survivors from the previous, now-obsolete scsh regexp interface. Old code must open the re-old-funs package to access them. They should not be used in new code.
These are old functions included for backwards compatibility with previous releases. They are deprecated and will go away at some point in the future.Note that the new release has no ``regexp compiling'' procedure at all -- regexp values are compiled for the matching engine on-demand, and the necessary data structures are cached inside the ADT values.
6.4.2 Standard procedures and syntax
This allows you to describe a regexp value with SRE notation.
Returns true if the value is a regular expression.
Search string starting at position start, looking for a match for regexp re. If a match is found, return a match structure describing the match, otherwise #f. Start defaults to 0.Flags is the bitwise-or of regexp/bos-not-bol and regexp/eos-not-eol. regexp/bos-not-bol means the beginning of the string isn't a line-begin. regexp/eos-not-eol is analogous. {Note They're currently ignored because begining/end-of-line anchors aren't supported by the current implementation.}
Use regexp-search? when you don't need submatch information, as it has the potential to be significantly faster on submatch-containing regexps.
There is no longer a separate regexp ``compilation'' function; regexp values are compiled for the C engine on demand, and the resulting C structures are cached in the regexp structure after the first use.
match:start returns the start position of the submatch denoted by match-number. The whole regexp is 0; positive integers index submatches in the regexp, counting left-to-right. Match-number defaults to 0.If the regular expression matches as a whole, but a particular sub-expression does not match, then match:start returns #f.
match:end is analogous to match:start, returning the end position of the indexed submatch.
match:substring returns the substring matched regexp's submatch. If there was no match for the indexed submatch, it returns false.
This procedure can be used to perform string substitutions based on regular-expression matches. The results of the substitution can be either output to a port or returned as a string.The match argument is a regular-expression match structure that controls the substitution. If port is an output port, the items are written out to the port:
If an item is a string, it is copied directly to the port.
If an item is an integer, the corresponding submatch from match is written to the port.
If an item is 'pre, the prefix of the matched string (the text preceding the match) is written to the port.
If an item is 'post, the suffix of the matched string is written.
If port is #f, nothing is written, and a string is constructed and returned instead.
This procedure is similar to regexp-substitute, but can be used to perform repeated match/substitute operations over a string. It has the following differences with regexp-substitute:
It takes a regular expression and string to be matched as parameters, instead of a completed match structure.
If the regular expression doesn't match the string, this procedure is the identity transform -- it returns or outputs the string.
If an item is 'post, the procedure recurses on the suffix string (the text from string following the match). Including a 'post in the list of items is how one gets multiple match/substitution operations.
If an item is a procedure, it is applied to the match structure for a given match. The procedure returns a string to be used in the result.
The regexp parameter can be either a compiled regular expression or a string specifying a regular expression.
Some examples:
;;; Replace occurrences of "Cotton" with "Jin".
(regexp-substitute/global #f (rx "Cotton") s
'pre "Jin" 'post)
;;; mm/dd/yy -> dd/mm/yy date conversion.
(regexp-substitute/global #f (rx (submatch (+ digit)) "/" ; 1 = M
(submatch (+ digit)) "/" ; 2 = D
(submatch (+ digit))) ; 3 = Y
s ; Source string
'pre 2 "/" 1 "/" 3 'post)
;;; "9/29/61" -> "Sep 29, 1961" date conversion.
(regexp-substitute/global #f (rx (submatch (+ digit)) "/" ; 1 = M
(submatch (+ digit)) "/" ; 2 = D
(submatch (+ digit))) ; 3 = Y
s ; Source string
'pre
;; Sleazy converter -- ignores "year 2000" issue,
;; and blows up if month is out of range.
(lambda (m)
(let ((mon (vector-ref '#("Jan" "Feb" "Mar" "Apr" "May" "Jun"
"Jul" "Aug" "Sep" "Oct" "Nov" "Dec")
(- (string->number (match:substring m 1)) 1)))
(day (match:substring m 2))
(year (match:substring m 3)))
(string-append mon " " day ", 19" year)))
'post)
;;; Remove potentially offensive substrings from string S.
(define (kill-matches re s)
(regexp-substitute/global #f re s 'pre 'post))
(kill-matches (rx (| "Windows" "tcl" "Intel")) s) ; Protect the children.
The following definition is a bit unwieldy, but the intuition is simple: this procedure uses the regexp re to divide up string s into non-matching/matching chunks, and then ``folds'' the procedure kons across this sequence of chunks. It is useful when you wish to operate on a string in sub-units defined by some regular expression, as are the related regexp-fold-right and regexp-for-each procedures.Search from start (defaulting to 0) for a match to re; call this match m. Let i be the index of the end of the match (that is, (match:end m 0)). Loop as follows:
If there is no match, return instead
(regexp-fold re kons (kons start m knil) s finish i)Finish defaults to (lambda (i knil) knil).
(finish start knil)In other words, we divide up s into a sequence of non-matching/matching chunks:
NM1 M1 NM1 M2 ... NMk-1 Mk-1 NMk where NM1 is the initial part of s that isn't matched by the regexp re, M1 is the first match, NM2 is the following part of s that isn't matched, M2 is the second match, and so forth -- NMk is the final non-matching chunk of s. We apply kons from left to right to build up a result, passing it one non-matching/matching chunk each time: on an application (kons i m knil), the non-matching chunk goes from i to (match:begin m 0), and the following matching chunk goes from (match:begin m 0) to (match:end m 0). The last non-matching chunk NMk is processed by k. So the computation we perform is
where Ji is the index of the start of NMi, Mi is a match value describing Mi, and Q is the index of the beginning of NMk.
(final Q (kons jk Mk ... (kons J1 M1 knil) ...))Hint: The let-match macro is frequently useful for operating on the match value M passed to the kons function.
The right-to-left variant of regexp-fold.This procedure repeatedly matches regexp re across string s. This divides s up into a sequence of matching/non-matching chunks:
NM1 M1 NM1 M2 ... NMk-1 Mk-1 NMk where NM1 is the initial part of s that isn't matched by the regexp re, M1 is the first match, NM2 is the following part of s that isn't matched, M2 is the second match, and so forth -- NMk is the final non-matching chunk of s. We apply kons from right to left to build up a result, passing it one non-matching/matching chunk each time:
where MTCHi is a match value describing Mi, Ji is the index of the end of NMi (or, equivalently, the beginning of Mi+1), and Q is the index of the beginning of M1. In other words, KONS is passed a match, an index describing the following non-matching text, and the value produced by folding the following text. The FINAL function "polishes off" the fold operation by handling the initial chunk of non-matching text (NM0, above). FINISH defaults to (lambda (i knil) knil)
(final Q (kons M1 j1 ... (kons Mk Jk knil) ...))Example: To pick out all the matches to re in s, say
Hint: The let-match macro is frequently useful for operating on the match value m passed to the kons function.
(regexp-fold-right re
(lambda (m i lis)
(cons (match:substring m 0) lis))
'() s)
Repeatedly match regexp re against string s. Apply proc to each match that is produced. Matches do not overlap.Hint: The let-match macro is frequently useful for operating on the match value m passed to varproc.
Mvars is a list of vars that is bound to the match and submatches of the string;#F
is allowed as a don't-care element. For example,matches the regexp against string s, then evaluates the body of the let-match in a scope where whole-date is bound to the matched string, and month, day and year are bound to the first, second and third submatches.
(let-match (regexp-search date s) (whole-date month day year)
... body ...)if-match is similar, but if the match expression is false, then the no-match expression is evaluated; this would be an error in let-match.
This macro allows one to conditionally attempt a sequence of pattern matches, interspersed with other, general conditional tests. There are four kinds of match-cond clause, one introducing a pattern match, and the other three simply being regular cond-style clauses, marked by the test and else keywords:
(match-cond (match-exp match-vars body ...) ; As in if-match
(test exp body ...) ; As in cond
(test exp => proc) ; As in cond
(else body ...)) ; As in cond
These functions map regexps and char sets to other regexps. flush-submatches returns a regexp which matches exactly what its argument matches, but contains no submatches.uncase returns a regexp that matches any case-permutation of its argument regexp.
simplify-regexp applies the simplifier to its argument. This is done automatically when compiling regular expressions, so this is only useful for programmers that are directly examining the ADT value with lower-level accessors.
uncase-char-set maps a char set to a regular expression that matches any character from that set, regardless of case. Similarly, uncase-string returns a regexp that matches any case-permutation of the string. For example, (uncase-string "Knight") returns the same value that (rx ("kK") ("nN") ("iI") ("gG") ("hH") ("tT")) or (rx (w/nocase "Knight")).
These are the SRE parser and unparser. That is, sre->regexp maps an SRE to a regexp value, and regexp->sre does the inverse. The latter function can be useful for printing out regexps in a readable format.
(sre->regexp '(: "Olin " (? "G. ") "Shivers")) ==> regexp
(define re (re-seq (re-string "Pete ")
(re-repeat 1 #f (re-string "Sz"))
(re-string "ilagyi")))
(regexp->sre (re-repeat 0 1 re))
==> '(? "Pete" (+ "Sz") "ilagyi")
These two functions are the Posix notation parser and unparser. That is, posix-string->regexp maps a Posix-notation regular expression, such as "g(ee|oo)se", to a regexp value, and regexp->posix-string does the inverse.You can use these tools to map between scsh regexps and Posix regexp strings, which can be useful if you want to do conversion between SRE's and Posix form. For example, you can write a particularly complex regexp in SRE form, or compute it using the ADT constructors, then convert to Posix form, print it out, cut and paste it into a C or emacs lisp program. Or you can import an old regexp from some other program, parse it into an ADT value, render it to an SRE, print it out, then cut and paste it into a scsh program.
Note:
The string parser doesn't handle the exotica of character class names such as
[[:alnum:]]
; the current implementation was written in in three hours.
6.5 The regexp ADT
The following functions may be used to construct and examine scsh's regexp abstract data type. They are in the following Scheme 48 packages: re-adt-lib re-lib scsh
Each basic class of regexp has a predicate, a basic constructor, a ``smart'' consructor that performs limited ``peephole'' optimisation on its arguments, and a set of accessors. The ...:tsm accessor returns the total number of submatches contained in the regular expression.
These variables are bound to the primitive anchor regexps.
These predicates recognise the associated primitive anchor regexp.
The variable re-trivial is bound to a regular expression that matches the empty string (corresponding to the SRE "" or (:)); it is recognised by the associated predicate. Note that the predicate is only guaranteed to recognise this particular trivial regexp; other trivial regexps built using other constructors may or may not produce a true value.
The variable re-empty is bound to a regular expression that never matches (corresponding to the SRE (|)); it is recognised by the associated predicate. Note that the predicate is only guaranteed to recognise this particular empty regexp; other empty regexps built using other constructors may or may not produce a true value.
The variable re-any is bound to a regular expression that matches any character (corresponding to the SRE any); it is recognised by the associated predicate. Note that the predicate is only guaranteed to recognise this particular any-character regexp value; other any-character regexps built using other constructors may or may not produce a true value.
The variable re-nonl is bound to a regular expression
that matches any non-newline character
(corresponding to the SRE (~ #\newline)
).
Is the object a regexp?
Return the total number of submatches contained in the regexp.
The current scsh implementation should call this function periodically to release C-heap storage associated with compiled regexps. Hopefully, this procedure will be removed at a later date.
6.6 Syntax-hacking tools
The Scheme 48 package sre-syntax-tools exports several tools for macro writers that want to use SREs in their macros. In the functions defined below, compare and rename parameters are as passed to Clinger-Rees explicit-renaming low-level macros.
If form is a legal SRE, this is equivalent to the expression conseq-form, otherwise it expands to alt-form.This is useful for high-level macro authors who want to write a macro where one field in the macro can be an SRE or possibly something else. E.g., we might have a conditional form wherein if the test part of one arm is an SRE, it expands to a regexp match on some implied value, otherwise the form is evaluated as a boolean Scheme expression. For example, a conditional macro might expand into code containing the following form, which in turn would have one of two possible expansions:
(if-sre-form test-exp ; If TEST-EXP is SRE,
(regexp-search? (rx test-exp) line) ; match it w/the line,
test-exp) ; otw it's a text exp.
This procedure is for low-level macros doing things equivalent to if-sre-form. It returns true if the form is a legal SRE.Note that neither sre-form nor if-sre-form does a deep recursion over the form in the case where the form is a list. They simply check the car of the form for one of the legal SRE keywords.
Parse sre-form into an ADT. Note that if the SRE is dynamic -- contains ,exp or ,@exp forms, or has repeat operators whose from/to counts are not constants -- then the returned ADT will have Scheme expressions in the corresponding slots of the regexp records instead of the corresponding integer, char-set, or regexp. In other words, we use the ADT as its own AST. It's called a ``hack.''parse-sres parses a list of SRE forms that comprise an implicit sequence.
Returns a Scheme expression that will construct the regexp re using ADT constructors such as make-re-sequence, make-re-repeat, and so forth.If the regexp is static, it will be simplified and pre-translated to a Posix string as well, which will be part of the constructed regexp value.
Is the regexp a static one?
13 That is, for use within s-expression syntax frameworks that, unlike R5RS, don't allow for : as a legal symbol. A Common Lisp embedding of SREs, for example, would need to use seq instead of :.