Awk, record I/O, and field parsing
Unix programs frequently process streams of records,
where each record is delimited by a newline,
and records are broken into fields with other delimiters
(for example, the colon character in /etc/passwd).
Scsh has procedures that allow the programmer to easily
do this kind of processing.
Scsh's field parsers can also be used to parse other kinds
of delimited strings, such as colon-separated $PATH
lists.
These routines can be used with scsh's awk loop construct
to conveniently perform pattern-directed computation over streams
of records.
8.1 Record I/O and field parsing
The procedures in this section are used to read records from I/O streams and parse them into fields. A record is defined as text terminated by some delimiter (usually a newline). A record can be split into fields by using regular expressions in one of several ways: to match fields, to separate fields, or to terminate fields. The field parsers can be applied to arbitrary strings (one common use is splitting environment variables such as $PATH at colons into its component elements).
The general delimited-input procedures described in chapter 7 are also useful for reading simple records, such as single lines, paragraphs of text, or strings terminated by specific characters.
8.1.1 Reading records
Returns a procedure that reads records from a port. The procedure is invoked as follows:(reader [port]) ---> string or eofA record is a sequence of characters terminated by one of the characters in delims or eof. If elide-delims? is true, then a contiguous sequence of delimiter chars are taken as a single record delimiter. If elide-delims? is false, then a delimiter char coming immediately after a delimiter char produces an empty-string record. The reader consumes the delimiting char(s) before returning from a read.The delims set defaults to the set {newline}. It may be a charset, string, character, or character predicate, and is coerced to a charset. The elide-delims? flag defaults to #f.
The handle-delim argument controls what is done with the record's terminating delimiter.
'trim Delimiters are trimmed. (The default) 'split Reader returns delimiter string as a second argument. If record is terminated by EOF, then the eof object is returned as this second argument. 'concat The record and its delimiter are returned as a single string. The reader procedure returned takes one optional argument, the port from which to read, which defaults to the current input port. It returns a string or eof.
8.1.2 Parsing fields
These functions return a parser function that can be used as follows:(parser string [start]) ---> string-listThe returned parsers split strings into fields defined by regular expressions. You can parse by specifying a pattern that separates fields, a pattern that terminates fields, or a pattern that matches fields:
Procedure Pattern field-splitter matches fields infix-splitter separates fields suffix-splitter terminates fields sloppy-suffix-splitter terminates fields These parser generators are controlled by a range of options, so that you can precisely specify what kind of parsing you want. However, these options default to reasonable values for general use.
Defaults:
...which means: break the string at white space, discarding the white space, and parse as many fields as possible.
delim (rx (| (+ white) eos)) (suffix delimiter: white space or eos) (rx (+ white)) (infix delimiter: white space) field
(rx (+ (~ white)))
(non-white-space) num-fields
#f
(as many fields as possible) handle-delim
'trim
(discard delimiter chars) The delim parameter is a regular expression matching the text that occurs between fields. See chapter 6 for information on regular expressions, and the rx form used to specify them. In the separator case, it defaults to a pattern matching white space; in the terminator case, it defaults to white space or end-of-string.
The field parameter is a regular expression used to match fields. It defaults to non-white-space.
The delim patterns may also be given as a string, character, or char-set, which are coerced to regular expressions. So the following expressions are all equivalent, each producing a function that splits strings apart at colons:
(infix-splitter (rx ":")) (infix-splitter ":") (infix-splitter #\:) (infix-splitter (char-set #\:))
The boolean handle-delim determines what to do with delimiters.
'trim Delimiters are thrown away after parsing. (default) 'concat Delimiters are appended to the field preceding them. 'split Delimiters are returned as separate elements in the field list. The num-fields argument used to create the parser specifies how many fields to parse. If #f (the default), the procedure parses them all. If a positive integer n, exactly that many fields are parsed; it is an error if there are more or fewer than n fields in the record. If num-fields is a negative integer or zero, then |n| fields are parsed, and the remainder of the string is returned in the last element of the field list; it is an error if fewer than |n| fields can be parsed.
The field parser produced is a procedure that can be employed as follows:
(parse string [start]) ===> string-listThe optional start argument (default 0) specifies where in the string to begin the parse. It is an error if start > (string-length string).The parsers returned by the four parser generators implement different kinds of field parsing:
- field-splitter
- The regular expression specifies the actual field.
- suffix-splitter
- Delimiters are interpreted as element terminators. If vertical-bar is the the delimiter, then the string "" is the empty record (), "foo|" produces a one-field record ("foo"), and "foo" is an error.
The syntax of suffix-delimited records is:
<.record.> ::= "" (Empty record) | <.element.> <.delim.> <.record.> It is an error if a non-empty record does not end with a delimiter. To make the last delimiter optional, make sure the delimiter regexp matches the end-of-string (sre eos).
- infix-splitter
- Delimiters are interpreted as element separators. If comma is the delimiter, then the string "foo," produces a two-field record ("foo" "").
The syntax of infix-delimited records is:
<.record.> ::= "" (Forced to be empty record) | <.real-infix-record.> <.real-infix-record.> ::= <.element.> <.delim.> <.real-infix-record.> | <.element.> Note that separator semantics doesn't really allow for empty records -- the straightforward grammar (i.e., <.real-infix-record.>) parses an empty string as a singleton list whose one field is the empty string, (""), not as the empty record (). This is unfortunate, since it means that infix string parsing doesn't make string-append and append isomorphic. For example,
((infix-splitter ":") (string-append x ":" y))doesn't always equalIt fails when x or y are the empty string. Terminator semantics does preserve a similar isomorphism.
(append ((infix-splitter ":") x)
((infix-splitter ":") y))However, separator semantics is frequently what other Unix software uses, so to parse their strings, we need to use it. For example, Unix
$PATH
lists have separator semantics. The path list "/bin:" is broken up into ("/bin" ""), not ("/bin"). Comma-separated lists should also be parsed this way.
- sloppy-suffix
- The same as the suffix case, except that the parser will skip an initial delimiter string if the string begins with one instead of parsing an initial empty field. This can be used, for example, to field-split a sequence of English text at white-space boundaries, where the string may begin or end with white space, by using regex
(rx (| (+ white) eos))(But you would be better off using field-splitter in this case.)
Figure 6 shows how the different parser grammars split apart the same strings.
| |||||||||||||||||||||||||||||||
Figure 6: Using different grammars to split records into fields. | |||||||||||||||||||||||||||||||
Having to choose between the different grammars requires you to decide what you want, but at least you can be precise about what you are parsing. Take fifteen seconds and think it out. Say what you mean; mean what you say.
This procedure is a simple unparser -- it pastes strings together using the delimiter string.The grammar argument is one of the symbols infix (the default) or suffix; it determines whether the delimiter string is used as a separator or as a terminator.
The delimiter is the string used to delimit elements; it defaults to a single space " ".
Example:
(join-strings '("foo" "bar" "baz") ":")
==> "foo:bar:baz"
8.1.3 Field readers
This utility returns a procedure that reads records with field structure from a port. The reader's interface is designed to make it useful in the awk loop macro (section 8.2). The reader is used as follows:(reader [port]) ===> [raw-record parsed-record] or [eof ()]When the reader is applied to an input port (default: the current input port), it reads a record using rec-reader. If this record isn't the eof object, it is parsed with field-parser. These two values -- the record, and its parsed representation -- are returned as multiple values from the reader.
When called at eof, the reader returns [eof-object ()].
Although the record reader typically returns a string, and the field-parser typically takes a string argument, this is not required. The record reader can produce, and the field-parser consume, values of any type. However, the empty list returned as the parsed value on eof is hardwired into the field reader.
For example, if port p is open on /etc/passwd, then
((field-reader (infix-splitter ":" 7)) p)returns two values:The field-parser defaults to the value of (field-splitter), a parser that picks out sequences of non-white-space strings.
"dalbertz:mx3Uaqq0:107:22:David Albertz:/users/dalbertz:/bin/csh"
("dalbertz" "mx3Uaqq0" "107" "22" "David Albertz" "/users/dalbertz"
"/bin/csh")The rec-reader defaults to read-line.
Figure 7 shows field-reader being used to read different kinds of Unix records.
;;; /etc/passwd reader
(field-reader (infix-splitter ":" 7))
; wandy:3xuncWdpKhR.:73:22:Wandy Saetan:/usr/wandy:/bin/csh
;;; Two ls -l output readers
(field-reader (infix-splitter (rx (+ white)) 8))
(field-reader (infix-splitter (rx (+ white)) -7))
; -rw-r--r-- 1 shivers 22880 Sep 24 12:45 scsh.scm
;;; Internet hostname reader
(field-reader (field-splitter (rx (+ ( ".")))))
; stat.sinica.edu.tw
;;; Internet IP address reader
(field-reader (field-splitter (rx (+ ( "."))) 4))
; 18.24.0.241
;;; Line of integers
(let ((parser (field-splitter (rx (? ("+-")) (+ digit)))))
(field-reader (lambda (s) (map string->number (parser s))))
; 18 24 0 241
;;; Same as above.
(let ((reader (field-reader (field-splitter (rx (? ("+-"))
(+ digit))))))
(lambda maybe-port (map string->number (apply reader maybe-port))))
; Yale beat harvard 26 to 7.Figure 7: Some examples of field-reader
8.1.4 Forward-progress guarantees and empty-string matches
A loop that pulls text off a string by repeatedly matching a regexp against that string can conceivably get stuck in an infinite loop if the regexp matches the empty string. For example, the SREs bos, eos, (* any), and (| "foo" (* ( "f"))) can all match the empty string.
The routines in this package that iterate through strings with regular expressions are careful to handle this empty-string case. If a regexp matches the empty string, the next search starts, not from the end of the match (which in the empty string case is also the beginning -- that's the problem), but from the next character over. This is the correct behaviour. Regexps match the longest possible string at a given location, so if the regexp matched the empty string at location i, then it is guaranteed it could not have matched a longer pattern starting with character i. So we can safely begin our search for the next match at char i + 1.
With this provision, every iteration through the loop makes some forward progress, and the loop is guaranteed to terminate.
This has the effect you want with field parsing. For example, if you split a string with the empty pattern, you will explode the string into its individual characters:
((suffix-splitter (rx)) "foo") ===> ("" "f" "o" "o")However, even though this boundary case is handled correctly, we don't recommend using it. Say what you mean -- just use a field splitter:
((field-splitter (rx any)) "foo") ===> ("f" "o" "o")Or, more efficiently,
((lambda (s) (map string (string->list s))) "foo")
8.1.5 Reader limitations
Since all of the readers in this package require the ability to peek ahead one char in the input stream, they cannot be applied to raw integer file descriptors, only Scheme input ports. This is because Unix doesn't support peeking ahead into input streams.
8.2 Awk
Scsh provides a loop macro and a set of field parsers that can be used to perform text processing very similar to the Awk programming language. The basic functionality of Awk is factored in scsh into its component parts. The control structure is provided by the awk loop macro; the text I/O and parsers are provided by the field-reader subroutine library (section 8.1). This factoring allows the programmer to compose the basic loop structure with any parser or input mechanism at all. If the parsers provided by the field-reader package are insufficient, the programmer can write a custom parser in Scheme and use it with equal ease in the awk framework.
Awk-in-scheme is given by a loop macro called awk. It looks like this:
(awk <.next-record.> <.record&field-vars.>
[<.counter.>] <.state-var-decls.>
<.clause1.> ...)
The body of the loop is a series of clauses, each one representing a kind of condition/action pair. The loop repeatedly reads a record, and then executes each clause whose condition is satisfied by the record.
Here's an example that reads lines from port p and prints the line number and line of every line containing the string ``Church-Rosser'':
This example has just one clause in the loop body, the one that tests for matches against the regular expression ``Church-Rosser''.
(awk (read-line) (ln) lineno ()
("Church-Rosser" (format #t " d: s %" lineno ln)))
The <.next-record.> form is an expression that is evaluated each time through the loop to produce a record to process. This expression can return multiple values; these values are bound to the variables given in the <.record&field-vars.> list of variables. The first value returned is assumed to be the record; when it is the end-of-file object, the loop terminates.
For example, let's suppose we want to read items from /etc/password, and we use the field-reader procedure to define a record parser for /etc/passwd entries:
(define read-passwd (field-reader (infix-splitter ":" 7)))binds read-passwd to a procedure that reads in a line of text when it is called, and splits the text at colons. It returns two values: the entire line read, and a seven-element list of the split-out fields. (See section 8.1 for more on field-reader and infix-splitter.)
So if the <.next-record.> form in an awk expression is (read-passwd), then <.record&field-vars.> must be a list of two variables, e.g.,
(record field-vec)since read-passwd returns two values.
Note that awk allows us to use any record reader we want in the loop, returning whatever number of values we like. These values don't have to be strings or string lists. The only requirement is that the record reader return the eof object as its first value when the loop should terminate.
The awk loop allows the programmer to have loop variables. These are declared and initialised by the <.state-var-decls.> form, a
((var init-exp) (var init-exp) ...)list rather like the let form. Whenever a clause in the loop body executes, it evaluates to as many values as there are state variables, updating them.
The optional <.counter.> variable is an iteration counter. It is bound to 0 when the loop starts. The counter is incremented each time a non-eof record is read.
There are several kinds of loop clause. When evaluating the body of the loop, awk evaluates all the clauses sequentially. Unlike cond, it does not stop after the first clause is satisfied; it checks them all.
(test body1 body2 ...)
If test is true, execute the body forms. The last body form is the value of the clause. The test and body forms are evaluated in the scope of the record and state variables.The test form can be one of:
integer: The test is true for that iteration of the loop. The first iteration is #1. sre:
A regular expression, in SRE notation (see chapter 6) can be used as a test. The test is successful if the pattern matches the record. In particular, note that any string is an SRE. (when expr):
The body of a when test is evaluated as a Scheme boolean expression in the inner scope of the awk form. expr:
If the form is none of the above, it is treated as a Scheme expression -- in practice, the when keyword is only needed in cases where SRE/Scheme expression ambiguity might occur. (range start-test stop-test body1 ...) (:range start-test stop-test body1 ...) (range: start-test stop-test body1 ...) (:range: start-test stop-test body1 ...)
These clauses become activated when start-test is true; they stay active on all further iterations until stop-test is true.So, to print out the first ten lines of a file, we use the clause:
(:range: 1 10 (display record))
The colons control whether or not the start and stop lines are processed by the clause. For example:
(range 1 5 ...) Lines 2 3 4 (:range 1 5 ...) Lines 1 2 3 4 (range: 1 5 ...) Lines 2 3 4 5 (:range: 1 5 ...) Lines 1 2 3 4 5 A line can trigger both tests, either simultaneously starting and stopping an active region, or simultaneously stopping one and starting a new one, so ranges can abut seamlessly.
(else body1 body2 ...)
If no other clause has executed since the top of the loop, or since the last else clause, this clause executes.(test => exp)
If evaluating test produces a true value, apply exp to that value. If test is a regular expression, then exp is applied to the match data structure returned by the regexp match routine.(after body1 ...)
This clause executes when the loop encounters EOF. The body forms execute in the scope of the state vars and the record-count var, if there are any. The value of the last body form is the value of the entire awk form.If there is no after clause, awk returns the loop's state variables as multiple values.
8.2.1 Examples
Here are some examples of awk being used to process various types of input stream.
(define $ list-ref) ; Saves typing.
;;; Print out the name and home-directory of everyone in /etc/passwd:
(let ((read-passwd (field-reader (infix-splitter ":" 7))))
(call-with-input-file "/etc/passwd"
(lambda (port)
(awk (read-passwd port) (record fields) ()
(#t (format #t " a's home directory is a %"
($ fields 0)
($ fields 5)))))))
;;; Print out the user-name and home-directory of everyone whose
;;; name begins with "S"
(let ((read-passwd (field-reader (infix-splitter ":" 7))))
(call-with-input-file "/etc/passwd"
(lambda (port)
(awk (read-passwd port) (record fields) ()
((: bos "S")
(format #t " a's home directory is a %"
($ fields 0)
($ fields 5)))))))
;;; Read a series of integers from stdin. This expression evaluates
;;; to the number of positive numbers that were read. Note our
;;; "record-reader" is the standard Scheme READ procedure.
(awk (read) (i) ((npos 0))
((> i 0) (+ npos 1)))
;;; Filter -- pass only lines containing my name.
(awk (read-line) (line) ()
("Olin" (display line) (newline)))
;;; Count the number of non-comment lines of code in my Scheme source.
(awk (read-line) (line) ((nlines 0))
((: bos (* white) ";") nlines) ; A comment line.
(else (+ nlines 1))) ; Not a comment line.
;;; Read numbers, counting the evens and odds.
(awk (read) (val) ((evens 0) (odds 0))
((> val 0) (display "pos ") (values evens odds)) ; Tell me about
((< val 0) (display "neg ") (values evens odds)) ; sign, too.
(else (display "zero ") (values evens odds))
((even? val) (values (+ evens 1) odds))
(else (values evens (+ odds 1))))
;;; Determine the max length of all the lines in the file.
(awk (read-line) (line) ((max-len 0))
(#t (max max-len (string-length line))))
;;; (This could also be done with PORT-FOLD:)
(port-fold (current-input-port) read-line
(lambda (line maxlen) (max (string-length line) maxlen))
0)
;;; Print every line longer than 80 chars.
;;; Prefix each line with its line #.
(awk (read-line) (line) lineno ()
((> (string-length line) 80)
(format #t " d: s %" lineno line)))
;;; Strip blank lines from input.
(awk (read-line) (line) ()
(( white) (display line) (newline)))
;;; Sort the entries in /etc/passwd by login name.
(for-each (lambda (entry) (display (cdr entry)) (newline)) ; Out
(sort (lambda (x y) (string<? (car x) (car y))) ; Sort
(let ((read (field-reader (infix-splitter ":" 7)))) ; In
(awk (read) (line fields) ((ans '()))
(#t (cons (cons ($ fields 0) line) ans))))))
;;; Prefix line numbers to the input stream.
(awk (read-line) (line) lineno ()
(#t (format #t " d:\t a %" lineno line)))
8.3 Backwards compatibility
Previous scsh releases provided an awk form with a different syntax, designed around regular expressions written in Posix notation as strings, rather than SREs.
This form is still available in a separate module for old code. It'll be documented in the next release of this manual. Dig around in the sources for it.