Filter Input Extracting Quoted Urls
I used this to extract referrer urls from the http server log, piping the output through deduplication, i.e. FilterInputDeletingDuplicateLines, into sort. (Or deduplicate with uniq after sorting.)
The appended variant extract-referrals emits both the referred url and its referrer.
#!/usr/local/bin/scsh \
-e main -s
USAGE: extract-qurls.scm < INPUT > OUTPUT
Copy quoted urls from std INPUT to std OUTPUT,
sans quotes, line by line.
This INPUT line
i577B5019.versanet.de - - [18/Nov/2007:01:36:44 +0100] "GET /pix/broad.jpg HTTP/1.1" 200 11371 "http://phat.xxx/fog.html" "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.8.1.9) Gecko/20071025 Firefox/2.0.0.9"
would contribute the url
http://phat.xxx/fog.html
to the OUTPUT.
!#
;; Extract (heuristically) quoted urls from (http log on) stdin
;; copy them (sans quotes) to stdout, line by line
(define (extract-urls)
(awk (read-line) (record) ()
((: "\""
(submatch (: (+ alphabetic) "://" (* any)))
"\" ")
=> (lambda (match)
(display (match:substring match 1))
(newline)))))
(define (main args)
(if (= (length args) 1)
(extract-urls)
(format #t "Usage: ~a < INPUT > OUTPUT~%" (first args))))
A variant
;; Extract the referred local url and its quoted referrer.
;; Copy them to stdout (sans quotes, separated by a tab),
;; line by line
(define (extract-referrals)
(awk (read-line) (record) ()
((: "GET "
(submatch (: (+ (~ whitespace))))
(* any)
"\""
(submatch (: (+ alphabetic) "://" (* any)))
"\" ")
=> (lambda (match)
(format #t "~a\t~a~%"
(match:substring match 1)
(match:substring match 2))
))))
FilterInputExtractingQuotedUrls - raw wiki source |
code snippets archive
|