scheme shell
about
download
support
resources
docu
links
 
scsh.net

Filter Input Extracting Quoted Urls

I used this to extract referrer urls from the http server log, piping the output through deduplication, i.e. FilterInputDeletingDuplicateLines, into sort. (Or deduplicate with uniq after sorting.)

The appended variant extract-referrals emits both the referred url and its referrer.


 #!/usr/local/bin/scsh \
 -e main -s

USAGE: extract-qurls.scm < INPUT > OUTPUT Copy quoted urls from std INPUT to std OUTPUT, sans quotes, line by line.

This INPUT line i577B5019.versanet.de - - [18/Nov/2007:01:36:44 +0100] "GET /pix/broad.jpg HTTP/1.1" 200 11371 "http://phat.xxx/fog.html" "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.8.1.9) Gecko/20071025 Firefox/2.0.0.9" would contribute the url http://phat.xxx/fog.html to the OUTPUT. !#

;; Extract (heuristically) quoted urls from (http log on) stdin ;; copy them (sans quotes) to stdout, line by line (define (extract-urls) (awk (read-line) (record) () ((: "\"" (submatch (: (+ alphabetic) "://" (* any))) "\" ") => (lambda (match) (display (match:substring match 1)) (newline)))))

(define (main args) (if (= (length args) 1) (extract-urls) (format #t "Usage: ~a < INPUT > OUTPUT~%" (first args))))


A variant

 ;; Extract the referred local url and its quoted referrer.
 ;; Copy them to stdout (sans quotes, separated by a tab),
 ;; line by line
 (define (extract-referrals)
   (awk (read-line) (record) ()
        ((: "GET "
            (submatch (: (+ (~ whitespace))))
            (* any)
            "\""
            (submatch (: (+ alphabetic) "://" (* any)))
            "\" ")
         => (lambda (match)
              (format #t "~a\t~a~%"
                        (match:substring match 1)
                        (match:substring match 2))
                ))))


FilterInputExtractingQuotedUrls - raw wiki source | code snippets archive