axiom-developer
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Axiom-developer] No progress on notangle


From: Waldek Hebisch
Subject: Re: [Axiom-developer] No progress on notangle
Date: Sun, 18 Feb 2007 05:43:22 +0100 (CET)

> I have tried to use read-sequence to load a file quickly, which works
> fine, but I'm afraid I haven't done too well with the problem of
> working with the resulting string.  The code so far is actually slower
> than the earlier version, and from what I can tell the primary issue
> seems to be too many subseq operations on the big one-pamphlet string. 
> I tried leaving the string intact and just identifying positions rather
> than chopping it off as I scanned, but the searches got progressively
> more expensive even though I was supplying new start positions. 
> There's probably a better way to do it or an obvious mistake in the way
> I have done it, but I'm afraid I've been staring at it a bit too long. 

Fast way to scan strings is to use finite automaton.  The code below
shows how this can be done in Lisp. 

Notes:
- This is a proof of concept (probably buggy) code, in particular I
  only extract the  main chunk (do not handle rescannig for other
  chunks).  Rescannig can be done by similar code, which simulatneously
  could handle @-escapes ("@<<" etc).  If escapes are not supported one
  could do with a single scan, extending automaton to remember positions
  of embedded tags.  Alternative way to have single scan is to
  copy content to new array during scan handling escapes during
  copy.
- AFAICS this code is faster then Debian (pretty fast) notangle.
  I belive that after adding needed features it can still be
  fast.
- I use (unsigned-byte  8) as a type because Lisp implementation
  may choose to have Unicode character and perfrom complicated
  recoding during input and output.  Also, Unicode characters
  are likely to require more storage (so more time for memory
  access and for garbage collection).
- The main function 'scan-for-chunks' is a rat nest of gotos.
  But that is correct because finite automaton _is_ a rat nest
  of transition and one of the simplest ways to implement
  transitions is to use gotos.  One can make it shorter by
  using apropritate macros.  Ideally the automaton code should
  be mechanically generated -- writing in C I would use flex
  (the task is trivial for flex, but no so trivial by hand).
- For ultimate speed pamplets should be read directly by Spad
  scanner.  This would avoid duplicate work.  Namely, Spad have
  to scan the file anyway.  Recoginzing noweb markers adds
  only little complexity to scanner and (if implemented via
  finite automaton) almost no execution time.  Spad performs
  macro expansion and the same mechanizm can substitute chunks
  if they respect pile rules.  If one really wants unstructured
  chunks one can handle them redirection parser input (like
  include files).
- Re-doing Spad scanning/parsing is on my todo list.  But ATM
  parsing works no worse than the rest of compiler and while
  quite slow it is still one of the fastest parts in the compiler.
  So this task has very low priority.

Code follows:

;;; Usage:
;;;         (untangle "input.file.name" "output.file.name")
;;;

(defconstant start-tag-code1 (char-code #\<))
(defconstant start-tag-code2 (char-code #\<))
(defconstant end-tag-code-1 (char-code #\>))
(defconstant end-tag-code-2 (char-code #\>))
(defconstant end-tag-code-3 (char-code #\=))
(defconstant chunk-end-code (char-code #\@))
(defconstant newline-code 10)
(defconstant space-code (char-code #\ ))

(defun read-file (f)
    (let* ((b-len (file-length f))
           (buff (make-array (list (+ b-len 1))
                             :element-type '(unsigned-byte  8))))
        (read-sequence buff f)
        (setf (aref buff b-len) newline-code)
        buff))

(defun read-named-file (name)
    (with-open-file (f name :element-type '(unsigned-byte  8))
         (read-file f)))

(defclass chunk ()
  ((chunk-name
    :initarg :chunk-name
    :initform (error "Must supply a chunk name.")
    :reader chunk-name
    :documentation "Name of chunk.")
   (chunk-contents
    :initarg :chunk-contents
    :initform '()
    :accessor chunk-contents
    :documentation "Text of chunk - may include references to other chunks")))

(defparameter *chunk-hash-table* (make-hash-table :test 'equalp))

(defun add-to-chunk-contents (name content)
    (let ((chunk (gethash name *chunk-hash-table*)))
         (if (not chunk)
             (setf chunk (setf (gethash name *chunk-hash-table*)
                               (make-instance 'chunk :chunk-name name))))
             (push (chunk-contents chunk) content)))

(defun view-all-chunks ()
  (maphash #'(lambda (k v) (format t "~a => ~S~&" k v)) *chunk-hash-table*))

(defun print-chunk-name (name line-number)
    (format t "Chunk start in line ~A: <<" line-number)
    (dotimes (i (array-dimension name 0))
        (format t "~A" (code-char (aref name i))))
    (format t ">>~&")) 

(defun scan-for-chunks (buff)
    (prog ((pos -1)
           (end-buff (- (array-dimension buff 0) 1))
           (start-pos)
           (code)
           (chunk-name)
           (line-number 0))
       normal-start
         (incf pos)
         (incf line-number)
         (if (>= pos end-buff)
             (return-from scan-for-chunks))
         (setf code (aref buff pos))
         (if (equal code start-tag-code1)
             (go chunk-start-tag-1))
         (if (equal code newline-code)
             (go normal-start))
         (go normal)
       normal
         (incf pos)
         (setf code (aref buff pos))
         (if (equal code newline-code)
             (go normal-start))
         (go normal)
       chunk-start-tag-1
         (incf pos)
         (setf code (aref buff pos))
         (if (equal code start-tag-code2)
             (progn
                  (setf start-pos (+ pos 1))
                  (format t "start-pos: ~A, char: ~A~&" start-pos
                                                      (aref buff start-pos))
                  (go in-chunk-start-tag)))
         (if (equal code newline-code)
             (go normal-start))
         (go normal)
       in-chunk-start-tag
         (incf pos)
         (setf code (aref buff pos))
         (if (equal code end-tag-code-1)
             (go in-chunk-start-tag-3))
         (if (equal code newline-code)
             (go normal-start))
         (go in-chunk-start-tag)
       in-chunk-start-tag-3
         (incf pos)
         (setf code (aref buff pos))
         (if (equal code end-tag-code-2)
             (go in-chunk-start-tag-4))
         (if (equal code newline-code)
             (go normal-start))
         (go normal)
       in-chunk-start-tag-4
         (incf pos)
         (setf code (aref buff pos))
         (if (equal code end-tag-code-3)
             (progn 
                  (setf chunk-name (subseq buff start-pos (- pos 2)))
                  (go in-chunk-start-tag-trailing)))
         (if (equal code newline-code)
             (go normal-start))
         (go normal)
       in-chunk-start-tag-trailing
         (incf pos)
         (setf code (aref buff pos))
         (if (equal code newline-code)
             (progn
                 (print-chunk-name chunk-name line-number)
                 (setf chunk-start-pos (+ pos 1))
                 (go in-chunk)))
         (if (equal code space-code)
             (go in-chunk-start-tag-trailing))
         (go normal)
       chunk-start
         (setf chunk-end-pos pos)
         (incf pos)
         (if (>= pos end-buff)
             (break "unexpected end of file"))
         (setf code (aref buff pos))
         (incf line-number)
         (if (equal code newline-code)
             (go chunk-start))
         (if (equal code chunk-end-code)
             (progn 
                 (add-to-chunk-contents 
                     chunk-name 
                     (list chunk-start-pos chunk-end-pos))
                 (go chunk-end-trailing)))
         (go in-chunk)
       chunk-end-trailing
         (incf pos)
         (setf code (aref buff pos))
         (if (equal code newline-code)
            (go normal-start))
         (if (equal code space-code)
            (go chunk-end-trailing))
         (format t "pos: ~A, code: ~A~&" pos code)
         (break "garbage after end of chunk marker")
       in-chunk
         (incf pos)
         (setf code (aref buff pos))
         (if (equal code newline-code)
             (go chunk-start))
         (go in-chunk)))

(defun untangle (in-file out-file)
    (let ((buff (read-named-file in-file)) (chunk))
          (scan-for-chunks buff)
          (setf chunk (gethash (make-array (list 1)
                                         :element-type '(unsigned-byte  8)
                                         :initial-element (char-code #\*))
                                *chunk-hash-table*))
          (if (not chunk)
              (break "No main chunk"))
          (with-open-file (of out-file :direction :output
                                       :element-type '(unsigned-byte  8))
              (dolist (seg (reverse (chunk-contents chunk)))
                    (write-sequence buff of :start (nth 0 seg) 
                                            :end (nth 1 seg))))))
         

-- 
                              Waldek Hebisch
address@hidden 




reply via email to

[Prev in Thread] Current Thread [Next in Thread]