[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Unicode I/O
From: |
Ludovic Courtès |
Subject: |
Re: Unicode I/O |
Date: |
Sun, 23 Jan 2011 00:42:23 +0100 |
User-agent: |
Gnus/5.110011 (No Gnus v0.11) Emacs/23.2 (gnu/linux) |
Hello!
address@hidden (Ludovic Courtès) writes:
> I’ve just pushed a ‘wip-iconv’ branch, which currently changes ports to
> use ‘iconv’ for input. Remaining tasks include doing it for output, and
> finding a solution for ‘scm_{to,from}_stringn’ so that it behaves in the
> same way wrt. to escapes and error handling.
I just merged ‘wip-iconv’ into ‘master’. It uses ‘iconv’ for
display/write and peek-char/read-char, but not yet for
‘scm_{to,from}_string’ and ‘read-line’. Caveat: only tested on
GNU/Linux.
Also, we should take advantage of this to improve error reporting, e.g.,
to include the location of a conversion failure.
Overall, it improves performance, except on Latin-1 ports since I chose
not to special-case them (i.e., I/O on Latin-1 ports goes through
iconv.) The trick is that iconv conversion descriptors are opened once
for all, and no heap allocation happens (‘u32_conv_from_encoding’ and
friends typically malloc.)
Benchmark results:
--8<---------------cut here---------------start------------->8---
;; with iconv:
("ports.bm: peek-char: latin-1 port" 700000 total 0.38)
("ports.bm: peek-char: utf-8 port, ascii character" 700000 total 0.38)
("ports.bm: peek-char: utf-8 port, Korean character" 700000 total 0.68)
("ports.bm: read-char: latin-1 port" 10000000 total 3.34)
("ports.bm: read-char: utf-8 port, ascii character" 10000000 total 3.33)
("ports.bm: read-char: utf-8 port, Korean character" 10000000 total 3.31)
("ports.bm: char-ready?: latin-1 port" 10000000 total 3.02 user 3.01)
("ports.bm: char-ready?: utf-8 port, ascii character" 10000000 total 3.0)
("ports.bm: char-ready?: utf-8 port, Korean character" 10000000 total 3.01)
;; with libunistring:
("ports.bm: peek-char: latin-1 port" 700000 total 0.25)
("ports.bm: peek-char: utf-8 port, ascii character" 700000 total 2.65)
("ports.bm: peek-char: utf-8 port, Korean character" 700000 total 7.58)
("ports.bm: read-char: latin-1 port" 10000000 total 3.38)
("ports.bm: read-char: utf-8 port, ascii character" 10000000 total 3.31)
("ports.bm: read-char: utf-8 port, Korean character" 10000000 total 3.29)
("ports.bm: char-ready?: latin-1 port" 10000000 total 3.08 user 3.08)
("ports.bm: char-ready?: utf-8 port, ascii character" 10000000 total 3.08)
("ports.bm: char-ready?: utf-8 port, Korean character" 10000000 total 3.05)
--8<---------------cut here---------------end--------------->8---
So ‘peek-char’ is faster, whereas ‘read-char’ gives the same results (to
my surprise, I must say.)
The ‘peek-char’ improvement is beneficial to SSAX. When loading a 4 MiB
XML file in UTF-8, it’s ~4 times faster than the old method:
--8<---------------cut here---------------start------------->8---
$ time guile -c '(use-modules (sxml simple)) (setlocale LC_ALL "") (xml->sxml
(open-input-file "chbouib.xml"))'
real 0m20.509s
user 0m20.437s
sys 0m0.064s
$ time ./meta/guile -c '(use-modules (sxml simple)) (setlocale LC_ALL "")
(xml->sxml (open-input-file "chbouib.xml"))'
real 0m5.676s
user 0m5.599s
sys 0m0.076s
--8<---------------cut here---------------end--------------->8---
For ‘write.bm’:
--8<---------------cut here---------------start------------->8---
;; with iconv:
("write.bm: write: string with escapes" 50 total 0.71)
("write.bm: write: string without escapes" 50 total 0.65)
("write.bm: display: string with escapes" 1000 total 3.39)
("write.bm: display: string without escapes" 1000 total 0.97)
;; with libunistring:
("write.bm: write: string with escapes" 50 total 7.06)
("write.bm: write: string without escapes" 50 total 7.51)
("write.bm: display: string with escapes" 1000 total 1.96)
("write.bm: display: string without escapes" 1000 total 1.46)
--8<---------------cut here---------------end--------------->8---
In the nominal case, ‘display’ is ~30% faster here, and ‘sxml->xml’ is
60% faster on this 4 MiB XML file:
--8<---------------cut here---------------start------------->8---
$ ./meta/guile -c '(use-modules (sxml simple) (ice-9 time)) (setlocale LC_ALL
"") (define s (xml->sxml (open-input-file "chbouib.xml"))) (time
(with-output-to-file "/tmp/foo.xml" (lambda () (sxml->xml s))))'
clock utime stime cutime cstime gctime
2.48 2.44 0.02 0.00 0.00 0.00
$ guile -c '(use-modules (sxml simple) (ice-9 time)) (setlocale LC_ALL "")
(define s (xml->sxml (open-input-file "chbouib.xml"))) (time
(with-output-to-file "/tmp/foo.xml" (lambda () (sxml->xml s))))'
clock utime stime cutime cstime gctime
6.43 6.39 0.04 0.00 0.00 0.00
--8<---------------cut here---------------end--------------->8---
Thanks,
Ludo’.
- Re: Unicode I/O, Ludovic Courtès, 2011/01/03
- Re: Unicode I/O,
Ludovic Courtès <=