[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
word boundaries in Asian languages
From: |
Eric Abrahamsen |
Subject: |
word boundaries in Asian languages |
Date: |
Mon, 19 Aug 2013 18:26:20 +0800 |
User-agent: |
Gnus/5.130008 (Ma Gnus v0.8) Emacs/24.3 (gnu/linux) |
I use emacs for prose more than for programming, and I've been idly
fiddling with making it a better environment for editing other
languages, particularly Asian languages, particularly Chinese prose.
One of the really awkward things about editing Chinese prose in Emacs is
that word boundaries are bound to spaces -- in a language that doesn't
use spaces to delineate words, movement and editing commands are thus
restricted either to per-character, or per-punctuated-phrase. It's
unwieldy.
Accurately identifying word boundaries in Chinese is a subject of
academic research, but a couple of C libraries have emerged (I've pasted
a couple of likely links at the bottom).
Given that this level of programming is _way_ above my pay grade, I
raise the following totally hypothetical scenario. How likely is this:
1. I call "forward-word" (or some equivalent word-based command)
2. Emacs checks a variable like use-multilingual-words, or something to
that makes all the following optional.
3. It's true, so we check the script of the following character, and try
a lookup in a variable that pairs scripts with C libraries that
provide word-level commands for those scripts.
4. A library is present! Instead of the usual "forward-word", we now
call a function from that library to identify the next word boundary.
Point goes either to that spot, or to the end of a contiguous run of
characters of the same script that we started in.
So external C libraries would have to be augmented with functions that
did word boundary location in a way that made sense to emacs, but
presumably the hard work would have already been done. Given my general
ignorance, how unlikely is all of this?
Thanks!
Eric
http://technology.chtsai.org/mmseg/
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.108.8593
- word boundaries in Asian languages,
Eric Abrahamsen <=