[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Freecats-Dev] Re: Free CATS - possible help
[Freecats-Dev] Re: Free CATS - possible help
Wed, 15 Jan 2003 01:12:08 +0100
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20020830
I'm sending a copy of this e-mail (a reply to your 2 last ones) over our
mailing list, as it will interest all team members. Please feel free to
subscribe even though you I know don't have time to code with us.
Good news: Savannah accepted my request to host Free Cats.
No problem. I wish I could do more, but you are setting out to
create a large, complex, and full featured application, and to be
fair, I just don't have time. Do feel free, on the other hand, to
contact me, or post to comp.lang.tcl for Tcl/Tk help!
Sure, thanks. I found about this NG yesterday, when exploring tcl.tk
You do help me by providing answers to my questions, at least the
ones concerning design issues. Here is one for you.
Considering we need to store translation units in a translation
memory (a CAT database), what do you think about implementing it on
top of an existing native XML DB server?
Sounds logical. I think most of them are in Java, though, which is
> an unpleasant thought. Java tends to not play that well with the rest
> of the world, I've found.
What you say about Java confirms a some feeling I had, and I'll take
your word for it.
I even thought, could be implement an alpha version of our TM server
with flat, .INI type files and simple string parsing functions, just to
see something running ;-)
I would start playing around with a few different things, and see
> what works best.
Well, as we are not that many yet, (speaking for the database server
component), I guess the first step could consist in:
- reading the documentation
- if it seems suitable, contacting the project team to ask for help
implementing our custom indexing features.
I know it seems naive, but it may work. After everything I heard about
Savannah and how selective they were about accepting projects, I see
their quick green light for accepting Free CATS as a good sign.
Since (and following your advice) we chose Tcl/Tk as the main
development team, I found out there are several, readily available
components which might interest us. As you are a member of Apache Tcl
project, do you have something to tell us about:
By the way, I just noticed lots of new materials at:
I am right in assuming Apache Xindice is a Java project?
As I see it with a newbie's eye, I believe it would mainly require
implementing custom indexing features, so as to be able to perform
fuzzy matching. We are working to define exactly what we need to
index within each translation unit's source segment and possible
Basically, for a given sentence, we need to index:
- each word in it, as well as tags (not a specific tag, but "a"
(generic) tag, as the real tag will come from another TU's source
segment) and punctuation marks
- the sequence of these items in the target segment.
There is another, major design issue for which I would be glad to hear
from you (and which we discussed at our first project team meeting last
We know our document working format is going to be tagged. XML, and
therefore Oasis' XLIFF, seems an obvious choice, but at the same time,
in our little newbies' heads, we couldn't help raising a few issues:
- XML specification is very "theoric" (writing a full-fledged XML parser
is a hard task).
- We can't help thinking about all existing HTML documents published so
far, which structure is invalid from XML syntax's point of view.
- We don't need/want to understand/alter the XML structure of translated
documents - in fact, we want to be sure we preserve (and ignore) it.
- Could a "dumb" approach be better than a "full-fledged" one?
I mean, we don't need/want to translate documents the way an author
would edit an XML document.
We first thought we would select and adapt an existing (free) XML editor
so as to integrate it as Free CATS's editing document.
This implies we would have to deal to many complex already integrated
features without which, in fact, we might be better off.
I tend to think that, for XML documents, we need to parse them in the
MOST simple way, so as to identify:
- actual text contents (to be translated)
- "internal" formatting tags (to be played with, to some extent, but at
the very least, we'll be able to accurately specify what we need)
- "external" (XML structure) tags (to be left untouched).
So, in fact, we're looking for a type of parser which would mark as
"Don't touch" these external tags, and create a sequence of "source
materials" (internal tags & text contents) which would be automatically
cut into translation units (TU):
(sorry for my very limited
a few simple data here (fuzzy matching rate)
source segment (to be translated)
<Middle of TU>
target segment (translated, to be inserted during translation)
From the bits I understood from XML's official definition, we would
only have to make sure that a given XML source document does not contain
the very string which is going to represent our own, custom tags.
As "xml" is a reserved string, a quick-and-dirty hack might consist in
including this very sequence as part of our own tags. That way, we may
not risk meeting it as part of the source document's original contents.
After that, things should be (quite more) simple...
Solving this in a simple and elegant way will be a major step. In fact,
we only have two "real" problems (read: "big" and tricky issues): the
one I tried to describe above, and the DBMS choice issue. Lots of things
must be taken care of, like a secure access to the DBMS, but they can wait.
(from your second message)
> This looks like it might be useful to you:
Oh, well... yes. Thanks a bunch, David.
This might be THE answer.
(may I insert a comment for one of the team project members:
BERTRAND, VA VOIR, C'EST POUR TOI !!!!!!!!!!!!!)
- [Freecats-Dev] Re: Free CATS - possible help,
Henri Chorand <=