[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Freecats-Dev] Bilingual File Format (again)
From: |
Henri Chorand |
Subject: |
[Freecats-Dev] Bilingual File Format (again) |
Date: |
Tue, 01 Apr 2003 21:43:31 +0200 |
User-agent: |
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20021003 |
Hi Marc,
Firstly, thank you to all members of Free CATS for your confidence in
> the future of the OmegaT project.
Three is a crowd, as Buster Keaton (I think) once said ;-)
Keith will no doubt begin making contributions of his own to the list
> in due course, but at the moment both he and I are under time pressure
> owing to other activities. Please be patient!
No problem - we all are very busy persons.
Now, concerning the Famous Bilingual file format:
When I was thinking about starting from scratch, I started wanting to
make up a very simple, yet extensible enough, design:
1) Being ignorant about XML, apart from its most basic principles, I
still wanted it to be XML-based, that is, tag-based - XML is the future,
or so I heard everybody knowledgeable say.
2) I knew it had to include formatting info. I first looked for tagging
info in XML specs and asked a few techies around, only to find out that,
in itself, XML specs do not say anything about it. I then supposed we
could use HTML's set of formatting tags, at least for a start (I assumed
that for the more complicated formatting tags found in some DTP &
word-processor, we could simply find a way to keep them unchanged and
end up with a pseudo-Wysiwyg approach that would be good enough for us
translators).
3) I also knew that we had to parse any XML source file sequentially, in
a "dumb" way (only caring about its text & formatting contents and
leaving its structure unchanged, even and especially if it was
supposedly weird or malformed, and all the more since most existing HTML
files found today ARE badly formatted from XML's point of view).
4) I spoke to Thierry about it, and it emerged that we could envision a
bilingual file format made up of our own custom tags (beginning of TU
(including anciliary TU info), middle of TU, end of TU). We would have
kept the internal tags (Trados' tw4winInternal style) within the TUs'
source & target segments, and proudly left unchanged all "structure"
tags (Trados' tw4winExternal style).
An industry-standard tagged bilingual file format would be a major
breakthrough. I am currently in the position of arguing vehemently
> that TMX, and not Trados' native translation memory format, should
> be regarded as the industry-standard translation memory format.
We all agree about TMX. The question that remains is about the bilingual
file format.
> Trados though, with its "uncleaned file" format, has a format for
> which there is no industry-standard equivalent, and so the Trados
> format can effectively claim this status by default. :-(
Trados' one is either:
- nicely based on character styles for MS Word / RTF files (but we don't
want to work within a MS Word framework, do we?)
- tagged (proprietary) for HTML / XML files - with a voluntarily blatant
incompatibility between Trados 3 (.BIF) & Trados 5 and later.
Wordfast cleverly clones Trados (Word version) on this, but has no
tagged format for HTML/XML files as it preps them in order to allow them
to be also translated with MS Word. Yves will correct me if I'm wrong.
However, I find it difficult to conceive of an industry-standard
> tagged bilingual file format in the absence of an industry-standard
> tagged (monolingual) word processing file format.
Of course we must end up designing our own.
For me, the question is, how to nicely & efficiently design something,
starting from what is readily available.
If, for the sake of argument, OpenOffice.org's file format (which is
> at least open, documented, extensible, and has been submitted to the
> W3C for formal recognition as a standard) is accepted as the standard
> for a *monolingual* word processing file format, the step to a tagged
> bilingual file format is trivial.
Exactly. I believe this is what we need, for the following obvious reasons:
- It's an open, XML-based, tagged document format - certainly the best
one available today.
- Keith's OmegaT already understands it pretty well.
It may well be possible to add such functionality with no alteration
to the OOo code, purely by modification of the XML mechanisms (DTD
> etc.).
I request a vote from the project team, as I believe we could all agree
on this.
(...)
Why wait for the appearance of a bilingual file format? There are
> lots of conversion filters which would be advantageous in their own
> right. .po to TMX, for example, and vice-versa, would be beneficial
> to OmegaT - I think that benefit is independent of a bilingual file
> format. Even TMX2 to TMX1 would be an advantage. It may well be
> that such filters already exist.
I'm not sure I fully got you.
The way I understand it, any conversion filter should work between a
given native file format and our own bilingual format, and the CAT
software should only care about properly translating bilingual format
files. If we don't go that way, how do we do?
Of course, the above comes from a restricted mind (mine), in that Open
Office already provides a nice bunch of conversion filters.
Therefore, the obvious goal seems to be able to introduce our own
TU-level and segment-level (within TU) tags within the existing OO
Writer file format in a way that will, as much as possible, avoid
disturbing OO.
The way we implement these filters is another problem and may somewhat
depend on which tool we want to end up with. Performance is not too much
an issue, portability is, as well as ease of use - integration within
our translation tool, whether it's OmegaT in its present form or
directly within OO Writer.
FYI, I'm presently trying to contact people at Sun in order to raise
interest about our efforts and to obtain some help.
On the subject of conversion filters, some initial work was done
> within the Semerkent project, see:
http://sourceforge.net/projects/semerkent/
It looks like they now merged with http://www.gtranslator.org/.
So - thanks to you, Marc - I just discovered Yet Another Free CAT
software project. Am I wrong in assuming some of their efforts might
partially overlap our ones? Did you contact their team yet?
- though I agree with Simos that scripts are a much more practical
> solution as I believe learning Perl or tcl/tk in order to manipulate
> plain text formats such as XML is within the realms of many
> translators' abilities.
Learning C for this purpose is a different proposition.
Well, in this case, you probably mean C++, as C does not even have a
string type, if I remember well.
Another solution for quickly building up filters would be Python.
Cheers,
Henri