[Freecats-Dev] Bilingual File Format (again)

freecats-dev

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Freecats-Dev] Bilingual File Format (again)

From:	Henri Chorand
Subject:	[Freecats-Dev] Bilingual File Format (again)
Date:	Tue, 01 Apr 2003 21:43:31 +0200
User-agent:	Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20021003

Hi Marc,

Firstly, thank you to all members of Free CATS for your confidence in

> the future of the OmegaT project.

Three is a crowd, as Buster Keaton (I think) once said ;-)

Keith will no doubt begin making contributions of his own to the list

> in due course, but at the moment both he and I are under time pressure
> owing to other activities. Please be patient!

No problem - we all are very busy persons.

Now, concerning the Famous Bilingual file format:

When I was thinking about starting from scratch, I started wanting tomake up a very simple, yet extensible enough, design:

1) Being ignorant about XML, apart from its most basic principles, Istill wanted it to be XML-based, that is, tag-based - XML is the future,or so I heard everybody knowledgeable say.

2) I knew it had to include formatting info. I first looked for tagginginfo in XML specs and asked a few techies around, only to find out that,in itself, XML specs do not say anything about it. I then supposed wecould use HTML's set of formatting tags, at least for a start (I assumedthat for the more complicated formatting tags found in some DTP &word-processor, we could simply find a way to keep them unchanged andend up with a pseudo-Wysiwyg approach that would be good enough for ustranslators).

3) I also knew that we had to parse any XML source file sequentially, ina "dumb" way (only caring about its text & formatting contents andleaving its structure unchanged, even and especially if it wassupposedly weird or malformed, and all the more since most existing HTMLfiles found today ARE badly formatted from XML's point of view).

4) I spoke to Thierry about it, and it emerged that we could envision abilingual file format made up of our own custom tags (beginning of TU(including anciliary TU info), middle of TU, end of TU). We would havekept the internal tags (Trados' tw4winInternal style) within the TUs'source & target segments, and proudly left unchanged all "structure"tags (Trados' tw4winExternal style).

An industry-standard tagged bilingual file format would be a majorbreakthrough. I am currently in the position of arguing vehemently

> that TMX, and not Trados' native translation memory format, should
> be regarded as the industry-standard translation memory format.

We all agree about TMX. The question that remains is about the bilingualfile format.


> Trados though, with its "uncleaned file" format, has a format for
> which there is no industry-standard equivalent, and so the Trados
> format can effectively claim this status by default.  :-(

Trados' one is either:

- nicely based on character styles for MS Word / RTF files (but we don'twant to work within a MS Word framework, do we?)- tagged (proprietary) for HTML / XML files - with a voluntarily blatantincompatibility between Trados 3 (.BIF) & Trados 5 and later.

Wordfast cleverly clones Trados (Word version) on this, but has notagged format for HTML/XML files as it preps them in order to allow themto be also translated with MS Word. Yves will correct me if I'm wrong.

However, I find it difficult to conceive of an industry-standard

> tagged bilingual file format in the absence of an industry-standard
> tagged (monolingual) word processing file format.

Of course we must end up designing our own.

For me, the question is, how to nicely & efficiently design something,starting from what is readily available.

If, for the sake of argument, OpenOffice.org's file format (which is

> at least open, documented, extensible, and has been submitted to the
> W3C for formal recognition as a standard) is accepted as the standard
> for a *monolingual* word processing file format, the step to a tagged
> bilingual file format is trivial.

Exactly. I believe this is what we need, for the following obvious reasons:

- It's an open, XML-based, tagged document format - certainly the bestone available today.


- Keith's OmegaT already understands it pretty well.

It may well be possible to add such functionality with no alterationto the OOo code, purely by modification of the XML mechanisms (DTD

> etc.).

I request a vote from the project team, as I believe we could all agreeon this.

(...)
Why wait for the appearance of a bilingual file format? There are

> lots of conversion filters which would be advantageous in their own
> right. .po to TMX, for example, and vice-versa, would be beneficial
> to OmegaT - I think that benefit is independent of a bilingual file
> format. Even TMX2 to TMX1 would be an advantage. It may well be
> that such filters already exist.

I'm not sure I fully got you.

The way I understand it, any conversion filter should work between agiven native file format and our own bilingual format, and the CATsoftware should only care about properly translating bilingual formatfiles. If we don't go that way, how do we do?

Of course, the above comes from a restricted mind (mine), in that OpenOffice already provides a nice bunch of conversion filters.

Therefore, the obvious goal seems to be able to introduce our ownTU-level and segment-level (within TU) tags within the existing OOWriter file format in a way that will, as much as possible, avoiddisturbing OO.

The way we implement these filters is another problem and may somewhatdepend on which tool we want to end up with. Performance is not too muchan issue, portability is, as well as ease of use - integration withinour translation tool, whether it's OmegaT in its present form ordirectly within OO Writer.

FYI, I'm presently trying to contact people at Sun in order to raiseinterest about our efforts and to obtain some help.

On the subject of conversion filters, some initial work was done

> within the Semerkent project, see:


http://sourceforge.net/projects/semerkent/


It looks like they now merged with http://www.gtranslator.org/.

So - thanks to you, Marc - I just discovered Yet Another Free CATsoftware project. Am I wrong in assuming some of their efforts mightpartially overlap our ones? Did you contact their team yet?

- though I agree with Simos that scripts are a much more practical

> solution as I believe learning Perl or tcl/tk in order to manipulate
> plain text formats such as XML is within the realms of many
> translators' abilities.

Learning C for this purpose is a different proposition.

Well, in this case, you probably mean C++, as C does not even have astring type, if I remember well.

Another solution for quickly building up filters would be Python.


Cheers,

Henri

[Prev in Thread]

Current Thread

[Next in Thread]

[Freecats-Dev] early proof of concept implementations?, Simos Xenitellis, 2003/04/01
- RE: [Freecats-Dev] early proof of concept implementations?, Henri Chorand, 2003/04/01
  - Re: [Freecats-Dev] early proof of concept implementations?, Marc Prior, 2003/04/01
    - [Freecats-Dev] Bilingual File Format (again), Henri Chorand <=

Prev by Date: Re: [Freecats-Dev] early proof of concept implementations?
Next by Date: Re: [Freecats-Dev] Bilingual File Format (again)
Previous by thread: Re: [Freecats-Dev] early proof of concept implementations?
Next by thread: Re: [Freecats-Dev] Bilingual File Format (again)
Index(es):
- Date
- Thread