Re: [Aramorph-users] XML tables

On 8/11/05, Pierrick Brihaye <address@hidden> wrote:

Hi,

... because I'm just back from holidays :-)

Welcome back.

Prefixes :

<entry>
  <unvocalized>f</unvocalized>
  <vocalized>fa</vocalized>
  <morphological-category>Pref-Wa</morphological-category>
  <glosses>
   <gloss>and;so</gloss>
  </glosses>
  <grammatical-categories>
   <grammatical-category>fa/CONJ</grammatical-category>
  </grammatical-categories>
</entry>

Here, you have *two* glosses :
<gloss>and</gloss>
<gloss>so<gloss>

But I have understanded from the code that glosses are separated by (+) not (;). See this line of code:

array = gloss.split("\\+");

Regarding :
  <grammatical-categories>
   <grammatical-category>>bi/PREP</grammatical-category>
  </grammatical-categories>

... it is not well-formed XML (because of the ">" character that must be
escaped).

This error is because I have made this XML snippet by hand. Of course when I use XML Document to write the XML file, all these special characters are escaped automatically by the XML serializer.

Stems :

<root materila="Ab">
is inconsistent with :
<!ATTLIST root material CDATA #REQUIRED>

This is another error caused because of making the XML snippet by hand.

Furthermore, I sugget you to be consistent with what is in the prefix
dictionary (I mean, share the document types as much as possible) :

So, I'd prefer

<entry root="Ab">

(mark the "root" attributes as required).

Do you mean that all entries are marked with the root attribute? So what about the hierarchy of the stems dictionary? Please give me more information for this point.

<!ATTLIST lemma lemma-id CDATA #REQUIRED>

an id attribute hould be enough.

That's an easy one.

Well, that all for my *quick* answer, but remember that the docitionary
format may be complex (see :
http://www.nongnu.org/aramorph/english/dictionaries.html).

> Till now I have transformed all dictionaries and tables to XML and
> also translated them to Arabic.

Is your XML valid ? Given the code above, it is doubtful...

I hope it is valid. Anyway, I tried it with a small text file and It worked fine. But it needs more testing of course.

> I have also made an
> XMLDictionaryHandler which parses XML tables, using digester from
> Jakarta commons, and loads them into memory.

What does the digester adds to a sandart XML parser ?

Digester is event based. This is faster and requires less memory when the XML file is passes only once. The dictStems.xml file is about 32 MB!!! It would certainly make an Out of Memory Exception if it is all loaded in memory. Actually it made this error when writing the XML file, and I had to increase the memroy space from the JVM.

> There's a small problem is that I have changed some code in the
> Solution and other classes which romanize words. Now we don't have to
> romanize words because dictionaries are Arabic.

Yes :-) That was in the TODO list.

The problem is that
> InMemoryDictionaryHandler will not work unless it romanizes the input
> text before searching dictionaries.
> If you need the new changes please let me know, and I will send them ASAP.

You patch will be welcomed.

I will send it to you after fixing the points you have mentioned.

Cheers,

--
Pierrick Brihaye, informaticien
Service régional de l'Inventaire
DRAC Bretagne
mailto:address@hidden
+33 (0)2 99 29 67 78

From:	Ahmed El-dawy
Subject:	Re: [Aramorph-users] XML tables
Date:	Thu, 11 Aug 2005 23:00:54 +0300