aramorph-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Aramorph-users] XML tables


From: Pierrick Brihaye
Subject: Re: [Aramorph-users] XML tables
Date: Thu, 11 Aug 2005 22:26:32 +0200
User-agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; fr-FR; rv:1.7) Gecko/20040608

Hi,

Ahmed El-dawy a écrit :

    ... because I'm just back from holidays :-)

Welcome back.

Thanks !

    Here, you have *two* glosses :
    <gloss>and</gloss>
    <gloss>so<gloss>

But I have understanded from the code that glosses are separated by (+) not (;). See this line of code:
array = gloss.split("\\+");

Oooops ! Sorry, you're right : one gloss with 2 words here.

    ... it is not well-formed XML (because of the ">" character that must be
    escaped).
This error is because I have made this XML snippet by hand. Of course when I use XML Document to write the XML file, all these special characters are escaped automatically by the XML serializer.

Fine !

    Stems :

    <root materila="Ab">
    is inconsistent with :
<!ATTLIST root material CDATA #REQUIRED>
>
This is another error caused because of making the XML snippet by hand.

Fine. No problem...

    So, I'd prefer

    <entry root="Ab">

    (mark the "root" attributes as required).
Do you mean that all entries are marked with the root attribute?

Only in the stems dictionary (even though some prefixes may be linguistically derivated from roots).

So what about the hierarchy of the stems dictionary? Please give me more information for this point.

See : http://www.nongnu.org/aramorph/english/dictionaries.html#Stems. The root "ktb" (";--- ktb" in the file) has *many* lemmas. However, keeping a trace of it may help in writing a root analyzer (useful for linguists ;-).

    <!ATTLIST lemma lemma-id CDATA #REQUIRED>

    an id attribute hould be enough.

That's an easy one.

... and a good pratice ;-)

    Is your XML valid ? Given the code above, it is doubtful...

I hope it is valid.

An XML parser would complain if not.

     > I have also made an
     > XMLDictionaryHandler which parses XML tables, using digester from
     > Jakarta commons, and loads them into memory.

    What does the digester adds to a sandart XML parser ?

Digester is event based. This is faster and requires less memory when the XML file is passes only once. The dictStems.xml file is about 32 MB!!! It would certainly make an Out of Memory Exception if it is all loaded in memory.

Eeeer... the Java *standard* SAX parser does it, doesn't it ? A SAX parser is really the thing we need here : big file, poor structure.

I will send it to you after fixing the points you have mentioned.

Great !

BTW, still as a quick answer : I think that the 3 compatibility tables may be merged in one single file.

Best regards,

p.b.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]