lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] an xml schema for (single|multiple)_cell_document file XML for


From: Greg Chicares
Subject: Re: [lmi] an xml schema for (single|multiple)_cell_document file XML format
Date: Mon, 27 Feb 2012 12:44:46 +0000
User-agent: Mozilla/5.0 (Windows NT 5.1; rv:8.0) Gecko/20111105 Thunderbird/8.0

On 2010-08-10 14:34Z, Vadim Zeitlin wrote:
> On Tue, 10 Aug 2010 10:41:36 +0000 Greg Chicares <address@hidden> wrote:
> 
> GC> | Issue A (major):
> GC> | ===============
> GC> | The current format contains only 'cell' nodes which represent cases,
> GC> | class and cells. To specify the number of nodes of each type helper 
> nodes
> GC> | 'NumberOfCases' and 'NumberOfClasses' are used. Each of 'NumberOfXXX'
> GC> | is a positive integer number N, which is followed by exactly
> GC> | N 'cell' nodes.
> GC> 
> GC> The 'NumberOf' elements aren't really appropriate in xml.
> GC> 
> GC> | A simple workaround would be to rename the 'cell' nodes into
> GC> | the corresponding cell type: 'case', 'class', 'cell'. This allows
> GC> | to fix the document node structure and to get rid of the redundant
> GC> | nodes 'NumberOfXXX'.
> GC> 
> GC> These three categories must be distinguished somehow. I'm inclined to add
> GC> an attribute or a subelement. Changing the main element tag seems drastic.
> ...
> GC> Alternatively, use enclosing elements instead of delimiters,
> GC> transforming the present '.cns' format:
> 
>  Having an attribute would work but IMHO using the enclosing elements would
> be better.

Done 20120220T0158Z, revision 5402:
  http://svn.savannah.nongnu.org/viewvc?view=rev&root=lmi&revision=5402

That exercise was unexpectedly interesting. I started with a simple
"use enclosing elements" change, essentially as described here:
  http://lists.nongnu.org/archive/html/lmi/2010-08/msg00015.html
That made loading a file too slow: I could feel it plainly even before
I measured it. The counter displayed on the statusbar paused noticeably
after loading about 32 cells, then about 64, then about 128--whereas it
incremented smoothly for the old file format. It turns out that knowing
the size in advance lets us call std::vector::reserve() so that the
initial capacity is sufficient and expensive reallocations are avoided.

The final code in the repository is as fast and smooth as the original
because it writes the enclosing elements with a size attribute, e.g.:
  <particular_cells size_hint="180">
and reserves the hinted number of elements before reading them. That
attribute is optional; omitting it affects speed, but not correctness.

> GC> | Issue B (minor):
> GC> | ===============
[...input sequences aren't very XML-ish, but...]
> GC> If that's difficult to validate
> GC> with XSD, so be it.
> 
>  Yes, I think we'll just have to leave this as is. We could probably split
> the sequence into parts and validate it at least partially but I don't
> think it's worth it.

I continue to concur (unless Vaclav doesn't).

> GC> | Issue D (minor):
> GC> | ===============
> GC> | Enum element values could contain '_' instead of spaces (' ').
> GC> 
> GC> In the past, they could. Now we generally avoid that; for instance, solve
> GC> types include:
> GC>     "Endowment"
> GC>     "Target CSV"
> GC>     "CSV = tax basis"
> GC>     "Avoid MEC"
> GC> which make sense to end users, who would find "Avoid_MEC" weird.
> 
>  I don't understand why should the end users look at XML files though. And
> while using spaces in XML is possible, I'd indeed prefer to avoid it as
> meaningful whitespace in any text format is just a recipe for trouble
> (let me add "unless it's limited to the start of line" to preventively
> defend myself from anti-Pythonic accusations). So why do we have to use the
> same strings in XML and in the user-visible places?

Vaclav correctly inferred what I'd answer, here:

http://lists.nongnu.org/archive/html/lmi/2010-08/msg00017.html
| "because you only
| have to maintain one value-to-string mapping that way"



reply via email to

[Prev in Thread] Current Thread [Next in Thread]