[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Spreadsheet support in core Octave instead of io package?
From: |
Philip Nienhuis |
Subject: |
Re: Spreadsheet support in core Octave instead of io package? |
Date: |
Sun, 19 Oct 2014 22:25:03 +0200 |
User-agent: |
Mozilla/5.0 (X11; Linux i686; rv:25.0) Gecko/20100101 Firefox/25.0 Iceape/2.22 |
John W. Eaton wrote:
xlsread, xlswrite, and xlsfinfo are all core functions in Matlab. I'd
like to make them core functions in Octave as well, as I suspect many
users would like to have these functions available by default without
having to install a separate package. Is there any objection to moving
them from the io package to core Octave?
I'd also consider moving the odsread, odswrite, and odsfinfo functions
to core Octave at the same time, though I also wonder whether it might
make sense to just put the XLS and ODS capabilities in a single function
instead of splitting them up in separate functions.
Comments?
No of course I have no objections, au contraire. But I think it does
require a bit more preparation and planning than might appear at first
sight.
Some questions, sorry :-) plus some comments and clarifications.
Apologies for an extremely long post.
Timing / planning
-----------------
1. Do you want to integrate these functions in the default branch
(future 4.2+)? I suppose 4.0's (gui-release) target is only, well, the GUI.
- That would imply continued maintenance of the io package as it stands
until 4.2 is official, maybe a little later for those users who cannot
upgrade so fast. (I expect end of 2015 at the earliest for 4.2). Anyway
I see no problems there as maintainer.
- That would also give time (for me) to polish the spreadsheet stuff
before inclusion into core Octave. That polishing is a long-standing
wish of mine anyway, lots of duplicate code sneaked in that I want to
eliminate.
I suppose parts of polished spreadsheet I/O could already be transferred
(copied) early next year.
- I'd rather fiddle in an OF package than in core Octave.
All in all a delay would be welcome.
Choice of interfaces
--------------------
"Interface" refers to the external dependencies for some particular file
format, or set of file formats.
The various interfaces and file formats constitute a fairly complicated
patchwork of partly overlapping sets. Below is the latest situation as
copied from the io-2.2.2 NEWS.
Additional complications are that each interface also has its pros and
cons, rate of matureness, and license. So there's the main reason that
the io package has so "many" interfaces.
Interface
File extension COM POI POI/OOXML JXL OXS UNO OTK JOD OCT
--------------------------------------------------------------
.xls (Excel95) R R R
.xls (Excel97-2003) + + + + + +
.xlsx (Excel2007+) ~ + (+) R +
.xlsb, .xlsm ~ ? R R?
.wk1 + R
.wks + R
.dbf + +
.ods ~ + + + +
.sxc + +
.fods +
.uos +
.dif +
.csv + R
.gnumeric +
--------------------------------------------------------------
(~ = dependent on Excel/LO/OOo version); + = read/write; R = only
reading )
So we have:
* No external dependencies (but limited flexibility):
"OCT", with support for only a limited number of XML-based file formats:
.xlsx (OOXML, or Excel 2007+), ,ods (LibreOffice) and .gnumeric.
.xls (Excel'97) is still lacking, a little unfortunate as there's so
much of that format out there, and still so much legacy SW that can only
produce .xls.
Markus and I are looking into Mac OSX' Numbers file format, I did peek
into oleo and applixware a while ago (looked promising. It is all XML).
* Java-based, using ready-baked Java class libs (.jar files). Offers
much more ready-baked flexibility (like deleting data, inserting graphs,
cell formatting) for those users that are prepared to dive in the
relevant javadocs.
- "POI" (Apache POI), only support for .xls (BIFF8 / Excel'97) and .xlsx
- "JXL" (JExcelAPI), supports .xls BIFF8 plus reading .xls (BIFF5 /
Excel'95)
- "OXS" (OpenXLS), supports .xls (BIFF8) plus buggy .xlsx
- "OTK" (ODF Toolkit), supports .ods (LibreOffice)
- "JOD" (JOpenDocument), supports .ods and (reading) .sxc (old
StarOffice/OpenOffice.org)
* Java-based, invoking LibreOffice/OpenOffice.org:
- "UNO" (Java UNO bridge). Supports in principle any file format that LO
or OOo can process, but not all has been implemented in the io package.
Also here a lot of (yet unimplemented) flexibility is within reach.
* On Windows only (with the OF windows package):
- "COM", (ActiveX) using the COM server (system library) to invoke Excel
itself (a sort of "system (...)" but more direct). Like UNO, any file
format is supported that Excel can handle. This is what Matlab does too.
As regards licenses I think this is OK for an OF package but close to
the edge; the io package's spreadsheet I/O doesn't depend on it.
I wouldn't want core Octave to be associated so tightly with proprietary
SW, so this is no candidate interface for core Octave.
Another limitation is that ActiveX / COM only works with 32-bit MS-Office.
Just to clarify, COM isn't limited to Excel only:
- At my work we use it to send Matlab graphics directly into MS Word or
Powerpoint documents;
- The COM interface can also be used to invoke OSS like LibreOffice. I
tried a few times but couldn't find a way to create arrays with COM &
Starbasic only and gave up there, but I did get LibreOffice to execute
basic commands from the Octave terminal.
The point to be aware of is that external support SW (Java-based) should
be available for users, either somehow included in self-contained
binaries (for Windows) or through distro package managers. AFAIK Redhat
does have Apache POI and maybe even JExcelAPI in its repos (the io
package's user support function can find it).
FYI, the 11 Java class libs for all interfaces except UNO occupy around
22 MB on my systems.
Support functions
-----------------
Apart from the actual spreadsheet I/O routines there's quite a bit of
other required stuff. For the XML-based formats two basic XML functions
are needed (isolating a node from a char array and getting node
attributes), there are support functions for parsing mixed cell arrays
into numeric and text components, and some functions relating to
spreadsheet cell addresses.
There's also a user support function that PKG_ADD and PKG_DEL use to
modify the javaclasspath.
All of this has to be transferred too.
Splitting up or one comprehensive function
------------------------------------------
There are 2 dimensions here, I suppose you mean "B":
A. Making xlsread just one comprehensive function.
B. Combining ODS and XLS into one xlsread or so.
A.
Matlab's xlsread and xlswrite are just one function (from a user
perspective) and in at least my experience that has been a bad decision.
That way, reading data from or writing data to file can only be done one
worksheet at a time as follows:
1. The entire spreadsheet file has to be read in memory
2. The data have to be read from or written to one worksheet only
3. (for writing) the entire spreadsheet file has to be written to disk
4. The file handle or background process (Excel in case of Matlab) has
to be closed.
All in all a very inefficient procedure once more worksheets in one file
need to be processed.
Octave's xlsread/xlswrite/odsread/odswrite and even xlsfinfo/odsfinfo
are mere wrappers around functions that
- open and read a spreadsheet file into RAM and return a handle;
- functions to read from or write to a worksheet using the handle;
- functions for parsing a raw cell array into numeric and char
components (in case of reading from worksheet);
- functions for closing the handle and -if needed- write the file to
disk, optionally with another file name and -location. I've even thought
of implementing writing to another file format. For the UNO and COM
interfaces that's already implicitly in place and there are workarounds
for the other Java-based interfaces.
B.
To me the name "xlsread" has the connotation of linking to some
proprietary SW; even though Matlab's xlsread can read ODS (provided
Excel2007 is the backend in use).
I'd rather have "spshread" or "spshwrite". In fact, that's one of the
things I wanted to do - replace the xlsopen/xls2oct/oct2xls/xlsclose and
odsopen/ods2oct/oct2ods/odsclose functions by wrappers calling
spshopen/spsh2oct/oct2spsh/spshclose functions, same for xlsread etc.
This way, xlsread could be just a thin wrapper or duplicate for
spshread, same for odsread and similar to the other ones (xls/odsfinfo &
xls/odswrite). Maybe even gnmread and gnmwrite for gnumeric?
I know, it does fill up the namespace. But apart from a lot of names,
there are no foreseen name clashes.
Other remarks
-------------
The io package's spreadsheet stuff is self-contained and independent of
any other OF package. There are only suggested dependencies.
I took care that adding or deleting interfaces should be easy (of
course, writing the actual interface-specific parts themselves is
another story).
Philip