[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Standard example datasets
Re: Standard example datasets
Thu, 2 May 2019 18:36:37 -0400
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:60.0) Gecko/20100101 Thunderbird/60.6.1
On 5/2/19 10:38 AM, Carnë Draug wrote:
> On Thu, 2 May 2019 at 01:22, Andrew Janke <address@hidden> wrote:
>> On 4/28/19 8:27 AM, Carnë Draug wrote:
>>> On Sat, 27 Apr 2019 at 21:02, Andrew Janke <address@hidden> wrote:
>>> Anyway, there is already an item on the tracker  that lists the
>>> ones in Matlab. The issue is finding who is the copyright holder of
>>> such data and contact them.
>>>  https://savannah.gnu.org/patch/?9544
>> Do we have any lawyers or software licensing experts on the list?
>> My understanding is that simple databases are not subject to copyright,
>> under the "you can't copyright facts" principle. They're just subject to
>> whatever licensing terms you signed a contract to get access to the data
> None of us are lawyers. Some people will argue that datasets are
> copyrightable. There's a bunch of scientists struggling with the
> whole thing about sharing data, and licenses for data are a real
> thing. Also, some of those datasets are images and photographs
> including of paintings.
> I think discussing this is outside the scope of Octave.
Guess I need to do some research on this.
>> I'm looking through the R source code. R's example datasets are mostly
>> little datasets written out in source code like this:
>> Could we just take the numbers from the R code, either under the "no
>> copyright for dbs" rule, or under the same license that R itself is
>> distributed under, rewrite it as M-code, and include those?
> The whole point I tried to made before was that it would be more
> useful to have the same datasets as Matlab because it makes easier to
> copy paste examples . If you copy the datasets of R, then you will no
> longer copy paste such example code into Octave at which point you
> might as well make up your own datasets and side step the whole
> copyright question.
I get and agree with your other point: having full compatibility with
the Matlab example datasets, to the point of where you can
copy-and-paste the Matlab code using their example data sets, would be
I don't think I can contribute to that, though: the Matlab example data
sets don't have public documentation; the only way you can see how
they're structured is by examining them in Matlab. By my reading, that's
a violation of the Matlab license's Non-Compete clause, which prohibits
using Matlab to develop any competing product, copyright aside. So I'm
not going to touch that; y'all can make your own decisions.
Another thing here is that the way Matlab organizes its example datasets
is lousy IMHO: they're just a bunch of mat-files dumped in the path. No
interface to list all the examples, get metadata about them, load them
through a uniform interface besides the plain "load(filename)", keep
them out of the global identifier namespace, etc. I like R's
organization of their example datasets into a "datasets" package with a
I think it could also be useful to have Octave equivalents of the R
example datasets. Matlab isn't the only substitute for Octave; R and
numeric Python are, too. We might have users coming from R to Octave, or
vice versa. Having the same example datasets in both programs would be
useful for pedagogical purposes, so you can show how to perform the same
analysis on the same data in both languages, providing a sort of Rosetta
stone for them.
I've started working through the R example data sets and translating a
few of them into Octave, and it's proven to be a useful exercise,
exposing some bugs in my Chrono and Tablicious libraries. I'm going to
continue working through them, and if I end up with something useful,
I'll let y'all know. If you want to follow along, it's on this branch on
my GitHub repo:
Alois Schloegl wrote:
> ...please consider also load_fisheriris (part of NaN-toolbox) ...
> ...BTW, the site  contains a number of other data sets...
Alois Schloegl: I've included the Fisher Iris dataset. Of course. It was
the first one I added. :) My approach includes storing the mat-file
version of the dataset in the source tree, so a network connection is
only needed at development time, not run time. And I'm looking through
that ics.uci.edu site for other potential example data sets.