help-octave
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RFC: method for storing data


From: CdeMills
Subject: RFC: method for storing data
Date: Fri, 25 Jun 2010 07:00:48 -0700 (PDT)

Hello,

There have been some discussions recently about a wished feature, the
ability to manipulate easily complex data set. Peoples have spoken about the
data.frame feature from R, so let's define the problem. Data set generally
consists of tables, where rows are observations and columns measured
properties. So we should expect one row to contain all the data about one
"sample" and each column to contain homegeneous base objects: int (age in
year), float (height in meter), or factors. Factors are all members of a
base set, like f.i. sex ('M' for male and 'F' for female). Each element
should also be able to take the value "missing", written 'NA' (not
available)

Now, let's run the example from R data.frame:
>L3 <- LETTERS[1:3]
(>d <- data.frame(cbind(x=1, y=1:10), fac=sample(L3, 10, replace=TRUE)))
   x  y fac
1  1  1   A
2  1  2   A
3  1  3   B
4  1  4   A
5  1  5   A
6  1  6   B
7  1  7   C
8  1  8   A
9  1  9   B
10 1 10   B

first colum is observation number, then we have two integer variables, then
a factor variable with levels 'A', 'B', 'C'

Let's see how we can access the data:
1) column-wise, by column name
>d$y
 [1]  1  2  3  4  5  6  7  8  9 10
>class(d$y)
[1] "numeric"
> class(d$fac)
[1] "factor"

2) column-wise, by column position
> d[,1]
 [1] 1 1 1 1 1 1 1 1 1 1
> class(d[, 1])
[1] "numeric"

A few words of explanation: x() is a function call, x[] a tabulated variable
access. Accessing all rows at once is done by omitting the row index: x[,
3], instead of x(:, 3)

3) row-wise
 d[3,]
  x y fac
3 1 3   B
class(d[3,])
[1] "data.frame"
> class(d[3,3])
[1] "factor"
> class(d[3,1])
[1] "numeric"

So, of interest, and difficult to mimic for now in octave, are
1) column-wise access by name
2) the polymorphism when extracting sub-parts. Notice f.i. that d[3,3]) is
of class 'factor', retaining the initial levels values, and that simple
values, either scalar or vector, are expressed in the most simple class,
going from data.frame to numeric.

Let's try with Octave:
>c = {}; for indi=1:10,
>c(indi, :) = { 1, indi, char(64+round(1+2*rand(1, 1)))};
>endfor

c is expressed in terms of cells; to obtain named columns, we go from cells
to struct:
>d = cell2struct(c, {"x","y", "fac"}, 2);
d =
{
  10x1 struct array containing the fields:

    x
    y
    fac
}

We already have: 
-column-access by field name, result is a list. It should be a vector
- access by row, result is a struct.  
- access by element, d(3).y, result is numeric in this case 

What could be changed:
- size of d is (10, 1), which looks counter-intuitive as each 'row' contains
3 values. The result has to be understood as "each line contains a single
structure with 3 fields".
- row access by field number, but in this case the problem is the same, the
'struct' level hiding the number of fields
- we should have a "factor" class, which is defined as "element belonging to
a set", in such a way the there exist a dynamic link between all the
elements and the associated set.

To be discussed: are those changes allowable to be applied to the object
'struct', or should we go to a specific dataframe object ? In the latter
case, I have a few ideas to implement it as a new class from a .m files.

Let's open the discussion ...

Regards

Pascal
-- 
View this message in context: 
http://octave.1599824.n4.nabble.com/RFC-method-for-storing-data-tp2268493p2268493.html
Sent from the Octave - General mailing list archive at Nabble.com.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]