[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Fwd: Problem with pdist
From: |
Esteban Cervetto |
Subject: |
Fwd: Problem with pdist |
Date: |
Thu, 15 Jul 2010 14:16:06 -0300 |
---------- Forwarded message ----------
From:
Jaroslav Hajek <address@hidden>Date: 2010/7/15
Subject: Re: Problem with pdist
To: Esteban Cervetto <
address@hidden>
Cc:
address@hidden
On Wed, Jul 14, 2010 at 8:09 PM, Esteban Cervetto
<
address@hidden> wrote:
> Hello:
>
> I am having problems with the pdist function. It seems not to do enough
> strong to calculate the next problem:
>
>
> octave:22> pdist(X,"mahalanobis")
>
> error: memory exhausted or requested size too large for range of Octave's
> index
>
> type -- trying to return to prompt
>
> octave:22> size(X)
>
> ans =
>
> 2 8993
>
> octave:23>
>
>
>
>
>
> However, Mahalanobis distance is usually used in datamining studies, and a
> sample of 8993 registers and two variables is small.
>
>
>
> It exist a form to improve dramatically the efficiency of octave
> with functions related with data mining or pattern classification?
>
>
>
> Links to similar topics are well received.
>
>
>
It seems the bottleneck here is Octave's nchoosek, which apparently
sucks for nchoosek(1:N, 2) where N is several thousand.
Hmmm. Perhaps something can be done there...
--
RNDr. Jaroslav Hajek, PhD
computing expert & GNU Octave developer
Aeronautical Research and Test Institute (VZLU)
Prague, Czech Republic
url:
www.highegg.matfyz.cz
I am trying to improve nchoosek when n = 2. I used teh ind2sub function instead
n=4000;
x = [1:n]';
tic
nchoosek(1:n,2);
toc
tic
[r,c] = ind2sub ([n,n], 1:n.^2);
order = [r(r>c);c(r>c)];
toc
Results with n = 40
Elapsed time is 0.0157 seconds.
Elapsed time is -6.98e-009 seconds.
Results with n = 4000
Elapsed time is 290 seconds.
Elapsed time is 10.4 seconds.
Another thing is that I am using octave-3.2.3, but i am not using the pdist() of the package. The one is:
## Copyright (C) 2008 Francesco Potortì
##
## This program is free software; you can redistribute it and/or modify
## it under the terms of the GNU General Public License as published by
## the Free Software Foundation; either version 3, or (at your option)
## any later version.
##
## This program is distributed in the hope that it will be useful, but
## WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
## General Public License for more details.
##
## You should have received a copy of the GNU General Public License
## along with this software; see the file COPYING. If not, see
## <
http://www.gnu.org/licenses/>.
## -*- texinfo -*-
## @deftypefn {Function File} @var{y} = pdist (@var{x})
## @deftypefnx {Function File} @var{y} = pdist (@var{x}, @var{distfun})
## @deftypefnx {Function File} @var{y} = pdist (@var{x},
## @var{distfun}, @var{distfunarg}, @dots{})
##
## Return the distance between any two rows in @var{x}.
##
## @var{x} is the matrix representing @var{q} row vectors of
## size @var{d}.
##
## The output is a dissimilarity matrix arranged into a row vector
## @var{y},(n - 1) * (n / 2) long, where the distances are in the order
## [(1, 2) (1, 3) @dots{} (2, 3) @dots{} (n-1, n)]. You can use the
## @seealso{squareform} function to display the distances between the
## vectors arranged into an matrix.
##
## @var{distfun} is an optional argument specifying how the distance is
## computed. It can be any of the following ones, defaulting to
## "euclidean", or a user defined function that takes two arguments
## distfun @var{x} and @var{y} plus any number of optional arguments,
## where @var{x} is a row vector and and @var{y} is a matrix having the
## same number of columns as @var{x}. @var{distfun} returns a column
## vector where row @var{i} is the distance between @var{x} and row
## @var{i} of @var{y}. Any additional arguments after the @var{distfun}
## are passed as distfun (@var{x}, @var{y}, @var{distfunarg1},
## @var{distfunarg2} @dots{}).
##
## Predefined distance functions are:
##
## @table @samp
## @item "euclidean"
## Euclidean distance (default).
##
## @item "seuclidean"
## Standardized Euclidean distance. Each coordinate in the sum of
## squares is inverse weighted by the sample variance of that
## coordinate.
##
## @item "mahalanobis"
## Mahalanobis distance: @seealso{mahalanobis}.
##
## @item "cityblock"
## City Block metric, aka Manhattan distance.
##
## @item "minkowski"
## Minkowski metric. Accepts a numeric parameter @var{p}: for @var{p}=1
## this is the same as the cityblock metric, with @var{p}=2 (default) it
## is equal to the euclidean metric.
##
## @item "cosine"
## One minus the cosine of the included angle between rows, seen as
## vectors.
##
## @item "correlation"
## One minus the sample correlation between points (treated as
## sequences of values).
##
## @item "spearman"
## One minus the sample Spearman's rank correlation between
## observations, treated as sequences of values.
##
## @item "hamming"
## Hamming distance: the quote of the number of coordinates that differ.
##
## @item "jaccard"
## One minus the Jaccard coefficient, the quote of nonzero
## coordinates that differ.
##
## @item "chebychev"
## Chebychev distance: the maximum coordinate difference.
## @end table
## @seealso{linkage,squareform}
## @end deftypefn
## Author: Francesco Potortì
function y = pdist (x, distfun, varargin)
if (nargin < 1)
print_usage ();
elseif ((nargin > 1)
&& ! ischar (distfun)
&& ! isa (distfun, "function_handle"))
error (["pdist: the distance function must be either a string or a "
"function handle."]);
endif
if (nargin < 2)
distfun = "euclidean";
endif
if (! ismatrix (x) || isempty (x))
error ("pdist: x must be a nonempty matrix");
elseif (length (size (x)) > 2)
error ("pdist: x must be 1 or 2 dimensional");
endif
y = [];
if (ischar (distfun))
order = nchoosek(1:rows(x),2);
Xi = order(:,1);
Yi = order(:,2);
X = x';
switch (distfun)
case "euclidean"
diff = X(:,Xi) - X(:,Yi);
if (str2num(version()(1:3)) > 3.1)
y = norm (diff, "cols");
else
y = sqrt (sumsq (diff));
endif
case "seuclidean"
diff = X(:,Xi) - X(:,Yi);
weights = inv (diag (var (x)));
y = sqrt (sum ((weights * diff) .* diff));
case "mahalanobis"
diff = X(:,Xi) - X(:,Yi);
weights = inv (cov (x));
y = sqrt (sum ((weights * diff) .* diff));
case "cityblock"
diff = X(:,Xi) - X(:,Yi);
if (str2num(version()(1:3)) > 3.1)
y = norm (diff, 1, "cols");
else
y = sum (abs (diff));
endif
case "minkowski"
diff = X(:,Xi) - X(:,Yi);
p = 2; # default
if (nargin > 2)
p = varargin{1}; # explicitly assigned
endif;
if (str2num(version()(1:3)) > 3.1)
y = norm (diff, p, "cols");
else
y = (sum ((abs (diff)).^p)).^(1/p);
endif
case "cosine"
prod = X(:,Xi) .* X(:,Yi);
weights = sumsq (X(:,Xi)) .* sumsq (X(:,Yi));
y = 1 - sum (prod) ./ sqrt (weights);
case "correlation"
corr = cor (X);
y = 1 - corr (sub2ind (size (corr), Xi, Yi))';
case "spearman"
corr = spearman (X);
y = 1 - corr (sub2ind (size (corr), Xi, Yi))';
case "hamming"
diff = logical (X(:,Xi) - X(:,Yi));
y = sum (diff) / rows (X);
case "jaccard"
diff = logical (X(:,Xi) - X(:,Yi));
weights = X(:,Xi) | X(:,Yi);
y = sum (diff & weights) ./ sum (weights);
case "chebychev"
diff = X(:,Xi) - X(:,Yi);
if (str2num(version()(1:3)) > 3.1)
y = norm (diff, Inf, "cols");
else
y = max (abs (diff));
endif
endswitch
endif
if (isempty (y))
## Distfun is a function handle or the name of an external function
l = rows (x);
y = zeros (1, nchoosek (l, 2));
idx = 1;
for ii = 1:l-1
for jj = ii+1:l
y(idx++) = feval (distfun, x(ii,:), x, varargin{:})(jj);
endfor
endfor
endif
endfunction