help-octave
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Kolmogorov-Smirnov test 2


From: Kai Torben Ohlhus
Subject: Re: Kolmogorov-Smirnov test 2
Date: Tue, 2 Jul 2019 19:47:26 +0900
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.7.2

On 6/28/19 6:35 AM, Tommy McCann wrote:
> 
> On Thu, Jun 27, 2019 at 1:09 AM Kai Torben Ohlhus <address@hidden
> <mailto:address@hidden>> wrote:
> 
>     On 6/22/19 8:15 AM, tmac017 wrote:
>     > I was trying to use the kolmogorov_smirnov_test_2 and I got this error
>     >
>     > warning: kolmogorov_smirnov_test_2: cannot compute correct
>     p-values with
>     > ties
>     > warning: called from
>     >     kolmogorov_smirnov_test_2 at line 79 column 5
>     >
>     > I saw there was another thread about this but it didn't answer the
>     question
>     > and that thread is closed.  Since I spent sometime looking at the
>     code I'm
>     > re-posting.
>     >
>     > The warning means that some values in each set are exactly the
>     same. The
>     > reason this is a problem is because the code sorts the values from
>     both sets
>     > and the sorted values can't occupy the same place in an ordered
>     series. In
>     > order to avoid an error caused by the sorting the function deletes
>     the D
>     > value at that point.  I don't think this should cause any problems
>     but it
>     > still prints a warning.
>     >
>     > The reason I got this error is because I was using the function
>     > empirical_cdf to generate a cdf for each data set along the same range
>     > because the HELP info said the function required cdf inputs. 
>     Based on the
>     > code it seems like the function takes in two data sets not CDFs.
>     Because
>     > CDFs alter the size of the set it messes with the results.
>     >
>     > Note: in the other thread Hamish was having a hard time using the
>     KS-test
>     > for
>     > a = randn(2000,1);
>     > b = randn(2000,1);
>     > p = kolmogorov_smirnov_test_2(a,b)
>     >
>     > she got the same error and the results weren't consistent. This is
>     > ironically BECAUSE of the large set size.  The test statistic is
>     sqrt (n_x *
>     > n_y / (n_x + n_y)) * d.  Since the curves were randomly generated some
>     > deviation was expected, the large sample size made the test more
>     sensitive
>     > to deviation, increasing the sample size just made the test even more
>     > sensitive.
>     >
>     Please can you tell the version of Octave and the version of the
>     statistics package you are using?  In version 4.4.0 many statistics
>     functions moved to the statistics package of Octave Forge [1].
> 
>     Additionally, it was nice to provide a reproducible test for this
>     warning message.  The example of Hamish from 2005 [2]
> 
> 
>        N = 1e6; while 1, a = randn(N,1); b = randn(N,1); p =
>     kolmogorov_smirnov_test_2(a,b), endwhile
> 
>     did not throw the warning you described N=2000 or N=1e6 for 5 minutes.
> 
>     Best,
>     Kai
> 
> 
>     [1] https://octave.sourceforge.io/statistics/NEWS.html
>     [2] https://lists.gnu.org/archive/html/help-octave/2005-11/msg00232.html
> 
>
> The version of octave I'm using is 4.2.0 it looks like the statistics
> package is 1.3.0
>
> True, in the original thread the error only occurred during the first
> run and her primary question was why the p value wasn't consistent. I
> got the error because I was comparing two cdfs instead of two data sets.
> Here is a piece of code similar to the one I used to troubleshoot.
>
> %%begin script to test KS error...
>
> %create random vectors
> d1 = rand(100,1);
> d2 = rand(100,1);
>
> x = 0:0.05:1;
>
> %create cdfs to visualize kolmogorov_smirnov_test results
> d1_cdf = empirical_cdf(x,d1);
> d2_cdf = empirical_cdf(x,d2);
>
> %plot cdfs
> figure(1);
> plot(x,d1_cdf,'-',x,d2_cdf,'-');
>
> % run KS-test
> disp('first test, without duplicate does not produce warning: ');
> P = kolmogorov_smirnov_test_2(d1, d2)
>
> %add duplicate value to vectors then re-run test
> disp('second test wcontaining duplicate produces warning : ');
> next = length(d1)+1;
> d1(next) = 0.5;
> d2(next) = 0.5;
>
> P = kolmogorov_smirnov_test_2(d1, d2)
>
> %attempt to use cdfs instead of raw data as specified in help info
> disp('using cdf produces warning and incorrect results:');
> P = kolmogorov_smirnov_test_2(d1_cdf, d2_cdf)
>

Thank you for your example.  Bug #56572 [3] was created by me, that your
example is not forgotten.

I do not fully understand, why the tie values are treated this way.  If
you can help with this bug, please post your ideas to improve the
function at [3].  But it seems, that the statistics package does not
have a maintainer right now.  Thus your changes might not immediately be
merged.

Best regards,
Kai

[3] https://savannah.gnu.org/bugs/?56572



reply via email to

[Prev in Thread] Current Thread [Next in Thread]