[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[gnuastrocommits] master 0c874fa 2/2: Merge: imported flat CFP outlier
From: 
Mohammad Akhlaghi 
Subject: 
[gnuastrocommits] master 0c874fa 2/2: Merge: imported flat CFP outlier detection, minor problem fixed 
Date: 
Tue, 7 May 2019 13:05:58 0400 (EDT) 
branch: master
commit 0c874fae4306abe03a7ed846ed281b59e15242f6
Merge: 8085adb b2f1ecb
Author: Mohammad Akhlaghi <address@hidden>
Commit: Mohammad Akhlaghi <address@hidden>
Merge: imported flat CFP outlier detection, minor problem fixed
There weren't any conflicts in this merge. But after doing the full test, I
noticed that there was a problem in building the Info manual. This has been
fixed.

NEWS  12 +++++
doc/gnuastro.texi  118 ++++++++++++++++++++++++
lib/gnuastro/statistics.h  7 +
lib/statistics.c  105 ++++++++++++++++++++++++++++
4 files changed, 150 insertions(+), 92 deletions()
diff git a/NEWS b/NEWS
index b726fe3..8f16b41 100644
 a/NEWS
+++ b/NEWS
@@ 19,6 +19,18 @@ See the end of the file for license conditions.
** Changed features
+ Arithmetic:
+  The output of coadding operators is no longer the same type as the
+ input in general. The output of the `min' and `max' operators are
+ still the same type as the input. However the `number' and
+ `sigclipnumber' operators will output an unsigned 32bit integer type
+ and the rest (`sum', `mean', `std', `median', `sigclipmedian',
+ `sigclipmean' and `sigclipstd') return 32bit floating point
+ datasets
+
+ Library:
+  gal_statistics_outlier_flat_cfp: Improved implementation with new API.
+
** Bugs fixed
bug #56195: astscriptsortbynight crashing because of AWK.
bug #56246: Singlevalued measurement must abort with no value in Statistics.
diff git a/doc/gnuastro.texi b/doc/gnuastro.texi
index db91223..155afb9 100644
 a/doc/gnuastro.texi
+++ b/doc/gnuastro.texi
@@ 15985,7 +15985,7 @@ the histogram binwidths. Generally, the mode of a
distribution also shifts
as signal is added. Therefore, even if it is accurately measured, the mode
is a biased measure for the Sky value.
address@hidden sigmaclipping
address@hidden Sigmaclipping
@item
Another approach was to iteratively clip the brightest pixels in the image
(which is known as @mymath{\sigma}clipping). See @ref{Sigma clipping} for
@@ 16444,11 +16444,12 @@ clipping}). @mymath{\sigma}clipping configuration is
done with the
@cindex Outlier
Here is one scenario where this can be useful: assume you have a table and
you would like to remove the rows that are outliers (not within the
sigmaclipping range). Let's assume your table is called @file{table.fits}
and you only want to keep the rows that have a value in @code{COLUMN}
within the @mymath{\sigma}clipped range (to @mymath{3\sigma}, with a
tolerance of 0.1). This command will return the sigmaclipped median and
standard deviation (used to define the range later).
address@hidden range). Let's assume your table is called
address@hidden and you only want to keep the rows that have a value in
address@hidden within the @mymath{\sigma}clipped range (to
address@hidden, with a tolerance of 0.1). This command will return the
address@hidden median and standard deviation (used to define the
+range later).
@example
$ aststatistics table.fits cCOLUMN sclipparams=3,0.1 \
@@ 16796,8 +16797,8 @@ it is interpreted as tolerance and if it is larger than
one it is a
specific number. Hence, in the latter case the value must be an integer.
@item outliersclip=FLT,FLT
Sigmaclipping parameters for the outlier rejection of the Sky value
(similar to @option{sclipparams}).
address@hidden parameters for the outlier rejection of the Sky
+value (similar to @option{sclipparams}).
Outlier rejection is useful when the dataset contains a large and diffuse
(almost flat within each tile) signal. The flatness of the profile will
@@ 17425,8 +17426,8 @@ quantile is between 0.49 and 0.51 (recall that the
median's quantile is
0.5) will be used.
@item outliersclip=FLT,FLT
Sigmaclipping parameters for the outlier rejection of the quantile
threshold. The format of the given values is similar to
address@hidden parameters for the outlier rejection of the
+quantile threshold. The format of the given values is similar to
@option{sigmaclip} below. In NoiseChisel, outlier rejection on tiles is
used when identifying the quantile thresholds (@option{qthresh},
@option{noerodequant}, and @option{detgrowquant}).
@@ 20058,11 +20059,11 @@ than the minimum, a value of @code{inf} is reported.
@item upperlimitskew
@cindex Skewness
This column contains the nonparametric skew of the sigmaclipped random
distribution that was used to estimate the upperlimit magnitude. Taking
address@hidden as the mean, @mymath{\nu} as the median and @mymath{\sigma} as
the standard deviation, the traditional definition of skewness is defined
as: @mymath{(\mu\nu)/\sigma}.
+This column contains the nonparametric skew of the @mymath{\sigma}clipped
+random distribution that was used to estimate the upperlimit
+magnitude. Taking @mymath{\mu} as the mean, @mymath{\nu} as the median and
address@hidden as the standard deviation, the traditional definition of
+skewness is defined as: @mymath{(\mu\nu)/\sigma}.
This can be a good measure to see how much you can trust the random
measurements, or in other words, how accurately the regions with signal
@@ 27757,7 +27758,7 @@ output will be @code{GAL_TYPE_UINT32} and for the rest,
it will be
Similar to the operands above (including @code{GAL_ARITHMETIC_MIN}), except
that when @code{gal_arithmetic} is called with these operators, it requires
two arguments. The first is the list of datasets like before, and the
second is the 2element list of sigmaclipping parameters. The first
+second is the 2element list of @mymath{\sigma}clipping parameters. The first
element in the parameters list is the multiple of sigma and the second is
the termination criteria (see @ref{Sigma clipping}). The output type of
@code{GAL_ARITHMETIC_OP_SIGCLIP_NUMBER} will be @code{GAL_TYPE_UINT32} and
@@ 28897,7 +28898,7 @@ container}.
@deffn Macro GAL_STATISTICS_SIG_CLIP_MAX_CONVERGE
The maximum number of clips, when @mymath{\sigma}clipping should be done
by convergence. If the clipping does not converge before making this many
clips, all sigmaclipping outputs will be NaN.
+clips, all @mymath{\sigma}clipping outputs will be NaN.
@end deffn
@deffn Macro GAL_STATISTICS_MODE_GOOD_SYM
@@ 29260,50 +29261,55 @@ moves the window as a separate line with several
columns. The first column
is the value, the second (in square brackets) is the sorted index, the
third is the distance of this element from the previous one. The Fourth and
fifth (in parenthesis) are the median and standard deviation of the
sigmaclipped distribution within the window and the last column is the
difference between the third and fourth, divided by the fifth.
address@hidden distribution within the window and the last column
+is the difference between the third and fourth, divided by the fifth.
@end deftypefun
address@hidden {gal_data_t *} gal_statistics_outlier_flat_cfp (gal_data_t
@code{*input}, size_t @code{dist}, float @code{thresh}, float @code{width}, int
@code{inplace}, int @code{quiet}, size_t @code{*index})
address@hidden {gal_data_t *} gal_statistics_outlier_flat_cfp (gal_data_t
@code{*input}, size_t @code{numprev}, float @code{sigclip_multip}, float
@code{sigclip_param}, float @code{thresh}, size_t @code{numcontig}, int
@code{inplace}, int @code{quiet}, size_t @code{*index})
Return the first element in the given dataset where the cumulative
frequency plot first becomes flat (below a given threshold on the slope)
for a sufficient width (percentage of whole dataset's range). The returned
dataset only has one element (with the same type as the input). If
address@hidden, the index (counting from zero, after sorting dataset
and removing any blanks) is written in the space that @code{index} points
to. If no sufficiently flat portion is found, the returned pointer will be
address@hidden

The operation of this function can be best visualized using the cumulative
frequency plot (CFP, see @ref{Histogram and Cumulative Frequency
Plot}). Imagine setting the horizontal axis of the input's CFP to a range
of 0 to 100, and finding the first part where its slope is
constantly/contiguously flat for a certain fraction/width of the whole
dataset's range. Below we'll describe this in more detail.

This function will first remove all the blank elements and sort the
remaining elements. If @code{inplace} is nonzero this step will be done in
place: reorganizing/permuting the input dataset. Using @code{inplace} can
result in faster processing and less RAM consumption, but only when the
input dataset is no longer needed in its original permutation.

The input dataset will then be internally scaled from 0 to 100 (so
address@hidden and @code{width} can be in units of percent). For each
element, this function will then estimate the slope of the cumulative
distribution function by using the @code{dist} elements before and after
it. In other words, if @mymath{d} is the value of @code{dist}, then the
slope over the @mymath{i}'th element (@mymath{a_i}) is estimated using this
formula: @mymath{1/(a_{i+d}a_{id})}. Therefore, when this slope is larger
than 1, the distance between the two checked points is smaller than
address@hidden of the dataset's range.

All points that have a slope less than @code{thresh} are then marked. The
first (parsing from the smallest to the largest values) contiguous patch of
marked elements that is larger than @code{width} wide (in units of
percentage of the dataset's range) will be considered as the boundary
region of outliers and the first element in that patch will be returned.
+frequency plot first becomes significantly flat for a sufficient number of
+elements. The returned dataset only has one element (with the same type as
+the input). If @code{index!=NULL}, the index (counting from zero, after
+sorting the dataset and removing any blanks) is written in the space that
address@hidden points to. If no sufficiently flat portion is found, the
+returned pointer will be @code{NULL}.
+
address@hidden Sigmaclipping
+The flatness on the cumulative frequency plot is defined like this (see
address@hidden and Cumulative Frequency Plot}): on the sorted dataset, for
+every point (@mymath{a_i}), we calculate @mymath{d_i=a_{i+2}a_{i2}}. This
+done on the first @mymath{N} elements (value of @code{numprev}). After
+element @mymath{a_{N+2}}, we start estimating the flatness as follows: for
+every element we use the @mymath{N}, @mymath{d_i} measurements before it as
+the reference. Let's call this set @mymath{D_i} for element @mymath{i}. The
address@hidden median (@mymath{m}) and standard deviation
+(@mymath{s}) of @mymath{D_i} are then calculated. The
address@hidden can be configured with the two
address@hidden and @code{sigclip_multip} arguments.
+
+Taking @mymath{t} as the significance threshold (value to @code{thresh}), a
+point is considered flat when @mymath{a_i>m+t\sigma}. But a single point
+satisfying this condition will probably just be due to noise. To make a
+more robust estimate, this significance/condition has to hold for
address@hidden contiguous elements after @mymath{a_i}. When this is
+satisfied, @mymath{a_i} is retured as the point where the distribution's
+cumulative frequency plot becomes flat.
+
+To get a good estimate of @mymath{m} and @mymath{s}, it is thus recommended
+to set @code{numprev} as large as possible. However, be careful not to set
+it too high: the checks in the paragraph above are not done on the first
address@hidden elements and this function assumes the flatness occures
+after them. Also, be sure that the value to @code{numcontig} is much less
+than @code{numprev}, otherwise @mymath{\sigma}clipping may not be able to
+remove the immediate outliers in @mymath{D_i} near the boundary of the flat
+region.
+
+When @code{quiet==0}, the basic measurements done on each element are
+printed on the commandline (good for finding the best parameters). When
address@hidden, the sorting and removal of blank elements is done on the
+input dataset, so the input may be altered after this function.
@end deftypefun
diff git a/lib/gnuastro/statistics.h b/lib/gnuastro/statistics.h
index 7eeff6a..9dcf35a 100644
 a/lib/gnuastro/statistics.h
+++ b/lib/gnuastro/statistics.h
@@ 179,9 +179,10 @@ gal_statistics_outlier_positive(gal_data_t *input, size_t
window_size,
float sigclip_param, int inplace, int quiet);
gal_data_t *
gal_statistics_outlier_flat_cfp(gal_data_t *input, size_t dist, float thresh,
 float width, int inplace, int quiet,
 size_t *index);
+gal_statistics_outlier_flat_cfp(gal_data_t *input, size_t numprev,
+ float sigclip_multip, float sigclip_param,
+ float thresh, size_t numcontig, int inplace,
+ int quiet, size_t *index);
diff git a/lib/statistics.c b/lib/statistics.c
index 5cab9b6..f65bf4e 100644
 a/lib/statistics.c
+++ b/lib/statistics.c
@@ 2213,56 +2213,93 @@ gal_statistics_outlier_positive(gal_data_t *input,
size_t window_size,
/* Find the outliers using the first flat portion of the cumulative
 frequency plot. A good value for window and thresh are 5 and 1.0. */
+/* Find the outliers using the average distance of the neighboring
+ points. */
#define OUTLIER_FLAT_CFP_BYTYPE(IT) { \
 IT m, n, *a=nbs>array, *p=a+dist, *pp=a+nbs>sizedist; \
 min=a[0]; max=a[nbs>size1]; \
+ IT diff, *pr=prev>array; \
+ IT *a=nbs>array, *p=a+d, *pp=a+nbs>sized; \
+ \
do \
{ \
 m=( *(pdist)min ) / (maxmin) * 100.0f; \
 n=( *(p+dist)min ) / (maxmin) * 100.0f; \
 check=1/(nm); \
 if(!quiet) printf("%6zu%15g%15g\n", pa, (float)(*p), check); \
 if( check<thresh && start==GAL_BLANK_SIZE_T ) \
 { start=pa; if(!quiet) printf("\tstart\n"); } \
 if( check>thresh && start!=GAL_BLANK_SIZE_T ) \
+ diff=*(p+d)*(pd); \
+ if(pad<numprev) \
{ \
 widthcheck=100.0 * (double)(*pa[start])/(maxmin); \
 if(!quiet) printf("\tend: %f\n", widthcheck); \
 if( widthcheck > width ) { flatind=start; break; } \
 start=GAL_BLANK_SIZE_T; \
+ pr[pad]=diff; \
+ if(!quiet) printf("%6zu%15g%15g\n", pa, (float)(*p), \
+ (float)diff); \
+ } \
+ else \
+ { \
+ /* Sigmaclipped median and std for a check. */ \
+ prev>flag=0; \
+ prev>size=prev>dsize[0]=numprev; \
+ sclip=gal_statistics_sigma_clip(prev, sigclip_multip, \
+ sigclip_param, 1, 1); \
+ \
+ sarr=sclip>array; \
+ check = (diff  sarr[1]) / sarr[3]; \
+ \
+ /* If requested, print the values. */ \
+ if(!quiet) printf("%6zu%15g%15g%15g (%g,%g)\n", pa, \
+ (float)(*p), (float)diff, check, sarr[1], \
+ sarr[3]); \
+ \
+ /* When values are equal, std will be roughly zero */ \
+ if(sarr[3]>1e6 && check>thresh) \
+ { \
+ if(flatind==GAL_BLANK_SIZE_T) \
+ { \
+ ++counter; \
+ flatind=pa; \
+ } \
+ else \
+ { \
+ if(flatind==pacounter) \
+ { /* First element above thresh is 0, so for */ \
+ /* counting, when counting the number of */ \
+ /* contiguous elements, we have to add 1. */ \
+ if(counter+1==numcontig) \
+ {gal_data_free(sclip); break;} \
+ else ++counter; \
+ } \
+ else { flatind=GAL_BLANK_SIZE_T; counter=0; } \
+ } \
+ } \
+ else { flatind=GAL_BLANK_SIZE_T; counter=0; } \
+ pr[(pad)%numprev]=diff; \
+ gal_data_free(sclip); \
} \
} \
while(++p<pp); \
 if( pp==p && start!=GAL_BLANK_SIZE_T ) \
 { \
 widthcheck=100.0 * (double)(*(p1)a[start])/(maxmin); \
 if(!quiet) printf("\tend: %f\n", widthcheck); \
 if( widthcheck > width ) flatind=start; \
 } \
+ if(counter+1!=numcontig) flatind=GAL_BLANK_SIZE_T; \
}
gal_data_t *
gal_statistics_outlier_flat_cfp(gal_data_t *input, size_t dist,
 float thresh, float width, int inplace,
+gal_statistics_outlier_flat_cfp(gal_data_t *input, size_t numprev,
+ float sigclip_multip, float sigclip_param,
+ float thresh, size_t numcontig, int inplace,
int quiet, size_t *index)
{
 gal_data_t *nbs, *out=NULL;
 double min, max, check, widthcheck;
 size_t one=1, flatind=GAL_BLANK_SIZE_T, start=GAL_BLANK_SIZE_T;
+ float *sarr;
+ double check;
+ gal_data_t *nbs, *prev, *out=NULL, *sclip;
+ size_t d=2, counter=0, one=1, flatind=GAL_BLANK_SIZE_T;
/* Sanity checks. */
 if(thresh<=0  width<=0)
 error(EXIT_FAILURE, 0, "%s: the values to `thresh' (%g) and `width' (%g) "
 "must be positive", __func__, thresh, width);
 if(width==0)
 error(EXIT_FAILURE, 0, "%s: `dist' (%zu) cannot be zero", __func__,
 dist);
+ if(thresh<=0)
+ error(EXIT_FAILURE, 0, "%s: the value of `thresh' (%g) must be "
+ "positive", __func__, thresh);
+ if(numprev==0)
+ error(EXIT_FAILURE, 0, "%s: `numprev' (%zu) cannot be zero", __func__,
+ numprev);
/* Remove all blanks and sort the dataset. */
nbs=gal_statistics_no_blank_sorted(input, inplace);
+ /* Keep previous slopes. */
+ prev=gal_data_alloc(NULL, GAL_TYPE_FLOAT64, 1, &numprev, NULL, 0, 1,
+ NULL, NULL, NULL);
+
/* Find the index where the distribution becomes sufficiently flat. */
switch(nbs>type)
{
@@ 2281,7 +2318,8 @@ gal_statistics_outlier_flat_cfp(gal_data_t *input, size_t
dist,
__func__, nbs>type);
}
 /* Write the output dataset. */
+ /* Write the output dataset: if no point flat part was found, return
+ NULL. */
if(flatind!=GAL_BLANK_SIZE_T)
{
out=gal_data_alloc(NULL, input>type, 1, &one, NULL, 0, 1,
@@ 2294,5 +2332,6 @@ gal_statistics_outlier_flat_cfp(gal_data_t *input, size_t
dist,
/* Clean up and return. */
if(nbs!=input) gal_data_free(nbs);
if(index) *index=flatind;
+ gal_data_free(prev);
return out;
}
[Prev in Thread] 
Current Thread 
[Next in Thread] 
 [gnuastrocommits] master 0c874fa 2/2: Merge: imported flat CFP outlier detection, minor problem fixed,
Mohammad Akhlaghi <=