gnuastro-commits
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[gnuastro-commits] master 0c874fa 2/2: Merge: imported flat CFP outlier


From: Mohammad Akhlaghi
Subject: [gnuastro-commits] master 0c874fa 2/2: Merge: imported flat CFP outlier detection, minor problem fixed
Date: Tue, 7 May 2019 13:05:58 -0400 (EDT)

branch: master
commit 0c874fae4306abe03a7ed846ed281b59e15242f6
Merge: 8085adb b2f1ecb
Author: Mohammad Akhlaghi <address@hidden>
Commit: Mohammad Akhlaghi <address@hidden>

    Merge: imported flat CFP outlier detection, minor problem fixed
    
    There weren't any conflicts in this merge. But after doing the full test, I
    noticed that there was a problem in building the Info manual. This has been
    fixed.
---
 NEWS                      |  12 +++++
 doc/gnuastro.texi         | 118 ++++++++++++++++++++++++----------------------
 lib/gnuastro/statistics.h |   7 +--
 lib/statistics.c          | 105 ++++++++++++++++++++++++++++-------------
 4 files changed, 150 insertions(+), 92 deletions(-)

diff --git a/NEWS b/NEWS
index b726fe3..8f16b41 100644
--- a/NEWS
+++ b/NEWS
@@ -19,6 +19,18 @@ See the end of the file for license conditions.
 
 ** Changed features
 
+  Arithmetic:
+   - The output of coadding operators is no longer the same type as the
+     input in general. The output of the `min' and `max' operators are
+     still the same type as the input. However the `number' and
+     `sigclip-number' operators will output an unsigned 32-bit integer type
+     and the rest (`sum', `mean', `std', `median', `sigclip-median',
+     `sigclip-mean' and `sigclip-std') return 32-bit floating point
+     datasets
+
+  Library:
+   - gal_statistics_outlier_flat_cfp: Improved implementation with new API.
+
 ** Bugs fixed
   bug #56195: astscript-sort-by-night crashing because of AWK.
   bug #56246: Single-valued measurement must abort with no value in Statistics.
diff --git a/doc/gnuastro.texi b/doc/gnuastro.texi
index db91223..155afb9 100644
--- a/doc/gnuastro.texi
+++ b/doc/gnuastro.texi
@@ -15985,7 +15985,7 @@ the histogram bin-widths. Generally, the mode of a 
distribution also shifts
 as signal is added. Therefore, even if it is accurately measured, the mode
 is a biased measure for the Sky value.
 
address@hidden sigma-clipping
address@hidden Sigma-clipping
 @item
 Another approach was to iteratively clip the brightest pixels in the image
 (which is known as @mymath{\sigma}-clipping). See @ref{Sigma clipping} for
@@ -16444,11 +16444,12 @@ clipping}). @mymath{\sigma}-clipping configuration is 
done with the
 @cindex Outlier
 Here is one scenario where this can be useful: assume you have a table and
 you would like to remove the rows that are outliers (not within the
-sigma-clipping range). Let's assume your table is called @file{table.fits}
-and you only want to keep the rows that have a value in @code{COLUMN}
-within the @mymath{\sigma}-clipped range (to @mymath{3\sigma}, with a
-tolerance of 0.1). This command will return the sigma-clipped median and
-standard deviation (used to define the range later).
address@hidden range). Let's assume your table is called
address@hidden and you only want to keep the rows that have a value in
address@hidden within the @mymath{\sigma}-clipped range (to
address@hidden, with a tolerance of 0.1). This command will return the
address@hidden median and standard deviation (used to define the
+range later).
 
 @example
 $ aststatistics table.fits -cCOLUMN --sclipparams=3,0.1 \
@@ -16796,8 +16797,8 @@ it is interpreted as tolerance and if it is larger than 
one it is a
 specific number. Hence, in the latter case the value must be an integer.
 
 @item --outliersclip=FLT,FLT
-Sigma-clipping parameters for the outlier rejection of the Sky value
-(similar to @option{--sclipparams}).
address@hidden parameters for the outlier rejection of the Sky
+value (similar to @option{--sclipparams}).
 
 Outlier rejection is useful when the dataset contains a large and diffuse
 (almost flat within each tile) signal. The flatness of the profile will
@@ -17425,8 +17426,8 @@ quantile is between 0.49 and 0.51 (recall that the 
median's quantile is
 0.5) will be used.
 
 @item --outliersclip=FLT,FLT
-Sigma-clipping parameters for the outlier rejection of the quantile
-threshold. The format of the given values is similar to
address@hidden parameters for the outlier rejection of the
+quantile threshold. The format of the given values is similar to
 @option{--sigmaclip} below. In NoiseChisel, outlier rejection on tiles is
 used when identifying the quantile thresholds (@option{--qthresh},
 @option{--noerodequant}, and @option{detgrowquant}).
@@ -20058,11 +20059,11 @@ than the minimum, a value of @code{-inf} is reported.
 
 @item --upperlimitskew
 @cindex Skewness
-This column contains the non-parametric skew of the sigma-clipped random
-distribution that was used to estimate the upper-limit magnitude. Taking
address@hidden as the mean, @mymath{\nu} as the median and @mymath{\sigma} as
-the standard deviation, the traditional definition of skewness is defined
-as: @mymath{(\mu-\nu)/\sigma}.
+This column contains the non-parametric skew of the @mymath{\sigma}-clipped
+random distribution that was used to estimate the upper-limit
+magnitude. Taking @mymath{\mu} as the mean, @mymath{\nu} as the median and
address@hidden as the standard deviation, the traditional definition of
+skewness is defined as: @mymath{(\mu-\nu)/\sigma}.
 
 This can be a good measure to see how much you can trust the random
 measurements, or in other words, how accurately the regions with signal
@@ -27757,7 +27758,7 @@ output will be @code{GAL_TYPE_UINT32} and for the rest, 
it will be
 Similar to the operands above (including @code{GAL_ARITHMETIC_MIN}), except
 that when @code{gal_arithmetic} is called with these operators, it requires
 two arguments. The first is the list of datasets like before, and the
-second is the 2-element list of sigma-clipping parameters. The first
+second is the 2-element list of @mymath{\sigma}-clipping parameters. The first
 element in the parameters list is the multiple of sigma and the second is
 the termination criteria (see @ref{Sigma clipping}). The output type of
 @code{GAL_ARITHMETIC_OP_SIGCLIP_NUMBER} will be @code{GAL_TYPE_UINT32} and
@@ -28897,7 +28898,7 @@ container}.
 @deffn  Macro GAL_STATISTICS_SIG_CLIP_MAX_CONVERGE
 The maximum number of clips, when @mymath{\sigma}-clipping should be done
 by convergence. If the clipping does not converge before making this many
-clips, all sigma-clipping outputs will be NaN.
+clips, all @mymath{\sigma}-clipping outputs will be NaN.
 @end deffn
 
 @deffn  Macro GAL_STATISTICS_MODE_GOOD_SYM
@@ -29260,50 +29261,55 @@ moves the window as a separate line with several 
columns. The first column
 is the value, the second (in square brackets) is the sorted index, the
 third is the distance of this element from the previous one. The Fourth and
 fifth (in parenthesis) are the median and standard deviation of the
-sigma-clipped distribution within the window and the last column is the
-difference between the third and fourth, divided by the fifth.
address@hidden distribution within the window and the last column
+is the difference between the third and fourth, divided by the fifth.
 
 @end deftypefun
 
address@hidden {gal_data_t *} gal_statistics_outlier_flat_cfp (gal_data_t 
@code{*input}, size_t @code{dist}, float @code{thresh}, float @code{width}, int 
@code{inplace}, int @code{quiet}, size_t @code{*index})
address@hidden {gal_data_t *} gal_statistics_outlier_flat_cfp (gal_data_t 
@code{*input}, size_t @code{numprev}, float @code{sigclip_multip}, float 
@code{sigclip_param}, float @code{thresh}, size_t @code{numcontig}, int 
@code{inplace}, int @code{quiet}, size_t @code{*index})
 
 Return the first element in the given dataset where the cumulative
-frequency plot first becomes flat (below a given threshold on the slope)
-for a sufficient width (percentage of whole dataset's range). The returned
-dataset only has one element (with the same type as the input). If
address@hidden, the index (counting from zero, after sorting dataset
-and removing any blanks) is written in the space that @code{index} points
-to. If no sufficiently flat portion is found, the returned pointer will be
address@hidden
-
-The operation of this function can be best visualized using the cumulative
-frequency plot (CFP, see @ref{Histogram and Cumulative Frequency
-Plot}). Imagine setting the horizontal axis of the input's CFP to a range
-of 0 to 100, and finding the first part where its slope is
-constantly/contiguously flat for a certain fraction/width of the whole
-dataset's range. Below we'll describe this in more detail.
-
-This function will first remove all the blank elements and sort the
-remaining elements. If @code{inplace} is non-zero this step will be done in
-place: re-organizing/permuting the input dataset. Using @code{inplace} can
-result in faster processing and less RAM consumption, but only when the
-input dataset is no longer needed in its original permutation.
-
-The input dataset will then be internally scaled from 0 to 100 (so
address@hidden and @code{width} can be in units of percent). For each
-element, this function will then estimate the slope of the cumulative
-distribution function by using the @code{dist} elements before and after
-it. In other words, if @mymath{d} is the value of @code{dist}, then the
-slope over the @mymath{i}'th element (@mymath{a_i}) is estimated using this
-formula: @mymath{1/(a_{i+d}-a_{i-d})}. Therefore, when this slope is larger
-than 1, the distance between the two checked points is smaller than
address@hidden of the dataset's range.
-
-All points that have a slope less than @code{thresh} are then marked. The
-first (parsing from the smallest to the largest values) contiguous patch of
-marked elements that is larger than @code{width} wide (in units of
-percentage of the dataset's range) will be considered as the boundary
-region of outliers and the first element in that patch will be returned.
+frequency plot first becomes significantly flat for a sufficient number of
+elements. The returned dataset only has one element (with the same type as
+the input). If @code{index!=NULL}, the index (counting from zero, after
+sorting the dataset and removing any blanks) is written in the space that
address@hidden points to. If no sufficiently flat portion is found, the
+returned pointer will be @code{NULL}.
+
address@hidden Sigma-clipping
+The flatness on the cumulative frequency plot is defined like this (see
address@hidden and Cumulative Frequency Plot}): on the sorted dataset, for
+every point (@mymath{a_i}), we calculate @mymath{d_i=a_{i+2}-a_{i-2}}. This
+done on the first @mymath{N} elements (value of @code{numprev}). After
+element @mymath{a_{N+2}}, we start estimating the flatness as follows: for
+every element we use the @mymath{N}, @mymath{d_i} measurements before it as
+the reference. Let's call this set @mymath{D_i} for element @mymath{i}. The
address@hidden median (@mymath{m}) and standard deviation
+(@mymath{s}) of @mymath{D_i} are then calculated. The
address@hidden can be configured with the two
address@hidden and @code{sigclip_multip} arguments.
+
+Taking @mymath{t} as the significance threshold (value to @code{thresh}), a
+point is considered flat when @mymath{a_i>m+t\sigma}. But a single point
+satisfying this condition will probably just be due to noise. To make a
+more robust estimate, this significance/condition has to hold for
address@hidden contiguous elements after @mymath{a_i}. When this is
+satisfied, @mymath{a_i} is retured as the point where the distribution's
+cumulative frequency plot becomes flat.
+
+To get a good estimate of @mymath{m} and @mymath{s}, it is thus recommended
+to set @code{numprev} as large as possible. However, be careful not to set
+it too high: the checks in the paragraph above are not done on the first
address@hidden elements and this function assumes the flatness occures
+after them. Also, be sure that the value to @code{numcontig} is much less
+than @code{numprev}, otherwise @mymath{\sigma}-clipping may not be able to
+remove the immediate outliers in @mymath{D_i} near the boundary of the flat
+region.
+
+When @code{quiet==0}, the basic measurements done on each element are
+printed on the command-line (good for finding the best parameters). When
address@hidden, the sorting and removal of blank elements is done on the
+input dataset, so the input may be altered after this function.
 @end deftypefun
 
 
diff --git a/lib/gnuastro/statistics.h b/lib/gnuastro/statistics.h
index 7eeff6a..9dcf35a 100644
--- a/lib/gnuastro/statistics.h
+++ b/lib/gnuastro/statistics.h
@@ -179,9 +179,10 @@ gal_statistics_outlier_positive(gal_data_t *input, size_t 
window_size,
                                 float sigclip_param, int inplace, int quiet);
 
 gal_data_t *
-gal_statistics_outlier_flat_cfp(gal_data_t *input, size_t dist, float thresh,
-                                float width, int inplace, int quiet,
-                                size_t *index);
+gal_statistics_outlier_flat_cfp(gal_data_t *input, size_t numprev,
+                                float sigclip_multip, float sigclip_param,
+                                float thresh, size_t numcontig, int inplace,
+                                int quiet, size_t *index);
 
 
 
diff --git a/lib/statistics.c b/lib/statistics.c
index 5cab9b6..f65bf4e 100644
--- a/lib/statistics.c
+++ b/lib/statistics.c
@@ -2213,56 +2213,93 @@ gal_statistics_outlier_positive(gal_data_t *input, 
size_t window_size,
 
 
 
-/* Find the outliers using the first flat portion of the cumulative
-   frequency plot. A good value for window and thresh are 5 and 1.0. */
+/* Find the outliers using the average distance of the neighboring
+   points. */
 #define OUTLIER_FLAT_CFP_BYTYPE(IT) {                                   \
-    IT m, n, *a=nbs->array, *p=a+dist, *pp=a+nbs->size-dist;            \
-    min=a[0]; max=a[nbs->size-1];                                       \
+    IT diff, *pr=prev->array;                                           \
+    IT *a=nbs->array, *p=a+d, *pp=a+nbs->size-d;                        \
+                                                                        \
     do                                                                  \
       {                                                                 \
-        m=( *(p-dist)-min ) / (max-min) * 100.0f;                       \
-        n=( *(p+dist)-min ) / (max-min) * 100.0f;                       \
-        check=1/(n-m);                                                  \
-        if(!quiet) printf("%-6zu%-15g%-15g\n", p-a, (float)(*p), check); \
-        if( check<thresh && start==GAL_BLANK_SIZE_T )                   \
-          { start=p-a; if(!quiet) printf("\t---start\n"); }             \
-        if( check>thresh && start!=GAL_BLANK_SIZE_T )                   \
+        diff=*(p+d)-*(p-d);                                             \
+        if(p-a-d<numprev)                                               \
           {                                                             \
-             widthcheck=100.0 * (double)(*p-a[start])/(max-min);        \
-             if(!quiet) printf("\t----end: %f\n", widthcheck);          \
-             if( widthcheck > width ) { flatind=start; break; }         \
-             start=GAL_BLANK_SIZE_T;                                    \
+            pr[p-a-d]=diff;                                             \
+            if(!quiet) printf("%-6zu%-15g%-15g\n", p-a, (float)(*p),    \
+                              (float)diff);                             \
+          }                                                             \
+        else                                                            \
+          {                                                             \
+            /* Sigma-clipped median and std for a check. */             \
+            prev->flag=0;                                               \
+            prev->size=prev->dsize[0]=numprev;                          \
+            sclip=gal_statistics_sigma_clip(prev, sigclip_multip,       \
+                                            sigclip_param, 1, 1);       \
+                                                                        \
+            sarr=sclip->array;                                          \
+            check = (diff - sarr[1]) / sarr[3];                         \
+                                                                        \
+            /* If requested, print the values. */                       \
+            if(!quiet) printf("%-6zu%-15g%-15g%-15g (%g,%g)\n", p-a,    \
+                              (float)(*p), (float)diff, check, sarr[1], \
+                              sarr[3]);                                 \
+                                                                        \
+            /* When values are equal, std will be roughly zero */       \
+            if(sarr[3]>1e-6 && check>thresh)                            \
+              {                                                         \
+                if(flatind==GAL_BLANK_SIZE_T)                           \
+                  {                                                     \
+                    ++counter;                                          \
+                    flatind=p-a;                                        \
+                  }                                                     \
+                else                                                    \
+                  {                                                     \
+                    if(flatind==p-a-counter)                            \
+                      { /* First element above thresh is 0, so for */   \
+                        /* counting, when counting the number of */     \
+                        /* contiguous elements, we have to add 1. */    \
+                        if(counter+1==numcontig)                        \
+                          {gal_data_free(sclip); break;}                \
+                        else ++counter;                                 \
+                      }                                                 \
+                    else { flatind=GAL_BLANK_SIZE_T; counter=0; }       \
+                  }                                                     \
+              }                                                         \
+            else { flatind=GAL_BLANK_SIZE_T; counter=0; }               \
+            pr[(p-a-d)%numprev]=diff;                                   \
+            gal_data_free(sclip);                                       \
           }                                                             \
       }                                                                 \
     while(++p<pp);                                                      \
-    if( pp==p && start!=GAL_BLANK_SIZE_T )                              \
-      {                                                                 \
-        widthcheck=100.0 * (double)(*(p-1)-a[start])/(max-min);         \
-        if(!quiet) printf("\t----end: %f\n", widthcheck);               \
-        if( widthcheck > width ) flatind=start;                         \
-      }                                                                 \
+    if(counter+1!=numcontig) flatind=GAL_BLANK_SIZE_T;                    \
   }
 
 gal_data_t *
-gal_statistics_outlier_flat_cfp(gal_data_t *input, size_t dist,
-                                float thresh, float width, int inplace,
+gal_statistics_outlier_flat_cfp(gal_data_t *input, size_t numprev,
+                                float sigclip_multip, float sigclip_param,
+                                float thresh, size_t numcontig, int inplace,
                                 int quiet, size_t *index)
 {
-  gal_data_t  *nbs, *out=NULL;
-  double min, max, check, widthcheck;
-  size_t one=1, flatind=GAL_BLANK_SIZE_T, start=GAL_BLANK_SIZE_T;
+  float *sarr;
+  double check;
+  gal_data_t  *nbs, *prev, *out=NULL, *sclip;
+  size_t d=2, counter=0, one=1, flatind=GAL_BLANK_SIZE_T;
 
   /* Sanity checks. */
-  if(thresh<=0 || width<=0)
-    error(EXIT_FAILURE, 0, "%s: the values to `thresh' (%g) and `width' (%g) "
-          "must be positive", __func__, thresh, width);
-  if(width==0)
-    error(EXIT_FAILURE, 0, "%s: `dist' (%zu) cannot be zero", __func__,
-          dist);
+  if(thresh<=0)
+    error(EXIT_FAILURE, 0, "%s: the value of `thresh' (%g) must be "
+          "positive", __func__, thresh);
+  if(numprev==0)
+    error(EXIT_FAILURE, 0, "%s: `numprev' (%zu) cannot be zero", __func__,
+          numprev);
 
   /* Remove all blanks and sort the dataset. */
   nbs=gal_statistics_no_blank_sorted(input, inplace);
 
+  /* Keep previous slopes. */
+  prev=gal_data_alloc(NULL, GAL_TYPE_FLOAT64, 1, &numprev, NULL, 0, -1,
+                      NULL, NULL, NULL);
+
   /* Find the index where the distribution becomes sufficiently flat. */
   switch(nbs->type)
     {
@@ -2281,7 +2318,8 @@ gal_statistics_outlier_flat_cfp(gal_data_t *input, size_t 
dist,
             __func__, nbs->type);
     }
 
-  /* Write the output dataset. */
+  /* Write the output dataset: if no point flat part was found, return
+     NULL. */
   if(flatind!=GAL_BLANK_SIZE_T)
     {
       out=gal_data_alloc(NULL, input->type, 1, &one, NULL, 0, -1,
@@ -2294,5 +2332,6 @@ gal_statistics_outlier_flat_cfp(gal_data_t *input, size_t 
dist,
   /* Clean up and return. */
   if(nbs!=input) gal_data_free(nbs);
   if(index) *index=flatind;
+  gal_data_free(prev);
   return out;
 }



reply via email to

[Prev in Thread] Current Thread [Next in Thread]