[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[gnuastrocommits] master 27df4f2: Book: edited section Quantifying sign
From: 
Mohammad Akhlaghi 
Subject: 
[gnuastrocommits] master 27df4f2: Book: edited section Quantifying signal in a tile 
Date: 
Fri, 21 May 2021 22:48:52 0400 (EDT) 
branch: master
commit 27df4f2d4c584d37521c3088d11bf943936550cb
Author: Mohammad Akhlaghi <mohammad@akhlaghi.org>
Commit: Mohammad Akhlaghi <mohammad@akhlaghi.org>
Book: edited section Quantifying signal in a tile
Until now, this section made no mention of NoiseChisel's 'qthresh'
option, causing confusion for readers who were guided here from the
description of this option in NoiseChisel. More generally, this section had
been expanded over the years, making some paragraphs not fully logically
contiguous.
With this commit, I read through the whole section and reordered some of
the paragraphs, while highlighting that this isn't just for the Sky or Sky
STD, but also for things like NoiseChisel's quantile thresholds.
This issue was reported by Juan Miro.

doc/gnuastro.texi  88 +++++++++++++++++++++++++++++++++++
1 file changed, 56 insertions(+), 32 deletions()
diff git a/doc/gnuastro.texi b/doc/gnuastro.texi
index 3f5304f..5cd7e8c 100644
 a/doc/gnuastro.texi
+++ b/doc/gnuastro.texi
@@ 15188,37 +15188,49 @@ Therefore all such approaches that try to approximate
the sky value prior to det
@node Quantifying signal in a tile, , Sky value misconceptions, Sky value
@subsubsection Quantifying signal in a tile
+In order to define detection thresholds on the image, or calibrate it for
measurements (subtract the signal of the background sky and define errors), we
need some basic measurements.
+For example the quantile threshold in NoiseChisel (@option{qthresh} option),
or the mean of the undetected regions (Sky) and the Sky standard deviation (Sky
STD) which are the output of NoiseChisel and Statistics.
+But astronomical images will contain a lot of stars and galaxies that will
bias those measurements if not properly accounted for.
+Quantifying where signal is present is thus a very important step in the usage
of a dataset: for example if the Sky level is overestimated, your target
object's brightness will be underestimated.
+
@cindex Data
@cindex Noise
@cindex Signal
@cindex Gaussian distribution
Put simply, noise can be characterized with a certain spread about the
measured value.
In the Gaussian distribution (most commonly used to model noise) the spread is
defined by the standard deviation about the characteristic mean.

Let's start by clarifying some definitions first: @emph{Data} is defined as
the combination of signal and noise (so a noisy image is one @emph{data}set).
@emph{Signal} is defined as the mean of the noise on each element.
We'll also assume that the @emph{background} (see @ref{Sky value definition})
is subtracted and is zero.

When a dataset doesn't have any signal (only noise), the mean, median and mode
of the distribution are equal within statistical errors and approximately equal
to the background value.
+Let's start by clarifying some definitions:
+@emph{Signal} is defined as the nonrandom source of flux in each pixel (you
can think of this as the mean in a Gaussian or Poisson distribution).
+In astronomical images, signal is mostly photons coming of a star or galaxy,
and counted in each pixel.
+@emph{Noise} is defined as the random source of flux in each pixel (or the
standard deviation of a Gaussian or Poisson distribution).
+Noise is mainly due to counting errors in the detector electronics upon data
collection.
+@emph{Data} is defined as the combination of signal and noise (so a noisy
image of a galaxy is one @emph{data}set).
+
+When a dataset doesn't have any signal (for example you take an image with a
closed shutter, producing an image that only contains noise), the mean, median
and mode of the distribution are equal within statistical errors.
Signal from emitting objects, like astronomical targets, always has a positive
value and will never become negative, see Figure 1 in
@url{https://arxiv.org/abs/1505.01664, Akhlaghi and Ichikawa (2015)}.
Therefore, as more signal is added, the mean, median and mode of the dataset
shift to the positive.
The mean's shift is the largest.
The median shifts less, since it is defined based on an ordered distribution
and so is not affected by a small number of outliers.
The distribution's mode shifts the least to the positive.
+Therefore, when signal is added to the data (you take an image with an open
shutter pointing to a galaxy for example), the mean, median and mode of the
dataset shift to the positive, creating a positively skewed distribution.
+The shift of the mean is the largest.
+The median shifts less, since it is defined after ordering all the
elements/pixels (the median is the value at a quantile of 0.5), thus it is not
affected by outliers.
+Finally, the mode's shift to the positive is the least.
@cindex Mean
@cindex Median
@cindex Quantile
Inverting the argument above gives us a robust method to quantify the
significance of signal in a dataset.
Namely, when the mean and median of a distribution are approximately equal, or
the mean's quantile is around 0.5, we can argue that there is no significant
signal.

To allow for gradients (which are commonly present in raw exposures, or when
there are large foreground galaxies in the field), we consider the image to be
made of a grid of tiles@footnote{The options to customize the tessellation are
discussed in @ref{Processing options}.} (see @ref{Tessellation}).
Hence, from the difference of the mean and median on each tile, we can
estimate the significance of signal in it.
The median of a distribution is defined to be the value of the distribution's
middle point after sorting (or 0.5 quantile).
Thus, to estimate the presence of signal, we just have to estimate the
quantile of the mean (@mymath{q_{mean}}).
If a tile's @mymath{q_{mean}0.5} is larger than the value given to the
@option{meanmedqdiff} option, that tile will be considered to have
significant signal and ignored for the next steps.
+Inverting the argument above gives us a robust method to quantify the
significance of signal in a dataset: when the mean and median of a distribution
are approximately equal we can argue that there is no significant signal.
+In other words: when the quantile of the mean (@mymath{q_{mean}}) is around
0.5.
+
+@cindex Signaltonoise ratio
+However, in an astronomical image, some of the pixels will contain more signal
than the rest, so we can't simply check @mymath{q_{mean}} on the whole dataset.
+For example if we only look at the patch of pixels that are placed under the
central parts of the brightest stars in the field of view, @mymath{q_{mean}}
will be very high.
+The signal in other parts of the image will be weaker, and in some parts it
will be much smaller than the noise (for example 1/100th of the noise level).
+When the signaltonoise ratio is very small, we can generally assume no
signal (because its effectively impossible to measure it) and @mymath{q_{mean}}
will be approximately 0.5.
+
+To address this problem, we break the image into a grid of tiles@footnote{The
options to customize the tessellation are discussed in @ref{Processing
options}.} (see @ref{Tessellation}).
+For example a tile can be a square box of size @mymath{30\times30} pixels.
+By measuring @mymath{q_{mean}} on each tile, we can find which tiles that
contain significant signal and ignore them.
+Technically, if a tile's @mymath{q_{mean}0.5} is larger than the value
given to the @option{meanmedqdiff} option, that tile will be ignored for the
next steps.
You can read this option as ``meanmedianquantiledifference''.
+@cindex Skewness
+@cindex Convolution
The raw dataset's pixel distribution (in each tile) is noisy, to decrease the
noise/error in estimating @mymath{q_{mean}}, we convolve the image before
tessellation (see @ref{Convolution process}.
Convolution decreases the range of the dataset and enhances its skewness, See
Section 3.1.1 and Figure 4 in @url{https://arxiv.org/abs/1505.01664, Akhlaghi
and Ichikawa (2015)}.
This enhanced skewness can be interpreted as an increase in the Signal to
noise ratio of the objects buried in the noise.
@@ 15231,16 +15243,18 @@ Also, since they don't occupy too many pixels, they
don't affect the mode and me
But their very high values can greatly bias the calculation of the mean
(recall how the mean shifts the fastest in the presence of outliers), for
example see Figure 15 in @url{https://arxiv.org/abs/1505.01664, Akhlaghi and
Ichikawa (2015)}.
The effect of outliers like cosmic rays on the mean and standard deviation can
be removed through @mymath{\sigma}clipping, see @ref{Sigma clipping} for a
complete explanation.
Therefore, after asserting that the mean and median are approximately equal in
a tile (see @ref{Tessellation}), the measurements on the tile are determined
after @mymath{\sigma}clipping with the @option{sigmaclip} option.
In the end, some of the tiles will pass the mean and median quantile
difference test and will be given a value.
+Therefore, after asserting that the mean and median are approximately equal in
a tile (see @ref{Tessellation}), the Sky and its STD are measured on each tile
after @mymath{\sigma}clipping with the @option{sigmaclip} option (see
@ref{Sigma clipping}).
+In the end, some of the tiles will pass the test and will be given a value.
Others (that had signal in them) will just be assigned a NaN (notanumber)
value.
So we need to use interpolation and assign a value to all the tiles.
+But we need a measurement over each tile (and thus pixel).
+We will therefore use interpolation to assign a value to the NaN tiles.
However, prior to interpolating over the failed tiles, another point should be
considered: large and extended galaxies, or bright stars, have wings which sink
into the noise very gradually.
In some cases, the gradient over these wings can be on scales that is larger
than the tiles and meanmedian distance test will be successful.
This will cause a footprint of such objects after the interpolation see
@ref{Detecting large extended targets}.
+In some cases, the gradient over these wings can be on scales that is larger
than the tiles (for example the pixel value changes by @mymath{0.1\sigma} after
100 pixels, but the tile has a width of 30 pixels).
These tiles (on the wings of large foreground galaxies for example) actually
have signal in them and the meanmedian difference isn't enough (due to the
``small'' size of the tiles).
+In such cases, the @mymath{q_{mean}} test will be successful, even though
there is signal.
+Recall that @mymath{q_{mean}} is a measure of skewness.
+If we don't identify (and thus set to NaN) such outlier tiles before the
interpolation, the photons of the outskirts of the objects will leak into the
detection thresholds or Sky and Sky STD measurements and bias our result, see
@ref{Detecting large extended targets}.
Therefore, the final step of ``quantifying signal in a tile'' is to look at
this distribution of successful tiles and remove the outliers.
@mymath{\sigma}clipping is a good solution for removing a few outliers, but
the problem with outliers of this kind is that there may be many such tiles
(depending on the large/bright stars/galaxies in the image).
We therefore apply the following local outlier rejection strategy.
@@ 15249,7 +15263,7 @@ For each tile, we find the nearest @mymath{N_{ngb}}
tiles that had a usable valu
We then sort them and find the difference between the largest and
secondtosmallest elements (The minimum isn't used because the scatter can be
large).
Let's call this the tile's @emph{slope} (measured from its neighbors).
All the tiles that are on a region of flat noise will have similar slope
values, but if a few tiles fall on the wings of a bright star or large galaxy,
their slope will be significantly larger than the tiles with no signal.
We just have to find the smallest tile slope value that is an outlier compared
to the rest and reject all tiles with a slope larger than that.
+We just have to find the smallest tile slope value that is an outlier compared
to the rest, and reject all tiles with a slope larger than that.
@cindex Outliers
@cindex Identifying outliers
@@ 15259,7 +15273,7 @@ We sort them in increasing order based on their
@emph{slope} (defined above).
So the array of tile slopes can be written as @mymath{@{s[0], s[1], ...,
s[N]@}}, where @mymath{s[0]<s[1]} and so on.
We then start from element @mymath{M} and calculate the distances between all
two adjacent values before it: @mymath{@{s[1]s[0], s[2]s[1], s[M1]s[M2]@}}.
The @mymath{\sigma}clipped median and standard deviation of this distribution
is then found (@mymath{\sigma}clipping is configured with
@option{outliersclip} which takes two values, see @ref{Sigma clipping}).
+The @mymath{\sigma}clipped median and standard deviation of this distribution
are then found (this @mymath{\sigma}clipping is configured with
@option{outliersclip} which takes two values, see @ref{Sigma clipping}).
If the distance between the element and its previous element is more than
@option{outliersigma} multiples of the @mymath{\sigma}clipped standard
deviation added with the @mymath{\sigma}clipped median, that element is
considered an outlier and all tiles larger than that value are ignored.
Formally, if we assume there are @mymath{N} elements.
@@ 15270,14 +15284,24 @@ If the value given to @option{outliersigma} is
displayed with @mymath{s}, the
@dispmath{{(v_iv_{i1})m\over \sigma}>s}
+@noindent
Since @mymath{i} begins from the @mymath{N/3}th element in the sorted array
(a quantile of @mymath{1/3=0.33}), the outlier has to be larger than the
@mymath{0.33} quantile value of the dataset.
Once the outlying tiles have been successfully removed we use nearest neighbor
interpolation to give a value to all tiles in the image.
During interpolation, the median of the @mymath{N_{ngb}} nearest neighbors of
each tile is found and given to the tile (@mymath{N_{ngb}} is the value given
to @option{interpnumngb}).
Once all the tiles are given a value, a smoothing step is implemented to
remove the sharp value contrast that can happen between some tiles.
+@cindex Bicubic interpolation
+@cindex Interpolation, bicubic
+@cindex Nearestneighbor interpolation
+@cindex Interpolation, nearestneighbor
+Once the outlying tiles have been successfully identified and set to NaN, we
use nearestneighbor interpolation to give a value to all tiles in the image.
+We don't use parametric interpolation methods (like bicubic), because they
will effectively extrapolate on the edges, creating strong artifacts.
+Nearestneighbor interpolation is very simple: for each tile, we find the
@mymath{N_{ngb}} nearest tiles that had a good value, the tile's value is found
by estimating the median.
+You can set @mymath{N_{ngb}} through the @option{interpnumngb} option.
+Once all the tiles are given a value, a smoothing step is implemented to
remove the sharp value contrast that can happen on the edges of tiles.
The size of the smoothing box is set with the @option{smoothwidth} option.
You can use the check images to inspect the steps and see which tiles have
been discarded as outliers prior to interpolation (@option{checksky} in the
Statistics program or @option{checkqthresh} in NoiseChisel).
+As mentioned above, the process above is used for any of the basic
measurements (for example identifying the quantilebased thresholds in
NoiseChisel, or the Sky value in Statistics).
+You can use the checkimage feature of NoiseChisel or Statistisc to inspect
the steps and visually see each step (all the options that start with
@option{check}).
+For example, as mentioned in the @ref{NoiseChisel optimization} tutorial, when
given a dataset from a new instrument (with differing noise properties), we
highly recommend to use @option{checkqthresh} in your first call and visually
inspect how the parameters above affect the final quantile threshold (e.g.,
have the wings of bright sources leaked into the threshold?).
+The same goes for the @option{checksky} option of Statistics or NoiseChisel.
[Prev in Thread] 
Current Thread 
[Next in Thread] 
 [gnuastrocommits] master 27df4f2: Book: edited section Quantifying signal in a tile,
Mohammad Akhlaghi <=