bug-gsl
[Top][All Lists]

## [Bug-gsl] Covariance estimate in weighted regression

 From: Alex Tartakovsky Subject: [Bug-gsl] Covariance estimate in weighted regression Date: Thu, 13 Oct 2005 00:49:55 -0700 (PDT)

```>From GSL manual (pp. 361-362), standard texts, and just common sense, one
>expects that the output produced by a weighted regression with all the weights
>set to 1 should be the same as from unweighted regression.  This is not the
>case for the covariance estimates produced by "fit" and "multifit"
>least-squares GSL functions.  The reason is that the cov estimates in the
>straight versions of the functions include s2 (an estimate of the error
>variance):

matrix cov = s2 * (Q S^-1) (Q S^-1)^T,

while the weighted versions omit s2.  The straight variant is correct.

If you look at the test files (test_filip.c, test_longley.c, test_pontius.c),
they do provide different expected cov numbers for the weighted and straight
regressions with the same data (all weights are 1).  It looks like this
difference is by design.  If it is, its so counterintuitive (and against the
manual) as to be just wrong.

Also, the multifit cov estimate scales if you scale all weights, while the
similar output from fit does not - because the weighted fit functions
normalize the weights, but the weighted multifit ones don't.  So, a 2x2 cov
estimate produced by multifit differs from what fit outputs with the same
data.

In general, I don't see a reason to have separate implementations for the
weighted least-squares.  They just repeat, line by line, their straight
counterparts.  Since in theory, weighted regressions always reduce to the
straight ones by scaling the variables, why not simply do the scaling and call
the straight regression?  Well, X, y, w are const, so I guess an internal
common function and a bigger workspace would be needed to make a single
implementation.

One more thing: X is balanced but not centered.  If a variable is much bigger
than the rest but has a small deviation, it would make sense to center before
balancing  as is done for ridge regressions.

In the meantime, one just needs to include s2 into the cov to fix
gsl_multifit_wlinear_svd  see attached (the weights are not normalized).

Regards,

Alex Tartakovsky

---------------------------------
Yahoo! Music Unlimited - Access over 1 million songs. Try it free.```
```/* this belongs to multifit\multilinear.c */
int
gsl_multifit_wlinear_svd (const gsl_matrix * X,
const gsl_vector * w,
const gsl_vector * y,
double tol,
size_t * rank,
gsl_vector * c,
gsl_matrix * cov,
double *chisq, gsl_multifit_linear_workspace * work)
{
if (X->size1 != y->size)
{
GSL_ERROR
("number of observations in y does not match rows of matrix X",
}
else if (X->size2 != c->size)
{
GSL_ERROR ("number of parameters c does not match columns of matrix X",
}
else if (w->size != y->size)
{
GSL_ERROR ("number of weights does not match number of observations",
}
else if (cov->size1 != cov->size2)
{
GSL_ERROR ("covariance matrix is not square", GSL_ENOTSQR);
}
else if (c->size != cov->size1)
{
GSL_ERROR
("number of parameters does not match size of covariance matrix",
}
else if (X->size1 != work->n || X->size2 != work->p)
{
GSL_ERROR
("size of workspace does not match size of observation matrix",
}
else
{
const size_t n = X->size1;
const size_t p = X->size2;

size_t i, j, p_eff;

gsl_matrix *A = work->A;
gsl_matrix *Q = work->Q;
gsl_matrix *QSI = work->QSI;
gsl_vector *S = work->S;
gsl_vector *t = work->t;
gsl_vector *xt = work->xt;
gsl_vector *D = work->D;

/* Scale X and y,  A = sqrt(w) X, t = sqrt(w) y */

gsl_matrix_memcpy (A, X);

for (i = 0; i < n; i++)
{
double wi = gsl_vector_get (w, i);

if (wi < 0)
wi = 0;
else
wi = sqrt (wi);

{
gsl_vector_view row = gsl_matrix_row (A, i);
double yi = gsl_vector_get (y, i);
gsl_vector_scale (&row.vector, wi);
gsl_vector_set (t, i, wi * yi);
}
}

/* Balance the columns of the matrix A */

gsl_linalg_balance_columns (A, D);

/* Decompose A into U S Q^T. U is returned in A */

gsl_linalg_SV_decomp_mod (A, QSI, Q, S, xt);

/* xt = U^T t */

gsl_blas_dgemv (CblasTrans, 1.0, A, t, 0.0, xt);

/* Scale the matrix Q,  Q' = Q S^-1 */

gsl_matrix_memcpy (QSI, Q);

{
double alpha0 = gsl_vector_get (S, 0);
p_eff = 0;

for (j = 0; j < p; j++)
{
gsl_vector_view column = gsl_matrix_column (QSI, j);
double alpha = gsl_vector_get (S, j);

if (alpha <= tol * alpha0) {
alpha = 0.0;
} else {
alpha = 1.0 / alpha;
p_eff++;
}

gsl_vector_scale (&column.vector, alpha);
}

*rank = p_eff;
}

gsl_vector_set_zero (c);

/* Solution */

gsl_blas_dgemv (CblasNoTrans, 1.0, QSI, xt, 0.0, c);

/* Unscale the balancing factors */

gsl_vector_div (c, D);

/* Compute chisq, from residual r = y - X c. Too bad A was destroyed */

{
double s2 = 0, r2 = 0;

for (i = 0; i < n; i++)
{
double yi = gsl_vector_get (y, i);
double wi = gsl_vector_get (w, i);
gsl_vector_const_view row = gsl_matrix_const_row (X, i);
double y_est, ri;
gsl_blas_ddot (&row.vector, c, &y_est);
ri = yi - y_est;
r2 += wi * ri * ri;
}

s2 = r2 / (n - p_eff);   /* p_eff == rank */

*chisq = r2;

/* Form covariance matrix cov = s2 (Q S^-1) (Q S^-1)^T */

for (i = 0; i < p; i++)
{
gsl_vector_view row_i = gsl_matrix_row (QSI, i);
double d_i = gsl_vector_get (D, i);

for (j = i; j < p; j++)
{
gsl_vector_view row_j = gsl_matrix_row (QSI, j);
double d_j = gsl_vector_get (D, j);
double s;

gsl_blas_ddot (&row_i.vector, &row_j.vector, &s);

s *= s2 / (d_i * d_j);
gsl_matrix_set (cov, i, j, s);
gsl_matrix_set (cov, j, i, s);
}
}
}

return GSL_SUCCESS;
}
}
```