pspp-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug #48040] GLM produces wrong output


From: Alan Mead
Subject: Re: [bug #48040] GLM produces wrong output
Date: Sun, 29 May 2016 15:43:36 -0500
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.1.0

On 5/29/2016 1:22 AM, John Darrington wrote:
Thanks Alan,

You are right - this is entirely due to missing values.  I'm somewhat relieved
that it is not something more fundamental.

But the problem I see now is that SPSS does not document how it treats missings.

I agree.  BTW, I used SPSS 23.  I may have said I used SPSS 24 but I just checked and it was 23.

Perhaps you could do some experiments.  For example, do missing values in the factor variables
get treated as a separate factor value or does the case get simply dropped?

What further experiments do you propose?

I removed the missing values (in PSPP) and saved the file using PSPP (the attached personality2.sav) , closed PSPP, reopened PSPP, opened the new file, and ran the below PSPP commands. 

recode agree_score (0=SYSMIS) (else=copy) into A_withmissing.
recode extra_score (0=SYSMIS) (else=copy) into E_withmissing.
recode caution_score (0=SYSMIS) (else=copy) into C_withmissing.
recode caution_score (lo thru 35=1) (36 thru hi=2) (else=SYSMIS) into FactorC.
recode extra_score (lo thru 31=1) (32 thru hi=2) (else=SYSMIS) into FactorE.
execute.

FREQ / agree_score extra_score caution_score FactorC FactorE A_withmissing C_withmissing E_withmissing .

* GLM output from below seems correct.
GLM agree_score BY  FactorC FactorE.

recode caution_score (0=SYSMIS) (1 thru 35=1) (36 thru hi=2) (else=SYSMIS) into FactorC_withmissing.
recode extra_score (0=SYSMIS) (1 thru 31=1) (32 thru hi=2) (else=SYSMIS) into FactorE_withmissing.
freq / FactorC_withmissing FactorE_withmissing .

* GLM output from below seems WRONG.
GLM agree_score BY  FactorC_withmissing FactorE_withmissing.

* GLM output from below seems WRONG, but less blatantly; df is wrong for the factor with missing data.
GLM agree_score BY  FactorC FactorE_withmissing.

* GLM output from below seems correct.
GLM A_withmissing BY  FactorC FactorE.

* GLM output from below seems WRONG.
GLM A_withmissing BY  FactorC_withmissing FactorE_withmissing.

* GLM output from below seems WRONG.
GLM A_withmissing BY  FactorC_withmissing FactorE.

When we run the SPSS 23 SAV file through PSPP GLM with missing values in the dependent variable (only), we get weird results like negative SS.  That apparently doesn't happen when PSPP generates the missing data (for the dependent variable), suggesting that there are differences as you suggest between the way SPSS 23 creates a SAV file and how PSPP does.  It seems like reverse-engineering the SPSS files has been the kind of thing that Ben has looked into in the past?

But there are still missing data issues that seem to have nothing to do with how the SAV file was created.  GLM may treat missing correctly in the dependent variable, but it appears not to do so for the independent variables and especially when both independent variables have missing data it seems to produce spectacularly bad output. 

I didn't generate different kinds of missing data, but these missing values are almost all the same case for each variable.  The value of zero isn't a possible value for any of the Likert variables and represents missing data (probably that someone completed a small fragment of the full survey).  So I recoded zero into missing. There were 22 zeros for Agreeableness but only 21 for Extraversion and Caution (conscientiousness).  So, I think for 21 cases, all three variables were missing and one case was only missing agreeableness.  I'm sure there are many datasets where missing status is relatively uncorrelated. I didn't try to re-create such a file but you could easily do so by randomly/manually censoring the file.
And what about the dependent variables? If there are say 2 dependent variables and one
is missing  what happens then?  Is the case dropped for both anayses or just the one that is missing?

Are you asking about the behavior of SPSS?  I believe SPSS offers listwise and pairwise deletion and that pairwise is the default.  So, if there were two dependent variables

Or if you were asking about PSPP, I was just looking at glm.c and I got the impression that it cannot handle two dependent variables yet?

-Alan

--
Alan D. Mead, Ph.D.
President, Talent Algorithms Inc.

science + technology = better workers

+815.588.3846 (Office)
+267.334.4143 (Mobile)

http://www.alanmead.org

I've... seen things you people wouldn't believe...
functions on fire in a copy of Orion.
I watched C-Sharp glitter in the dark near a programmable gate.
All those moments will be lost in time, like Ruby... on... Rails... Time for Pi.

          --"The Register" user Alister, applying the famous 
            "Blade Runner" speech to software development

Attachment: personality2.sav
Description: application/spss-sav


reply via email to

[Prev in Thread] Current Thread [Next in Thread]