Thanks Alan,
The coding is as you describe. Three variables are currently coded in this
way. One of them, for example, employment can be FT/PT/none. In the dataset FT
= 1, PT = 2, None = 3.
Therefore,
- FT becomes a new variable = 1 if employment = 1
- PT becomes a new variable = 1 if employment = 2
- employment = 3 is not included. 'None' is the reference level.
In the example regression output table I tried to include in the message these
are RA1SG17A_1 (for FT) and RA1SG17A_2 (for PT), and RA1SG17A_1 is one that is
producing NaN. (what's the best way to try and include the regression output
table in a pspp-users@gnu.org message?)
Tim Goodspeed
+44 (0)7714 136 176 | @TimGoodspeed
-----Original Message-----
From: pspp-users-bounces+tim.goodspeed=btinternet.com@gnu.org
<pspp-users-bounces+tim.goodspeed=btinternet.com@gnu.org> On Behalf Of
pspp-users-request@gnu.org
Sent: Wednesday, December 20, 2023 10:17 AM
To: pspp-users@gnu.org
Subject: Pspp-users Digest, Vol 209, Issue 11
Send Pspp-users mailing list submissions to
pspp-users@gnu.org
To subscribe or unsubscribe via the World Wide Web, visit
https://lists.gnu.org/mailman/listinfo/pspp-users
or, via email, send a message with subject or body 'help' to
pspp-users-request@gnu.org
You can reach the person managing the list at
pspp-users-owner@gnu.org
When replying, please edit your Subject line so it is more specific than "Re:
Contents of Pspp-users digest..."
Today's Topics:
1. Re: dummy coding of categorical variables results in zero
coefficients and standard errors (Alan Mead)
----------------------------------------------------------------------
Message: 1
Date: Wed, 20 Dec 2023 04:16:44 -0600
From: Alan Mead <amead2@alanmead.org>
To: pspp-users@gnu.org
Subject: Re: dummy coding of categorical variables results in zero
coefficients and standard errors
Message-ID: <35bbe032-0a39-413a-b632-88cdaa727245@alanmead.org>
Content-Type: text/plain; charset="utf-8"; Format="flowed"
Tim,
NaN looks like a numerical error. I'm curious, how may levels does the variable
have and how many dummy variables are you using?
If the original variable has K levels, you should have K-1 dummy variables. For
example, if your variable were location (1=rural, 2=suburban, 3=urban) then you
would pick one level to be the reference and create two dummy variables,
perhaps:
recode location (1=1) (else=0) into dum1.
recode location (2=1) (else=0) into dum2.
Then the coefficients of dum1 and dum2 tell you how living in a rural
(dum1) or suburban (dum2) area compares to living in an urban area.
The model won't be defined if you use K variables for K levels.
I notice that both of the zeros are for xxx_1 variables, so that suggested
possibly not coding the categorical variable correctly. But I don't know if
that's what you are seeing. You could also get zeros if there were no instances
of that dummy code, but you shouldn't see NaN values. It could also be another
problem, or a bug. In fact, I think it's probably a bug to see NaN's...
-Alan
On 12/20/23 3:46 AM, tim.goodspeed@btinternet.com wrote:
A basic stat’s question and a specific PSPP query, please. Any help
gratefully received. I can’t see this in the archives anywhere
(searching for ‘categorical’ and ‘dummy’).
For a linear regression, some variables are categorical and so
included using dummy coding (Coding Systems for Categorical Variables
in Regression Analysis (ucla.edu)
<https://stats.oarc.ucla.edu/spss/faq/coding-systems-for-categorical-variables-in-regression-analysis-2/#:~:text=Categorical%20variables%20require%20special%20attention,entered%20into%20the%20regression%20model.>).
*basic stat’s question: *This results in a zero coefficient and zero
standard error for some variables, as shown in the example below. Is
this correct? There is little or no linear relationship to be found?
*specific PSPP query: *if there is little relationship/the coefficient
is very small, is there a way to tell PSPP to show the very small
value instead of zero?**
Thanks in advance
Table: Model Summary (adjRA1SR1)
R
R Square
Adjusted R Square
Std. Error of the Estimate
0.55723
0.310505
0.302797
0.8359
Table: ANOVA (adjRA1SR1)
Sum of Squares
df
Mean Square
F
Sig.
Regression
619.25791
22
28.148087
40.284698
0
Residual
1375.0987
1968
0.698729
Total
1994.3566
1990
Table: Coefficients (adjRA1SR1)
Unstandardized Coefficients
Standardized Coefficients
t
Sig.
95% Confidence Interval for B
B
Std. Error
Beta
Lower Bound
Upper Bound
(Constant)
8.163407
0.310014
0
26.332394
0
7.555417
8.771397
lnSTINC
-0.036745
0.011677
-0.088107
-3.146888
0.002
-0.059645
-0.013845
RA1PKHSIZ
-0.011834
0.016218
-0.020561
-0.729708
0.466
-0.043639
0.019971
RA1PRAGE
-0.039326
0.011175
-0.550388
-3.519082
0
-0.061242
-0.01741
sqPRAGE
0.000464
0.000109
0.666977
4.258349
0
0.00025
0.000678
RA1PRSEX
0.13709
0.03935
0.068446
3.483888
0.001
0.059918
0.214261
RA1PB19_1
0
0
0
NaN
NaN
0
0
RA1PB19_2
-0.485628
0.170694
-0.054029
-2.845015
0.004
-0.820389
-0.150867
RA1PB19_3
-0.324574
0.058981
-0.109094
-5.503011
0
-0.440246
-0.208902
RA1PB19_4
-0.333625
0.089807
-0.074169
-3.714896
0
-0.509752
-0.157497
RA1PB1
-0.002888
0.008407
-0.007002
-0.343559
0.731
-0.019376
0.0136
RA1SG17A_1
0
0
0
NaN
NaN
0
0
RA1SG17A_2
-0.061221
0.053837
-0.021822
-1.137147
0.256
-0.166804
0.044363
RA1PA1
-0.15082
0.022182
-0.160102
-6.7991
0
-0.194324
-0.107317
RA1PA2
-0.248882
0.024367
-0.243609
-10.214077
0
-0.29667
-0.201095
RA1SC1
-0.328042
0.073134
-0.08782
-4.485512
0
-0.471469
-0.184614
RA1PF3bin
0.003064
0.041159
0.001422
0.074435
0.941
-0.077655
0.083783
RA1PF7A_2
0.009538
0.086914
0.002111
0.109735
0.913
-0.160917
0.179992
RA1PF7A_3
0.14177
0.166844
0.016081
0.849712
0.396
-0.18544
0.468979
RA1PF7A_4
-0.104009
0.155971
-0.01266
-0.666848
0.505
-0.409894
0.201877
RA1PF7A_5
0.173309
0.59246
0.005486
0.292525
0.77
-0.988606
1.335224
RA1PF7A_6
0.064264
0.080864
0.01504
0.794712
0.427
-0.094325
0.222853
RA1PG2
-0.350528
0.030049
-0.233421
-11.66509
0
-0.40946
-0.291597