Best-subset analysis with size-correction
SPSS/PASW Statistics script (Windows only)
Pavel Klimov
Description. This VBA script will generate all combinations or subset of
combinations of the independent variables and performs either Canonical variates
analysis (Discriminant function analysis) or Logistic regression. The output
will be saved in the working directory and can be analyzed further in any
spreadsheet application. Darroch and Mosimann (1985) Òsize correctionÓ can be performed for morphometric data.
You can modify this script for your own needs.
Installation.
1. Open SPSS ver. 12-15,18 or PASW Statistics 18 (will not
work in ver 16, ver. 17 not tried).
2. Go to menu File-New-Script.
3. Paste the content of this file
4. Save the script
You can test the script using this data file (Right-click: download link
to disk)
Use of the script
1) Name your independent variables
as "var00001 ... var00010 ... " (default for SPSS)
2) Name your dependent variable as
"depend" (no quotation marks)
3) Define the following variables:
Nvar - Number of independent variables. If this value is large
(e. g. >22), the analysis can be is prohibitively long because of the large
number of combinations, e. g., 2Nvars -1. The total number of
combinations may be reduced if SubSetMin and SubSetMax are modified (see below)
LNTr - Do logarithmic (base e) transformation? (True/False)
DoSizeCorrection - Enter True if you want Darroch
and Mosimann (1985) size correction (for morphometric data only!), otherwise enter False
ExternValid - Perform external validation?
(True/False). Variable ÒValÓ must be created in the datafile. Code cases to be
included in the analysis as 0 and cases to be used for external validation
(holdout sample) as 1.
SubSetMin - if you want to obtain subsets of particular size, enter
its lower limit, otherwise enter 1
SubSetMax - if you want to obtain subsets of
particular size, enter its upper limit, otherwise enter the number of you
independent variables (should be equal to Nvar)
4) If you have a large number of
independent variables (>12), SPSS may experience a memory problem
terminating your analysis. To avoid this, your large analysis is divided onto
several analyses each performing a smaller number of iterations (on my computer, it is about 8000). Define the following variables
Prt =True activates this option, "False" turns it off
StartRange=1 (from 1 to n) starts with specified number of iterations*
EndRange=8000 - stops after specified number
of iterations and writes results to disk*
* These settings will perform 8000
analyses. To conduct another 8000 analyses set the variables again:
StartRange=8001 and EndRange=1600
CVA
Define the following variables:
LR=False (tells the script to run CVA
instead of Logistic regression)
if ExternValid=true you have to
define another variable: SelectSet
By default SelectSet=vbCrLf &
"/SELECT=val(0)" , where "0" is
the code for your analysis subset; "val" is a variable name of the
variable defining the internal and external datasets (must be created in your
datamatrix)
Logistic regression
Define the following variables:
LR=True (tells the script to run
Logistic regression instead of CVA)
if ExternValid=true you have to
define another variable: SelectSet
By default SelectSet=vbCrLf & "/SELECT = val EQ 0" where
"0" is the code for your analysis subset; "val" is a
variable name of the variable defining the internal and external datasets (must
be created in your datamatrix)
Output
processing
The following VBA script will
process your output leaving only variable names and hit ratio value
Installation:
1. Open MS Word
2. In menu select Tools-Macro-Macros
(or Developer-Macros) and press "Create" and give a name of the macro
3. Paste the content of the this file
Use:
4. Open SPSS output file as text (the
file should be in the working directory, the name of the file should start with
Òsubsets..Ó , or ÒLR_subsetsÓ for logistic regression)
5. Run the script: in menu select
Tools-Macro-Macros and select the name of the macro from step 3 (Installation)
6. Deduce
the the meaning of the values from the raw file. For example,
The row
Òv01 v02 62.5 60.0 61.5Ó means:
Two
independent variables, v01, v02 were used. 62.5 (first) is the percentage of
selected original grouped cases correctly classified; 60.0 is the percentage of
unselected original grouped cases correctly classified; and 61.5 (last) is the
percentage of selected cross-validated grouped cases correctly classified.
These data can be sorted in MS Excel to get the extreme values.
An outdated page that generates command syntax for datasets
with small number of independent variables can be found here.