Best-subset analysis with size-correction

Best-subset analysis with size-correction

SPSS/PASW Statistics script (Windows only)

Pavel Klimov

Description. This VBA script will generate all combinations or subset of combinations of the independent variables and performs either Canonical variates analysis (Discriminant function analysis) or Logistic regression. The output will be saved in the working directory and can be analyzed further in any spreadsheet application. Darroch and Mosimann (1985) “size correction” can be performed for morphometric data. You can modify this script for your own needs.

Installation.

1. Open SPSS ver. 12-15,18 or PASW Statistics 18 (will not work in ver 16, ver. 17 not tried).

2. Go to menu File-New-Script.

3. Paste the content of this file

4. Save the script

You can test the script using this data file (Right-click: download link to disk)

Use of the script

1) Name your independent variables as "var00001 ... var00010 ... " (default for SPSS)

2) Name your dependent variable as "depend" (no quotation marks)

3) Define the following variables:

Nvar - Number of independent variables. If this value is large (e. g. >22), the analysis can be is prohibitively long because of the large number of combinations, e. g., 2^Nvars-1. The total number of combinations may be reduced if SubSetMin and SubSetMax are modified (see below)

LNTr - Do logarithmic (base e) transformation? (True/False)

DoSizeCorrection - Enter True if you want Darroch and Mosimann (1985) size correction (for morphometric data only!), otherwise enter False

ExternValid - Perform external validation? (True/False). Variable “Val” must be created in the datafile. Code cases to be included in the analysis as 0 and cases to be used for external validation (holdout sample) as 1.

SubSetMin - if you want to obtain subsets of particular size, enter its lower limit, otherwise enter 1

SubSetMax - if you want to obtain subsets of particular size, enter its upper limit, otherwise enter the number of you independent variables (should be equal to Nvar)

4) If you have a large number of independent variables (>12), SPSS may experience a memory problem terminating your analysis. To avoid this, your large analysis is divided onto several analyses each performing a smaller number of iterations (on my computer, it is about 8000). Define the following variables

Prt =True activates this option, "False" turns it off

StartRange=1 (from 1 to n) starts with specified number of iterations*

EndRange=8000 - stops after specified number of iterations and writes results to disk*

* These settings will perform 8000 analyses. To conduct another 8000 analyses set the variables again: StartRange=8001 and EndRange=1600

CVA

Define the following variables:

LR=False (tells the script to run CVA instead of Logistic regression)

if ExternValid=true you have to define another variable: SelectSet

By default SelectSet=vbCrLf & "/SELECT=val(0)" , where "0" is the code for your analysis subset; "val" is a variable name of the variable defining the internal and external datasets (must be created in your datamatrix)

Logistic regression

Define the following variables:

LR=True (tells the script to run Logistic regression instead of CVA)

if ExternValid=true you have to define another variable: SelectSet

By default SelectSet=vbCrLf & "/SELECT = val EQ 0" where "0" is the code for your analysis subset; "val" is a variable name of the variable defining the internal and external datasets (must be created in your datamatrix)

Output processing

The following VBA script will process your output leaving only variable names and hit ratio value

Installation:

1. Open MS Word

2. In menu select Tools-Macro-Macros (or Developer-Macros) and press "Create" and give a name of the macro

3. Paste the content of the this file

Use:

4. Open SPSS output file as text (the file should be in the working directory, the name of the file should start with “subsets..” , or “LR_subsets” for logistic regression)

5. Run the script: in menu select Tools-Macro-Macros and select the name of the macro from step 3 (Installation)

6. Deduce the the meaning of the values from the raw file. For example,

The row “v01 v02 62.5 60.0 61.5” means:

Two independent variables, v01, v02 were used. 62.5 (first) is the percentage of selected original grouped cases correctly classified; 60.0 is the percentage of unselected original grouped cases correctly classified; and 61.5 (last) is the percentage of selected cross-validated grouped cases correctly classified. These data can be sorted in MS Excel to get the extreme values.

An outdated page that generates command syntax for datasets with small number of independent variables can be found here.