The pathological science: Psychology, skepticism, and statistics: June 2017

Shortly after I finished my Master's research project, I decided - for some reason now lost in the sands of time - that I wanted to double-check some results I had produced in my thesis. (My project was a psychometric validation study using confirmatory factor analysis in AMOS*). I found my final data file and my model specification, re-ran the analysis with the same options selected, right down to the random seed for the bootstrapping of results, and.... all the factor loadings were very slightly different to those that I had published.

Shit. Frantic investigations ensued.

...

The reason for the discrepancy in results, it ultimately turned out, was this: The bootstrapping procedure obviously selects a random sample of rows in each bootstrapped sample, and at some point I had run my analyses with my dataset sorted in a different way than I'd found it when trying to reproduce the analyses. Hence very slightly different numbers coming out. A tiny and seemingly inconsequential aspect of my workflow had resulted in my findings becoming unreproducible.

Clearly, there was a flaw in my workflow, and the consequences could have been even worse then: When you rely on point-and-click programs and manual data manipulation in programs like Excel, SPSS and AMOS, it's damn easy to end up with a study folder that's jam-packed with datafiles that look "Data134", "Finaldata34", "FinalfinaldataEM", and "SUPERDATABASE.sav"**, screeds of output files, and no idea whatsoever how you ended up with that table of results on page 34 of your thesis.

Now in an ideal world, we'd all be writing our analyses in R, using Markdown or knitr to produce output for publication, with neatly commented scripts, our original data saved for posterity on osf, and version control of our scripts using git. (Or something along those lines).

The thing is, though, is that there are many people analysing data out there who have minimal programming skills, and I don't think that reproducible workflows should only be the province of people with those skills. For example, I teach a postgraduate multivariate data analysis course at a university where most of my students haven't had any experience with programming, and very little experience with data analysis full stop; I don't think it's reasonable to expect them all to use tools like R and github.

So I'm interested in how we people using SPSS via point-and-click can produce reproducible workflows without having to battle with programming code. The constraint of doing this via point-and-click is obviously going to result in an imperfect solution, but anything is better than a workflow consisting of 20 different versions of your datafile, 75 output files all based on different versions of the data, and published results tables based on some haphazard assortment of the above combined with a few typos. Without further ado, here is a draft workflow:

Set up a folder for your project, and set up a backup solution. This could be as simple as having the project folder within your My Dropbox folder, if you use Dropbox.
If you have to type in raw data into SPSS yourself (e.g., from surveys), do this carefully and then double-check every single data point. Ideally have someone else check all the data points again. Be absolutely sure you're happy with the raw data before doing any data analysis or processing.
Now save a "raw data.sav" copy of your dataset. Do not ever change it again - this is the key departure point for everything else.
From this point forward, do all your data processing and analysis using menu commands, but always click "Paste" instead of "Ok" once you've specified an analysis or command. This will paste the syntax for the analysis or command you've specified into a syntax file. To actually run the analysis or command, select (highlight) it in the syntax file and click the green "run selection" triangle. Crucially, you should NOT:

manually edit data in the data view (don't even sort it within the data view; use Data > Sort cases for that).
manually edit variable properties in the variable view. If you want to change variable properties (label, value labels, measure type, width etc.) use the Data > Define Variable Properties command.

You can now do whatever data processing you need to now using SPSS (e.g., reverse-coding items, imputing missing data, combining items into scale scores using Compute variable, labelling variables and levels, etc). Again, make sure you do all this via menu commands.
Keep your syntax file neat - if you run lots of analyses and click paste every time, you will end up with lots and lots of code chunks in your syntax. Delete the code if you've run analyses that you don't need (though remember not to run lots of analyses looking for "interesting" results unless your project is explicitly exploratory - stick to whatever analyses you set out in your pre-registration!)
Carefully comment your syntax file, starting with a description at the top of the dataset it applies to, the date, who wrote it, and any other crucial contextual information. Begin a comment in SPSS syntax by typing an asterisk, and end the comment by typing a full stop. Include comments explaining what the various analyses and data processing commands achieve.
As you work, you may end up with several versions of your syntax file(s) for your data analysis and processing. This is ok, but include a comment at the beginning of the syntax file explaining what version it is and what the key differences are from previous versions. Make absolutely sure it's clear what dataset is used with the syntax file. Give the files sensible sequential filenames (e.g., "analysis script version 1.sps", "analysis script version 2.sps").
When you get to the point of copying SPSS output into your analysis, add a comment next to the analysis in the syntax file (e.g., "This analysis goes in Table 3 in manuscript".) Avoid manually typing numbers into your manuscript whenever possible; copy and paste instead (if necessary into paste Excel first for rounding).
Before submitting a paper or thesis, make sure that you can take your original raw data and a syntax file (or more than one syntax file) and easily reproduce every single figure reported in the paper. Ensure you keep and back up copies of the files necessary to do this, and if at all possible post copies somewhere where others can access them (e.g., osf.io).

So that's my basic idea for an unintimidating reproducible workflow for beginner data analysis. It's imperfect but at least results in a syntax file and raw data file that should be sufficient to reproduce any published results.

Do any readers of my blog have suggestions for how this can be improved? I know my own analysis practices aren't as neat and organised as they could be, so I'm open to feedback. Ideally this could eventually be something I can pass on to my students for their own data analyses.

*I know: Ew.
**Actual example from one of my old projects.

The pathological science: Psychology, skepticism, and statistics

Tuesday, June 13, 2017

A really minimal reproducible workflow using point-and-click in SPSS