With Spring finally in the air, our speaker for April’s Open Lunch was Dr Viktoria Spaiser, Associate Professor in Sustainability Research and Computational Social Sciences at the School of Politics and International Studies.
Thanks in no small part to open science, something like hope has also arrived with Spring. The Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome was sequenced and made publicly available on 10 January 2020 (GenBank accession number MN908947), paving the way for the rapid development and deployment of several different vaccines only 12 months later.
Viktoria is a social scientist however and gave a comprehensive overview of how open and reproducible research is currently practiced in the social sciences, how it varies in quantitative, computational, and qualitative social research and how these practices are currently changing. She also discussed the specific barriers for open and reproducible research in social science, and how at least some of them might be addressed in the future.
Viktoria began her talk with a quick definition of open and reproducible research in the social sciences, emphasising the importance of transparency at every stage of the research process, from the way data is collected and processed to how it is analysed and conclusions drawn. Data should be made available along with analysis methodology. Communication of results should also be transparent. Reproducible science is about ensuring that you as the researcher, as well as others, are able to replicate your results at later stage.
Replicability in social science: a crisis?
Open and reproducible science is relatively well established in some disciplines, especially Medicine, but is much less so in social science. There has been high profile coverage of Psychology in particular where a large scale replication study in 2015 found that more than half of a sample of 98 psychology studies from 3 different journals could not be independently replicated. While this has lead to a recognition of the importance of rigorous research methods amongst social scientists, to ensure that research results are sound, a 2020 study* found minimal adoption of transparency and reproducibility-related research practices, with practices like sharing research protocols and preregistration apparently not having penetrated the broader social sciences at all.
*looking at research from 2014-2017
Focussed on quantitative methods, Transparent and Reproducible Social Science Research: How to Do Open Science (2019, first ed., University of California Press) will also be of use to qualitative researchers, from which Viktoria highlighted 10 core principles (links to respective points in Viktoria’s talk):
- Meta analysis / Systematic evidence synthesis
- Pre-analysis plan (research protocol)
- Sensitivity analysis
- Reporting standards
- Data sharing
- Script & material sharing
- Reproducible workflows
- Open access publication
Viktoria went through each of these in detail, and it’s well worth watching the full recording of her talk on YouTube, or link to each in the list above.
Meta analysis / Systematic evidence synthesis
While many journals encourage meta analyses and systematic evidence sytheses, with specific paper formats designed to ensure that a researcher properly reviews the body of research in a given field, there is still too little meta analysis carried out. Systematic evidence synthesis goes beyond a simple literature review, which can be very selective, to consider everything published on a given topic, or at least a random sample of published studies. This will help mitigate any researcher bias and cherry-picking results. It’s also important to consider both quantitative and qualitative research.
Pre-registration refers to archiving a read-only, time-stamped study protocol in a public repository before commencing a study. Preregistration is particularly important for quantitative researchers to differentiate between ‘confirmatory’ i.e. generating hypotheses from existing observations and ‘exploratory’ research i.e. testing hypotheses with new observations. Preregistration also provides an opportunity to report null results.
While originally focussed on quantitative research, preregistration can also be useful in qualitative research where it is important to be transparent about your theoretical framework for example, or on what basis qualitative data was collected and analysed. Viktoria referred to a 2019 paper Preregistering qualitative research (Haven, T.L., Van Grootel, L.) that was actually discussed at a Leeds ReproducibiliTea journal club last year (recording on YouTube)
A pre-analysis plan is what you actually publish during preregistration and should include:
- Study Design (e.g. RCT, panel data with FE, discussion groups with …)
- Study Sample (incl. strategy for missing data, attrition, etc.)
- Primary & Secondary Outcome Measures or Conceptualisation (e.g. variables specifications, causal chains, interview questions, discussion and justification of the concepts used in one’s research)
- Families of Mean Effects (if numerous outcome measures tested, group outcomes into families)
- Multiple Hypothesis Testing Adjustment (accounting for false positives e.g. by adjusting p-values)
- Subgroups (if outcomes interacted with baseline characteristics to test for heterogenous treatment effects, subgroups to be interviewed)
- Direction of Effect (to validated one-directional statistical tests)
- Exact Statistical/Methodological Specification (e.g. linear, generalized linear or some other regression type?, control variables?, robust estimation of standard errors?, equations, distributional assumptions, etc.); for qualitative researchers other methodological specifications (e.g. what discourse analysis method etc.)
- Structural Model (if you are testing a formal mathematical model, e.g. specify the utility function etc.)
- Time-Stamp It!
Sensitivity analysis is about transparency of results and to help ensure you don’t just report those result that fit your narrative while disregarding alternative explanations. Essentially this means combining or averaging across many different models and analytical decisions.
The core message is to test all possible models and consider all possible explanations with all the data and do so transparently.
Once again this applies to qualitative as well as quantitative research, for example using multi-coders to avoid biased coding in your analysis.
Whereas preregistration is about transparency before research is undertaken, reporting standards relate to transparent reporting and communication of results and what is reported in the final publications:
- Study context: funding sources, conflict of interests
- Study design: sampling, what measures were collected (what questions were asked), how sampling size determined
- Details about data collection process: when data collected? How much noncompliance? Attrition
- Data description: number of observations, means and standard deviations
- Analysis method: statistical method? Qualitative analysis method? How missing data handled?
- Results reporting: confidence intervals, results of sensitivity analysis, subgroup analysis? What measures/treatments/questions did not work?
- Study limitations: weaknesses? Biases? Generalizability?
- Material availability: data and code available? Data collection protocol available? Where can those be found?
To help you actually report on each of these, it is useful to have a project protocol for you to log each step of your research for your own reference.
Viktoria emphasised that replication in social science is especially important, not least because research results may be used to inform social policy. Replication studies are rare and really need to be more commonplace.
There are different types of replication study:
- using the same data as the original study and using the same methods is a verification study
- using the same data but modifying the method to explore alternative models is a re-analysis study
- using different data i.e. collected your own data but with the same method to test the same hypothesis is a direct replication study
- using different data and different methods but still examining a similar hypothesis is an extension study
The issue of replication is more nuanced in qualitative research which by its very nature is less objective. However, original data can still be shared, allowing other researchers to engage with the original data, or data they have collected themselves, to explore a similar research question, perhaps with modified mothods.
It is important to share data, not only for replication but also for other researchers doing different analyses, potentially interdisciplinary. It is equally important for qualitative and quantitative data.
One suggestion is to create a ‘replication package’ to publish alongside your paper with data, code and documentation describing them. This might be as supplementary information or via a repository such as the UK Data Service or the University of Leeds’ own data repository run by the Library.
Viktoria also emphasised FAIR data principles whereby data should be Findable, Accessible, Interoperable and Reusable
N.B. FAIR does not necessarily mean ‘open’ and both the UK Data Service and the Leeds repository offer controlled access to e.g. sensitive or special category data.
An example of a dataset shared by Viktoria and colleagues on the UK Data Service repository: Environmental behaviour data collected through smartphones in a field-experimental setup
There are limitations to data sharing, from commercial services like Twitter for example where terms and conditions do not permit data collected via its APIs to be directly shared, though they do permit sharing of tweet IDs and associated user IDs which can then be ‘rehydrated’ by colecting the data again based on those IDs (though data may have benn deleted).
Confidential data can also be problematic to share, depending on agreements with data subjects for example, or location data from GPS tracking. There are ways to mitigate these issues however while facilitating replication, by replacing confidential data of including distances derived from GPS data.
Script & material sharing
To enable others to replicate your studies it is also crucial to share code and scripts, Python or R for instance, which can be part of a replication package or alongside a paper. Sharing platforms to facilitate code sharing include Github or CRAN (The Comprehensive R Archive Network).
Example of R code shared via CRAN from Viktoria’s own research: https://cran.r-project.org/package=bdynsys
Reproducible workflows mean doing everything with a scripts, never manual modification of data or calculations. Version control of scripts to track changes over timeusing a platform like Github is also important in both quantitative and qualitative research.
Open Access publication
The final aspect highlighted by Viktoria was the importance of open access publishing which can be expensive, though there is often the option to use a repository such as White Rose Research Online. See the Library website for more information.
To err, of course, is human and if you discover a mistake, or someone else points one out, it’s important to publish a correction. Viktoria shared a couple of instances where this has happened to her and published a correction. Fortunately in these cases the mistake was minor but in a worst case an error might invalidate your results and need to be retracted. It’s not nice but the responsibility of an ethical researcher not something to be ashamed of.
We need only consider again Coronavirus and that the social policies associated with the pandemic have dramatically impacted us all, which nicely illustrates Viktoria’s point that there is an ethical imperative to ensure that social science can be replicated, because if it is used to inform policy then it can have a real effect on the lives and wellbeing of real people.