Think about the “traditional” social science research project, where every user is recruited individually and required to read and sign a detailed information sheet confirming that they understand how their data will be used by the study and how they can withdraw consent at any time. The project has been through ethical review and the panel has signed off how and under what licence the anonymised data can be archived and shared from a repository.

What is the typical sample size for such a project? Fifty? A hundred? A thousand?

Twitter claim over 300 million monthly active users which is in turn dwarfed by Facebook with well over 2 BILLION active users, even as the dust settles after the Cambridge Analytica scandal.

These are but two of the platforms, albeit the most significant, that have heralded what Kandy Woodfield and Ron Iphofen refer to in their introduction to the Ethics of Online Research as “an explosion of Internet-based social science research”

Book cover

Volume 2 in Emerald’s Advances in Research Ethics & Integrity series pulls together 10 chapters from diverse authors across UK Higher Education and beyond to investigate to what extent “‘conventional’ research methods and ethical codes/guidelines apply to online research” and whether “new methods and new codes need to be sought”.

These questions are very much a live issue for us in the Research Data team as we receive an increasing number of enquiries about data sourced from social media.

In the age of Google, anonymisation is hardly an option for this type of material where inputting quoted text into a search engine will lead straight back to the original post, and it is impractical, to say the least, to secure informed consent to collect, let alone share, aggregated datasets from Twitter or Facebook.

The legal and ethical implications also need to be considered afresh in the wake of Cambridge Analytica and GDPR legislation (both of which post-date this book).

The Ethical Disruptions of Social Media Data: Tales from the field

In her opening chapter, Susan Halford, Professor of Sociology and Director of the Web Science Institute at the University of Southampton, identifies the “bureaucratization of the UK research ethics” that has coincided with changes in the nature of the data available to social researchers, mediated first by the Internet and the Web, and latterly by Social Media.

This “perfect storm” has resulted in a situation where institutionalized ethics are ill-equipped to provide adequate or pragmatic guidance. A fact which Halford illustrates with damning quotes directly from her PhD students:

"institutional processes for establishing formal not allow for the incorporation of flexibility that is necessary when studying group activities online." 

"A paternalistic 'no colouring outside the lines' approach lets the legal department sleep at night but may not aid research" 

"I promise to do things I know I am technically unable to do"

Halford outlines five specific ethical “disruptions” that prefigure discussion throughout the rest of the volume:

  • The Data are Already Created

Ethics currently assumes the generation of new data which can be managed ethically, whereas social media data are already produced. Material can be of a personal or sensitive nature and users’ understanding of how their publicly accessible posts can be collected as ‘data’ are “uneven”.

  • These Data are Beyond Our Control

Personal data that has been sourced by ‘traditional’ methods can be anonymised before being shared for secondary analysis, or not shared at all if deemed too sensitive. Social media data, by contrast, is already ‘out there’ and cannot necessarily be kept secure behind a university firewall. Other users have access to the same data and can compare and cross reference incomplete or related datasets at “scale and speed”.

  • These Data are Not Finite

Social media data continue to circulate online and can change in context. It presents an unbounded dataset for which existing ethical considerations are inadequate, the ‘right to withdraw’ for example, or tracking the provenance of data over time.

  • There are Some Implicit Assumptions about Scale and Granularity

Ethics panels are used to small sample sizes, relatively speaking, less than 100 for qualitative, perhaps the low thousands for quantitative research. At this scale a researcher can have a relationship with each participant, however attenuated, but the huge numbers of ‘participants’ enabled social media research reduces the possibility of a direct relationship with the researcher. Even the ethical protection of the ‘individual’ may no longer be sufficient when it becomes possible, at scale, to “explore and delineate social groups”.

  • These Data are attracting Interest in Social Research from across the Entire Research Field

Social scientists are joined in the analyses of these types of data by researchers from the mathematical and computational sciences, inevitably encountering new ethical regimes along the way that might not regard this ‘published data’ as requiring the same degree of ethical regulation.

Situational Ethics

Also referred to as ‘agile ethics’ or ‘open source’ ethics, rather than take an absolutist perspective, situational ethics takes into account the specific context, emphasising the “dialogical and relational process of ethical responsibility”.  Exactly what this means in practice is a moot point and one that Halford concludes requires urgent attention; we need to assemble examples of specific challenges and developing practices to deal with them while also promoting interdisciplinary dialogue. This is crucial,  she suggests, if the research ethics review process is to remain credible, consistent, fair and appropriate for interdisciplinary research.

Much of the book considers ethics from a largely theoretical perspective without necessarily providing practical advice, which isn’t a criticism and merely reflects the status quo.

In Chapter 2: Users’ Views of Ethics in Social Media Research: Informed Consent, Anonymity and Harm, for example, Matthew L. Williams et al consider users’ views of ethics, arguing that rather than relying on the legal permissions that are often encoded in social media platforms’ Terms and Conditions, researchers must adopt a “nuanced and reflexive ethical approach that puts user expectations. safety, and privacy rights centre stage.” While in Chapter 3, Sarah Quinton and Nina Reynolds highlight the ethical challenges associated with changing roles of researchers and participants in social media research.

Chapter 5 finds Janet Salmons also exploring the issue of informed consent. Based on a literature review and a study of current practices she offers advice for researchers including the importance of building trust and (scholarly) credibility and best practice for online user agreements. She also provides several ‘exemplars’ which serve as a useful reference point for a social scientist looking to design their own study in an online space.

Tinder is the subject of Chapter 6, specifically the challenges associated with “location-aware social discovery apps”. Jenna Condie and colleagues take us through the early stages of their research project, focussing once again on issues of informed consent, privacy and copyright. Like Halford, they also discuss how their research “sits uneasily within standardized ethical protocols”.


Image adapted from Flickr [James Cridland]
Leanne Townsend and Claire Wallace go beyond the theoretical in Chapter 8 to present a new ethics framework for researchers working with social media data developed as part of the ESRC funded project Social Media – Developing Understanding, Infrastructure & Engagement (Social Media Enhancement) and reproduced here with permission (the referenced sections are in the document Social Media Research: A Guide to Ethics):

Image taken from: Townsend, L., Wallace, C. (2016) Social Media Research: A Guide to Ethics, University of Aberdeen

Using Twitter as a Data Source: An Overview of Ethical, Legal, And Methodological Challenges

Hopping back through the volume, it is worth considering Chapter 4 in a bit more detail for some practical insight to specific challenges and developing practice. This chapter is also freely available from the White Rose Research Repository.

Wasim Ahmed and colleagues from the Information School at the University of Sheffield have a special interest in how people use Twitter to communicate in extreme circumstances. In this chapter they consider the specific issues potentially associated with theirs and others’ research, legal, ethical and privacy related.

In addition to their own research around the use of Twitter during infectious disease outbreak, they cite examples analysing tweets relating to crisis events including riots and natural disasters where, for example, a tool utilising text-mining technology has demonstrated how relevant Twitter messages can be identified and utilised to inform the situation awareness of an emergency incident as it unfolds.

In contrast to other social media platforms, Twitter data is relatively open and can contain metadata including geospatial which, while invaluable to emergency services, can also potentially place (individual, identifiable, locatable) users at increased risk. Not all Twitter users necessarily consider the public nature of their posts, especially during a crisis event, or the extent to which they can be collected and analysed, either to assist emergency services or for research, which  echoes the “uneven” understanding of users identified by Susan Halford in her opening chapter.

Considering legal concerns, the authors consider Twitter’s Terms of Service which unambiguously state that users agree that Twitter may make their content “available to other companies, organizations or individuals for…syndication, broadcast, distribution, promotion or publication”, and which are often used as a justification by researchers to use data from the platform without informed consent.

As an additional ethical caveat the  authors emphasise the argument made by some researchers that if a Tweet contains a hashtag (e.g. EbolaOutbreakAlert) the user has intended the message to be visible to a wider audience.

In addition to outlining several case studies and software for harvesting and analysing tweets (e.g. NodeXL), this chapter is of most interest to us in outlining the process of obtaining research ethics approval for a PhD project using Twitter as a primary data source around infectious disease outbreaks and reviewing some of the issues that have arisen throughout the project.

Tellingly, and bearing out Susan Halford’s reflections on university ethics processes, as Twitter data is already publicly available one of the first questions raised was whether ethical approval was required at all. It is nevertheless generated by human ‘participants’ who may be identifiable – which is one of the overriding conditions for ethical review at Leeds for example.  While users involved in an online “conversation” may forget specific tweets, let alone expect them to be swept up during academic data collection, the internet, of course, never forgets and analyses may well highlight trends in the data that will identify groups or individuals.

When a dataset may comprise >100,000 tweets, the issue of gaining informed consent is somewhat moot and the project advocated a light touch, retrospective consent process in order to quote specific tweets by sending a tweet to a user with details of the study and requesting permission. In the event, the project team decided not to gain informed consent, rather not to include user-handles during reporting or quote verbatim, taking care to carefully reword the content of the tweet.

To share or not to share?

The question of actually sharing the full dataset does not appear to have been considered by Wasim et al. Given the lack of informed consent it was presumably regarded as ethically intractable.

Twitter’s Terms & Conditions state that user data may be redistributed or used for other purposes. They do not advocate sharing tweets as full corpora however, rather as Tweet or User ‘IDs’ which the end user of the content can “rehydrate” (i.e. request the full Tweet, user, or Direct Message content) using the Twitter APIs (Twitter, n.d.).

Legal and technically supported it may be, whether it is ethical is debatable.

In Chapter 7, Ethical Challenges of Publishing and Sharing Social Media Research Data Libby Bishop and Daniel Gray emphasise the value ofopen data and how, in the context of social media data, the “legitimate objective” of open science comes into conflict with the issue of privacy:

"Both protecting data subjects' privacy and opening data are defensible values."

The ethics rabbit hole goes even deeper if you consider that data aggregation increasingly represents ‘power’ and that conditions of data access have implications for who wields that power, consider again the Cambridge Analytica scandal where Facebook data was allegedly manipulated to influence the Brexit vote.

In one of several case studies, the authors emphasise how Twitter’s Terms & Conditions require third parties to publish tweet content in full with IDs and unchanged in any way, which obviously precludes anonymisation or pseudonymisation. They go on to describe a case presented by Daniel Gray who successfully used a protocol developed by the Collaborative Online Social Media Observatory  (COSMOS) for his Master’s dissertation on mysogynist speech on Twitter.

COSMOS Risk Assessment:

Low risk – Tweet is from official/institutional account: Publish without seeking consent inmost cases.
Medium risk – Tweets are from individual users and contain mundane information of a
non-sensitive nature: Must contact the user (direct message/@mention/email) informing
them of the intent to publish; unless the user opts out consider as permission to publish.
High risk – Tweets are from individual users and contain sensitive information (overly
personal, abusive etc.). Must contact the user (direct message/@mention/email) and ask
their permission to publish. Only publish if consent is received.
High risk – Tweet has been deleted precluding publication under Twitter Developer Agreement/Policy.
High Risk– Tweet is from a deleted account meaning it has been deleted precluding publication under Twitter Developer Agreement/Policy.
Williams, Matthew L, Burnap,Peter and Sloan, Luke 2017. Towards an ethical framework for publishing Twitter data in social research: taking into account users’ views, online context and algorithmic estimationSociology 51 (6) , pp. 1149-1168. 10.1177/0038038517708140 (CC-BY)

Bishop and Gray conclude that, while the ethical challenges of gathering and, in particular, sharing social media data, are considerable, there is in fact much to learn from existing ethical frameworks, and there are already good resources available (see above). There is still a need for training around big data, research ethics and integrity which should be “practical, case-based and interactive”. The focus should not be on the individual researcher however and their undoubted responsibilities; institutions must be good and proper stewards of the data under their control which is, of course, emphasised by GDPR legislation which will serve to enforce greater consistency across Europe.