A causal model to explain data reuse in science: a study in health disciplines
Investments in data infrastructures, data management, data repositories, and Open Data sharing policies and recommendations are viewed as increasingly important for scientific knowledge production. One of the underlying assumptions justifying these investments is that the more available Open Data becomes, then the greater the possibilities for creating new knowledge that can advance both science and human wellbeing. Yet efforts and investments in Open Data and other ways of data sharing only have value if data are actually reused. Recent scholarly efforts have brought forth some of the challenges and facilitators related to the reuse of data, in order to inform current and future policies and investments. However, despite these efforts, we still do not know why and how some researchers are successful in reusing data, despite the challenges they face, and why some researchers abandon the process of reusing data when facing such challenges. This dissertation aims to fill this gap by focusing on a causal explanation of the data reuse process, which it understands as being nested in broader patterns of researchers’ motivations, scientific goals and decision-making strategies.
The dissertation is comprised of three main elements. First, it proposes a heuristic model of the scientific actor, the bounded individual horizon (BIH) model, which understands that, on the one hand, researchers’ work and careers are structured by their motivation to produce scientific contributions and rewards systems that prioritizes certain types of contributions. On the other hand, researchers’ struggles to achieve their objective of creating new findings that accrue recognition and rewards occur within a frame of limited information and resources, conditioned by multiple institutional, social, and other factors. Second, the study proposes a mechanistic causal theoretical explanation that enables us to understand the data reuse process and its effects (outcomes). The data-reuse mechanism as it is called, enables us to understand how the satisficing behavior that characterizes scientific decisionmaking applies to the specific conditions and processes of data reuse. Third, a set of ten empirical case studies of data reuse in health research were conducted and are reported in the dissertation. These cases are analyzed and interpreted using the complementary theoretical lenses of the bounded individual horizon and the data-reuse mechanism approaches.
The main findings explain that there is an apparent association between the extent and types of efforts required to reuse data, researchers’ contextualized motivations, and broader goal-setting and decisionmaking frames. Access to data is a necessary condition for the reuse of data, yet is not sufficient for the reuse to happen. Characteristics of available data, including the context of their production, the extent of the preparation and stewarding of these data and their potential value in relation to researchers’ motivations to make new scientific claims or generate background knowledge are found to be essential elements for understanding why some data reuse processes persist and succeed, while others do not. The thesis concludes that efforts and investments designed to reap the benefits of data reuse should also be expanded to include training researchers in data reuse, including to efficiently recognize opportunities, navigate the challenges of the reuse process, and be aware of and acknowledge the limitations of the use of secondary data. Without such investments, the promises and expectations linked to emerging data infrastructures, data repositories, data management guidelines and open science practices are argued to be far less likely to reach their full potential.