Open data in the life sciences: the ‘Selfish Scientist Paradox’

Full and open access is promoted as the international norm for the exchange of scientific data by numerous scientific and political bodies. In the contemporary digital era, since scientists are both consumers and producers of data, they inevitably play a crucial role in defining the level of data accessibility. Yet, it is individual researchers usually who resist the release of their data. Through a global online questionnaire survey, the perception of 858 life scientists with respect to open data was investigated. Differences in scientists’ perceptions were tested per major country, rank position and academic performance in order to identify partial and global preferences. The ‘Selfish Scientist Paradox’ was identified: although the majority of respondents were in favour of open access to life sciences data, and most acknowledged that data gathered by others is vital to their work, the same group of people were quite reluctant to share their own data; only a third of them were willing to make their data available unconditionally. Scientists with >10 yr professional experience were twice as likely to oppose open access, while almost half of junior researchers would rather not share their data prior to publishing. Senior scientists argued that although project funding in general was a significant incentive towards making their data available, at the same time certain confidentiality agreements in some projects become a main barrier to data sharing. Country of professional location largely affected most responses, revealing that southern Europeans had a ‘conservative’ attitude towards open access, being more unwilling to share their data. Analyses based on academic performance (publications and citations) indicated that established individuals were more dependent on data collected by others and more opposed to open access.


INTRODUCTION
Scientists are driven by curiosity, constantly asking new questions and testing hypotheses. In doing so, they either have to produce their own data from carefully planned laboratory and/or field experiments or to extract data from 'libraries'. Their curiosity is generally supported by funding; e.g. US$1.14 trillion was spent on research and development (R&D) in the countries of the Organisation for Economic Co-oper-ation and Development (OECD) in 2015 .
Conventionally, knowledge exchange in science is conducted through journal articles and conference proceedings. However, in the contemporary digital world, access to original data has become essential and is equally important to publishing for both individual scientists and their hosting institutes (Borgman 2007).
The level of the current global data supply is huge, reaching 2.8 trillion GB in 2012 (projected to reach 40 trillion GB by 2020), of which only 0.5% is used for analysis (Gantz & Reinsel 2012). There is a strong movement by the stakeholders related to scientific data (i.e. governments, research-funding agencies, universities and non-profit research institutes, professional societies, international scientific organizations, industry research institutions, and the general public; Uhlir & Schroder 2007) promoting open-data policy (e.g. Editorial 2007a,b, Piwowar et al. 2008, Vision 2010, Molloy 2011, Warren 2016; http:// blogs. nature. com/ scientificdata/). For instance, after reviewing its existing policy on sharing research data, the US National Science Foundation (NSF) now requires that grant awardees make primary data available to others (NSF 2010). Following the same path, one of the pillars of research funding in Europe, the Horizon 2020 (H2020) financial instrument, has pledged to improve and maximize access to and reuse of research data generated by projects. To this end, all H2020-funded projects are currently required to deposit all data, including associated metadata, needed to validate the results presented in scientific publications as soon as possible. Moreover, project partners will have to take measures so that third parties are able to access, mine, exploit, reproduce and disseminate project data -free of charge for any user (https:// www. openaire. eu/ h2020-oa-datapilot). Also, 2900 journals from > 500 publishers have signed up to the Transparency and Openness Promotion (TOP) guidelines for journals, along with 57 organizations (Gewin 2016). TOP advocates that transparency, open sharing, and reproducibility are core values of science, although not always part of daily practice, and has introduced a set of standards to encourage disclosure of data and increase the credibility of inferential research (Nosek et al. 2015).
There is a plethora of benefits arising from the openness of scientific data collected through researchfunding instruments: e.g. the ability to address new questions and test new hypotheses, provide multiple perspectives, conduct large spatio-temporal scale analyses, develop new study and software, maximize data utility and avoid duplication of effort, increase visibility/transparency of scientific output, identify error, discourage scientific fraud, and increase opportunities for education and training (e.g. Piwowar et al. 2007, 2008, Boulton et al. 2011, Molloy 2011, Van den Eynden et al. 2011, Piwowar & Vision 2013, Warren 2016. Unquestionably, all these benefits accelerate scientific progress and eventually lead to better science for the common good. Although scientists are considered the weak link in the open-data debate (Gewin 2016), the available studies ( Table 1) that have surveyed the opinions of scientists on open data suggest that in principle (as opposed to in practice), they are in favour. Herein, we present the results of a global online questionnaire survey, which took place between January and August 2017, concerning the perception of life scientists with respect to open data. We also tested for differences in scientists' perceptions per major country, rank position and citation index.

Participants
This study involved an online questionnaire survey (http:// artemis2. ath. hcmr. gr/ Hcmr Polls/ polls/ questions/), with the anonymity of respondents protected; no identifying questions were asked, and findings herein are reported in aggregation. The survey was open for responses from January 15 to August 21, 2017.
At the first stage, researchers were approached through a 'snowball' sampling method (Goodman 1961). An email cover letter was sent to a small group of selected members of scientific society, notably colleagues whom we have collaborated with. This group was requested to circulate the link to their colleagues in their department (faculty, lecturers, postdoctoral associates). At the second stage, the survey encompassed a wider spectrum of researchers, and the electronic survey targeted the mailing lists of faculty members of specific departments (biogeography, biophysics, botany, conservation biology, ecology, evolutionary biology, fisheries, physiology, systematics, zoology) from universities, research centres and institutions.

Research implementation
The survey implementation proceeded by partitioning the questionnaire into 3 sections: demographics, academic performance and relationship with data. The full version of the questionnaire is provided in the online Supplement at www. int-res. com/ articles/ suppl/e018p027 _ supp. pdf.
Scientists were asked to provide their country of employment, level of professional experience and rank in their organization. In order to evaluate differences in perception among established and emerging researchers, respondents were asked to provide a measure of academic performance, notably numbers of published papers and citations.
Source ( Clear commitment to share biodiversity data, but also a reluctance to actually do so, due to a mixture of social and technical impediments, such as loss of control over data and lack of professional reward for sharing European Commis-National governments, regional and local gov-1140 Strong support (90% of responses) for research data that is publicly available and results from public sion (2012) ernments, research funding organizations, uni-funding to be, as a matter of principle, available for re-use and free of charge on the internet versity/research institutes, libraries, publishers, international organizations, individual researchers, citizens, nongovernmental organizations (NGOs), industries, charities, learned societies, scientific and professional associations Kim (2013) Engineering, physical sciences, earth, atmo-1317 Significant between-discipline variances as well as within-discipline variances in scientists' data- To evaluate respondents' attitude to data openness, they were asked to what degree they agree or disagree with a series of statements. Each question or statement explored a different aspect: collection and use of research data; storage and maintenance of research data; views on data sharing and fair ex change of data and responsibility for their data.
Questions included both Likert scales (Likert 1932), based on a point system to assign levels of agreement or disagreement to a respondent's answer, as well as closed-ended questions expecting a 'yes' or 'no'. Likertscaled questions provide quantitative information, allowing for data to be analysed with relative ease (see 'Likert scale' in the Supplement). In some cases, whenever the choices provided did not accurately represent the respondent's condition, the respondent was asked to describe the condition in a separate category ('other').
The present study focused on a subset of the questionnaire, namely questions Q1 to Q6 and Q11 to Q15.

Demographics of respondents
Between January 15 and August 21, 2017, we sent about 7500 emails, out of which a total of 858 researchers re sponded to the questionnaire (11%). Responses were almost equally distributed among academic ranking (Q1; Fig. 1a), with a slight preponderance of senior scientists (39%). Almost two-thirds of the replies came from scientists with >15 yr of professional experience (Q2; Fig. 1b).

Global perceptions on data openness
The majority of the respondents (~80%) were in favour of open access to life sciences data (Q6; Fig. 4a). Most researchers (64%) also acknowledged that data gathered by others is vital to their work (Q12; Fig. 4b), with > 80% of them (Q13; Fig. 4c) asserting that during the past 5 yr, they have utilized data collected by other researchers (Q13). The source of these external data varied from local to international institutional databases (Q14; Fig. 5a). For those declaring other sources (35 responses), the majority (91%) identified publicly accessible repositories as their source of data (online databases: 37%, published manuscripts: 34%, official statistics: 20%).
Although the desire for accessing other scientists' data was evident, the same group of people were quite reluctant to share their own data (Q15). Less than one-third of respondents were willing to make their data available unconditionally, and the remaining two-thirds were willing to provide their data upon certain conditions (Q15; Fig. 5b). A subset of respondents (n = 106) provided comments on those certain conditions: e.g. availability only after publishing, gaining collaboration, co-authorship in a high-ranking journal, mutual exchange of data, and expertise swapping ( Table 2).
The vast majority of respondents considered the rebuilding of data in case of catastrophic loss extremely dif ficult, signifying the issue of data storage, handling and access (Q11; Fig. 6).

Specific perceptions on data openness
The level of academic ranking (Q1) did not bias perceptions on data openness, working with data of others or reproducibility of research data (Fig. S1 in the Supplement). However, scientists with professional experience of >10 yr (Q2) were twice as likely to oppose open access, compared to their younger counterparts (with <10 yr of experience) (Fig. S2). Perceptions of one's own data sharing (Q15) were more or less comparable among all aca-  (Table S1). Almost half of the junior researchers (46%) would rather not share their data prior to exploiting them to the advantage of their academic career (e.g. a publication or a conference presentation). In case they do so, they would prefer to share them with prominent researchers or to publish in highranking journals. Senior scientists argued that gaining project funding is also a significant incentive towards making their data available.  Table S2 in the Supplement). African scientists (Cameroon and South Africa) were predominantly (75%) of no opinion (Fig. S3). Scientists from the UK, Australia, the Netherlands and the USA considered it less essential to their work to use external data, where as German, Danish, Norwegian and Canadian scientists were more likely to use other researchers' data ( Fig. S3). During the past 5 yr (Q14− 15), scientists from northern Europe (Denmark, Norway, Sweden) and Canada were more dependent on data collected by others, with Italians, Greeks and Germans being less prone to use external data sources (Fig. S3). Canadians, Italians and the British were the respondents most concerned about their data reproducibility, where as scientists from the Netherlands, Sweden, Norway, Germany and the USA were the least concerned (Q11; Fig. S3). Investigating the country effect further, we identified an obvious association between country-specific percentage of gross domestic product (GDP) spending on re search (OECD 2017) and data openness. For Greece, Italy and Spain, R&D expenditure fluctuated around 1% of GDP, while this figure was > 3% in Finland, Sweden and Denmark.
The perception towards data openness was inversely related to R&D expenditure (Fig. 7).
Analyses based on academic performance (publications and citations, Q4 and Q5) indicated that established individuals have made more frequent use of data collected by others (although this difference is rather marginal); this group of researchers was also more opposed to open access (Fig. S4 in the Supplement). Once again, junior−intermediate scientists, with a publishing record of < 40 papers, were not willing to share their data before making the most out of them (publications), while researchers with 50+ published manuscripts were of the opinion that project funding and collaboration under confidentiality agreements are desirable conditions prior to sharing their data (Table S3).

DISCUSSION
The debate about open access to research data is by no means new; it is an argument that is being increasingly raised. Open access is promoted as the international norm for the exchange of scientific data by numerous scientific and political bodies: the OECD (OECD 2007), European Commission (EC 2011), United Nations Educational, Scientific and Cultural Organization (UNESCO 2012), G8 science ministers (G8 Science Ministers 2013), US federal government (OMB 2013), and the International Council for Science (ICSU-WDS 2015). It has intensified in recent years due to the ever-increasing amount of data and the emergent possibilities offered 33

Condition
Percentage of respondents (n) After publishing 39% (42) Only with other experts in my field 6% (7) In case of offered collaboration 5% (5) Only with other researchers 5% (5) Only if to publish in a high-ranking journal 3% (3) With engaged stakeholders 2% (2) If proper credits are given 2% (2) In case of mutual exchange of data 2% (2) If co-authorship is offered 2% (2) A mix of all the above 36% (38) In the contemporary scientific arena, data sharing through the emergence of e-science (Hey & Hey 2006) has revolutionized the way science is conducted. Technological developments, data repositories and collaborative efforts have brought to fruition a long-standing aspiration of researchers: access to data. However, everything comes at a cost, and in the academic realm, data has become an 'information currency' (Davis & Vickery 2007), shaping various institutional or individual data-sharing behaviours.
The present study strengthens the general belief that researchers are (in principle) in favour of open data, with most of them acknowledging that data gathered by others is vital to their work. This concept of knowledge sharing among scientists ('scholarly altruism') has been documented by numerous other studies (Constant et al. 1996, Davenport & Prusak 1998, Kan k anhalli et al. 2005, Lin 2008, He & Wei 2009, Hung et al. 2011. However, career level (academic ranking and professional experience) somehow distorted this general pattern, with established re searchers being more likely to oppose open access. Piwowar & Chapman (2010), investigating the biomedical sciences, arrived at the opposite conclusion by using bibliographic indices as a proxy for 'experience'. Perhaps the different culture of the biomedical discipline, compared to other life sciences, has shaped this contradictory perception. Kim (2013) also cites distinct data-sharing behaviours in diverse scientific disciplines; e.g. that in the field of life sciences, geneticists were more likely to deny others' requests than non-geneticists. Nevertheless, it must be acknowledged that the problem is far from straightforward. Each scientific discipline generates different types of data, and hence each discipline is expected to have different data-sharing requirements and expectations, which in turn shape distinctive scientific 'cultures'.
In this study we also identified differences in scientists' perceptions related to country/region of professional location. Such a notable difference was apparent between southern and northern Europe, probably related to the divergence in research funding and opportunities. Research is obviously not a priority in southern Europe, and it seems that the difficulties associated with fundraising for conducting research in this region probably manifests into a more 'conservative' view towards data openness. Piwowar & Chapman (2010) also showed that researchers with a corresponding address in the USA were twice as likely to embrace open data.
The most noticeable outcome of the present study was that although in principle, scientists were well disposed to open data and acknowledged that other scientists' data are essential and have been utilizing them, they were quite hesitant to share their own data. The withholding rate reported in this study is considerably high compared to other studies so far (Kim 2013 and references therein), where rates ranged from 3% to 75%, the majority being below 35%. Focusing specifically on the life sciences, these numbers varied from 8% to 32% (Kim 2013 and references therein). This is in fact the 'Selfish Scientist Paradox': scientists want to use other scientists' data but protect their own data.
Although many reasons were brought up to justify data withholding, the main reason was exploiting data to publish. Kim (2013) lists numerous reasons for data withholding: funding agency's policy, contract with industry sponsors, data sensitivity, privacy, losing publication opportunities, facing potential criticism and lack of data repositories. It seems that the current institutional setting and the academic reward system forces scientists to seek opportunities for more publications (and citations) as a way to build a reputation (Gewin 2016). This desire to make the most out of their data and the absolute necessity to publish has affected the viewpoints of even junior researchers in the present study. One would expect that at least emerging scientists would function under the norm of communalism (Merton 1968, Braxton 1986), rather than embracing particularism (Mitroff 1974). Yet, it seems that in the modern scientific arena, the need to capitalize on the data for future career benefits prevails over scholarly altruism.
In spite of the undeniable progress and change of mindset realized throughout human history, as this study also suggests, scholarly altruism is still not the norm, and numerous barriers are blocking the free exchange of scientific information: disciplinary traditions, institutional barriers, lack of technological infra structure, intellectual property concerns, and in dividual perceptions (Kim & Stanton 2012). Although the researchers surveyed in the present study indicated confidentiality agreements or the desire to exploit data (by publishing) as their main reservations for not sharing data, other studies have concluded that scientific discipline, institutional environment and individual motivations are crucial factors influencing data-sharing behaviours (e.g. policies developed by funding agencies and journals; university tenure and promotion systems; Borgman 2010).