CHORUS Forum: Are We Ready for Data Citation Metrics Summary Now Available

Over 250 publishers, researchers, librarians, and research funders registered for the CHORUS Forum: Are We Ready for Data Citation Metrics held on 28 October 2021. This forum was co-hosted with the American Geophysical Union.

The full recording of the forum session is available on the event page as well as the individual presentations.

Shelley Stall, Senior Director for Data Leadership at AGU, kicked off a lively discussion by conveying the challenges around sharing and citing data and software which has been an area that many in the industry, including AGU, have been working on for over a decade. With today’s topic being a testament to the significant work done thus far and establishing the value of data and software as important contributions to the scientific record that can be reused to further our understanding of the world and provide transparency to our methods.

The scholarly ecosystem concerning data and software research products is actively working towards alignment but we are up against a deep-seated reward structure focused on peer-reviewed publications, explained Stall. Funds are limited for curators to help researchers make their data as interoperable as possible. With these challenging gaps, questions remain:

How do we look at a dataset or software that has been shared to determine its contribution to the scholarly enterprise? What would that look like as part of promotion and tenure processes?
How do we evaluate a well-prepared dataset that can be easily understood and interoperable, as being a quality measure as compared to a quantity measure provided by a citation count?

Speakers from varying backgrounds helped us understand where we are now and possible next steps to consider.

Session 1: How do we value shared research data?

Dr. Sarah Nusser, Professor emerita of statistics, Iowa State University, and Research Professor of the Biocomplexity Institute at the University of Virginia, led the first session in which four speakers gave their perspectives about the current data sharing ecosystems and current issues and potential fixes for ensuring citation and credit are given to researchers who prepare and share publicly accessible research data.

Rick Gilmore, Professor of Psychology, Pennsylvania State University, provided the researcher perspective. Rick is an avid promoter of open science tools and practices, transparency, reproducibility, and is co-founder of Databrary. Rick emphasized that data sharing is viewed with lower value than most other products of scholarship and that it is much more work than it should be. If we want to expand data sharing, planning must begin early and we need to give it real value by celebrating exemplary practices and behaviors. In some ways, we already value research data and we should. Doing so accelerates discovery and scientific advances are made possible by sharing. To understand how we can increase the value of shared research data, we should consider who values shared research data and note where there are mixed outcomes. Funders value data, but requirements are weak; journals have introduced policies that require data sharing, but not all; Institutions, being the formal legal recipient of grant funding, should value research data but have many new roles to play in supporting researchers; researchers value shared data in some but not all fields. What can be done to give shared research data real value? Money, supporting resources, and recognition need to be provided by Funders, Institutions, and Journals to make an impact. Funds should be particularly directed to researchers and research data services involved in data curation. Societies should provide guidance about where, how, and why because researchers need help in making decisions in the context of their fields. We need to make data sharing something that is exciting.

Dan Reed, Senior VP of Academic Affairs, University of Utah, gave his thoughts from the university administration perspective as a provost and former vice president for research. Dan’s academic background is in computer science and engineering and is currently working in the area of cloud and edge computing and he has been active in discussions held within the Association of American Universities (AAU) and Association of Public and Land Grant Universities (APLU) to advance open science and data sharing across the academic system. Dan echoed the lack of data rewards and recognition. Dan noted that the culture of shared data is shifting the power structure away from those who have unique data or unique infrastructure to those who can ask better questions — no one should underestimate the power of sociology in effecting how we think about reward metrics. This shift is challenging for some, as it challenges their world view. Connecting back to the earlier remarks about domain specific and cross-domain recognition, one of the other sociological issues we need to face is when the value of data ceases to accrue, preserving that data and continuing access to it will be critical. As we all know, there are differences across units and cultures in determining metrics, it is more than simple access counts and should take into consideration roles and responsibilities (e.g., individual investigator vs team discovery).

Amy Nurnberger, Program Head, Data Management Services, MIT Libraries, offered input from someone who
provides data services to researchers. Amy has been at the forefront of visioning for the 21^st century academic library as well as data sharing. Amy asks “Are Data Citation Metrics ready for us?” Librarians have been willing and wanting to cite the data for a while, an NRC reference dating back from 1985 “Recommendations 12. Journals should require full credit and appropriate citations to original data collections…” but there are likely earlier references in this area. Where are we now, publishers are asking for data availability statements which are necessary but insufficient, and publishers asking for appropriate citation of publicly available data sets, which gets closer to the issue but doesn’t quite get what we need. Fortunately, what we need is what Force 11 has developed — a Joint Declaration of Data Citation Principles. Cite all the data, regardless of its availability status. However, we need to be aware of, address, and account for the issues of misuse, misrepresentation, misinterpretations, biases, and inequities in the systems and infrastructures we develop. We must do better and can do so by clearly laying out shared understandings of data citation metrics’ role and meaning and provide incentives, rewards, and recognition but not replicate those that already exist.

Madison Langseth, Science Data Manager, US Geological Survey (USGS) is interested in the ROI for data that has been funded by the USGS. She develops tools and workflows to ease the burden for researchers and data managers, and has been working on ways to identify and evaluate the reuse of USGS data. Madison is assessing the impact of these data on research by reporting data access counts within the USGS repository ScienceBase, and tracking data citations identified in publications and storing those as related identifiers in the USGS Science Data Catalog. These access and citation counts demonstrate to USGS researchers and leadership which data are getting reused, as well as how often and where they are being reused. USGS next steps are to align data access counts with COUNTER Code of Practice for Research Data, explore Scholix to track formal citations, and continue to educate USGS researchers and publishing staff on data citation principles.

Session 2: A path forward: Scaling-up infrastructure to support increased community adoption of data citation assessment

During the second session of the forum, Matthew Buys, Executive Director of DataCite, led a discussion about the underlying current infrastructure efforts and the need for increased adoption of practices and policies that support both quality and quantity aspects of adoption of data and software citation assessment.

Howard Ratner, Executive Director, CHORUS stated that CHORUS is dedicated to making open research work and has been doing so since 2013. The goals are to make sure the main stakeholders are all scaling their OA compliance. We are working to develop metrics about open data and we help stakeholders improve the quality of their metadata related to open research. Most recently CHORUS has been hosting forums and webinars to raise awareness. We connect datasets to published content, link articles to agency portals, and grants where available. CHORUS provides dashboards, reports, and APIs for our stakeholders. We make connections with these stakeholders in hopes they can learn together and build trust. We help create and participate in open science pilots. Most recently, CHORUS was named in a two-year 2020 NSF grant awarded to the American Geophysical Union to implement fair data practices across the Earth, Space, and Environmental Sciences. The scope of the work is about data citations with AGU content funded by NSF grants. These will be captured in an upcoming NSF PAR (2.0) update. CHORUS helps by providing customized dataset reports over the two-year project. This project helps AGU improve their dataset metadata deposits into the NSF PAR. The project will also help to develop and further open data citation best practices. More recently CHORUS became a partner in the Coleridge Rich Context Project. This project aims to apply machine-learning and natural language processing techniques that search publications provided by CHORUS members to find what datasets are in publications, show how they’ve been used, find other experts who have used the data, identify other related dataset, and to show the impact of funded datasets back to the US agencies, institutions, and publishers in an ongoing service. It was done using a Kaggle contest which was held earlier in the year and received entries from over 1000 expert teams competing to produce the best algorithms and workflows to solve these problems. The summit was just held on 20 October 2021 and numerous pilots have already started. We are dedicated to finding gaps by examining issues and problems and then filling those gaps by working towards solutions with the community, pursuing common ground, and identifying opportunities.

Lucia Melloni, Professor, Max Planck Institute for Empirical Aesthetics, is a researcher working on a dataset project funded by Templeton World Foundation. The goal of the project is to find an exemplary role model for open science from data generation to publishing the data and paper. The study is on accelerating research on consciousness where the team is creating large reproducible and open data and they are encouraging collaboration amongst the scientific community at large. The approach started by comparing the two most prominent theories of consciousness, Global Workspace Theory and Integrated Information Theory which could not be more different about how they explain consciousness. To accelerate research on consciousness we compared these two theories and in an adversarial collaboration we have teamed up with the two proponents to put forward testable but opposing predictions. We have pre-registered everything from the beginning so the idea is to be open for everyone. The project studies the brain using both invasive and non-invasive technologies and is obtaining data from over 500 individuals and is using steps that foster reproducibility. Lucia said, “all has been great except for the fact that it has been extremely challenging when trying to document the whole research cycle especially when producing very reliable results…. Spending an enormous amount of time trying to standardize across different laboratories and trying to place everything into the credit order statement. Also noteworthy is that different researchers contribute differently.” All efforts need to be considered when developing standards for metrics.

Stefanie Haustein, Associate Professor, University of Ottawa, started by answering the question that we are not ready for data metrics. But this doesn’t mean we shouldn’t work towards them, because they could help in providing incentives to make data be counted as a first-class scholarly output. Coming from a bibliometrics background Stefanie outlined what she sees as good metrics vs bad metrics, emphasizing the need to avoid repeating mistakes from the past, where simplistic and flawed indicators like h-index and impact factor have created adverse effects. In addition, metrics are typically proprietary and irreproducible. Good metrics need to be more than just a number, they should always be used to complement qualitative evaluation, be open and transparent, and reproducible and shouldn’t be owned by a company. We need the metrics we create to provide context; we need disciplinary information to normalize for known differences, and incentivize positive behavior to increase data sharing and get credit. Because we are lacking the necessary metadata, we can’t create valuable metrics today. Much of what is captured today doesn’t have discipline information and many citations are missing altogether which means we can’t create any meaningful metrics. What we need is to first engage all stakeholders to make data citations the norm; get repositories to collect more and better metadata; and conduct mixed-methods research by creating evidence on data sharing and citation patterns across disciplines so benchmarks can be created. The Meaningful Data Counts research project is a two-year project funded by Sloan and is part of the larger Make Data Count initiative.

Anna Hatch, Program Director, DORA, explained that DORA is an international non-profit initiative to improve the ways that researchers are assessed in academia. The initiative is supported by 16 organizations that include funders, publishers, scholarly societies, and academic libraries. DORA’s work is focusing on supporting systems change at academic institutions. Taking a step back from recognizing and rewarding data, something else that we need to discuss is what institutional conditions and infrastructure are needed to support these initiatives. Research assessment reform is difficult to solve because it involves multiple stakeholders and more challenging is that individual interventions are not going to be effective and will not solve system challenges, instead this requires a collaborative approach. DORA developed a new tool (SPACE) that was designed to support the development of new policies and practices. This community project is broken down into six categories of research assessment reform: standards, process, accountability, culture, and evaluative and iterative feedback. It helps institutions analyze the outcomes of interventions to improve academic career assessment and supports the development of new practices and cause change. To offer some concrete ideas about institutional infrastructure and conditions Anna presented DORA case studies where a university was able to show shared goals and general standards for research assessment. She noted that it’s worth also noting that DORA’s developing a dashboard which is aiming to track criteria and standards that academic institutions can use for decisions that impact research careers, such as open science and data reuse.

Daniella Lowenberg, Principal Investigator, Make Data Count, explained that the Make Data Count initiative is focused on the development of open data metrics. Breaking apart the question of “Are we ready for data citation metrics?” Daniella believes we have been ready for data citations for years but we are not as far along as we should be and we are ready for data metrics. We need them to drive the community but are we ready to decide what data metrics are appropriate and right, we are not there yet! One takeaway that Daniella wants everyone in attendance to walk away with is there are steps that everyone can do to build towards open data metrics. Focusing specifically on data citation as opposed to usage and others, there are three principles: 1) Make data citation a priority, in addition to adding citations to references, funders need to emphasize and reward these practices and we need to make it clear to researchers how to properly cite data 2) Make data citation and data metrics open, transparent, and auditable so we can build trust around metrics. We have existing open infrastructure so we should be supporting and implementing these frameworks 3) Make data metrics evidence-based and to reiterate what Dr. Haustein pointed out earlier, citations need to be contextualized. We call all do this now by supplying better metadata. We will be ready for data metrics once we all support data citation in the form of action, prioritizing in terms of implementation and resourcing.

Wrapping up the second session, Matt Buys emphasized by contextualizing where we are in this process after hearing perspectives from the different stakeholders and this collective action that’s needed. We know there are different perspectives and activities that can be undertaken, such as researchers citing data in reference lists and outputs, collecting quality metadata particularly on subject areas, depositing the linkages from datasets to articles and pre-prints. Also, publishers need to make sure to collect the data citations and include these in the metadata and allow data citations in the references. Funders and policy makers should be advocating and requiring data citation practices. We heard about moving away from data availability statements and towards acknowledging and rewarding each citation as reuse and keeping infrastructure open and we need to do so in an open transparent way so metrics can be developed. A general takeaway is that we need to prioritize open data citations so that bibliometrics experts can unveil what these indicators may be best fit for diverse data and doing so should allow us to develop open data citation metrics as a community.

An audience question to the panelists was: “If you can make one change today to move things forwards, what would you change?” Panelist responses were: focusing on metadata around datasets and software; create through community effort; a non-proprietary open discipline classification; align objectives of the measures and the goals and values; create DOIs for datasets as a standard; and change the notion that we have to figure everything out before we can get started.

Final questions / comments for our panelists: “Are there metrics that can capture the diversity and not just a number?” Dr. Stephanie Haustein believes that while a lot of focus has been on qualitative measures, we do need to provide complementary measures, so while they can never get rid of qualitative measures, we are moving towards a valuable output and metrics, such as reuse metrics, can help as a first step in that direction.

Calls to action:

Stakeholders

Researchers: cite data in reference lists of articles, cite data as outputs in your grants, CVs, and anywhere you would make mention of your article (or of data you’ve reused)
Data repositories collect quality metadata (particularly on subject area) and deposit linkages from datasets to articles and preprints in the DOI metadata
Publishers: make sure that you are collecting the citation and including this in the DOI metadata; allow data citations in the reference list
Funders and policy: advocate and require data citation practices, acknowledge and reward data citation as reuse
Infrastructure: keep all data citation infrastructure open, let the community build together so metrics development is transparent and auditable
Overall takeaway: prioritize open data citations so bibliometrics experts can begin to unveil what indicators may be best fit for diverse data. This will allow us to openly develop data citation metrics as a community

Forum attendees

Reach out to DataCite (support@datacite.org) to discuss how your repository or publication can contribute to data citation.
Join the Make Data Count newsletter (https://makedatacount.org/engage/).
Share on social media how you are helping make data citation #possible @MakeDataCount @DataCite.

Closing

Howard closed the forum by thanking our sponsors: American Chemical Society, AIP Publishing, Association for Computing Machinery, Society of Photo-Optical Instrumentation Engineering, and American Meteorological Society; as well as AGU who we partnered with on this forum; and our moderators and speakers.

The full recording of the forum session is available on the event page as well as the individual presentations.

Advancing Open Access to Research

News

CHORUS Forum: Are We Ready for Data Citation Metrics Summary Now Available

Categories

Archives

Get Updates from CHORUS

Featured