Takeaways from CHORUS Forum on Improving Scholarly Publishing Metadata

With over a hundred participants, the September 29th CHORUS Forum resulted in a robust discussion around solutions for improving scholarly publishing metadata.

A big thank you to our sponsors for supporting this forum: Association for Computing Machinery, Clarivate, American Chemical Society, AIP Publishing, GeoScienceWorld, and STM.

Howard Ratner, Executive Director, CHORUS, moderated the first session which examined weak spots within scholarly publishing metadata and why it’s important for the scholarly community to work together to improve their metadata.

Andrea Medina-Smith, (bio) Data Librarian, National Institute of Standards and Technology (NIST), focused on the progression of metadata fields – acknowledging how far we’ve come, but also demonstrating how far we still need to go to improve and reduce the gaps. NIST is unique in that they serve their researchers as a publisher, funder, and as a library, and therefore need to track research from a variety of perspectives. NIST wants to be able to show their researchers what is happening with their research output, and the collection of accurate metadata is necessary to accomplish this goal. Participating in the CHORUS forum is a useful opportunity for NIST to highlight their needs and the gaps in the metadata with a diverse group of stakeholders.

Using Read & Publish agreements as an example, Alexander ‘Sasha’ Schwarzman, (bio) Content Technology Architect, Optica Publishing Group, identified a weak spot that he sees in the scholarly publishing metadata – author, whether it be an individual or a consortium. The markup of the author(s) is important as one has to be able to identify who is eligible under the agreement. Is it only the corresponding author or is there more than one author eligible, and how is that represented in the metadata? Currently there are no best practices or recommendations. If eligibility is based on contributor type, the CRediT taxonomy could be utilized. However, if eligibility is based on contribution level, no taxonomy or recommendation currently exists. Affiliation, licensing, and funding are other metadata elements important for tracking Read & Publish agreements, and have the same challenge – a lack of consensus or best practices in the scholarly ecosystem.

Metadata retention and preservation are two other areas that are audited by libraries, funders, and publishers, and would benefit from having rules and best practices outlined. Sasha also reiterated that FAIR practices (findability, accessibility, interoperability, reusability) and building community-wide metadata validation tools are needed.

Our third speaker for this session was Megan O’Donnell, (bio) who is Head, Research Data Services, Iowa State University, University Library. Megan spoke of gaps in actionable publishing metadata, not just metadata that tells us things. With the variety of metadata in the ecosystem, libraries bear the burden of making them all work together in systems so they can be findable.

Typically, metadata in Institution Repositories are original works. But without assigned DOIs they are excluded from the rich scholarly data environment and tracking. ORCID IDs are really useful, but they are opt-in and not well integrated with the library systems – ORCID IDs are reliant on authors to create, maintain, and link to the records in the library systems. Libraries are working to manage this opt-in process better but more needs to be done.

Questions such as “who did we help” and “what research did we enable” will serve to help track and justify funding for resources and expertise. But in order to work, PIDs and unified standards and transparency are needed.

Megan also noted that there is little coordination between publishers and repositories at the moment, and with the upcoming OSTP August 2022 and NSM33 changes coordination is going to be more important. She summarized the gaps that she encounters:

  • More standards and standardization
  • Metadata verification (who and how)
  • Post-publication updates and enrichment
  • Communication between data and text publishers
  • Goals and ethics for scholarly publications metadata

Make sure to watch the Q&A session on the forum video: 00:39:18.

During the break, attendees were asked to provide their thoughts on what one piece of metadata was missing or often incorrect. While there are some outliers, there is a clear picture that our community must do a better job of collecting better author data (for all authors), affiliations, and funding metadata.

The audience was also asked how they fix their metadata, with the results showing that an overwhelming number of them rely on manual fixes, ask their vendors to fix it for them, or end up not fixing and leaving the errors in place.

Session 2 was moderated by Scott Dineen (bio), Senior Director of Publishing Production and Technology, Optica Publishing, and focused on practical promising solutions to close the gaps and connect the dots with metadata.

Marjorie Hlava (bio), Founder, Chairman, President, Access Innovations, addressed the framework for metadata for content. She noted that solutions to improve the accuracy of scholarly publishing metadata need to be automated, accurate, and consistent. We need tools that will improve scholarly metadata, not just the NISO standards to follow. There are a number of consortia for clean data who are working together to share, check, and enhance the metadata and have made giant strides compared to 10-20 years ago. However, improving the content along the way also needs to be addressed. Semantic enrichment supports metadata and search, saves time for researchers, and allows for disambiguated information. Much could be done to weed out things before publication. One example she shared is American College of Physicians (ACP) who have found through their article submission process they can index things better.

Constantly changing vernacular also causes problems when trying to find information. The following are some databases Marjorie recommended to help reduce errors:

  • TaxoGene (a specialized application of the Data Harmony Suite) to help automatically find synonyms.
  • Medicinal Plant Names Service (MPNS) is a global nomenclatural indexing and reference service for medicinal plants.
  • Bad Cell Lines (offered by ICLAC) a registry of cell lines that are known to be misidentified through cross-contamination or other mechanisms (e.g., mislabeling).
  • Sci Gen is a system to identify problematic papers.

Indexing of audio and video is another area new on the horizon; but without proper tagging these assets disappear.

Chris Shillum (bio), Executive Director, ORCID, has a different take on metadata and said, “We don’t need copies of metadata everywhere if we have PIDs and links. And to clean it up is insurmountable and unrealistic — it’s just too big and we’ve been trying to do this for decades.” ORCID is not just a registry of people, it’s a registry of relationships with other research objects/entities, such as works, research outputs, publications, etc. PID providers, along with their primary object, store metadata about linked objects.

Chris noted that if we want to improve metadata quality, we need to ensure PIDs are assigned to research objects and metadata is deposited with the primary PID provider. PID infrastructure providers are starting to work together to make it easier to access consistent metadata across all services.

Yvonne Campfens (bio), Executive Director, OA Switchboard, reported on the OA Switchboard, a mission driven community environment with practical tools designed to help all stakeholders in the scholarly communication enterprise. The OA Switchboard was designed as a way to share information about Open Access publications throughout the publication journey and is a technical hub where metadata are being exchanged and leveraged by PIDs. The organization has learned a few things about improving scholarly publishing metadata, that being:

  1. Metadata and PIDs should be determined and captured at the source
  2. To capture the event base and time-specific values as they can change over time. An example — the corresponding author at submission may not be the corresponding author at publication.
  3. Organizational stakeholders, such as research funders, publishers, and institutions, should be in control of their own metadata.

Yvonne explained that the OA Switchboard does not clean up metadata or take responsibility for metadata coming out of the OA Switchboard, as it is the responsibility of the one who provides it. Nor do they build custom solutions for participants to connect to the OA Switchboard. However, the OA Switchboard does build a community of like minded spirits, offers better management information, re-uses smart matching tools, and conducts market research.

Make sure to watch the Q&A session on the forum video: 01:37:15.

In closing the Forum, Howard and Scott noted a shared goal to figure out the gaps and identify who the right stakeholders are to take forward further discussions and to identify solutions that do not reinvent the wheel.

Key takeaways:

  • Author attribution needs to be more robust throughout all systems
  • Utilize PIDs wherever possible
  • Use existing tools, do not reinvent the wheel
  • Incentivize metadata collection
  • More coordination between stakeholders is needed
  • Better processes for updating metadata are needed

Thank you to our speakers for their insights, our sponsors for their contributions, and the many attendees who participated and helped make this CHORUS Forum a success.

The full recording of the event along with the individual presentations are available on the event page.

Share this: