(1) Introduction

This paper is an output of the Open Bibliographical Data Workflows project funded through the DARIAH Theme 2022 programme from the DARIAH ERIC. DARIAH’s Bibliographical Data Working Group took this opportunity to prepare and disseminate a set of workflows in key areas related to bibliographical data (or ‘bibliodata’) curation and research. Our paper begins by discussing the challenges and opportunities of publishing workflows via the SSH Open Marketplace for which a set of five case study workflows has been prepared. The set of workflows presented here deals with both bibliodata curation and research and showcases various aspects of bibliodata related agendas and their practical implementation, ranging from converting library data to research data (see 2.1) and introducing a novel data analysis toolkit (2.2), to sharing of open citations (2.3), bibliodata LODification (2.4), and providing a national service for bibliometrics (2.5). The paper then proceeds to the ‘multilinguality challenge’ that the bibliodata community is facing and how this challenge could be addressed by workflow creation and dissemination. Multilinguality in the context of bibliodata is understood here as creation of a bibliographic dataset which contains data in multiple languages in a way that facilitates seamless data processing and analysis while respecting linguistic diversity. With this in mind, we propose differentiating between the needs of bibliodata curation and research, arguing that more attention is needed to raise awareness and share best practices for processing multilingual datasets in bibliodata research workflows, given that some solutions for multilingual challenges have already been adopted by the curatorial community. A preliminary version of the workflow for ‘multilingual harmonization’ of bibliodata is proposed that could be further discussed and published on the Social Sciences and Humanities Open Marketplace (SSH Open Marketplace) to supplement existing bibliodata-related workflows.

(1.1) What are SSH Open Marketplace and SSH Open Marketplace Workflows?

Workflows are important from a methodological point of view as they enumerate and describe the variety of steps to be followed while working on specific bibliodata-related tasks. They could play a key role in documenting specific use cases, serving to fix concrete instructions on how the same dataset might be adapted to meet different research needs, or how a service can be used. Making these workflows publicly available undoubtedly meets the needs of the larger DH community, supports implementation of open science policies, and facilitates interinstitutional and interdisciplinary cooperation in the field (). Furthermore, the publication of workflows contributes to FAIRification of scholarly communication by providing a structured and formalized, findable and openly accessible resource for the academic community. This goal is best achieved by publishing workflows via the SSH Open Marketplace, which also serves to embed them within a wider and denser network of DH resources.

The SSH Open Marketplace is a discovery portal which pools and contextualizes resources for SSH research communities. These resources, broken into five categories (tools and services, training materials, datasets, publications, and workflows), allow the SSH Open Marketplace to showcase solutions and research practices for every step of the research data life cycle. Particularly important are the built-in links contextualizing resources with one another, allowing users to see, for instance, not just the tool itself but related training materials for learning how to use it, and publications that illustrate how other scholars have used the tool. Another aspect of contextualization is the reliance on domain-specific vocabularies, such as TaDiRAH, that allow users to describe their resources with rich metadata (). This contextualization, alongside strong curation and community outreach, allows the SSH Open Marketplace to facilitate discoverability and findability of research services and products that are essential for the sharing and re-use of research workflows and methodologies.

The SSH Open Marketplace, developed during the Horizon 2020 Project SSHOC, acts as a thematic entry door into the European Open Science Cloud (EOSC) and is maintained by DARIAH, CLARIN, and CESSDA. The development work behind the SSH Open Marketplace was inspired by previous ventures such as the DiRT directory (), TAPoR, and Standardization Survival Kit (), and indeed the initial data aggregation came in part from these sources (). Aside from this initial data aggregation, the SSH Open Marketplace also developed a sustainability plan and governance scheme that integrated the lessons learned from these previous experiences – notably the ‘directory paradox’ () – in order to integrate community feedback and provide solid infrastructure and human resource support from the ERICs in the interest of maintaining platform viability (). As part of this effort, an editorial board was established to actively maintain the platform and curate its content.

A key asset of the SSH Open Marketplace is Workflows, which build upon the built-in contextualization of the Marketplace to show a step-by-step approach to a research scenario. Workflow is an ideal way to share one’s research resources, and harness the power of the SSH Open Marketplace to contextualize tools and services with publications, datasets, and training resources, thus presenting a research activity from A to Z in a way that is reproducible and easy to follow.

(2) Bibliographical Data Workflows

The Open Bibliographical Data Workflows project aimed to prepare and publish a set of bibliodata-related workflows, because of the need to share best practices, popularize existing services and datasets, and facilitate the future reproducibility of research. These workflows represent an array of examples, from more general (Workflow 1) to more specific ‘use cases’ (Workflows 2–5) with regard to tools and services. At the same time each workflow discusses different types of bibliodata to ensure that the needs of the largest possible user group within the bibliodata community are addressed.

(2.1) Workflow 1: From Library Data to Research Data ()

Library data is without question the most frequent type of bibliographical data, and is naturally a key data source for data-driven research of this type of data.

Library catalogues have been identified as a crucial resource for studying different aspects of print culture spanning from literature to intellectual history and to informatics (; ; ; ). At the same time, using them requires addressing challenges of data quality, representativeness, coherence, completeness, and interpretation. An important aspect of bibliographic data science workflow is that it is imagined as a multilingual and transnational way of approaching extensive data for SSH research over long periods of time. Data from national libraries, for example, cover hundreds of years of data and many languages. Looking at the possibility of combining different library datasets from multiple countries, researchers often face the challenge of dealing with language dependencies, deduplication, and structural harmonization.

Available bibliographic metadata is seldom readily amenable to quantitative analysis. Biases, inaccuracies, and gaps hinder the productive use of bibliographic metadata collections for research (). Varying standards, conventions, and languages pose challenges for data integration. The purpose of harmonization is to turn catalogue information into a dataset that can be used in quantitative humanities research. The core of this type of work consists in iterating between data harmonization, conducting analysis built on different use cases, and validating data against other sources. The central position of these three activities within the whole iterative process is schematically presented in Figure 1.

Figure 1 

Bibliographic Data Science: From Catalogue to Research Data Workflow.

For harmonization, it is common to use external data sources on authors, publishers, and places to enrich and verify bibliographic information. Examples of harmonization include the removal of spelling errors, inconsistent transliteration or transcription, disambiguated and standardized terms, augmented missing values, and developed custom algorithms that can convert the raw MARC notation to numerical page count estimates, for instance. For library catalogue data we can use largely identical algorithms across most metadata collections. However, we often have to deal with various challenges posed by different languages, especially if we want to work across catalogues from different countries. Automation, scalability, and quality control are critical, as the data collections may contain information on millions of documents.

It is important to understand that harmonization is an iterative process that combines harmonization, analysis, and validation of data. On a deeper level, data quality is assessed by way of such generic quality dimensions as completeness of bibliographical information, conformity, appropriateness, and consistency (; ). The slogan ‘fitness for use’, frequently mentioned in quality assessment literature, reflects the manner in which we always measure how a functional requirement could be satisfied by the data. Since these requirements may differ between a particular library and researcher, what may count as high quality data in one context may be low quality in another.

We might imagine this as an open science ecosystem made up of different metadata collections, where work on one of them eases the use of another. It is also helpful to think in terms of ecosystems because the harmonization step often depends on other linked sources such as authority files or other bibliographical sources. Some have described this approach, through which library metadata catalogues become research data, as ‘bibliographic data science’ (). In the case of a workflow that describes the conversion of raw data from library catalogues into research data, it is important to account for, document, and control the process. This can be described as an open science initiative because it takes questions of reproducibility and data quality seriously.

(2.2) Workflow 2: AVOBMAT Software: How to Analyze and Visualize Bibliographical Data ()

The aim of creating the AVOBMAT (Analysis and Visualization of Bibliographic Metadata and Texts) multilingual research toolkit was to combine bibliographical data and textual analyses in several languages by using current NLP techniques and methods in a transparent and reproducible workflow, focusing especially on multi-language comparative analysis with time-based components (; ; ). This methodological approach makes it possible to ask more complex research questions than would otherwise be possible when dealing separately with either textual or bibliographic data analyses. The majority of NLP-based text analysis applications make little use of bibliographic data. As for the multilingual aspects of the workflow, the outcome of preprocessing and analytical modules depends on the language(s) of the uploaded texts. There are two ways to assign a language to a document: researchers can manually choose a language for the full dataset (out of 52 available languages) or select the automatic language detection option. As regards the latter, the system chooses a language independently for each document; in the preprocessing phase, based on the selected language, users can choose a number of other features, including stopword, punctuation filtering, and lemmatization. For stopword and punctuation filtering, the spaCy library is used, to which additional stopword and punctuation lists can be added. The research toolkit implements spaCy language models (small, large, and transformer) for lemmatization; in the case of languages without spaCy models (e. g. Czech, Bulgarian, Romanian), LemmaGen () is implemented.

Users can investigate the bibliographic data of texts preprocessed in AVOBMAT in various ways (e.g. diachronic charts, network analysis, gender analysis of authors) by using distant and close reading methods. They can also explore semantically enriched metadata, drawing, for example, on named entity recognition of the documents. Metadata enrichment includes the identification of author gender in 55 languages (male, female, unknown gender or without author). Users can upload a list of male and female forenames supplementing and replacing the ones found in the dictionaries of the programme. AVOBMAT currently identifies the Parts-of-Speech tags in 16 languages and produces different interactive visualizations and statistical tables of results. It also disambiguates and links recognized named entities to Wikidata, VIAF, and ISNI. The advantage of the latter, in the case of multi-language collections such as the European Literary Text Collection corpus or European Literary Bibliography, is that the recognized named entities such as names, locations, and organizations – appearing in distinct forms in different languages – are disambiguated and interconnected with one another, using Linked Data principles, via the relations stored in the growing multilingual Wikidata knowledge base.

(2.3) Workflow 3: Metadata Crosswalk for Citation Data Production in OpenCitation ()

OpenCitations (), managed by the Research Centre for Open Scholarly Metadata at the University of Bologna, is a public service infrastructure organization that advocates for open science principles in bibliographic and citation data exposition. The data published by OpenCitations is first collected from several data sources, curated, remodelled to be expressed in compliance with the OpenCitations Data Model (OCDM) (), and exposed using Semantic Publishing and Referencing (SPAR) Ontologies and Semantic Web technologies (). To guarantee unrestricted access to data, the collections are produced in easily reusable and interoperable formats, namely RDF, SCHOLIX (), and CSV, and provided under a Creative Commons CC0 1.0 public domain licence. In sum, all data released by OpenCitations complies with both Force11 FAIR principles (), and recommendations by I4OC () that citation data should be structured, separable, and open.

Citation data is crucial in calculating the impact of scientific and scholarly research, but also in verifying the soundness of theories, and making studies replicable. In SSH disciplines today, there is a noticeable delay in awareness regarding open science issues (). For this reason, SSH researchers should be provided with the appropriate tools for facilitating the exchange of data concerning references among publications.

The proposed workflow offers a procedure for extracting citations and bibliographic data from a dataset and reshaping it in conformity with a new data model (). As concerns metadata, however, adopting a language-agnostic approach and a broad character encoding scheme – barring any restrictions imposed by data models – enhances the preservation of bibliodiversity over global uniformity in a dominant language.

This workflow describes a procedure for ingesting data into the OpenCitations infrastructure (), with the logical consequence that data will be represented in the OpenCitations Data Model (OCDM) (). Nevertheless, it also exposes general good practices for metadata crosswalks, fostering free and open-source tools, and disseminating an easily replicable procedure for metadata conversion between different data models.

The workflow consists of five steps:

  1. Data source selection: identification of a data source, exposing structured data where bibliographic entities are identified by persistent identifiers (PIDs).
  2. Development of a software plug-in for data conversion, extending the OpenCitations converter software to manage the bibliographic data of a new source; this is done by validating identifiers, abstracting data content, and re-structuring it according to the end data model (OCDM).
  3. Production of metadata and citation data collections, structured according to OCDM.
  4. Ingestion of the metadata collection in OpenCitations META (), using META software to integrate new data into the infrastructure and map each persistent identifier to an OMID, and using the OpenCitations identifier to uniquely identify bibliographic records for deduplication purposes.
  5. Production of OpenCitations citation data: starting from the collection of citations expressed as links between the PIDs of the entities involved, the OpenCitations META database is used to map each PID with its corresponding OMID and produce a dataset of OMID-OMID citation links; the collection is then published in RDF, CSV, and SCHOLIX format on open platforms under a CC0 waiver licence.

Looking forward, this workflow provides an easily replicable procedure for metadata crosswalks between different models, while also serving as a predisposition to help deal with the complexities of multilingual bibliographic data acquisition. Researchers and institutions working with non-English-language data sources may use this methodology as a model, as it may be adapted and customized to meet their individual needs.

(2.4) Workflow 4: LODification of Bibliographical Data: Zotero to Wikibase Migration ()

Research in the SSH may require a flexible software solution for bibliodata aggregation, curation, storage, transformation, and dissemination, while also ensuring interoperability with external tools and resources, and conforming to FAIR principles. Achieving these goals often requires major efforts to harmonize data, regarding both conceptual data modelling and data format migration. To date, there are very few universal standards among bibliodata providers for sharing metadata, most notably MARC21. Longstanding efforts to standardize inter-library data sharing have focused on addressing the needs of libraries and not on providing bibliodata to the wider research community ().

To a certain extent, bibliodata aggregation across multiple data sources, management, storage, and dissemination can be achieved using freely available reference management systems like Zotero. However, options for data curation, enrichment, and alignment to other resources remain very limited, especially when scaling up to larger datasets and spanning multiple descriptive languages (see also 3.2). To address this need, LODification of bibliographical data: Zotero to Wikibase migration is a workflow for converting bibliodata collections from Zotero to a custom Wikibase: a free and open source software solution for storing, modelling, editing, and querying Linked Data, by which literal, field-based bibliodata information can be turned into a network of identifiable entities and standardized properties. The most well-known Wikibase instance is Wikidata, currently the largest free and open Knowledge Graph (). It contains millions of entities, describing, for example, scholarly publications and their authors. Custom Wikibase instances for managing bibliodata may or may not follow the manner in which the Wikidata community models entities, especially as this regards the semantics of bibliodata-describing properties. Use-case-specific modelling decisions are thus still possible, while otherwise it is preferable to follow Wikidata’s modelling choices to ease upload to or federation with Wikidata, and to benefit from Wikidata tools such as Scholia () or Author Disambiguator.

Building on previous work (; ; ), this workflow proceeds to describe the preliminary stages of ZotWb, a new python app for enabling the migration of bibliographic records from a Zotero group library to a custom Wikibase instance, in the process cleaning and harmonizing data, and finally aligning and enriching this data with Wikidata content. Wikibase items and Zotero records remain linked to each other.

ZotWb allows a fine-grained and self-defined mapping of Zotero data item types, data fields, and creator types to Wikibase properties and classes. In several cases, however, it suggests using Wikidata-aligned entities for this purpose. The OpenRefine tool is integrated in our workflow for carrying out data cleaning, as well as alignment with Wikidata (reconciliation) and subsequent enrichment. OpenRefine offers highly specialized functions for duplicate detection based on string similarity, allowing users to harmonize variant literals describing persons, places, or institutions, and to disambiguate entities with identical labels. ZotWb includes a reconciliation service for its own custom Wikibase, in addition to the default Wikidata reconciliation service built into OpenRefine, so that literals can be matched against entity labels on the ZotWb Wikibase and/or Wikidata. Users may choose to import aligned Wikidata-entities, or to exploit the alignment using federated SPARQL queries.

The ZotWb tool and workflow around it rely entirely on free software, which makes it possible for anybody to share bibliographies as Linked Data.

(2.5) Workflow 5: Studies on Science and Higher Education System in Poland Using the RAD-on Platform ()

A workflow concerning studies on the science and higher education system in Poland illustrates how to gather and analyze data using the RAD-on platform (). The RAD-on platform is a realization of the idea of open government data, providing reports, analysis, and raw data on science and higher education in Poland, and featuring data collected at the national level and at all universities and research institutions in Poland since 2013. This platform has been developed for scientists and scholars, students, entrepreneurs, policymakers, and journalists looking for reliable and up-to-date information on scientific and scholarly institutions and the research they conduct. In addition to its openness, RAD-on implements the FAIR concept, associated primarily with scientific datasets used in research.

The RAD-on platform is part of the largest national research information system on science and higher education in Europe, and its users gain access to regularly updated data on Polish science, interactive maps and charts, and REST APIs, which can be used to create custom data summaries. Freely available datasets contain information on scientific and scholarly institutions, activities in the fields of science and the arts, academic staff, academic promotion procedures, and bibliographical data about scientific and scholarly publications by Polish researchers.

The workflow includes five steps in which a user can perform a custom analysis of data using RAD-on:

  1. The first step involves coming up with research questions which may then be answered through the querying of RAD-on’s data.
  2. The second step underlines the importance of identifying the relevant dataset for a particular scientific inquiry.
  3. The third step presents two possible approaches for retrieving data: the first involves basic export of data, which might be preferred by users with no background in IT; the second, more advanced approach involves the use of an application programming interface (API), and might be more suitable for users with basic programming skills, such as data scientists.
  4. The fourth step identifies various procedures for checking the obtained data.
  5. In the last step, some useful and popular tools for data analysis are listed and examples of completed data analysis using RAD-on are provided.

The workflow for RAD-on was designed to benefit all its users, regardless of their digital literacy or familiarity with the higher education and research landscape in Poland. Its authors were confronted with the challenge of striking a balance between providing accurate data analysis descriptions and ensuring that they will be understood by scientists working outside data science and related fields.

(3) Future Challenges for the Creation of New Bibliodata Workflows: The Multilinguality Challenge

The bibliodata-related workflows presented in this paper describe practices pertaining to bibliodata curation and research. One issue they all share in common is the need for data harmonization. What these workflows illustrate is that, because of their differing functions, curatorial-infrastructural and research environments place different emphasis on data harmonization. In the following sections we investigate how the harmonization of multilingual data is being tackled in both cases. We argue that bibliodata-related research needs stronger support to face this challenge, given its more varied and less formalized nature of research activities. To that end, a general multilingual harmonization workflow is proposed for further specification and case-by-case adaptation.

(3.1) Multilinguality in Contemporary Bibliographical Data Curation

In this section we present the typical scenario of a ‘multilingual challenge’ in the curatorial environment. Multilinguality issues are discussed among the librarian and bibliographical community, because MARC21 standards – the highly predominant data format for curation of bibliographical data in the librarian sphere – and related cataloguing rules (today mostly RDA or previous AACR2) do not sufficiently reflect multilingual aspects (; ). MARC21 does not define specific rules on how to manage multilingualism at the general level, presupposing only one language to be used for processing each singular bibliographical or authority record, by virtue of the fact that the field for cataloguing language is defined as non-repeatable.

Multilinguality issues might be partly solved at the level of specific types of fields. Fields for subject description (6XX fields) allow the parallel use of different descriptive systems. These systems are often based on the language (code of such systems is stored in subfield 2 or in the second indicator of the 6XX field). In cataloguing praxis, this option for inclusion of multilingual variants is mostly used for the topical headings (field 650).

Hence, these descriptive systems make it possible to create the bibliographical record in multiple languages and to interconnect each given value with the language used for its cataloguing. For example, in the cataloguing praxis of the Czech National Library, authority file for topical headings includes translation of each heading into English (language version is then specified by subfield 2). This allows it to include into each record topical headings in both Czech and English (czenas vs. eczenas subject authority files).

65007 $a argentinská próza $7 ph162925 $y 20. století $2 czenas

65009 $a Argentine prose literature $y 20th century $2 eczenas

Apart from authority files, the MARC21 standard as such offers at least a partial solution for dealing with multilinguality in the fields not related to authority systems. First of all, official documentation defines field 880 as ‘Alternate Graphic Representation’, which enables it to represent the values from basically any field in a different script (Arabic, Latin, Chinese/Japanese/Korean, Cyrillic, Greek and Hebrew are defined for this field with a specific code). Library systems (ILS) allow for data to be displayed simultaneously in all of the recorded scripts. MARC21 partly stores the information about the original title of the translated book as well, but this information is processed inconsistently (fields 240 and 765) with respect to the cataloguing rules used. In each of these approaches, it is clear that application is far from reaching its full potential, so that sometimes the name of a translator (in 245$c together with other contributors) is the only evidence that a book is a translation (no source language nor original title is recorded) ().

The subject description of a given document is mostly provided in verbal form as well as in language-agnostic classification systems such as Dewey Decimal or Universal Decimal. Each value in these systems is represented by a specific numeric code, which makes it possible to group together the terms for related entities regardless of their verbal representation (e.g. each value related to language or literature starts with 8, etc.). These classifications are regularly mapped and translated into vernacular languages, which allows librarians and bibliographers to understand the content of the record without needing to be familiar with its language.

For the purposes of internationally understandable subject descriptions, various disciplinary-specific descriptive systems are continuously developed within internationally coordinated consortia. These consortia often curate the central version of a given descriptive system, which is available mostly in English. National nodes are responsible for its translation into specific vernacular languages. With regard to the level of their globalization, internationally used thesauri are more frequent in natural and medical sciences, where research outputs are published mostly in English (cf. internationally widely broadened thesauri like MeSH or Agrovoc).

Next to the aforementioned options defined in general MARC21 standards, multilinguality issues are quite often solved via local interpretations of this format, in particular by defining specific subfields that contain information about the language of the value in a given field. Such a solution was quite recently implemented at KBR (Royal Library of Belgium), which uses a special subfield @ for specification of the language (in ISO notation).

110 $a Province du Brabant wallon $g Brabant wallon $c Wavre $@ fr-BE

Contrary to the Czech and Belgian solution, the Israel National Library uses subfield 9 for specification of the script:

710 $9 heb $a תליהק ידע הוהי לארשיב

710 $9 lat $a Watchtower Bible and Tract Society of New York

Unfortunately, these rules are defined only as a local extension of MARC21, frequently undocumented, differ among various databases, or often do not exist at all.

Multilinguality issues are taken into account also for the processing of records in other types of bibliographical databases. National bibliometric databases usually register data in vernacular languages as they are used primarily for evaluation purposes at the level of a given country. Nevertheless, at least the most important information on a particular document (title, abstract, keywords) is typically available in another language (predominantly English) for the international audience.

For the most part, references and bibliographical citations are currently language-agnostic. In spite of this, there are now a plethora of citation norms, which predominantly do not use any language specific abbreviations at all – each element (volume, issue etc.) might be defined by its position, or by specific interpunction in the given field –, or prefer to use internationally recognized abbreviations in English or Latin.

(3.2) Multilingual Harmonization for Bibliodata-based Research

Bibliodata-driven research performed on multilingual datasets poses a significant challenge, and researchers commonly face the task of disambiguating language-sensitive values in metadata fields. They may wish, for example, to harmonize the names of publishing houses, authors, or subject descriptions that are expressed in different languages on different databases. These processes typically aim to yield reliable statistical results, for which the harmonization and deduplication of values are crucial. What is more important, researchers may wish to combine different single-language datasets, and their research might entail different standards for disambiguation or deduplication than those that exist in curatorial settings ().

Multilinguality poses an even bigger challenge to the harmonization of metadata contents when creating international datasets. Having investigated the parameters of the bibliodata with regard to the availability of multilingual elements, we propose a dedicated workflow for the preparation of multilingual bibliographical datasets. Such a workflow might not contribute only to further data reuse between bibliodata processors from different countries, but also significantly accelerate comparative research on bibliographical data, providing a path for interconnecting and harmonizing datasets processed in different languages.

The main goal of such a workflow is to ensure that all language-sensitive data elements are machine-readable while respecting multilinguality as such. To do so it is critical to identify language-sensitive data elements, apply multilingual PID systems, and enrich data in a way that respects multilinguality and the machine-readability of data.

From our perspective, enrichment of the bibliographical datasets with multilingual elements should generally be based on the following steps:

  1. identification of metadata fields with multilingual content
  2. identification of external sources useful for multilingual description (e.g. VIAF, Wikidata)
  3. application of harmonization solutions
  4. enrichment of data with PIDs or thesauri while respecting multilinguality of contents
  5. documentation and publication of data

Key elements for implementing such workflows in the case of both research and curatorial praxis are:

  1. the use of data models and standards that allow for direct representation of multilinguality (e.g. by a specific subfield/data element),
  2. further development of multilingual bibliodata infrastructure descriptive thesauri and other knowledge bases with emphasis on incorporating values in less common languages; development of such datasets requires intensive cooperation with GLAM experts and bibliodata researchers,
  3. preparation of dedicated software tools for multilingual enrichment of the given bibliographical datasets.

Creation of a workflow for multilingual enrichment of the bibliographical data represents only the first step towards this goal. We are suggesting that the SSH Open Marketplace, with its strategies to engage relevant stakeholders, might become an important vehicle for creating such workflows demanding international and cross-sectoral collaboration. What would be of particular value is a detailed presentation of tools (e.g. platforms) emphasizing research reuse of multilingual dataset in the SSH Open Marketplace – namely inclusion of data harmonization workflows for such infrastructures that could serve as tested, scalable workflows. One example of such platforms is the European Literary Bibliography (literarybibliography.eu), a service aggregating bibliographic metadata, or projects such as Share-VDE. At the same time, of course, ensuring creation of multilingual datasets for research purposes will not solve any other existing issues around data quality or data coherence for any possible research or curational need.

(4) Conclusion

Describing and disseminating workflows’ descriptions are an important tools for any academic community of practice, including the bibliographical data community. The role of the workflows is to promote harmonization and interoperability in a dispersed field and to provide an opportunity to better allocate resources and produce reproducible results of scientific works. To that end initiatives such as the SSH Open Marketplace are helpful as a vehicle for standardisation and dissemination.

Bibliodata-related workflows are needed at different levels of data creation and research, both for specific software features or data sources as well as for consolidating methodological aspects of bibliographical data curation at a very general level. Such workflows have to be applied in particular for cleaning, conversion, enrichment and harmonization of the bibliographical data.

Multilinguality is an important part of data harmonization both in curation and in research, yet there are no overarching, consistent workflows to deal with these challenges. This issue is especially prominent in those research activities which vary from case to case, and which are less formalized than curatorial processes. Multilinguality needs to be consistently respected on all possible levels of scholarly data processing. The ‘Helsinki Initiative on Multilingualism in Scholarly Communication’ advocates for supporting the infrastructure of scholarly communication in national languages, as this fosters opportunities for publishing locally relevant research and interacting with communities beyond academia, and in particular helps to interconnect locally relevant data to broader international cooperative networks. Our proposed workflow for the preparation of multilingual bibliographical datasets aims to satisfy the need for supporting and developing multilingual frameworks for data sharing.