6  Standardisation of Data & Terminology

Data can only be understood with the broader concepts of information and knowledge, because data in itself is unprocessed, raw knowledge, that cannot be understood. The EMO Feasibility Study intuitively defines data gaps without an apparent reference to a data or conceptual model. Because standardisation is one of the key services of the envisioned European music observatory, we gave a lot of consideration to the standards to be applied, and the terminology negotiation process among the observatory’s stakeholders.

In information science, a conceptualisation is an abstract, simplified view of some selected part of the world, containing the objects, concepts, and other entities that are presumed of interest for some particular purpose and the relationships between them. Usually, when we record information about a musical work, we do not make a copy of the entire work but record some identifying properties of the work, for example, the name of its author and the name (i.e., the title), its unique ISWC identifier, and the data or registration. Composers as human beings are represented by their names, IP Names or ISNI identifiers, and date of birth and death.

A data gap can only be formally defined and filled with some reference to conceptual models of the world. A typical data problem plaguing the music sector is the amount of computer and human work needed to connect musical works and their recorded fixation, and eventually, the composers, producers, and performers linked to these objects for royalty payment. How can we define a data gap in such circumstances, and how can we fill it?

6.1 Business processes

Since the Open Music Observatory is primarily a data dissemination hub, the definition of our services (Chapter 3) apply elements of the Generic Statistical Business Process Model (GSBPM), an international standard that describes and defines the set of business processes needed to produce official statistics. The GSBMP is accompanied by the General Statistical Information Model, which builds on the Data Documentation Initiative (DDI) and the Statistical Data and Metadata eXchange (SDMX) (Pellegrino and Grofils 2013).

The DDI and SDMX are the foundations of working with social sciences archives, statistical microdata, and processed statistical data. Their key elements are described in the Resource Description Framework of the World Wide Web and can be used in Linked Data.

Some elements of DDI are described with RDF: The DDI-RDF Discovery Vocabulary is a draft specification of the DDI Alliance. (Hartmann et al. 2024). Whenever possible, we rely in our observatory with this annotation; if that is not yet possible, we follow the DDI Lifecycle (3.3) Documentation (Data Documentation Initiative 2020).

6.2 Conceptual and information models

Data can only be understood with the broader concepts of information and knowledge, because data in itself is unprocessed, raw knowledge, that cannot be understood. The EMO feasibility Study intuitively defines data gaps without an apparent reference to a data or conceptual model.

In information science, a conceptualisation is an abstract, simplified view of some selected part of the world, containing the objects, concepts, and other entities that are presumed of interest for some particular purpose and the relationships between them. Usually, when we record information about a musical work, we do not make a copy of the entire work but record some identifying properties of the work, for example, the name of its author and the name (i.e., the title), its unique ISWC identifier, and the data or registration. Composers as human beings are represented by their names, IP Names or ISNI identifiers, and date of birth and death.

A data gap can only be formally defined and filled with some reference to conceptual models of the world. A typical data problem plaguing the music sector is the amount of computer and human work needed to connect musical works and their recorded fixation, and eventually, the composers, producers, and performers linked to these objects for royalty payment. How can we define a data gap in such circumstances, and how can we fill it?

Numerous knowledge institutions store information about musical works, as well as natural persons (humans) who composed or performed these works and contributed to their recorded fixation. If we want to inquire about composers, we must know that a composer is always a human (animals or software agents with AI algorithms cannot be entitled to composer copyrights.) We also must know that a musical work is an abstract creation, manifesting as a notation (physical or digital sheets, MIDI files) or recording (analogue or digital-physical object, or a file.) If we want to validate the composer’s information connected to a recording of a particular musical work, we must access databases containing information about humans concerning some identifying properties of works or recordings.

We imagine a future European Music Observatory that is not a specialised knowledge institution and is not a library, archive, museum, or statistical agency. Instead, it should be able to consolidate knowledge from all such institutions and find ways to bring together data from private enterprises and data collection programs to fill the information gaps of the European music sector stakeholders.

Our services use the Wikidata Data Model as a data coordination and reconciliation model (Wikimedia Foundation n.d.). In this regard, we follow many successful EU and member-state, (Alexiev et al. 2020; Diefenbach, Wilde, and Alipio 2021; Rossenova, Duchesne, and Blümel 2022; Faraj and Micsik 2023) or music projects (Siler 2022). We particularly want to mention the excellent work of the University of Helsinki in creating WB CIDOC, a simple business process and data mapping between the Wikidata Data Model and the more complex CIDOC CRM used by extensive collection management systems (Kesäniemi, Koho, and Hyvönen 2022).

The StatDCAT-AP and the more general DCAT-AP definition of the EU Open Data Portal provide a bridge among library metadata systems, such as DCMI Metadata Terms (Dublin Core) for libraries, the World Wide Web DCAT standard for publishing datasets, and some core terms of the Statistical Data and Metadata eXchange.

Our most important reference is the DCAT-AP 3.0 specification, and its extension to statistical data by the EU Open Data Portal.

The Europeana Data Model (EDM) similarly provides a more straightforward connection tool among various library, museological or musical collections; it mainly builds on Dublin Core and offers equivalent classes for the more complex CIDOC CRM (“Definition of the Europeana Data Model V5.2.8” 2017). We see no problem in connecting the EDM towards RiC.

The CIDOC Conceptual Reference Model (CRM) provides an extensible ontology for concepts and information in cultural heritage and museum documentation (Bekiari et al. 2024).

Last, we mention some novel standards and standard candidates related to documents, microdata, and metadata documentation, such as music survey questionnaires. The Records In Context (RiC) 1.0 CRM and ontology were adopted in November 2023 to replace four international archival standards with backward compatibility. The DDI-Discovery vocabulary is an evolving standard that aims to describe important DDI terms with the World Wide Web standard Resource Description Framework. To keep our systems future-proof, we adopt elements of RiC and DDI-Discovery to document our question bank and codebooks (Archives Expert Group on Archival Description 2023; Hartmann et al. 2024).

Note

A future European Music Observatory could help with coordinating European research activities in the music sector. An EMO could also develop tools to establish cooperation between various data collection bodies. The Observatory should, therefore, also be involved in setting standards and developing common EU wide definitions that are crucial for consistency. (European Commission et al. 2020, p80)

Since the adaptation of the European Interoperability Framework and similar FAIR measures in open science, such terminological standardisation has taken place in the definition of formal ontologies, i.e., knowledge bases that software applications can use, too.

The music observatory should have competent knowledge engineers and ontologists and should be involved in the discussions of sector-agnostic ontology, for example, on the possible improvements of CIDOC or EDM, for a better representation of music.

There is also a need for the development of more usable and more widely accepted music-sector ontologies. In T5.1, we have reviewed the Polifonia Ontology Network and the Music Ontology, but we believe both have shortcomings for a full adaptation.

6.3 Identification & Entity Linking

Entity linking, also referred to as named-entity linking (NEL), named-entity disambiguation (NED), named-entity recognition and disambiguation (NERD) or named-entity normalisation (NEN) is the task of assigning a unique identity to entities (such as famous individuals, locations, or companies) mentioned in a digital resource, such as a file.

Tip

The MusicBrainz free music database contains records of 20 artists named Paris (artists)↗, and 15 locations using the same name Paris (locations)↗, which all may enter a data-driven service as artists who must be credited for attribution or royalties, and as a place of an event, release, or publication. Connecting the word Paris to the correct person, group or location is the task of entity linking.

Since the inception of the world wide web, data flows across organisations and countries, and the use of local identifiers is not a good solution. International organisations of music, heritage management, science, and national organisations are increasingly shifting to the use of persistent identifiers (or permanent Identifier or handle).

Tip

A persistent identifier (or permanent Identifier or handle), is one that never changes, so that your bookmarks and links don’t break when a website or a database or an API service gets updated.

In 2024, there will be no European or international standard procedure for using PIDs, but several EU member states (Austria, Czechia, Germany, Netherlands) and other countries will have already adopted national PID strategies. Because Reprex is the current technical registrar of the Open Music Observatory, we losely follow the Dutch national strategy (Cruz and Tatum 2021) and the ID allocation practice of the Nationaal Archief, but this means no bias towards data partners in the Netherlands. The Dutch PID strategy does not use mandatory practices; it only recommends practices, and offers a thought-through consistent policy of using global identifiers that are not country-specific.

The structure and management of global identifiers strongly correlates with the grade of achievable automation and the potential for innovation within and across different sectors of the media industries.

Because of the prevailing problems of named entity linking, we are planning value added services to resolve named-entity recognition and disambiguation (NERD.) For this purpose, we are planning the use of AI (see Section 7.3).

6.3.1 Registers & Authority Files

Registers record every data subject belonging to a category or class: every music publisher operating in a jurisdiction, music composer with copyright claims, or statistical dataset published. Registers are essential in identifying persons and objects (“things” in information science.)

Authority files play a similar role in collections management: they provide identification information about persons or objects and tools for disambiguation. Authority files, for example, give the preferred name title for persons and musical works when available in different name or title formats, and they provide a language-independent, machine-readable identifier pointing to the correct name title. For two or more authors or performers with the same name, these identifiers help reference the proper person (or object.)

Registers are valuable and indispensable for many digital workflows. They serve as the foundation of various processes, such as statistical sampling (determining who should receive a questionnaire) or copyright management (deciding who should receive the royalty payment). Their absence or inefficiency can significantly hamper these operations.

Unfortunately, the music industry has long missed access to reliable, open registers. The reasons for this are beyond the scope of this report, but we highlight that the underlying reasons for closed and not interoperable registers are deeply rooted in the conflicts of interests among different sub-sectors of music and are unlikely to be solved in a short time. Therefore, music enterprises, researchers, professionals, and curators will need identification services and identity brokerage services for a long time.

Creating and maintaining high-quality registers require significant professional and financial commitments, and they can form a vital service of a future European Music Observatory. Currently, we are experimenting with three service levels in the Open Music Observatory.

  • We create our own transparent and interoperable identifiers within the OMO for persons and their groups (ensembles, bands, orchestras, associations…), legal persons (music businesses, collective rights management agencies, …), events (recording, composing, performing events, festivals, conferences, …), musical works and their manifestation (books, works, recordings, sheets.)

  • We create integrity brokerage services and middle-term identification via Wikibase and Wikidata. Our identifiers are connected to middle-term Wikidata and Wikibase QIDs, which also serve as graph nodes to registry, library, collections, and industry-specific identifiers.

  • We are piloting data improvement services that can find erroneous identifiers or add correct identifiers to various datasets.

6.3.2 Open and persistent identifiers

In line with the practice of the Netherlands, we prefer the use of the following identifiers:

ISNI: preferred persistent identifier for names of people and groups. The use of ISNI is also preferred by Apple Music, Spotify, and as a pilot it was introduced by Teosto, the Finnish national collective management society; it is being considered in many use cases for adoption in all CISAC societies. ISNI is the ISO certified global standard number for identifying the millions of contributors to creative works and those active in their distribution. (Camp, Lieber, and IFLA 2022)

For legal persons, we are discussing the terms to use the OpenCorporates ID, because many organisations at this point do not have an ISNI.

ORCiD: preferred persistent identifiers for music researchers and scholars. This is in line with the Horizon Europe and the European Open Science Cloud recommendations; ORCiD itself only adds functionality to ISNI; i.e. each ORCiD ID is at the same time registered as an ISNI.

VIAF: VIAF is the shared authority file of national libraries. It offers more services than ISNI and includes an ISNI for the author.

DOI: we use the Digital Object Identifier for publicly released documents.

ISBN: We issue ISBN identifiers for long-form publications of our partners. (ISO 2017)

6.3.3 Not open, music-industry specific identifiers

Book and music sheet publishing uses the ISBN and ISWN, professional and magazines and scholarly music journals use the ISSN, and the music rights management uses ISRC and ISWC. These standards usually resolve an identifier to some network location where metadata or the object itself can be found. There are many advantages and disadvantages of this model.

For example, the ISWC identification of musical works is the backbone of copyright management, and it is a closed and consistent system developed over many decades by the member organisations of CISAC. The downside of this closed system is that the metadata about the works identified by ISWC is strictly available only to CISAC member societies. While CISAC offers an API for the individual lookup of ISWC for one example of a musical work, currently it does not allow bulk access to the registered data.

We have already started a discussion with some music industry registers about connecting the Open Music Observatory to their systems. We are planning to present our proposals on the CISAC Good Governance seminar to be held in December 2024.

Musical works ISWC: nternational Standard Musical Work Code is a unique identifier for musical works. It is adopted as international standard ISO 15707 (ISO 2022). OpenCollectons ID: Our ID for music works (only if we publish data about them.)

Sound recordings ISRC: The International Standard Recording Code (ISRC) is the international identification system for sound recordings and music video recordings. (ISO 2019; Authority 2021) OpenCollectons ID: Our ID for sound recordings (only if we publish data about them.)

Music sheets ISWN: The International Standard Music Number currently identifies published music sheets (ISO 2022).

ISBN-13: Before the introduction of ISWN, published sheets were identified by the ISBN book identifier. ISBN-10: The older format of the ISBN book identifier, which predates both the ISWN and the 13-digit ISBN used to identify music sheets. ISCC: The International Standard Content Code (ISCC) is an identifier for numerous types of digital assets. This is our preferred identifier for not published sheets. (ISO 2017)

For unpublished works, our preference is the use of the brand-new ISO-standard ISCC because it was designed precisely for the use case we were looking for. It is free to generate, generated from digital content (or its digital copy), and can connect various local or lesser-used identifiers.

Datasets DOI: DOIs are assigned to each distribution of a dataset. As datasets are often continuously filled, these datasets will have periodic versions with versioned DOIs (from Zenodo.) OpenCollectons ID: Our ID for unversioned (continous) datasets, pointing to the latest available version of the data.

Codebooks URI: Whenever possible, we use standard codebooks of SDMX or Eurostat, and provide a URI to the codebook, and provide dereferencing to the codebook definition. OpenCollectons ID: Our ID for our codebooks, regardless if they are same as the SDMX/Eurostat standards, or we create a non-standard coding for a novel dataset.

Questionbank URI: Whenever possible, we use standard questionnaires, and provide a URI to the codebook, and provide dereferencing to the DDI questionnaire item definitions. OpenCollectons ID: Our ID for questionbank items.

6.3.4 Lyrics

In many genres, lyrics are very important parts of a musical work, and there is a growing demand and need to provide or analyse the lyrics of the work. For example, in our X, we want to create location-aware music services and encourage the public performance of music made in Bratislava or music somehow specific to Bratislava within the public places or radio stations of Bratislava. One possible semantic connection to this environment is that a song is about Bratislava (Berlin, Paris, or Germany.)

Access to the lyrics part of the music is not straightforward, mainly because the lyrics may be arranged from a literary work. We see lyrics identification and semantic analysis as the next immediate step to our location-aware application, for which we are looking for good industry solutions. In many cases, we will likely need to rely on the ISCC code as a temporary identifier for lyrics databases that were not available in a licensed format earlier.

6.3.5 ISCC

The Open Music Observatory will start to implement the newest ISO-standard open identifier, the ISCC-CODE. ISCC is inverting the principle of a centralised register. It generates the ISCC code from the digital content object itself, therefore no third-party lookup is needed for finding the identifier of the object.

ISCC registration becomes necessary when an ISCC code needs to be globally unique, publicly discoverable, resolvable, owned or authenticated. While these features inevitably require some kind of registry, not all of them require a centralised institutional registry. The ISCC specifies the necessary protocols to implement the aforementioned features in a decentralised, federated environment and across multiple public blockchains. Given a registered ISCC code, an application can unambiguously determine on what blockchain (if any), by which account, and at what time an ISCC has been registered.

Registered ISCC codes refer to an authoritative public blockchain network. This indicator is part of the ISCC Code itself, such that codes registered on different networks cannot collide. This guarantees uniqueness of ISCC codes across multiple blockchains. Ownership of ISCC codes (not the identified content) is granted to the signatory of the first transaction for a given ISCC code on the corresponding blockchain.

As such the ISCC fulfils a distinct role and is not a replacement for established identifiers. Rather it is designed as an umbrella standard to augment established identifiers with enhanced algorithmic features. It can be used in the metadata of existing standards or support discoverability (reverse lookup). We will use for precisely this application: whenever we receive content that is not identified by a DOI, ISNI, ISWC, ISRC, or other standard identifier, we will assign an OMO identifier and enhance it with the ISCC features. This will help later linking to the preferred global, persistent identifiers.

6.3.6 OMO Identifiers

We create our own identifiers for persons and things. We follow the practice of the Dutch national archives in the creation of PIDs, and we make them URIs following the W3C recommendation.

music.dataobservatory.eu/{type}/{concept}/{reference}

For {type} we utilise the following definitions: - {id}: an identifier {type} for dereferenced identifiers.

  • {doc}: a documentation {type} for the documentation of persons and objects.

  • {def}: a definition {type} for ontologies.

For {concept} we utilise three categories:

  • music.dataobservatory.eu/{id}/{person}/{reference} for persons, in order to synchronise with national and international name spaces.

  • music.dataobservatory.eu/{id}/{place}/{reference} for places, in order to synchronise with national and international name spaces.

  • music.dataobservatory.eu/{id}/{oc}/{reference} an other objects, such as musical work, a sound recording, a group a persons.