Annex 2 - Data Curators Manuals

Our primary services involve data collection, processing, and dissemination. These services will not produce a high-quality data resource without competent data curators.

Data curators, as professionals, are responsible for managing, maintaining, and enhancing the quality of an organisation’s data. Their work is instrumental in making data easily accessible, accurate, and relevant to the organisation’s needs. In large organisations, they collaborate closely with data engineers, analysts, scientists, and other stakeholders to establish a robust data ecosystem.

Note

To work with the systems of the Open Music Observatory, we created two manuals.

  • In March 2023 we created (and subsequently updated with user feedbacks) a Contributors Guideline for the internal stakeholders of the Open Music Europe (OpenMusE) project. This manual is available on contributors.dataobservatory.eu

  • In 2024 we created a partly overlapping manual for the Observatory Stakeholder Network, i.e., for organisations that are providing and exchanging data with the Open Music Observatory. This manual is available on manual.opencollections.net.

In the music sector, because of the dominance of micro- and small enterprise (institution) sizes, very few competent data curators and specialised data or knowledge engineers are present. Our approach to solving this problem is the following:

We provide small-group training and online manuals for the data curators, responsible for maintaining the quality of data ingested by the Open Music Observatory. For large data providers, for example, collective management organisations or music information centres, we train one in-house data curator because, due to data confidentiality issues, often only an in-house person can review the totality of the data (and sort out the part of the data that can be shared within the dataspace with other observatory stakeholders.)

Inspiration

Very few music organisations employ data curators or employees with a library or information sciences background. We want to encourage music professionals working in various for-profit or social enterprises and research institutions to discover their “inner data curator.” We believe that a passion for music and the sector and deep knowledge and experience with music are more important than the technical skills needed to curate the data.

We are encouraging the members of the Observatory Stakeholder Network (See: 8.1.1 Observatory Stakeholder Network) to find professionals, researchers, or artists in their organisation who have deep subject-domain knowledge about the data we want to improve: they know a lot about organs in churches, about labels of a particular genre, sync licensing to films, or any other domain on which we collect data. Our ideal curators share a passion for data-driven evidence or visualisations, can learn tools that Wikipedia editors use, and have a robust and subjective idea about the data that would inform them in their work.

Basic Data Organisation Concepts

We are training and providing self-training material for two crucial but relatively simple concepts: tidy data and the text annotation and mark up.

Following three rules makes a dataset tidy: variables are in columns, observations are in rows, and values are in cells. From R For Data Science - 12. Tidy Data

Our documentation system works with MediaWiki, the mark up system developed by Wikipedia.

Dilinger is one of the best editors, and it is particularly suitable foor first-time markup users, as you immediately get visual feedback on how you mark up your text.

Understanding the Data Model

Wikidata is a collaboratively edited multilingual knowledge graph hosted by the Wikimedia Foundation. It is a common source of open data that Wikimedia projects, such as Wikipedia, and anyone else, is able to use under the CC0 public domain license. As of early 2023, Wikidata had 1.54 billion item statements, or small, verifiable, scientific statements about our world. It runs on Wikibase, the tool that we use for the data consolidation of the Open Music Observatory.

Wikidata is a document-oriented database, focusing on items, which represent any kind of topic, concept, or object.

Wikidata is a document-oriented database. This document connects a lot of knowledge about the late English writer and humorist, Douglas Adams.

Data curators are expected to understand the basics of the Wikibase Data Model and the idea of working with a document-oriented database. We could learn from many EU and member state projects in this regard because Wikibase is a tool that is very often used for similar tasks. Originally intended at the level of citizen scientists, it allows music domain experts, like musicologists, music economists, music librarians and other non-technical stewards of data, to work efficiently with a data coordination system that uses Wikibase.

Working with the GUI

Identical to Wikidata: you must fill out at least the main Label of the item, and a description. We use English (en) as the master language for international cooperations.

Sandbox environment

Our manual is accompanied with a “sandbox” learning environment, where new data curators can try out various data manipuliation, editing, deleting, uploading actions without endangering their data, or the shared database.

Mass importing

Mass importing data requires solid technical skills because, in almost all cases, the data arrives from a very differently structured database: spreadsheets or a relational database management system. These tasks are performed by Reprexbase , the software components developed by Reprex to connect music industry sources to the Wikibase Data Model. The guidelines provide information on how to prepare the data or what information should be given to us about the original schemas to start the mass import.

Data enrichment

The data enrichment are carried out with software components created by Reprex. It took about ten months to get enough data clearances in our data-sharing space to accumulate enough data for training enrichment algorithms. Results will be reported later in other tasks.

Quality Testing with SPARQL

SPARQL is the standard query language and protocol for Linked Open Data and RDF databases. Having been designed to query a great variety of data, it can efficiently extract information hidden in non-uniform data and store it in various formats and sources. SPARQL, pronounced ‘sparkle’, is the standard query language and protocol for Linked Open Data on the web or RDF triplestores. The SPARQL standard is designed and endorsed by the World Wide Web Consortium and helps users and developers focus on what they would like to know instead of how a database is organised.

Our data curators must be able to run SPARQL queries and make elementary modifications to them. Because we often import very large datasets, it would be very difficult to manually control every record on the graphical user interface. We use pre-written SPARQL queries (the data curator is expected to run via a simple URL link, perhaps modifying a class’s QID or a property’s PID) that serve as so-called unit tests. These queries programmed by Reprex allow simple tests like these:

# Composers: citizens of Slovakia

SELECT ?item ?itemLabel  ?givenNameLabel ?lastnameLabel ?birthdate ?deathdate ?nationalityLabel ?itemDescription WHERE {
    ?item wdt:P31 wd:Q5 .                # instance of human
    ?item wdt:P106/wdt:P279* wd:Q36834.  # occupation or subclass of occupation that is composer
    ?item wdt:P27 wd:Q214.               # country of citizenship is Slovakia  
    optional { ?item wdt:P735 ?lastname . }
    optional { ?item wdt:P734 ?givenName . }
    optional { ?item wdt:P569 ?birthdate . }
    optional { ?item wdt:P570 ?deathdate . }
    optional { ?item wdt:P27 ?nationality . }

  SERVICE wikibase:label { bd:serviceParam wikibase:language "en,sk,de,hu" }
}

order by ?itemLabel

Try it out↗