Open Metadata Handbook/Data Integration

This chapter will be about fetching data, integrating it in a project then presenting it. = Retrieving data =

Data stores
Data and metadata can be found on a variety of sources. A "data store" is a generic term for an online resource storing and providing access to data. In a broad sense, this could be just about any server online, for example a web server that provides data represented as webpages. In the context of this handbook, we'll focus on data stores whose purpose is to allow free and open access, in reusable forms, to bibliographic metadata.

The Data Hub
Thedatahub acts as an hub for all kind of data. It is possible to filter the datasets i.e. to get only Open Data.

Major Bibliographic Catalogues

 * Europeana Catalogue: Linked Open Data released under CC0.
 * CERN: MARCXML released under the Public Domain Data License.
 * British National Library: Linked Open Data + MARC21 released under CC0.
 * German National Library: Linked Open Data released under CC0.
 * Bibliothèque Nationale de France: Linked Open Data released under the Licence Ouverte.

Crowd contributed data

 * Wikimedia
 * One major source of crowd contributed knowledge comes from the Wikimedia galaxy of sites: Wikipedia, Wiktionary, Wikimedia Commons. This data is usually presented in a quite unstructured form (we'll explain more about that later). Efforts have been made to turn this information into structured data and provide data stores:
 * DBpedia http://dbpedia.org/
 * Freebase http://freebase.org/
 * Yago http://www.mpi-inf.mpg.de/yago-naga/yago/
 * Content sharing websites
 * Some websites not only provide user generated content such as pictures, videos or music, but also provide metadata and APIs to access this metadata and search for content:
 * Flickr http://www.flickr.com/ offers one of the richest and most accessible APIs on the social web.

Accessing data
A large part of available open datasets are provided as downloadable files. This is the easiest way to retrieve data as it only implies first finding the right dataset, then clicking to download it. But such downloads usually don't integrate well in automated processes and there often is no other way to make sure the data is up to date than to manually check for updates.

Accessing data through APIs
"API" stands for "Application Programmation Interface". As this name suggests, APIs allow for more complex interactions than downloads.

In most open knowledge APIs, the interface to access data is based on the HTTP protocol, the same browsers use to access web pages, which guarantees an easy access from almost any internet connection.

Just like when you open a webpage, to request data from a web-based API, you'll need to call an URL (Unique Resource Locator), the address for this webpage (or in the second case, for this API endpoint, hence the use of the neutral term "Resource" to designate both).

Most APIs follow the REST (REpresentational State Transfer) architecture, in which parameters (e.g. the name of a dataset, a specific range within a dataset) are passed within the URL. This allows for an easy testing of APIs, as you can try them in your browser and see the results.

The world of APIs ranges from very little more than parametrable downloads to fully replicating the functions of an online service (from user authentication to content creation), allowing to build custom clients on top of these services.

An example
The endpoint for the Wikipedia API is http://en.wikipedia.org/w/api.php which means that any URL starting this way will redirect to the API.

If you open Wikipedia's endpoint URL without any further parameters, you'll see a web page containing detailed information about the API syntax, i.e. how to build URLs to access data inside Wikipedia. Most APIs don't provide documentation through their endpoint, but will offer developer resources, such as the Mediawiki API page

Adding parameters will provide access to specific actions. For example http://en.wikipedia.org/w/api.php?format=xml&action=query&titles=Mexico_City&prop=revisions&rvprop=content will return the contents of the lastest revision of the Mexico City Wikipedia article, encapsulated in an XML file.

You can easily fiddle around and change the parameters: Replace the titles=Mexico_City part by titles=London|Paris and you'll get the articles for both London and Paris. Replace format=xml by format=json and you'll have a different encapsulation

API Tools

 * API Consoles such as APIgee provide tools to build and test API requests
 * JSON being one of the most popular formats for API data, we recommend the Firefox extension JSON View that helps explore the nested structure of such data.

Automating data retrieval
= Metadata Interoperability =

Different metadata schemes are used for different purposes. The same resource can thus be described by complementary or supplementary schemes according to the context. With so many metadata standards, it is sometimes difficult to understand how to achieve interoperability between one format and another. Metadata interoperability refers to the ability to exchange metadata with no or little loss of information. It allows for metadata to be transferred from and across different systems regardless of the distinctive characteristics of these systems. Given that interoperable metadata can be stored and processed on multiple systems, interoperability enables metadata concerning a particular resource to be accessed and understood by different systems, to be aggregated or integrated with metadata derived from different systems or concerning different resources.

The main benefits of interoperable metadata are the following:
 * easier importation of metadata from different systems through standardized tools.
 * transfer of different datasets amongst multiple systems.
 * the ability to search for metadata through multiple catalogs.

The best way to ensure maximum interoperability and high levels of consistency is for everyone to agree on the same schema, such as the MARC (Machine-Readable Cataloging) format or the Dublin Core (DC). However, a uniform standard approach it is not always feasible or practical, particularly in heterogeneous environments where different resources are described by a variety of specialized schemas. Different institutions with different needs are likely to develop "ad-hoc" metadata formats, which are often not interoperable with each other because they do not comply with the same standards.

Specific tools must therefore be developed to allow interoperability between different formats. Various mechanisms can be used to assemble different datasets, even if that means drawing on components specified within different metadata standards. Data providers should however ensure that the result can be interpreted by independently designed applications. Interoperability can be achieved either before the metadata records are created (ex-ante interoperability) or after (ex-post interoperability).

This section is intended to illustrate how to merge, integrate or aggregate different metadata standards and ontologies.

Conversion
Often, a particular metadata schema had been adopted and metadata records had already been created before the issue of interoperability was carefully considered. It is sometimes desirable to use domain-specific metadata standards in combination with each others. Converting metadata records becomes one of the few options for integrating established metadata databases. Conversion is the traditional top-down approach of library data (i.e. producing MARC records as stand-alone descriptions for library material):. Ad-hoc formats give data providers a simple exchange format to dump out their records, which can easily be extracted and aggregated together.

PRO: CONS:
 * Record-centric approach: The focus is on the records, which is what we want to get hold of and make openly accessible.
 * Low costs and easy implementation
 * Ex-post conversion according to the smallest denominator, but risk of lossy conversion: data gets lost when converting from a rich structure to a simpler structure (as opposed of being enriched).
 * The major challenge is how to minimize loss or distortion of data. Oother complicated situations include converting value strings associated with certain elements that require the use of controlled vocabularies.

Examples:
 * Library of Congress provides tools (available at ) to convert between the MARC record and the MODS record, and between the DC record and the MODS record.
 * The Picture Australia project serves as a good example of data conversion. It is a digital library project encompassing a variety of institutions, including libraries, the National Archives, and the Australian War Memorial, many of which came with legacy metadata records prepared under different standards. Records from participants are collected in a central location (the National Library of Australia) and then translated into a "common record format," with fields based on the Dublin Core.
 * National Science Digital Library (NSDL) Metadata Repository where metadata records from various collections were harvested. For instance, ADL (Alexandria Digital Library) metadata records had to be converted into a Dublin Core record when these records were harvested by the NSDL Metadata Repository. When converting an ADL record into a DC-based record for display, value strings in the ADL elements are displayed in equivalent DC-elements.
 * BibJSON

Mapping & Cross-walking
Mapping of the elements, semantics, and syntax from one metadata scheme to another is usually done through a table that represents the semantic mapping of data elements in one data standard (source) to those in another standard (target) based on the similarity of function or meaning of the elements. Mapping enable heterogeneous collections to be searched simultaneously with a single query as if they were a single database (semantic interoperability).

Mapping from an element in one scheme to an analogous element in another scheme will require that the meaning and structure of the data is shareable between the two schemes, in order to ensure usability of the converted metadata. Ad-hoc formats that can very easily mapped to and from most existing formats (but often lossy conversion) Examples:
 * Almost all schemas have created crosswalks to popular schemas such as DC, MARC, LOM, etc.
 * VRA Core 3.0, which lists mapped elements in target schemas VRA 2.0 (an earlier version), CDWA, and DC.
 * BibJSON offers a lightweight RDF/LD compatible format. Full mapping of BibJSON to RDF/LD can be done by other interested parties, not necessarily the initial data provider.

It is also possible to use a specific metadata schema (new or existing) to channel crosswalking among multiple schemas. Instead of mapping between every pair in the group, each of the individual metadata schemas is mapped to the switching schema only. Examples:
 * Getty's crosswalk in which seven schemas all crosswalk to CDWA

Problems:
 * While presently crosswalks have paved a way to the relatively effective exchange and sharing of schema and data, there is a further need for effective crosswalks to solve the everyday problem of ensuring consistency in large databases that are built of records from multiple sources.
 * One of the main problems of crosswalking is the different degrees of equivalency: one-to-one, one-to-many, many-to-one, and one-to-none. This means that when mapping individual elements, often there are no exact equivalents. Meanwhile, many elements are found to overlap in meaning and scope. For this reason, data conversion based on crosswalks could create quality problems.
 * Conversion between ad-hoc formats requires to define extensibility mechanism and methods for vocabulary alignment (e.g. explain that your "title" is the same as dc:title or some other schema's title).

Examples:

 Ontology for Media Resources 

The Ontology for Media Resources 1.0 (http://www.w3.org/TR/mediaont-10/) is presently a “W3C Candidate Recommendation” (W3C = World Wide Web Committee). It will evolve into a full “W3C Recommendation” as soon as the work about the corresponding API (see API for Media Resources 1.0, http://www.w3.org/TR/mediaont-api-1.0/) that provides a uniform access to all of its elements will be completed. This Media ontology is both i) a core vocabulary, i.e., a set of properties describing media resources selected taking into account the metadata formats currently in use and ii) a mapping between its set of properties and the elements from some metadata formats presently published on the Web like, e.g., Dublin core, EXIF 2.2, ITPC, Media RSS, MPGE-7, QuickTime, XMP, YouTube etc. The purpose of the mapping is to provide an interoperable set of metadata, to be shared and reused among different applications. Ideally, the mapping should preserve the semantics of a metadata item across metadata formats. In reality, this cannot be done in general because of the differences in the definition of the associated values see, e.g., the property “dc:creator” from the Dublin Core and the property “exif:Artist” defined in the Exchangeable Image File Format (EXIF) - both mapped to the property “creator” in the Media Ontology. “Types” of mapping are then defined in the Ontology: "exact", "more specific", "more generic" and "related". Mechanisms for correcting the possible loss of semantics when mapping back and forth between properties from different schemata using only the Media Ontology are beyond the scope of the Media Ontology work. A Semantic Web compatible implementation of the Ontology in terms of the Semantic Web languages RDF and OWL is also available, and presented in Section 7 of the http://www.w3.org/TR/mediaont-10/ document.

 BibJSON 

BibJSON is a convention for representing bibliographic metadata in JSON; it makes it easy to share and use bibliographic metadata online. It is a form of JSON - a simple, useful and common way of representing data on the web that can be used to shift information around and between apps. BibJSON is designed to be simple and useful above all else. It has virtually no requirements, and you could use your own namespaces to extend it. By being in BibJSON (or converted to it) data can really easily be displayed, searched, embedded, merged and shared on the internet. One can parse from X to BibJSON to Y, soon it will even be possible to perform translations via BibJSON. The parsers are accessible via an API call at http://bibsoup.net/parse

Linked Data
Linked Data provides a mechanism to identify common concepts in different databases without being limited to the smallest denominator. It does not involve any type of conversion; but rather creates a modular metadata environment where different types of metadata elements (descriptive, administrative, technical, use, and preservation) from different schemas, vocabularies, and applications can be combined in an interoperable way. The components of a metadata record can be regarded as various pieces of a puzzle. They could be put together by combining pieces of metadata sources coming from different processes, and they could also be used and reused piece by piece when new records need to be generated. The metaphor of the Lego can be useful to describe the process, whereby anyone is able to "snap together" selected "building blocks" drawn from the "kits" provided by different metadata standards, even if they were created independently.

As Linked Open Data (LOD) is gaining traction in the information world right now, Europeana has just launched an animation  to explain something about it and its benefits for users and data providers.

Examples:

METS: The Metadata Encoding and Transmission Standard (METS) provides a framework for incorporating various components from various sources under one structure and also makes it possible to "glue" the pieces together in a record. METS is a standard for packaging descriptive, administrative, and structural metadata into one XML document for interactions with digital repositories. The descriptive metadata section in a METS record may point to descriptive metadata external to the METS document such as a MARC record in an Online Public Access Catalog (OPAC) or an Encoded Archival Description (EAD) finding aid maintained on a WWW server. Or, it may contain internally embedded descriptive metadata. It can therefore provide a useful standard for the exchange of digital library objects between collections or repositories.

RDF: The Resource Description Framework (RDF) is another model that provides a mechanism for integrating multiple metadata schemes. Multiple namespaces may be defined to allow elements from different schemas to be combined into a single resource description. Different namespaces are defined by an URL that defines the metadata scheme used in order to describe a particular resources. A single RDF record can thus incorporate multiple resource descriptions - which may have been created at different times and for different purpose. RDF thus provides a framework within which independent communities can develop vocabularies that suit their specific needs and share vocabularies with other communities. Together with useful principles that come out of the semantic web community, it can contribute to improved interoperability and expansion of metadata - although proper vocabulary alignment require an accurate mapping through RDF ontologies.

Advantages:

Linked can be combined with specific protocols for linking data together, in order to allow for better:
 * Standardisation: Linked Data methods support the retrieval and re-mixing of data in a way that is consistent across all metadata providers.
 * Interoperability: Linked Data favors interdisciplinarity by enriching knowledge through linking among multiple domain-specific knowledge bases: i.e. the totality of datasets using RDF and URIs presents itself as a global information graph that users and applications can seamlessly browse.
 * Decentralization: With Linked Data, different kinds of data about the same asset can be produced in a decentralized way by different actors, then aggregated into a single graph. Resources can be described in collaboration with other GLAM institutions and linked to data contributed by other communities or even individuals.
 * Efficiency: GLAM institutions can create an open, global pool of shared data that can be used and re-used to describe resources. Linked Open Data enables institutions to concentrate their effort on their domain of local expertise, rather than having to re-create existing descriptions that have been already elaborated by others.
 * Resiliency: Linked Data is more durable and robust than metadata formats that depend on a particular data structure because describes the meaning of data ("semantics") separately from specific data structures ("syntax" or "formats").

Metadata Registries and Repositories
When multiple sources are searched through a single search engine, one of the major problems is that the retrieved results are rarely presented in a consistent, systematic, or reliable format. A metadata repository provides a viable solution to such interoperability problems by maintaining a consistent and reliable means of accessing data.

One question a repository faces is whether to allow each original metadata source to keep its own format. If not, how would it convert / integrate all metadata records into a standardized format? How would it support cross-collection search? Three common approaches are:

Common Format
The idea is to create a repository that stores metadata records into a simple and interoperable format, so as to encourage institutions to release their metadata into that format in order to reduce the need for conversion and de-duplication.

 Bibsoup / Bibserver 

The BibSoup approach encourages the contribution of Open bibliography without the overhead of de-duplication at contribution time. We expect that, as it grows, services will develop that help users and maintainers to manage the information. De-duplication into a central repository may be one solution (with the presumed platonic identity of STM bibliographic entries), but we also expect that software based on RDF will allow tools to manage alternative representations of bibliographic data, leaving the choice to the user as to what strategy they take. In short, current STM bibliography is a distributed mess. BibSoup takes this as a starting point and, where the political will and financial support is available, offers methods for tidying this up. BibSoup consists of a number of collections of bibliography (initially in STM areas) united by a common syntax. It is left to humans and machines to develop annotations and equalities between the components of these collections. Thus, for example, various records for “the same paper” may be found in arXiv, DBLP and possibly even Medline. The question of determining whether two records relate to “the same object” is difficult and controversial and BibSoup deliberately avoids this. It is just a collection of bibliographic records represented in BibJSON and made available to other people. It may be on one instance of BibServer, in a file, or all of these combined; it is just a matter of scope. For more details, see http://bibserver.org/about/bibsoup/ and http://bibserver.org

Cross-system search
One common way to increase interoperability between different metadata formats is to provide cross-system search (metasearch). Although metadata remains in a local repository, the local search system accepts queries originating from remote search systems.

Z39.50 The international standard Z39.50 is the best-known protocol for cross-system search. The protocol does not require sharing or duplicating metadata, but provides instead specific search capabilities mapped to a common set of search attributes that are understood through the Z39.50 protocol.

Examples:
 * Library of Congress, SRU: Search/Retrieve via URL website http://www.loc.gov/standards/sru/. A standard protocol for passing Z39.50-like search queries in a URL, utilizing a Common Query Language.

Metadata Harvesting Protocols
Another way to increase interoperability between different metadata format is through the implementation of specific harvesting protocol, such as the Open Archives Initiative Protocol for Metadata Harvesting (OAI/PMH). Systems which are compatible with these protocols can make metadata available to the public, for it to be used by external search services and/or included in federated databases.

 Open Archives Initiative (OAI) Protocol  Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a protocol whose goal is to supply and promote an application-independent interoperability framework that can be used by a variety of communities engaged in publishing content on the Web. The Open Archives Initiative requires each metadata provider to translate their metadata into a common set of key elements that are exposed for harvesting. This metadata is then gathered into a central index with a consistent metadata format, in order to allow for cross-repository searching regardless of the native metadata formats used by the metadata providers in their own repository. More information on http://www.openarchives.org/ See also the best Practices for OAI Data Provider Implementations and Shareable Metadata at http://webservices.itcs.umich.edu/mediawiki/oaibp/index.php/Main_Page (a joint initiative between the Digital Library Federation and the National Science Digital Library)

Examples:
 * the NSDL Metadata Repository employs an automated "ingestion" system based on OAI-PMH, whereby metadata flows into the Metadata Repository with a minimum of ongoing human intervention. The NSDL, from this perspective, functions essentially as a metadata aggregator. The notion behind this process is that each metadata record contains a series of statements about a particular resource, and therefore metadata from different sources can be aggregated to build a more complete profile of that resource. As a result, several providers might contribute to an augmented metadata record. These enhancements are exposed via OAI-PMH, and the Metadata Repository can then harvest them.
 * The University of Michigan’s OAIster search service contains millions of records for digitized cultural heritage materials harvested from hundreds of collections via the OAI-PMH. See OAIster website at http://www.oaister.org/.

 Multilingual Access to Subjects (MACS) 

The Multilingual Access to Subjects (MACS) project illustrates another value-based mapping approach to achieving interoperability among existing metadata databases. MACS is a European project designed to allow users to search across library cataloging databases of partner libraries in different languages, which currently include English, French, and German. Specifically, the project aims to provide multilingual subject access to library catalogs by establishing equivalence links among three lists of subject headings: SWD/RSWK (Schlagwortnormdatei / Regeln für den Schlagwortkatalog) for German, Rameau (Répertoire d'autorité-matière encyclopédique et alphabétique unifié) for French, and LCSH (Library of Congress Subject Headings) for English.