Data Fairport

WORK PACKAGES UPDATE 01/06/2014

WORK PACKAGES UPDATE (01/06/2014)

Work Package 1

The consensus process has resulted in the stricter separation of the needs for the absolute core minimum requirement for the publication of FAIR data, and their actual publishing, finding, interoperability, reuse and citation.

The entire data stewardship cycle is in the interest of FAIRport partners but most of the 'application' elements are already covered by a growing list of FAIRport Enablers such as data publishing infrastructures, semantically smart search engines and analytics environments for data, interoperability formats and citation principles.

In the update of WP2 we describe the developments in this 'Enabler' area, but WP1 will focus more and more on the absolute core of data interoperability. Again, even in this case we will work preferentially with community emerging global efforts as Enablers.

In this realm we have made enormous and unforeseen progress by connecting to the central leadership of the Research Data Alliance (RDA). In the core working groups of this rapidly growing Alliance, much work has already been done to define the minimal needs for data interoperability. A Concept Note version 7, annotated by Barend Mons, please send your comments to either =This email address is being protected from spambots. You need JavaScript enabled to view it. or This email address is being protected from spambots. You need JavaScript enabled to view it.) was already produced featuring the exact same hourglass metaphor that emerged from the FAIRport Lorentz workshop.

In fact, in that document the minimal requirements have been stripped down to the essential requirements of 'a Persistent Identifier' (PID) for each concept used in any published data set and 'rich enough provenance' in the form of classical metadata or any other, richer and computer readable format. A key group of thought leaders has already established an organization named the Digital Object Naming Authority (DONA), established under Swiss Law in Geneva in April 2014. In addition, FAIRport was now formally accepted by the Heads of Nodes meeting of ELIXIR.

So from the FAIRport core groups perspective we will still work mainly on Life Science use cases to demonstrate the added value of FAIR data publishing, but the WP1 leaders Bengt Persson and Barend Mons propose to work very closely with RDA and DONA to prepare for a proposal to the Global Research Council in 2015 to formally and sustainably support a ring of 'DONA'/media/system/js/s/index.html' ensuring sustainability by replication of a reliable service to assign PID's to each concept used in data publishing.

This system would allow maximum degrees of freedom, even at the level of the choice of symbols (terms and URI's/URL's) used to refer to concepts in for instance RDF, as long as the symbol is non-ambiguous and/or properly (and publicly) mapped to the PID. We propose that the attachment of sufficiently rich metadata (each again using PID's for concepts) and provenance will be a recommendation of such DONA's but they will have no 'authority' over the richness and compliance of data publishers in how they comply with these guidelines.

In such way, also maximum degree of freedom is given to data publishers to publish their data in various incremental steps. Data publishing could now range from the simple provision of a 'black box container' (data not computer readable) with a DOI (PID) for the entire set and minimal (computer findable and interoperable) metadata ('1 star') to fully open and semantically analyzable data sets where each and every concept and relationship in the data set is machine readable ('5 stars'). Enablers in FAIRport can provide PID's and many symbols for certain 'semantic types', not only for 'proteins' (e.g. SwissProt) or 'chemicals' (e.g., RSC) but also for organizations (e.g. VIVO) and people (e.g. ORCID) mainly used in typical metadata elements.

In the coming months we will meet regularly with the relevant RDA groups and the DONA founders to explore the feasibility of this approach and to potentially write a common consensus white paper on the minimal requirements for FAIR data publishing to be jointly presented at the first keynote of the RDA 4th plenary in September, in preparation for a formal presentation to the GRC in Japan mid 2015.

Work Package 2

A major step forward has been the strong collaboration with the various working groups in FORCE11 context (notably the dissemination and implementation groups of the Joint Data Citation Principles (JDCP)), where Tim Clark, co-leader of WP2 has played a crucial role. Currently, June 1st, 66 organizations and 152 individuals have endorsed the JDCP. As we decided before, this is seen as a first step for any 'Enabler' to engage with the FAIRport initiative.

In the list of organizations signed up so far we have categorized the 'Enablers' for the purpose of FAIRport interest areas in 'data producers/providers', formal data and information publishers, information aggregators/identifier providers, terminology system providers, funding agencies and other governmental organizations and international research infrastructures such as ELIXIR.

A milestone is that the Heads of Nodes of ELIXIR have endorsed the proposal by Niklas Blomberg, the director of ELIXIR to formally sign the JDCP, which brings in the entire network of Nodes .

We will now concentrate from a FAIRport perspective on the continuous engagement of more Enablers, such as ORCID, VIVO, but also funders of major data producing research consortia and infrastructures, such as IMI. And finally also major research funders such as H2020, NIH and NSF with the ultimate aim to gain enough momentum and critical mass to convince the GRC in 2015 that this community is the most solid embedding for a global policy and sustainable funding for the 'DONA-ring' as the 'heart of FAIRports'

Work Package 3

Through a series of bi-weekly teleconferences, and an inaugural Hackathon, WP3 participants have begun to build a prototype core technology layer upon which the richer FAIRport layers and behaviors can be constructed. In reaching our initial conclusions about what was needed at the "narrow waist" of the hourglass, we examined a variety of other similar projects (e.g. Dataverse, RightField, ISAtab) as well as publications and standards that could be used to unify these existing and successful projects.

Our plans now are to, as quickly as possible, "build something that works" - a strawman that we can test against a variety of use-cases to determine if we have correctly interpreted the requirements. Having a "working thing" will also help to clarify (not only for third-parties, but also for the WP3 group members) exactly what it is we are trying to achieve, and assist in focusing our efforts for the next Hackathon, which we hope will take-place in mid-June.

The specific technology approach we are currently pursuing involves creating a "Profile" of a metadata schema. The Profile (a concept introduced by the W3C's DCAT working group, albeit without any guide as to what a Profile should look like or behave) is a meta-metadata-template: For any given organization/repository, what fields should be in the metadata-capture interface, and what are the constraints on those fields (their potential values and their requirement/optional status, for example). These templates can then be used to automatically generate a metadata capture interface reflecting the requirements of that organization - for example, a Profile could be interpreted to generate the xml files within an ISATab Configuration, or interpreted to generate the rules behind a RightField template, or interpreted to generate a novel Web interface; the objective is re-use of metadata profiles in a generic manner, rather than a project-specific manner, to improve harmonization between different projects/repositories.

So far we have constructed a minimal schema, in RDFS, that can be used to encode Profiles. For the constraints on Values within a metadata facet, we propose to use the ontology->SKOS mapping described by Jupp et al. in their recent "a view on bio-ontologies" paper, since this seems to be a lightweight approach to harmonizing over various ontology formats, and various ontological "rigors" (from vocabularies to full OWL-DL), by removing the logic and leaving a standardized "view".

It is our intention that these Profiles should be modular, and re-usable. One should be able to combine a series of existing Profiles - possibly created by others - into a novel Profile specific to your resource, yet sharing the various metadata facets used by others.

Indexing over these Profiles should allow search tools to predict/understand the metadata they encounter around a given digital object, and know how/if it can be integrated with metadata from other such objects.

Articles

WORK PACKAGES UPDATE 01/06/2014