The Linked Data Service of the Federal Spatial Data Infrastructure
In this article we provide some insights and lessons learned acquired with the realization and the maintenance of the Linked Data Service of the Federal Spatial Data Infrastructure (FSDI) geo.admin.ch. We also try to identify pitfalls and areas where engagement and investments are needed in order to facilitate unlocking the potential of the Geographic Information and bringing it on the Web.
Geographic Information and the Web
The Geographic Information (GI) and the Web appear to be good friends: GI is pervasive on the Web and Web technologies are used by the GI community to provide access to geodata. In August 1991 Tim Berners-Lee announced the WWW Project on the newsgroup alt-hypertext; the Xeroc Parc Map Viewer was one of the first Web applications and went online in 1993. Since then the GI community has evolved around the concept of spatial data infrastructures (SDIs) with the main idea to use the Web in order to connect isolated GI Systems and exchange GI. The implementation of SDIs has been supported by an international, top down and highly centralized standardization effort driven by organizations like the Open Geospatial Consortium (OGC) and ISO Technical Committee 211.
The Web has instead moved with a complete different approach, more agile, decentralized and bottom up. Just consider, for example, that the GeoJSON (a geospatial data interchange format based on JSON) and the JSON-Schema (the vocabulary to annotate JSON documents) specifications sum up to some fifty pages, while the OGC standard GML (an xml-based modeling language and interchange format) is about four hundred pages.
At the end of the story, the Web is today mostly agnostic about SDIs. The discovery of information resources in SDIs is delegated to catalog services providing access to metadata and users generally cannot simply follow links to access data. OGC Web services do not address indexing of the resources by general purposes search engines.
To address this and other related issues, OCG and the W3C have teamed up to advise on best practices for the publication of spatial data on the Web based on Linked Data.
The Linked Data Service of geo.admin.ch
With these considerations in mind, we at swisstopo started to consider the possibility to publish geodata as Linked Data in 2016. We launched a project with the objective to publish a selection of geodata: the result is the Linked Data Service of geo.admin.ch, the Federal Geoportal. The service is operational since March 2017 and we publish so far two main datasets: swissBOUNDARIES3D (administrative units versioned since 2016) of swisstopo and the Public Transport Stops of the Federal Office of Transport.
We use a standard process for the publication of data according to the Linked Data principles, implying the serialization of the data in RDF (here we use the GeoSPARQL standard), the storage of the triples in a triple store and the setup of a Linked Data Frontend.
We use the Virtuoso Open Source Edition as triple store and the open source product Trifid as Linked Data frontend for URI dereferencing and content negotiation.
Both software components are dockerized: we use Rancher as composer.
Providing geodata as Linked Data results in a mutual exchange of benefits between the GI community and the Linked Data / Web community. The GI community brings new ways of querying data: the GeoSPARQL standard is in fact not just a vocabulary for encoding geodata in RDF, since it provides above all extensions to the SPARQL language, enabling to query resources based on their spatial relations. Queries like “Find all resources of type A (with a point geometry) within the resource of type B (with a polygon geometry)” or “Find all resources of type A within a distance of X kilometers from point C” are simply not possible with standard SPARQL.
On the other hand the Linked Data / RDF approach to data modeling can ease the burden of communicating data semantics. Data modeling within the GI domain is based on the “model driven approach”, where data semantics resides in (complex) conceptual and database schemas and has been traditionally very hard to share.
Linked Data users do not need to understand / know all this complexity to adequately use the data, since data semantics is in the data itself. Here not only objects (resources) are on the Web but also objects properties (what we call attributes in the GI domain and predicates in RDF) are first class objects and are on the Web: objects types and predicates are described via web-accessible open agile vocabulary definitions and this definitions are reusable.
The bad news is that Virtuoso Open Source does so far not support GeoSPARQL (there are plans to support it according to this tweet). It actually supports spatial queries but these are based on built-in spatial functions instead of using those defined by the GeoSPARQL standard. Within the open source community we don’t see so far a valuable alternative to Virtuoso, on the other hand the interest about GeoSPARQL is growing more and more and commercial solutions exist, that start to support it. GraphDB for example has a good support and so should the new Stardog version (to be tested).
We argue that the standard approach to Linked Data publication built around a triple store is not very adequate when:
- A data publication pipeline is already in place, as is the case for the FSDI. Here one has to deal with data redundancy and synchronicity;
- Data to be published is somehow massive and dynamic, meaning there is a high update frequency (daily, hourly).
To address these issues we are investigating alternative solutions based on virtual graphs. The idea here is to have a proxy on top of the main data storage and let the proxy provide RDF serialization at runtime.
Again the problem is that open source solutions are very few: D2RQ, a reference system for accessing relational databases as virtual RDF graphs, is a very old technology and does not seem adequate for a production environment; ontop seems a promising technology but does not support so far GeoSPARQL (there are plans to support it according to this post).
One more issue about usage: we are monitoring the service and the statistics do not show so far big numbers. We will have to work in this sense maybe by supporting showcases and make developers aware of the power of the Linked Data approach: one simple data model (RDF) with enough semantics enabling easy understanding of the data and a unique interface (SPARQL).
When we started our project, one of the objectives was to try to improve the indexing of our data by the main search engines: Linked Data is an easy and valuable way to bring geodata to the Web and to improve geodata visibility and cross-domain interoperability.
We were aware that we had to work with the schema.org vocabulary and we spent some time in trying to debug our RDF serialization with the Google structure data tools. We eventually had to realize that Google does not really follow the “open world assumption” to Linked Data publication: the structured data tool have an issue with mixed vocabularies, simply said they only recognize schema.org definitions, while definitions from other vocabularies produce errors.
Now schema.org is simply not sufficient to describe geodata. Let us make a simple example: to tag a resource of type “administrative units” one can use the schema.org type “AdministrativeArea” but there is no possibility to specify the level of the administrative unit (is it a Municipality or a Canton?). Schema.org fails on one of the main Linked Data principles: reuse. It just does not care about existing standards and vocabularies.
Amazon has also recently launched its “Registry of Open Data on AWS” for discovery and sharing of datasets available as AWS resources.
The Frictionless Data initiative from Open Knowledge International is yet a further approach to data publication and sharing.
In the quest for unlocking the potential of (geo)data, data providers have to navigate on sight in a sea full of alternative and competing «same-but-different» solutions.
We argue that the Linked Data approach is the one with the higher, yet unexplored potential.