The next steps for Open Data Portals? Data recipes!

2019, March 21

My dataset is on the data portal, why isn’t it added in every route planner now?
A city official.

We have been building Open Data portals and Open Data standards (see DCAT) for a while now. Yet, judging from the state of the art, still only humans get to understand what’s in an open data portal. We somehow need better metadata, in order for machines to make sense out of the big pile of data gathered on an Open Data portal. I believe the next challenges for Open Data portals are two-fold: (i) making sure industry players adopt “data recipes” (discovery algorithms) for finding datasets for a specific feature; and (ii) adding better metadata to existing datasets. I believe the latter can be achieved by innovating the user interfaces for adding metadata to your dataset.

Stijn works for the city of Antwerp as a mobility specialist. The problem he experiences is a text-book Open Data challenge:

How do I get a dataset about a new local policy adopted in third party end-user interfaces?

It is not an act of philanthropy that leads him to publishing this data, his data must be reused in order for his city to function properly. It stresses the importance of having on the one hand intelligent bots that can integrate a dataset automatically, but on the other hand also making sure the metadata is of a high quality, to assist machines looking for data.

From data catalogs to data recipes

Looking for a dataset is still a manual process. I had to personally ask Stijn whether there is already an opendataset about this, who knew where the dataset could be retrieved. It appeared to be in their geospatial dataset published (the metadata is so-far not yet integrated on the main open data portal) at portaal-stadantwerpen.opendata.arcgis.com. This only left me to wonder: if I cannot find this dataset manually, how would a script from Google, TomTom or HERE be able to discover this dataset?

Instead of having human oriented full-text search forms in Open Data portals, we need to think about data recipes. These are flow-charts or algorithms that a robot can execute in order to automatically discover certain datasets. Such a recipe could look like this:

Request opendata.antwerpen.be (probably in DCAT)
Study all next possible steps from the open data portal. For example, these “flow chart” blocks could be offered:
- zoom in on a specific geographic region,
- follow links to an overview of datasets that were added the latest,
- read about the latest local council decisions (this is the first digital source that may be a trigger a change in route planning advice),
- or follow links to datasets in certain themes such as “traffic rules”.
Follow the right links until a dataset of interest is found.
In the case of the Low Emission Zone (LEZ), it is a boundary shape. In the future we should make sure a robot can detect that “if you are inside this shape, some extra rules apply”. This way, any next set of extra rules, not only the LEZ, will become adopted automatically.

Every step in this recipe is close to how a human would discover a dataset, yet can be optimized for machines. This needs some new alignments with data reusers. On the one hand, companies such as TomTom, Google and HERE need to document what steps they take to understand data. And in some way, Google already does this with the structured data testing tool or in this paper by Natasha Noy, Matthew Burgess and Natasha Noy from Google AI on creating a public dataset search engine. This way, when you want to publish a new dataset, you can try to make it work with how machines already interpret your data.

On the other hand, you will have intended your data to be visited in certain ways. Document your building blocks that you expose on your website. At Informatie Vlaanderen, we put the first steps forward in this by creating a working group for a Generic Hypermedia API across Flanders.

Authoring environments for metadata

Problematic today is the fact that an Open Data portal is supposed to be delivered one company only. A machine however does not care about back-end systems: links are followed seamlessly, regardless of what services are behind it. The main page of opendata.antwerpen.be could be generated by the Drupal system, the links to the geospatial search could link to an arcgis system, while other links could be given to CKAN instances, The DataTank, OpenDataSoft, an IoT Data Broker, and so forth. The important task for the people in charge of the Open Data portal in the city, is however to document the building blocks that are needed on every level, and expose these in machine readable hypermedia controls. It is up to the company to make sure these building blocks that can be used in a data recipe by a client, are fully functional.

Yet, these building blocks today are invisible to the people who maintain the Open Data portal. How can we make these more visible? And can we come up with an authoring environment the automatically puts data currectly in this flowchart? Could this authoring environment for metadata also automatically suggest other building blocks to be added to your dataset? I think solving these questions will also automate a lot of steps for civil servants trying to publish data for maximum reuse. In each case, Open Data is still a domain that still needs to mature a lot.