I do not want your Open Data API, I’d rather scrape your website

2016, April 01

Why don’t you like APIs for Open Data?
A question I get way too often

What is not to like about APIs? Today, if you want to be cool, you have an API for third parties to integrate with basically anything. While there is a lot of uncertainty about what an API exactly is, we can agree it is something which third party developers can use in their programming code to do something. APIs exist to talk to your IoT devices, APIs exist to other people’s software routines and libraries, and APIs exist to exchange data between different parties. And truly, I love this idea!

But when publishing Open Data, what people mean by “I want an API” can only be two things:

The developer wants small fragments of the data which can be retrieved in JSON or XML (I know, you probably want JSON, but hey, also government and enterprise people are reading this blog!) instead of having to download a data dump or scraping HTML pages.
The developer is lazy (in most cases this is a good property of a developer) and (s)he wants the data publisher to pay for a free service for their app.

For the first point: great you are advocating for open web standards such as JSON! Yet, what you are advocating for now, is a whole new channel, which requires new funding only to set it up: new servers, new consultants programming HTTP interfaces, new ETLs, etc. Because of observation that these APIs often then become a second inferior channel, some thought leaders started preaching “API first!”. This way, they preach their website itself should use this API to show the data. The core idea is good: document all your resources and work on a decent http URI-strategy. However, why do you still need a separate API? Your HTML pages are also resources part of this http URI-strategy, so you could as well just return JSON on a page using the http Accept header, or annotate your HTML pages with machine readable snippets.

For the second point, I will need a new chapter:

Services vs. data publishing

When you are creating a production ready app, you do not want the government to host e.g., your full text search service. Can you imagine Google relying on the full text search of your government to give you search results? Of course not:

you cannot rely on the government to provide you a reasonable uptime for this service that will be harder and harder to keep online if your user base grows
there are many different parameters to a full text search query, and if you want to innovate, you need to be in control of the query execution algorithm
you cannot ask combined full text search queries over different governmental websites: you can only ask the servers separately, but integrating both result sets will be troublesome.

Other example services than full text search services are things like geo-queries, or exposing a query language such as graphql, sparql or sql, or route planning, or geocoding, or …

So, if you want to use data from the government in your production apps or your next start-up, you will want data access which allows you to replicate the entire dataset on your own machines. That is why we need the government to publish their data licensed under e.g., a CC0 license.

But what does it mean to publish your data? What is the distinction between publishing your data and a data service? At iMinds, we like to visualize this as an axis as follows:

A data dump is certainly data publishing, yet there are many drawbacks to publishing a data dump: when publishing data that changes often, having to update this data dump on your server every, let’s say second, is a bit much. That’s why I think a more Web-ish approach would not hurt: small documents (JSON, HTML, whatever) that link together a big Web of knowledge. An interesting way of working towards that is having a resource oriented approach, where you first identify all the resources you have in your dataset using a global identifier (such as a URI). Then, you create documents of data (identified by e.g., a URL) which contain something about these resources. The documents can be structured just like you would structure your website, and links direct you from one document to the other. This way, programmers can write source code that follows links (the idea of hypermedia!) to answer more difficult questions that those answered in only one document. And for the documents themself, they are small enough to contain rapidly changing data.

Examples

Setting the bad example

The Europeana API: well documented by Ruben Verborgh in his blogpost The Lie of the API.

Setting the good example

Check out schema.org! It is a way to annotate your HTML-pages with rich snippets. This way, the scraping of the website to generate a data dump goes easier, and the entire Web becomes more structured as websites are annotated with similar properties.

Check out Linked Data Fragments. It is a way to publish your data in fragments which makes clients, when asking multiple small questions to download parts of the dataset just in time, still able to ask very complicated questions. This is the true power of the Web: combining resources to solve difficult questions.

Check out the website of the city of Ghent, which provide rich snippets with Linked Data

Check out Linked Connections: it is Linked Data Fragments applied to route planning.

Huh? So what are you doing with api.iRail.be? Is that not a good example?

Indeed: it is not a good example of how to publish Open Data. We also never said it is the goal of api.iRail.be to publish the data: after all, we are not the data owner. We have created a service, available for free for everyone, which enables everyone to do calculate routes on the Belgian railway network and display this in various apps, such as Railer and BeTrains. We do this as a non-profit project that wants to make our transport experience in Belgium better. For the data dumps itself, head to gtfs.irail.be. In the same logic, I applaud any initiative of open data reusers offering free services to hobby developers, but you will hear me complain when a data publisher is spending its time on this instead of raising their own data quality.