What we did in 2019 and will be doing in 2020

By pietercolpaert

2019, December 31

It’s not research if you’re not learning in hindsight

I’m sure someone must have said it at some point

2019 was the second year I was a postdoctoral researcher at IDLab. In this blog post I want to reflect on the research goals I set one year ago, but also what we are going to do in 2020. It is curious to see that after all these years, I still underestimate certain steps, while I overestimate others. Will we be able to the better at the end of 2020? The blog post from one year ago is available here: our research in 2019”.

Our research focus was and will remain designing Public Web APIs. Last year, I put forward our main research approach for read-only Web APIs as: “How do you fragments datasets bigger than 50kb?”. Taking the fragmentation approach on a dataset helps to re-think and re-shape APIs for Open Datasets, yet putting forward an ideal size is certainly an oversimplification that we should not overuse. The ideal size of a page depends on so many factors: update frequency of the data in the page, whatfor the data itself is used, how compact the data can be represented, how the data is requested by query engines, the compression rate, type of compression, cacheability, etc. Nevertheless, for most use cases, 50kb after compression held as a good initial guess.

Thinking about dataset publishing as merely a fragmentation problem, helps a lot nevertheless. I’ve started coining the idea of “the Web as a hard disk” to explain that no database expert in their right mind would suggest removing the page cache from an operating system. It is this cache that is the enabler of the scalability of hard disk drives, powered by the locality of reference principle. If we could use existing caches that are already in everyone’s pocket, HTTP browser caches, then we could make the web of data much more efficient as well. The kind of fragmentation will lower the amount of fragments that need to be downloaded for a specific use case, but might not for the other. We recommend to always work with real query logs from an existing API in order to prove a point.

Designing public Web APIs is not limited to just fragmenting. Quickly you notice also other aspects come into play, that again make hosting more expensive: supporting different serializations, allowing to request a version of the page from the archive, materializing data dumps for manual inspection, doing metadata well for both dataset discovery (dcat) as interface discovery (hydra) and provenance. In non fragment-based interfaces we have however not even started to think about these problems.

Insights from 2019

In the Smart Flanders programme, we outlined technical principles that data publishers should adhere to. The technical principles include adding a license to your dataset, enabling Cross Origin Resource Sharing, using JSON-LD over plain JSON, using the Flemish OSLO domain models, etc. We have been working three full years on getting these principles accepted at local governments, working on how this translates into paragraphs to be put in tendering documents. For the next years, it will be a challenge to translate these principles into architecture diagrams.

Different use cases were studied. In last year’s blog post, we outlined 3 focus topics: time series, text search and geospatial search and specific ideas on how to tackle them. The ideas we had on summarizing time series were too simplistic. There is no silver bullet when it comes to summarizing time series, although a novel technique called Matrix Profile comes quite close. We are now studying that approach for compatibility with Linked Data and hope to publish this in 2020.

For geospatial search, we are still in the process of developing different approaches. R-tree and tiling have been studied and described using hypermedia. In 2020 I hope we will be able to describe techniques like hexagonal tiling and geohashes too. There might be an interesting overlap with text search there, as something that is geospatially contained within another region will have an id that has the id of the larger area as its prefix. We abandoned the idea of hilbert indexes in hypermedia APIs however. They are an interesting idea for the back-end, but not for the hypermedia API itself.

We are working on publishing the results of benchmarks we ran for time series, geospatial search and autocompletion services. Keep an eye on our publications!

Goals in 2020

What would I love to look back on at the end of 2020? We are a team of computer scientists, so we should do two things well: write inspiring papers and deliver useful code.

Want to add your data project to our goals? Our growing team is open to your challenges!