Predicting train occupancies based on query logs and external data sources

This paper was presented at the locweb workshop at the Web conference 2017

Gilles Vandewiele
Pieter Colpaert§
Joachim Van Herwegen§
Olivier Janssens§
Ruben Verborgh§
Erik Mannens§
Femke Ongenae
Filip De Turck

On dense railway networks – such as in Belgium – train travelers are frequently confronted with overly occupied trains, especially during peak hours. Crowdiness on trains leads to a deterioration in the quality of service and has a negative impact on the well-being of the passenger. In order to stimulate travelers to consider less crowded trains, the iRail project wants to show an occupancy indicator in their route planning applications by the means of predictive modelling. As there is no official occupancy data available, training data is gathered by crowd sourcing using the Web app and the Railer application for iPhone. Users can indicate their departure & arrival station, at what time they took a train and classify the occupancy of that train into the classes: low, medium or high. While preliminary results on a limited dataset conclude that the models do not yet perform sufficiently well, we are convinced that with further research and a larger amount of data, our predictive model will be able to achieve higher predictive performances. All datasets used in the current research are, for that purpose, made publicly available under an open license on the iRail website [16] and in the form of a Kaggle competition [17]. Moreover, an infrastructure is set up that automatically processes new logs submitted by users in order for our model to continously learn. Occupancy predictions for future trains are made available through an api [18].

Read the full paper in PDF (CC BY)

Conclusion and future work

In this paper, the first steps towards a system that can predict the occupancy level of a train in the nearby future based on query logs are presented. Such a system can have a signficant positive impact on the quality of service while decreasing the operational costs. We discussed the different phases of constructing such a system: (i) adding a feature to a widely used application in Belgium in order to collect data through crowdsourcing; (ii) extracting numerical features from these raw JSON logs and (iii) creating a predictive model on this extracted data. Moreover, an API was created in order to expose the predictions of our model and a Kaggle competition was setup to enable collaborative benchmarking.

We conclude that in this early phase our predictive model, which is trained on a limited amount of data, is good at predicting trains with a low occupancy. This comes at no surprise, as the low occupancy of trains outside peak hours is easy to predict and it is the largest populated class (currently, around 41% of all samples have the low occupancy label). When more samples are collected, we are convinced that the system’s predictive performance will increase. The strength of the approach in this paper is that the data used can be gathered for any public transport system. At this moment, data has only been collected over a limited timespan. The current dataset contains only a limited amount of samples, but is growing rapidly with on average 1000 query logs per month.


Gilles Vandewiele is funded by a PhD SB fellow scholarship of FWO.

Thank you to iRail, TreinTramBus and Metro Time, and crowd-funding supporters for their time and financial effort in the Spitsgids campaign. Thank you to SNCB for the support to gather first data. Thank you Serkan Yildiz, Stan Callewaert and Arne Nys for their enthusiasm implementing the features in the apps during open Summer of code.

The Spitsgids (Literal translation: “Peak Hour Guide”) crowdfunding campaign was introduced with this video (in Dutch)

Thank you Kris Peeters, Nathan Bijnens and other Twitter users who helped discussing the data publicly.