Conclusion and future work
In this paper, the first steps towards a system that can predict the occupancy level of a train in the nearby future based on query logs are presented. Such a system can have a signficant positive impact on the quality of service while decreasing the operational costs. We discussed the different phases of constructing such a system: (i) adding a feature to a widely used application in Belgium in order to collect data through crowdsourcing; (ii) extracting numerical features from these raw JSON logs and (iii) creating a predictive model on this extracted data. Moreover, an API was created in order to expose the predictions of our model and a Kaggle competition was setup to enable collaborative benchmarking.
We conclude that in this early phase our predictive model, which is trained on a limited amount of data, is good at predicting trains with a low occupancy. This comes at no surprise, as the low occupancy of trains outside peak hours is easy to predict and it is the largest populated class (currently, around 41% of all samples have the low occupancy label). When more samples are collected, we are convinced that the system’s predictive performance will increase. The strength of the approach in this paper is that the data used can be gathered for any public transport system. At this moment, data has only been collected over a limited timespan. The current dataset contains only a limited amount of samples, but is growing rapidly with on average 1000 query logs per month.
Acknowledgements
Gilles Vandewiele is funded by a PhD SB fellow scholarship of FWO.
Thank you to iRail, TreinTramBus and Metro Time, and crowd-funding supporters for their time and financial effort in the Spitsgids campaign. Thank you to SNCB for the support to gather first data. Thank you Serkan Yildiz, Stan Callewaert and Arne Nys for their enthusiasm implementing the features in the apps during open Summer of code.
Thank you Kris Peeters, Nathan Bijnens and other Twitter users who helped discussing the data publicly.