There is a wealth of data around on public transport. What is the best way to take advantage of it? Alexandre Masselot is giving an introductory talk, exploring some of the on-trend frameworks at Voxxed Days Zurich. We asked him what inspires him, and the importance of near real-time Big Data.
What inspires your talk on Big Data?
I’ve been doing this kind of thing since my PhD in 2000 at the University of Geneva – it was not yet called Big Data but we were playing with large amounts of data and high performance computing. I then worked in bioinformatics and pharma research in Switzerland and San Francisco. There we had a large variety of problems which had all of the big data themes – tons of data, rich data, streams and so on. But we were mainly using artisanal means – hand-crafted technology and ah hoc solutions. Then one and a half years ago I started consulting with OCTO Technology in Lausanne, and, thanks to my colleagues, dove into the more state-of-the-art big data stack.
For example, with clients we are using Apache Kafka, Hadoop, Spark, Elasticsearch etc. We also participate in a few very exciting projects with rich and reactive visualizations. We are using some of those techniques (and yet more of them) in the project I will present at Voxxed Days Zurich. In in terms of choosing what technology to use, it is more a project-by-project basis, as there isn’t “one size fits all”.
When you are looking at a new Big Data project for a client, what is the first thing you look at?
First, when we have a client starting with big data, our first step is to identify use-cases that will bring value to them. Not just playing with a big data stack because it’s fashionable. We look at what the problem is that a classic database model won’t solve. You have rich data, a lot of events, but what are you getting out of your current stack reach? We design an MVP: what is the simplest question that will bring you value, that you can demonstrate to your company, your customers.
Very often we encounter situations where people just want to use Hadoop for the sake of it, because they read the name and they want to look into it – but then spend 100 days doing something that doesn’t bring any actual value. Following that route is a sure way to abandon a promising stack.
So it is looking at what the need is – that’s the first step. How can we build a proof of concept, a minimum viable product, that will bring value to your business?
What is the most important first step as a developer?
With transport, we don’t have access to actual vehicle real time data. The CFF (Swiss transport) will open the data in March and we are eager to see that happening. In the project we present, data is fed every 20 seconds or so for all vehicle positions and station board information (next departures, delays). But for a first step in transport, it proved to be enough. Some of our customers in the industry have more stringent constraints with thousands of events per second, or another one with 150 million events per day. Therefore, following transport is easy in comparison. Even if we looked at all transport in Switzerland every second, it would not be that big a deal handling the flow.
“Real” real time – this is the system in your car. This is a total other world where you need full control of the memory, garbage collection, and so on. Or high frequency trading where you need “real” real time. What we are looking at, and what is the reality of many businesses, is in fact “near” real time, because we can afford a few seconds latency for garbage collection or network delays. When going real time – a few seconds can really change the game. And that is another story…
For more information, see the talk at Voxxed Days Zurich.