We're kicking off a new project examining data ingestion at scale. Our scenario centers around a fictitious company building connected cars. In our scenario, cars are frequently sending small packets of data (also known as events) to a backend service. We want to explore how to build a system that can handle tens of thousands of events per seconds (or more).

The project isn't just about connected cars though, the practices and patterns we're employing are applicable to many other scenarios involving high volume and high velocity streams of data.

Initially, we're focusing on two common problems related to event stream processing:

  • How to examine incoming events and pass them along to an appropriate handler function. The emphasis of our solution is on speed and overall throughput at scale.
  • How to facilitate the cold storage of these events for later analytics. We've found that it can be tricky translating a chatty stream of events into chunky blobs.

Our solution centers around Microsoft Azure Event Hubs, a cloud-scale telemetry ingestion service. If you're not familiar with Event Hubs, you can pick up the general concepts from the docs.

Even though there's a lot of code in the initial commit, this is really just the beginning. We know there are many improvements that can be made. We'd love to have your contributions and criticisms. Pull requests and comments are welcome.

Now, go fork the repo.

(Oh, in case you missed it: we're hiring!)