From Uber inc
Mingmin Chen is the tech lead and senior software engineer with streaming data team at Uber, primarily focusing on building Apache Kafka pipeline and scaling Uber's real-time infrastructure. Prior to that he was a software engineer with Twitter and Oracle, working on big data infrastructure, storage server and database technologies. He got his PhD in computer science from UC Davis.
Most storage systems replicate data across data center asynchronously, which introduces eventual consistency in the system. This is not acceptable for high-value data which requires strongly consistency across data center. At Uber, we build a messaging system that can synchronously replicate data to another data centers such that data will be available even after single data center failure. In this talk, we will discuss how we leverage Apache Zookeeper, Apache Kafka and cloud to build this system.
In distributed systems, retries are inevitable. From network errors to replication issues and even outages in downstream dependencies, services operating at a massive scale must be prepared to encounter, identify, and handle failure as gracefully as possible.
Given the scope and pace at which Uber operates, our systems must be fault-tolerant and uncompromising when it comes to failing intelligently. In particular, in streaming processing and event driven architecture, supporting reliable redeliveries with dead letter queues is a popular ask from many real-time applications and services at Uber. To accomplish this, we leverage Apache Kafka, a popular open source distributed pub/sub messaging platform, which has been industry-tested for delivering high performance at scale. We build competing consumption semantics with dead letter queues on top of existing Kafka APIs and provide interfaces to ack or nack out of order messages with retries and in-process fanout features.