Electrical Engineering and Computer Science


Software Seminar

Kraken: Leveraging Live Traffic Tests to Identify and Resolve Resource Utilization Bottlenecks in Large Scale Web Services


Mike Chow, Software Engineer, Facebook

Dan Peek, Software Engineer, Facebook

 
Tuesday, December 13, 2016
12:30pm - 1:30pm
3725 BBB

Add to Google Calendar

About the Event

Modern web services such as Facebook are made up of hundreds of systems running in geographically-distributed data centers. Each system needs to be allocated capacity, configured, and tuned to use data center resources efficiently. Keeping a model of capacity allocation current is challenging given that user behavior and software components evolve constantly.

This talk focuses on Kraken, a new system that runs load tests by continually shifting live user traffic to one or more data centers. Kraken enables empirical testing by monitoring user experience (e.g., latency) and system health (e.g., error rate) in a feedback loop between traffic shifts. We analyze the behavior of individual systems and groups of systems to identify resource utilization bottlenecks such as capacity, load balancing, software regressions, performance tuning, and so on, which can be iteratively fixed and verified in subsequent load tests. Kraken, which manages the traffic generated by 1.7 billion users, has been in production at Facebook for three years and has allowed us to improve our hardware utilization by over 20%.

Three insights motivate this work: (1) the live user traffic accessing a web service provides the most current target workload possible, (2) we can empirically test the system to identify its scalability limits, and (3) the user impact and operational overhead of empirical testing can be largely eliminated by building automation which adjusts live traffic based on feedback.

Biography

Mike Chow is a software engineer at Facebook working on building systems for testing the performance and efficiency of backend systems. He recently received his Ph.D. from the University of Michigan in 2016.

Daniel Peek is a software engineer at Facebook where he has worked on disaster readiness, data integrity, and distributed storage performance. He received his Ph.D. from the University of Michigan in 2009.

Additional Information

Contact: Stephen Reger

Phone: 734-764-9401

Email: sereger@eecs.umich.edu

Sponsor(s): SSL

Faculty Sponsor: Professor Jason Flinn

Open to: Public