DevOps at Massive Scale

When you have a billion users, people notice. That’s where our story about DevOps and Yahoo! starts. For Kishore Jalleda and Gopal Mor, both engineers at Yahoo!, when something goes wrong on a Yahoo! page, people will notice. Correction: a lot of people will notice.

Of course, Yahoo!, like all services on the Internet, constantly improves its products. In fact, they have 100+ iterations and experiments happening at any given time. Some changes bring new innovation to the forefront and others alter the user experience.

When iterations and experiments are served in front of loyal users who have become comfortable with a specific user experience, they sometimes react with a natural resistance to the change. When change causes, or appears to cause, breaks in service, the backlash can be crippling. Frequent spurts of backlash resulting from these changes became known to some Yahoo! Insiders as “the bad years”.

At the recent All Day DevOps conference, Kishore and Gopal shared how Yahoo! turned to DevOps practices to recover from “the bad years”. In their presentation, Launching Products at Massive Scale: The DevOps Way, Kishore said, “DevOps is about eliminating technical, process, and cultural barriers between idea and execution - using software.” Specific to each of these, Kishore recommends the following:

Culture of ownership and excellence: own lifecycles within Development, fix root causes, and have pride in the product.

Processes: design or engineer processes to be fast and agile. Work should be iterative, support learning, and provide fast feedback cycles. Let the machines to the heavy lifting.

Tools - Solve operations problems with software. Use open source tools that are self-service, reusable, friendly, and easy-to-use.

At Yahoo!, the DevOps practice is built on three functional pillars: deliver products to market quickly; prevent defects from reaching customers; and, repair production issues quickly.

Speaking of repairing production issues quickly, Gopal discussed resiliency. With their user base, downtime (or lack thereof) is critical to Yahoo! and the challenges are many:

Distributed multilayer architecture
100s of subsystems
Complex request flow
Change is the only constant

While it may be counterintuitive, Gopal demonstrates how the combined system is weaker than the weakest subsystem.

To ensure optimum uptime, Gopal tells us we need to:

Analyze the entire range of failure types
Understand their rate and impact level
Plan to cover all failure types
Conduct fire drills - test, test, and test.

Specifically, how does Yahoo! ensure high availability? They maintain four layers of resiliency in the serving stack:

Speculative retry - Deliver the page again after a predefined latency is exceeded. This addresses long tail latency and intermittent failures.

Per-module fallback - Cache non-personalized modules to front-end servers and serve cached content for failed modules. You need to ensure the cache is refreshed often, that you implement strong validation of the cache’s data, and you check for backward compatibility if the TTL is high.

Fullpage failsafe - Cache the entire page without personalized data and ads and with minimal interaction. This is used when the entire page cannot be served. Yahoo! uses auto-scale AWS servers to serve these pages so their servers are not negatively affected.

Failwhale page - The “we will be right back” page that lets users know, yes, your Internet is working, but our service isn’t - temporarily. Please try again. Obviously, a last resort, but a necessary one.

There is a lot here and you don’t have to have a service at the scale of Yahoo! to benefit for the experience. You can dive into Kishore and Gopal’s full All Day DevOps conference session (just 30 minutes) to learn more about their learnings from the DevOps front lines. The other 56 presentations from the All Day DevOps Conference are also available online, free-of-charge here.

This blog series is reviewing sessions from the All Day DevOps conference from November which hosted over 13,500 registered attendees. Last week I discussed, “Docker, the New Ordinary”. Next week, look for System Hardening Using Ansible, from Akash Mahajan.

ADDO BEGINS IN

Days :

Hours :

Minutes :

Seconds

SCHEDULE

REGISTER

DevOps at Massive Scale