In our eleventh issue of the Architects’ Newsletter we are continuing to explore the emerging field of chaos engineering and what it can teach us about building resilient distributed systems.
Chaos Engineering: Why the World Needs More Resilient Systems
The video is now available for the QCon London talk “Chaos Engineering: Why the World Needs More Resilient Systems” by Tammy Butow, Principal SRE at Gremlin, which explains why the world needs more resilient systems and how this can be achieved with the practice of chaos engineering. The talk suggests that three primary prerequisites for chaos engineering must be implemented before additional work can begin. Specifically: high severity “SEV” incident management; effective monitoring; and the ability to measure the impact of a failure (both in technical and business terms). Butow also presents a series of guidelines, tools and principles for creating a chaos testing practice.
Additionally, Butow recently wrote an informative guide on “Getting Started with Chaos Engineering” and took part in a Software Engineering Radio podcast that discussed the factors that caused Chaos Engineering to emerge, the different types of chaos that can be introduced to a system, and how to structure experiments. Her colleague, Ana Medina, a chaos engineer at Gremlin, has also presented a similar talk about getting started with chaos engineering at SREcon Australia, for which the slides are available.
Five mistakes that Teams New to Chaos Engineering Make
An informative blog post by Tyler Lund, Director of Software Development at Audible, discusses how the teams who excel with chaos engineering use frequent, small experiments to find issues that affect the user experience, “rather than waiting for all users to experience a problem”. The five mistakes he often sees new chaos engineering teams make include: not monitoring enough; breaking things just to break them; lacking a proper shutoff switch; never running in production; and replacing other kinds of tests with chaos tests. A constant theme of the discussion is that chaos engineering must focus primarily on the user experience of the system under test:
“Software development teams tend to get excited about chaos engineering and go all in a bit too quickly without really thinking about how to best use it to improve the experience for their users”.
Lund concludes by stating that effective teams “use Chaos Engineering to find Chaos, not to cause it.”
Purple Testing and Chaos Engineering in Security Experimentation
On opensource.com Aaron Rinehart and Andrew Weidenhamer have presented how red and “purple” team testing and chaos engineering complement each other to form a strong security experimentation strategy. The post begins by stating that testing “seeks to assess and validate the presence of previously known system attributes”, whereas, in comparison, experimentation “seeks to derive new information about a system by utilizing the scientific method”. Both approaches are important in order to create and maintain secure systems.
An exploration of security experimentation, and how this came about through the application of chaos engineering is presented, and the challenges of traditional security approaches of red team vs blue team are discussed. The concept of purple team exercises was discussed, which attempts to create a more cohesive testing experience between offensive and defensive security techniques by increasing transparency, education, and better feedback loops:
“By integrating the defensive tactics and controls from the blue team with the threats and vulnerabilities found by the red team into a single narrative, the goal is to maximize the efforts of each”.
Continuous Chaos: Never Stop Iterating
Philip Gebhardt, software engineer at Gremlin, discussed how the approach to chaos engineering is similar to software testing patterns. “You wouldn’t write software without an iterative testing cycle”, he says, so why would you “design production systems without one?” The majority of the post explores a challenging DNS issue that the Gremlin team found within their production systems, and explains how an iterative approach to designing and running a series of chaos experiments helped to mitigate against that issue in the future.
Chaos Engineering at LinkedIn: The “LinkedOut” Failure Injection Testing Framework
The LinkedIn Engineering team have recently discussed their “LinkedOut” failure injection testing framework in more detail. Hypotheses about service resilience can be formulated and failure triggers injected via the LinkedIn LiX A/B testing framework or via data in a cookie that is passed through the call stack via an Invocation Context (IC) framework. Failure scenarios include errors, delays and timeouts. The LinkedOut project is part of the larger “Waterbear” initiative to encourage every team at LinkedIn to contribute to resilience engineering efforts.
Logan Rosen, senior engineer, Site Reliability at LinkedIn, recently wrote “LinkedOut: A Request-Level Failure Injection Framework” on the LinkedIn Engineering blog. The post began by stating that in a complex, distributed technology stack, it is important to understand the points where things can go wrong and also to know how these failures might manifest themselves to end users. Engineers should assume that “Anything that can go wrong, will go wrong.”
There are many ways to inject failures into a distributed system, but the most fine-grained way to do it is at the request level. The Netflix chaos/resilience engineering team have previously discussed how they created the Failure Injection Testing (FIT) framework that eventually evolved into the Chaos Automation Platform (ChAP), which injected failure in just this way. Similarly the LinkedIn Site Reliability Engineering (SRE) team established the Waterbear project in late 2017, which is an effort to help developers “hit resiliency problems head-on” by both replicating system failures and adjusting frameworks to handle failures gracefully and transparently. Out of this work emerged the LinkedOut failure injection testing framework which enables request-level failure injection.
At its core, LinkedOut is a “disrupter” request filter in the organisation’s Rest.li stack, a Java framework that allows developers to easily create clients and servers that use a REST-style of communication. The open-source portion of this work can be found in the r2-disruptor and restli-disruptor modules within the project’s GitHub repository. LinkedOut is currently able to create three types of failures: error — the Rest.li framework has several default exceptions thrown when there are communication or data issues with the requested resource; delay — engineers can specify an amount of latency before the filter will pass the request downstream; and timeout — the filter waits for the timeout period specified.
At development time, engineers use the LinkedOut framework to validate that their code is robust. This validation is extended to production scenarios to provide external parties the confidence and evidence of robustness. There are two primary mechanisms to invoke the disruptor while limiting impact to the end-user experience. One of these is LiX, the LinkedIn framework for A/B testing and feature gating. The second is the Invocation Context (IC), a LinkedIn-specific, internal component of the Rest.li framework that allows keys and values to be passed into requests and propagated to all of the services involved in handling them.
As the service call graph is large and complicated at LinkedIn — the latest home page depends on more than 550 different endpoints — it is very difficult for engineers to ensure expected “graceful” degradation on the home page for every failure scenario. Therefore the SRE team created a service account (not associated with a real member) and gave it access to all of the LinkedIn products.
To automatically test web pages, the team leverages an internal framework that allows for Selenium testing at scale. They send commands to inject the disruption information into the invocation context (IC) via a cookie (which only functions on their internal network), authenticates the user, and then loads the URL defined in the test. The team considered several ways to determine success after injecting failures, but for the first iteration of the framework they decided to simply provide default matchers for “oops” (error) pages and blank pages. If the page loaded by Selenium matched one of these default patterns, they would consider the page to not have gracefully degraded.
At LinkedIn the mechanism of triggering failures via feature targeting (flagging) is simple due to the maturity and power of the LiX experimentation framework. Engineers create a targeting experiment based on the failure parameters that they specify. Once the experiment is activated, the disruption filter picks up the change, via a LiX client, and fails the corresponding requests. Using LiX also allows an engineer to easily terminate failure plans (“within minutes”) that have gone wrong or are impacting end-users inappropriately.
Additional details on the LinkedOut framework, including additional references and a discussion of the importance of the human and cultural side of resilience testing, can be found within the recent “Chaos Engineering at LinkedIn: The ‘LinkedOut’ Failure Injection Testing Framework” article on InfoQ.
To get notifications when InfoQ publishes content on this topic follow Chaos Engineering on InfoQ.
Missed a newsletter? You can find all of the previous issues on InfoQ.
|This edition of The Software Architects’ Newsletter is brought to you by:
Load Balancing in the Cloud
Cloud load balancing refers to distributing load across a number of application servers or containers running on cloud infrastructure. Cloud providers offer Infrastructure as a Service (IaaS), which renders virtual machines and network provisioning through use of an application programming interface (API). In the cloud it’s easy and natural to scale horizontally as new application servers are just an API call away. With dynamic environments, where new machines are provisioned and decommissioned to meet user demand, there is a greater need for a load balancer to intelligently distribute traffic across your machines.
InfoQ strives to facilitate the spread of knowledge and innovation within this space, and in this newsletter we aim to curate and summarise key learnings from news items, articles and presentations created by industry peers, both on InfoQ and across the web. We aim to keep readers informed and educated about emerging trends, peer-validated early adoption of technologies, and architectural best practices, and are always keen to receive feedback from our readers. We hope you find it useful, but if not you can unsubscribe using the link below.
Forwarded email? Subscribe and get your own copy.