Building the Resilient Service : A three pronged Approach

Parashar Borkotoky
5 min readJun 28, 2021

The ability to recover quickly from failures has culturally been an infrastructure concern. Enterprises have traditionally hosted their applications and services in centralized infrastructure like mainframes or vertically scaled databases which made it easier for infrastructure teams and system admins to manage and guarantee to a great extent very high reliability. The need for agility, performance and growing user bases have resulted in enterprises adopting more distributed architectures that have been further propelled by the advent of cloud and microservices.

The growing complexity and number of system components especially services within enterprises has made resilience as much as a software design concern as an infrastructure one. Conversations around resilience has also resulted in cultural shifts with many organizations being more open to the ideas of failing fast, testing in production and rolling out features in very small increments. This has also mandated the need for resiliency offices and resilience stories in sprint backlogs, especially in the context of web services.

If you are Chief Resilience Officer, an Enterprise Architect or a Chief Information Officer, you may have been exposed to the divergent views around Resilience — the executive who views resilience from the lens of priority issues on her dashboard, the software engineer who emphasizes the need for design patterns to make services resilient, the site reliability engineer focused on system health and monitoring and the infrastructure engineer focused on hardware bottlenecks. You may have also realized that building a resilient application or service is not a magic bullet or a few checkboxes to be ticked as part of a list but a 360 degree long term incremental process that need a cultural shift and collaboration across the lifecycle.

A three-pronged Approach

One approach for resilience is to look at it from three perspectives :

Design for Failures

Services should be designed with the assumption that hardware, network, application or service failure can happen anytime and designing for failures helps a service to recover quickly or fail gracefully. This principle brings the onus of resilience to the software developer with the paradigm that good software can work well even with hardware limitations or failures.

Relaxed Temporal Constraints : Many real use cases do not require strict consistency. Embracing eventual consistency helps alleviate the need for a centralized data store and loosen up coupling of services to a single data center or availability zone

Asynchronous Communication : Using asynchronous communication it is possible to make services more responsive. Developers can strive to break every service to its bare capabilities and ask which parts can be done without making the service unresponsive

Parameter Checking : One of the very obvious and perhaps often neglected but simple way to make a service more resilient is go back to the basics and make sure all parameters are well checked before requests are processed.

Recoverability : How does a service recover from failures ? Design your service to be able to retry, fallback and rollback from failures to help recover to the best acceptable state as soon as possible. Along with a service recovering from its own failures, it is also important to design it be idempotent so that repeated requests with the same inputs provide a consistent response. Limiting the affect of failures is also as important and design patterns like the the circuit breaker, timeout or bulkheads can help mitigate cascading and catastrophic failures.

Build for Agility

A significant portion of outages happen in an organization during planned events like feature releases or software upgrades. These events need not be go-no decisions with high risks. A culture of agility using build processes and frameworks can make software deployments almost a non-event and greatly reduce the chance of failures. This often requires embracing concepts like incremental rollouts and testing in production

Feature Flags : Big-bang releases are recipes for failures and one great way of increasing deployment frequency while lowering risks is by use of feature flags which can help turn on or off features dynamically during run time

Rollout Strategy : It is important to distinguish and decouple deployment from rollouts. Deployments can be a continuous process with the ability of deploying software multiple times a day while rollouts should be driven by product of feature roadmaps. It is also important to rollout incrementally to reduce risk. It is incumbent on engineering teams to look at various rollout strategies including shadowing — the ability to take traffic in production and testing for resilience, piloting — the ability to rollout to a low risk set of users like employees, sampling — the ability to rollout to a small set of users incrementally by demography, location or other parameters using techniques including canary rollouts and blue green deployment strategies.

Testing in Production : Engineering managers and Product managers are often wary of the term “testing in production”. However, done well and in controlled environments, using strategies like Chaos testing and failure injection can battle test a system and help make large systems resilient to failures

Infrastructure for the Cloud

An infrastructure that has the ability to automatically adapt to positive and negative disruptions is at the heart of resiliency efforts. Building scalable infrastructure whether by accelerating the adoption of public cloud like AWS within the enterprise or by adopting them within on- premise data center infrastructure often requires executive commitment and significant investment. A cloud native infrastructure platform is paramount for resiliency of distributed applications and services

Auto Scaling : Ability to automatically adjust capacity of computational resources is paramount for scalability and availability of applications and services. If certain parts of the infrastructure are not auto scalable, it is important to identify such components pro-actively and build excess capacity so as to not make them failure points in an otherwise resilient system

Availability Zones : Building availability zones with with redundancy and segregated power, networking and connectivity is key to guarantee availability of systems. It is as important to have efficient distribution of traffic using best-in class load balancing and networking strategies.

In Conclusion

Resiliency in modern applications is becoming a revenue metric for enterprises as outages have direct profit and loss implications apart from brand and reputational impact. One may consider a comprehensive approach spanning developers, infrastructure engineers and product management while envisioning resiliency. While scalable infrastructure is at the heart of resiliency, building resilient services and applications also require deep analysis and planning for design considerations and agility in managing change.

--

--

Parashar Borkotoky

Observer, learner. Interested in Architecture, MicroServices and People. Views are mine and do not reflect my employer, previous or present