Liveness and Readiness Check

Liveness and Readiness

When designing a distributed system, one of the metrics you'll probably need to track is how healthy are your systems. In a monolith system, it is easy to determine when your system is down because a larger system will be unable to service requests, leading to immediate feedback for your errors. Moving to a micro service architecture splits that problem into many sub problems, as now you have many different systems that could possibly go down, be under more load than others, and generally be smaller such that you want to be able to scale small instances rather than rely on few large instances. This is where we introduce the concept of health checking instances, such that they can report metrics back to the infrastructure so that it can decide what it needs to do for optimal performance.

Two common health checks are: Readiness and Liveness

Readiness

Readiness is that status that the application is in a state that it can begin servicing requests. During this time, the application may do various setup steps such as load caches, check connectivity with hard dependencies like a database, or do some other registration work with the rest of the system. Once the application passes the readiness check, it signals to the overall system that it may begin sending requests to it. This check is usually called at the beginning of the application life cycle and then once completed remains inactive until a new system is spun up.

Liveness

Once an application has been successfully started up and ready to accept requests, we still need to check that an application may experience difficulties in it's lifetime, and may in a state due to some execution that causes it to be unable to handle requests. That's where Liveness check comes into play. On some interval, your infrastructure should make routine checks to the applications that would be considered healthy to see if they can handle simple requests. Best practice is that there should not contain any business or heavy computation logic, as your merely determining if the service can handle new requests. In the event that the application cannot handle requests, it may have a specified retry policy before forcefully restarting the application. 

Lessons Learned

  • Only put hard dependencies within the readiness check. There was discussion that as part of the readiness check, we would add connectivity checks on other services that our application used. This may seem like a good idea because it adds a layer of security that the systems it needs to serve requests are there and running, but in practice causes a lot of problems. The issue is that you are creating a dependency on other services whereas your merely stating that your application is ready to run. Some possible situations you can be put in by applying this pattern:
    • When a brand new environment is spun up, and you have circular dependencies between two services, that are both attempting to check if the other is healthy. Neither service will be able to reach a ready state because the other hasn't been exposed yet.
    • If you are trying to spin up a service that is reliant on a high traffic system that may not be able to handle the request at that time. This will cause your service to be brought down, which may cause a cascade effect on the rest of your system, as you were probably trying to spin up a new service to meet the growing demand.

Comments

Popular posts from this blog

Uncle Bob's Clean Architecture

C4 Model: Describing Software Architecture

Multi-Tenant Design: The Basics