Currently health check failure events are only log if the HealthFlag for a host transition from non-FAILED_ACTIVE_HC to FAILED_ACTIVE_HC. However, since hosts are initialized in the FAILED_ACTIVE_HC state, hosts that never became healthy have no events associated with it.
Since the current health check events only log transitions, we'll have to scan the entire log in order to find the hosts in a current failing state. Then we'll still have to filter the hosts permanently removed from the cluster by the discovery service. This makes the events very difficult to use in operations.
Proposed solution
Both of these 2 issues can be solved by emitting a health check failure event if either of these conditions are true:
If the active health check failed and it's the first health check for a host. This ensures we have events for hosts that never became healthy.
If the active health check failed and a AlwaysLogFailures configuration is set to true, by default this flag is set to false. This makes it very easy to find the hosts currently failing by looking at the last few seconds of logs.
Signed-off-by: Henry Yang <hyang@lyft.com>
Mirrored from https://github.com/envoyproxy/envoy @ 11e196b67ee9124f33c45f5adf542841386e3c39