Update doc for failing healthcheck monitoring

brablc · Jul 9, 2024 · 665dac1 · 665dac1
1 parent 2a73346
commit 665dac1
Showing 1 changed file with 13 additions and 5 deletions.
diff --git a/README.md b/README.md
@@ -2,8 +2,8 @@
 
 Detect unhealthy containers using two methods:
 
-1. **🚪Opened ports** - uses auto discovery and checks whether services with non zero replicas are available on those ports.
-2. **💔 Failing services** - uses [Docker Events API](https://docs.docker.com/engine/api/v1.45/#tag/System/operation/SystemEvents) to detect containers, that are restarted too often.
+1. **🚪 Opened ports** - uses auto discovery and checks whether services with non zero replicas are available on those ports.
+2. **📜 Docker events** - analyzes events generated by swarm when containers are created/destroyed 🔁 or have failing healthcheck 💔.
 
 ## Configuration
 
@@ -23,11 +23,19 @@ services:
                 - "swarm-health-alerter.port=5672,15672"
 ```
 
-### 💔 Failing services
+The monitoring does not ensure proper number of instances, it it satisfied when at least one instance is running on the port.
 
-Sometimes your service would fail (or be killed by healthcheck) and restart. This would be seen as event `destroy` and `create`.
+### 📜 Docker events
 
-If both the number of `destroy` and `create` events exceed configured `EVENTS_THRESHOLD` within `EVENTS_WINDOW`, the service is deemed unhealthy and alert is created. If there was no event from the service withing the window, the problem is deemed resolved.
+Uses [Docker Events API](https://docs.docker.com/engine/api/v1.45/#tag/System/operation/SystemEvents) to monitor two conditions:
+
+#### 🔁 Restarting services
+
+Sometimes your service would fail (or be killed by healthcheck) and restart. This would be seen as event `destroy` and `create`. If both the number of `destroy` and `create` events exceed configured `EVENTS_THRESHOLD` within `EVENTS_WINDOW`, the service is deemed unhealthy and alert is created. If there was no event from the service within the window, the problem is deemed resolved.
+
+#### 💔 Failing healthcheck
+
+When healtcheck fails for given number of retries, it would normally lead to a service restart. In certain situation this is better avoided as it can lead to loss of data (imagine RabbitMQ being killed while recovering queues from disk). In such situation you may prefer to set high number of retries for healtcheck: `retries: 9999` and get alerted when the number of failed healthcheck retries exceeds configured `EVENTS_THRESHOLD`.
 
 ## Installation