Skip to content

Commit

Permalink
Update doc for failing healthcheck monitoring
Browse files Browse the repository at this point in the history
  • Loading branch information
brablc authored Jul 9, 2024
1 parent 2a73346 commit 665dac1
Showing 1 changed file with 13 additions and 5 deletions.
18 changes: 13 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@

Detect unhealthy containers using two methods:

1. **🚪Opened ports** - uses auto discovery and checks whether services with non zero replicas are available on those ports.
2. **💔 Failing services** - uses [Docker Events API](https://docs.docker.com/engine/api/v1.45/#tag/System/operation/SystemEvents) to detect containers, that are restarted too often.
1. **🚪 Opened ports** - uses auto discovery and checks whether services with non zero replicas are available on those ports.
2. **📜 Docker events** - analyzes events generated by swarm when containers are created/destroyed 🔁 or have failing healthcheck 💔.

## Configuration

Expand All @@ -23,11 +23,19 @@ services:
- "swarm-health-alerter.port=5672,15672"
```
### 💔 Failing services
The monitoring does not ensure proper number of instances, it it satisfied when at least one instance is running on the port.
Sometimes your service would fail (or be killed by healthcheck) and restart. This would be seen as event `destroy` and `create`.
### 📜 Docker events
If both the number of `destroy` and `create` events exceed configured `EVENTS_THRESHOLD` within `EVENTS_WINDOW`, the service is deemed unhealthy and alert is created. If there was no event from the service withing the window, the problem is deemed resolved.
Uses [Docker Events API](https://docs.docker.com/engine/api/v1.45/#tag/System/operation/SystemEvents) to monitor two conditions:
#### 🔁 Restarting services
Sometimes your service would fail (or be killed by healthcheck) and restart. This would be seen as event `destroy` and `create`. If both the number of `destroy` and `create` events exceed configured `EVENTS_THRESHOLD` within `EVENTS_WINDOW`, the service is deemed unhealthy and alert is created. If there was no event from the service within the window, the problem is deemed resolved.

#### 💔 Failing healthcheck

When healtcheck fails for given number of retries, it would normally lead to a service restart. In certain situation this is better avoided as it can lead to loss of data (imagine RabbitMQ being killed while recovering queues from disk). In such situation you may prefer to set high number of retries for healtcheck: `retries: 9999` and get alerted when the number of failed healthcheck retries exceeds configured `EVENTS_THRESHOLD`.

## Installation

Expand Down

0 comments on commit 665dac1

Please sign in to comment.