Too many restarts in K8s cluster deployment #179

c0c0n3 · 2019-04-16T15:37:29Z

We've been experiencing an unusually high number of restarts in our K8s cluster. For example in the last 3 days K8s restarted QL 103 and 99 times in each of the two pods, respectively.

chicco785 · 2019-04-17T18:54:03Z

I think it happens when QL becomes unresponsive, and so it's killed by k8s:

  Warning  Unhealthy  54m (x1061 over 4d21h)    kubelet, ip-172-20-60-68.eu-central-1.compute.internal  Liveness probe failed: Get http://172.20.44.1:8668/v2/health: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
  Normal   Pulling    54m (x101 over 4d21h)     kubelet, ip-172-20-60-68.eu-central-1.compute.internal  pulling image "smartsdk/quantumleap:rc"
  Normal   Killing    54m (x100 over 4d21h)     kubelet, ip-172-20-60-68.eu-central-1.compute.internal  Killing container with id docker://quantumleap:Container failed liveness probe.. Container will be killed and recreated.
  Normal   Pulled     53m (x101 over 4d21h)     kubelet, ip-172-20-60-68.eu-central-1.compute.internal  Successfully pulled image "smartsdk/quantumleap:rc"
  Normal   Created    53m (x101 over 4d21h)     kubelet, ip-172-20-60-68.eu-central-1.compute.internal  Created container
  Normal   Started    53m (x101 over 4d21h)     kubelet, ip-172-20-60-68.eu-central-1.compute.internal  Started container
  Warning  Unhealthy  53m (x3 over 4d21h)       kubelet, ip-172-20-60-68.eu-central-1.compute.internal  Liveness probe failed: Get http://172.20.44.1:8668/v2/health: dial tcp 172.20.44.1:8668: connect: connection refused
  Warning  Unhealthy  8m50s (x1127 over 4d21h)  kubelet, ip-172-20-60-68.eu-central-1.compute.internal  Readiness probe failed: Get http://172.20.44.1:8668/v2/health: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

chicco785 · 2019-10-12T09:57:04Z

i believe this was solved with allowing for yellow state of crate cluster.

pooja1pathak · 2024-10-28T12:59:27Z

@c0c0n3 We are facing this issue in our Kubernetes deployment of quantumleap with wq configuration. We have used below liveness probe setting in both deployment files for quantumleap and quantumleap-wq:

        livenessProbe:
          failureThreshold: 3
          initialDelaySeconds: 180
          periodSeconds: 60
          successThreshold: 1
          httpGet:
            path: /health
            port: 8668
            scheme: HTTP
          timeoutSeconds: 60

Please find below observation:

Quantumleap pod does not restart
Quantumleap-wq pod restarts many times
Liveness probe failed for quantumleap-wq pod
Health API returns status:pass

Liveness probe failed for quantumleap-wq:

We have checked our environment for crate health, and it is GREEN.
If we remove livenessProbe from quantumleap-wq deployment file, then it does not restart.

Please confirm our understanding: livenessProbe is not required in quantumleap-wq deployment file.

pooja1pathak · 2024-10-30T12:22:42Z

@c0c0n3 we have following observation on why livenessProbe is not working with quantumleap-wq deployment file:

As two deployments of quantumleap is running in our environment, one is for master and other is for worker.

In master deployment file we can have livenessProbe which will call quantumleap's health API and restart pod if any error occurs.

Whereas worker quantumleap can only handle notify API as mentioned in https://github.com/orchestracities/ngsi-timeseries-api/blob/master/docs/manuals/admin/wq.md.

As per our understanding, health API of quantumleap cannot be executed on worker quantumleap and it returned connection error and restarts.
We can remove livenessProbe from worker quantumleap because it is already handled in master quantumleap which will check crate status.

Please correct my understanding if there is anything I am missing.

c0c0n3 · 2024-11-21T10:52:18Z

hi @pooja1pathak :-)

health API of quantumleap cannot be executed on worker quantumleap and it returned connection error and restarts.

Correct. Each Worker process is a standalone RQ instance, there's no QL Web API there:

We can remove livenessProbe from worker quantumleap

Yes. I suggest you start Workers using Supervisor with our config:

https://github.com/orchestracities/ngsi-timeseries-api/blob/master/src/wq/supervisord.conf

That will give you the reliability you're after I guess.

More about it here:

https://github.com/orchestracities/ngsi-timeseries-api/wiki/Work-a-Q

Hope this helps!

c0c0n3 added the ops label Apr 16, 2019

chicco785 closed this as completed Oct 12, 2019

pooja1pathak reopened this Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too many restarts in K8s cluster deployment #179

Too many restarts in K8s cluster deployment #179

c0c0n3 commented Apr 16, 2019

chicco785 commented Apr 17, 2019

chicco785 commented Oct 12, 2019

pooja1pathak commented Oct 28, 2024

pooja1pathak commented Oct 30, 2024

c0c0n3 commented Nov 21, 2024

Too many restarts in K8s cluster deployment #179

Too many restarts in K8s cluster deployment #179

Comments

c0c0n3 commented Apr 16, 2019

chicco785 commented Apr 17, 2019

chicco785 commented Oct 12, 2019

pooja1pathak commented Oct 28, 2024

pooja1pathak commented Oct 30, 2024

c0c0n3 commented Nov 21, 2024