Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many restarts in K8s cluster deployment #179

Open
c0c0n3 opened this issue Apr 16, 2019 · 5 comments
Open

Too many restarts in K8s cluster deployment #179

c0c0n3 opened this issue Apr 16, 2019 · 5 comments
Labels
ops infrastructure and scm related stuff

Comments

@c0c0n3
Copy link
Member

c0c0n3 commented Apr 16, 2019

We've been experiencing an unusually high number of restarts in our K8s cluster. For example in the last 3 days K8s restarted QL 103 and 99 times in each of the two pods, respectively.

@c0c0n3 c0c0n3 added the ops infrastructure and scm related stuff label Apr 16, 2019
@chicco785
Copy link
Contributor

I think it happens when QL becomes unresponsive, and so it's killed by k8s:

  Warning  Unhealthy  54m (x1061 over 4d21h)    kubelet, ip-172-20-60-68.eu-central-1.compute.internal  Liveness probe failed: Get http://172.20.44.1:8668/v2/health: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
  Normal   Pulling    54m (x101 over 4d21h)     kubelet, ip-172-20-60-68.eu-central-1.compute.internal  pulling image "smartsdk/quantumleap:rc"
  Normal   Killing    54m (x100 over 4d21h)     kubelet, ip-172-20-60-68.eu-central-1.compute.internal  Killing container with id docker://quantumleap:Container failed liveness probe.. Container will be killed and recreated.
  Normal   Pulled     53m (x101 over 4d21h)     kubelet, ip-172-20-60-68.eu-central-1.compute.internal  Successfully pulled image "smartsdk/quantumleap:rc"
  Normal   Created    53m (x101 over 4d21h)     kubelet, ip-172-20-60-68.eu-central-1.compute.internal  Created container
  Normal   Started    53m (x101 over 4d21h)     kubelet, ip-172-20-60-68.eu-central-1.compute.internal  Started container
  Warning  Unhealthy  53m (x3 over 4d21h)       kubelet, ip-172-20-60-68.eu-central-1.compute.internal  Liveness probe failed: Get http://172.20.44.1:8668/v2/health: dial tcp 172.20.44.1:8668: connect: connection refused
  Warning  Unhealthy  8m50s (x1127 over 4d21h)  kubelet, ip-172-20-60-68.eu-central-1.compute.internal  Readiness probe failed: Get http://172.20.44.1:8668/v2/health: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

@chicco785
Copy link
Contributor

i believe this was solved with allowing for yellow state of crate cluster.

@pooja1pathak
Copy link
Collaborator

@c0c0n3 We are facing this issue in our Kubernetes deployment of quantumleap with wq configuration. We have used below liveness probe setting in both deployment files for quantumleap and quantumleap-wq:

        livenessProbe:
          failureThreshold: 3
          initialDelaySeconds: 180
          periodSeconds: 60
          successThreshold: 1
          httpGet:
            path: /health
            port: 8668
            scheme: HTTP
          timeoutSeconds: 60

Please find below observation:

  1. Quantumleap pod does not restart
  2. Quantumleap-wq pod restarts many times
  3. Liveness probe failed for quantumleap-wq pod
  4. Health API returns status:pass

Liveness probe failed for quantumleap-wq:
health

We have checked our environment for crate health, and it is GREEN.
If we remove livenessProbe from quantumleap-wq deployment file, then it does not restart.

Please confirm our understanding: livenessProbe is not required in quantumleap-wq deployment file.

@pooja1pathak pooja1pathak reopened this Oct 28, 2024
@pooja1pathak
Copy link
Collaborator

@c0c0n3 we have following observation on why livenessProbe is not working with quantumleap-wq deployment file:

As two deployments of quantumleap is running in our environment, one is for master and other is for worker.

In master deployment file we can have livenessProbe which will call quantumleap's health API and restart pod if any error occurs.

Whereas worker quantumleap can only handle notify API as mentioned in https://github.com/orchestracities/ngsi-timeseries-api/blob/master/docs/manuals/admin/wq.md.

As per our understanding, health API of quantumleap cannot be executed on worker quantumleap and it returned connection error and restarts.
We can remove livenessProbe from worker quantumleap because it is already handled in master quantumleap which will check crate status.

Please correct my understanding if there is anything I am missing.

@c0c0n3
Copy link
Member Author

c0c0n3 commented Nov 21, 2024

hi @pooja1pathak :-)

health API of quantumleap cannot be executed on worker quantumleap and it returned connection error and restarts.

Correct. Each Worker process is a standalone RQ instance, there's no QL Web API there:

We can remove livenessProbe from worker quantumleap

Yes. I suggest you start Workers using Supervisor with our config:

That will give you the reliability you're after I guess.

More about it here:

Hope this helps!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ops infrastructure and scm related stuff
Projects
None yet
Development

No branches or pull requests

3 participants