Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delay in determining a KVM host is down #10477

Closed
rajujith opened this issue Feb 27, 2025 · 2 comments
Closed

Delay in determining a KVM host is down #10477

rajujith opened this issue Feb 27, 2025 · 2 comments

Comments

@rajujith
Copy link

problem

With the default configurations, CloudStack determines a KVM host is down in 15-20 minutes. The HA-enabled instances will be started on another host only after this process. While reviewing the delay for the host state investigation followed by a ping timeout I see one command that takes 10 minutes 'com.cloud.agent.api.CheckOnHostCommand printing in the logs the following message 'timed out after 3600'. Later the host is determined as down via the neighbouring host quickly.

I suspect there is some issue in this specific implementation and if fixed the VM HA delay in KVM could be reduced by 10 minutes.

2025-01-28 06:22:30,041 DEBUG [c.c.a.t.Request] (AgentTaskPool-4:ctx-1ece8add) (logid:49d74f9a) Seq 2-4979573812988215360: Sending  { Cmd , MgmtId: 32988184186020, via: 2(ref-trl-5786-k-Mu22-jithin-raju-kvm2), Ver: v1, Flags: 100011, [{"com.cloud.agent.api.CheckOnHostCommand":{"host":{"guid":"439751ba-a6eb-3103-b60d-8321f53224fb-LibvirtComputingResource","privateNetwork":{"ip":"10.1.33.180","netmask":"255.255.240.0","mac":"1e:00:a9:00:0a:72","isSecurityGroupEnabled":"false"},"storageNetwork1":{"ip":"10.1.33.180","netmask":"255.255.240.0","mac":"1e:00:a9:00:0a:72","isSecurityGroupEnabled":"false"}},"reportCheckFailureIfOneStorageIsDown":"false","wait":"0","bypassHostMaintenance":"false"}}] }
2025-01-28 06:32:14,792 DEBUG [c.c.a.m.AgentAttache] (AgentTaskPool-4:ctx-1ece8add) (logid:49d74f9a) Seq 2-4979573812988215360: Waiting some more time because this is the current command
2025-01-28 06:32:14,792 DEBUG [c.c.a.m.AgentAttache] (AgentTaskPool-4:ctx-1ece8add) (logid:49d74f9a) Seq 2-4979573812988215360: Waiting some more time because this is the current command
2025-01-28 06:32:14,792 WARN  [c.c.a.m.AgentAttache] (AgentTaskPool-4:ctx-1ece8add) (logid:49d74f9a) Seq 2-4979573812988215360: Timed out on Seq 2-4979573812988215360:  { Cmd , MgmtId: 32988184186020, via: 2(ref-trl-5786-k-Mu22-jithin-raju-kvm2), Ver: v1, Flags: 100011, [{"com.cloud.agent.api.CheckOnHostCommand":{"host":{"guid":"439751ba-a6eb-3103-b60d-8321f53224fb-LibvirtComputingResource","privateNetwork":{"ip":"10.1.33.180","netmask":"255.255.240.0","mac":"1e:00:a9:00:0a:72","isSecurityGroupEnabled":"false"},"storageNetwork1":{"ip":"10.1.33.180","netmask":"255.255.240.0","mac":"1e:00:a9:00:0a:72","isSecurityGroupEnabled":"false"}},"reportCheckFailureIfOneStorageIsDown":"false","wait":"0","bypassHostMaintenance":"false"}}] }
2025-01-28 06:32:14,793 DEBUG [c.c.a.m.AgentAttache] (AgentTaskPool-4:ctx-1ece8add) (logid:49d74f9a) Seq 2-4979573812988215360: Cancelling.
2025-01-28 06:32:14,793 WARN  [c.c.a.m.AgentManagerImpl] (AgentTaskPool-4:ctx-1ece8add) (logid:49d74f9a) Operation timed out: Commands 4979573812988215360 to Host 2 timed out after 3600

https://gist.github.com/rajujith/9a51c52163eb4862b497057a40e8b812#file-acs-kvm-vm-ha-host-down

versions

4.19.1.3

The steps to reproduce the bug

  1. Power off the host via ILO/IDRAC or power off the nested hypervisor through the base hypervisor.
  2. Observe the delay in VM HA and review the logs

...

What to do about it?

Reduce the delay in the VM HA on KVM.

@DaanHoogland
Copy link
Contributor

@rajujith , can you try with latest? There is configurable timeouts for agent commands.

@rajujith
Copy link
Author

@DaanHoogland I tried in the latest and there is a significant improvement due to the default timeout of CheckOnHostCommand being set to 20 seconds resulting timeout of 40 seconds. I tried a few configurations and the results are below. I used the commands.timeout in the PR #9659 .

  1. Default settings:About 5 minutes.

  2. commands.wait=CheckHealthCommand=5,CheckOnHostCommand=5: About 2 Minutes, 30 seconds.

  3. ping.interval=30,ping.timeout=2,commands.wait=CheckHealthCommand=5,CheckOnHostCommand=5: About 2 Minutes.

cc: @harikrishna-patnala

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants