Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OMD 5.10 shows very different gearman worker status #157

Open
infraweavers opened this issue Mar 2, 2023 · 15 comments
Open

OMD 5.10 shows very different gearman worker status #157

infraweavers opened this issue Mar 2, 2023 · 15 comments

Comments

@infraweavers
Copy link
Contributor

Hello,

So since our OMD 4.40 -> OMD 5.10 upgrade we've been experiencing occasions where our gearman server appears to have large numbers of running or waiting checks. On investigation we can see that the behaviour of service checks through gearman is very different under OMD 5.10. In order to do some diagnostics we've downgraded one of our OMD boxes to OMD 4.60; but we have "transplanted" the version of mod_gearman_worker-go and the epn into the 4.60 box so we're not running into ConSol-Monitoring/mod-gearman-worker-go#19 This has the added benefit of exonerating mod_gearman_worker-go which is nice. I'm leaning towards there being a change in naemon-core.

OMD config:

    omd config set GEARMAND on
    omd config set GEARMAND_PORT 0.0.0.0:4730
    omd config set GEARMAN_WORKER on
    omd config set LIVESTATUS_TCP on
    omd config set LIVESTATUS_TCP_PORT 6557
    omd config set MOD_GEARMAN on
    omd config set PNP4NAGIOS gearman
    omd config set THRUK_COOKIE_AUTH off
    omd config set GRAFANA on

Graph of /omd/sites/default/lib/monitoring-plugins/check_gearman -H OMD101.man.cwserverfarm.local -W 501 -C 750 -w 501 -c 750 where we can see the differing behaviour.

image

@infraweavers
Copy link
Contributor Author

The Load Average (not that it means much) is also significantly higher under 5.10.
image

I'll keep digging and see what else shows up. We did notice that the core scheduling graph also looks "wierd" under 5.10 compared to 4.60 (like much spiker and not as even etc) however it's difficult to get a side-by-side comparison on that. I'll see what turns up

@sni
Copy link
Contributor

sni commented Mar 2, 2023

try disabling embedded perl in the etc/mod-gearman/worker.cfg. I noticed an issue yesterday in the epn connector if the plugin output exceeds 8kb.

@infraweavers
Copy link
Contributor Author

try disabling embedded perl in the etc/mod-gearman/worker.cfg. I noticed an issue yesterday in the epn connector if the plugin output exceeds 8kb.

Cool, we'll give that a shot on an un-touched 5.10

@sni
Copy link
Contributor

sni commented Mar 2, 2023

yeah, but wait till tomorrow, still working on that fix.

@infraweavers
Copy link
Contributor Author

Hmm, I disabled embedded perl yesterday (about where the red line is); can't really see a difference so far:
image

@sni
Copy link
Contributor

sni commented Mar 3, 2023

todays daily looks fine. epn should run much smoother now.

@infraweavers
Copy link
Contributor Author

Cool, I'll build one of our boxes onto that and give it a test

@infraweavers
Copy link
Contributor Author

Hmm, I would say it doesn't look massively different at "big scale":
image

On the 1 week scale you can see where we upgraded to the nightly build (red line), it does arguably look a little bit better maybe?
image

@infraweavers
Copy link
Contributor Author

OK so we've downgraded one of them to OMD4.60 as well to see if we can narrow it down. It looks like the change in behaviour is between 4.60 and 5.10

image

@sni
Copy link
Contributor

sni commented Mar 10, 2023

could you try the latest OMD daily, it should work quite well now. I also added something in the gearman neb module to flatten out the number of concurrent started checks.

@infraweavers
Copy link
Contributor Author

could you try the latest OMD daily, it should work quite well now. I also added something in the gearman neb module to flatten out the number of concurrent started checks.

Yep we'll do that on Monday

@infraweavers
Copy link
Contributor Author

We've just rolled out omd-5.11.20230314-labs-edition onto one of the servers to test that now

@infraweavers
Copy link
Contributor Author

infraweavers commented Mar 17, 2023

So from what we can see, it seems to be improved but not really back to where it was in 4.60. I think we will have to increase the workers to see if that will remove some of the noise and pressure that we're seeing. We do also keep getting pnp4nagios errors with the interval being too short between updates (similiar to #156 but for other checks, we have decreased the pnp_gearman_worker down to 1 to eliminate a race condition there and it still does it, so we're thinking that something is running the same check back-to-back as it were).

This sort of feels to us that check's aren't being run at regular intervals under 5+ (most of our checks are once per minute). We're going to investigate if we have evidence to support that assertion, but it certainly feels like that's what's going on.

@infraweavers
Copy link
Contributor Author

SO we looked into the naemon suspicions there and have found absolutely no evidence to support the idea that checks are being run more frequently than they should be. So, we have bumped our thresholds up from 500 to 2500 for the time being whilst we try and ascertain if the change is actually a problem for gearman/OMD etc or not

@sni
Copy link
Contributor

sni commented Jun 22, 2023

btw, load average might seem to increase if you use the check_load scaled by cpu mode. The check_load now has a scaled_load perf counter and the previous "scaled" metric is the absolute unit now. So it might be, that the cpu usage did not increase at all, but the check_load check
now reports different numbers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants