Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pkg/component] - A component's health should also depend on health of units #5386

Open
VihasMakwana opened this issue Aug 29, 2024 · 4 comments
Labels
Team:Elastic-Agent Label for the Agent team

Comments

@VihasMakwana
Copy link
Contributor

Describe the enhancement:

  • As per current implementation, a component's health is determined by checkins/missed checkins.
    case checkin := <-comm.CheckinObserved():
    sendExpected := false
    changed := false
    if c.state.State == client.UnitStateStarting {
    // first observation after start set component to healthy
    c.state.State = client.UnitStateHealthy
    c.state.Message = fmt.Sprintf("Healthy: communicating with pid '%d'", c.proc.PID)
    changed = true
  • I believe this should also take all the individual units into the account.

Describe a specific use case for the enhancement or feature:

  • Consider the following image:
    Image
  • The beats-monitoring component is Healthy (state 2), but the monitoring unit is actually in a degraded state.
  • I believe the component should also be degraded.

What is the definition of done?

  • A component should also be considered degraded if any of the underlying units are degraded.
@VihasMakwana VihasMakwana added the Team:Elastic-Agent Label for the Agent team label Aug 29, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@VihasMakwana
Copy link
Contributor Author

Please share your thoughts!

@blakerouse
Copy link
Contributor

I think the original idea on the split health status of an overall component versus a unit was that we can see the component itself is healthy but the unit that is running is not.

I think in practice most users just review the component health and don't look at individual units. I think taking an aggregated approach of make the unit status reflect the component status does make sense.

I am +1 for this type of change, if done correctly. I don't want to lose the context of the component health, so maybe we need to have two status levels for a component. The overall health of the component (including the aggregation of the units) and then a single health state for the components communication with the Elastic Agent.

@cmacknz
Copy link
Member

cmacknz commented Sep 3, 2024

I think in practice most users just review the component health and don't look at individual units. I think taking an aggregated approach of make the unit status reflect the component status does make sense.

From a user perspective I agree. The only case that worries me is the upgrade watcher:

// agent is healthy; but a component might not be healthy
// upgrade tracks unhealthy component as an issue with the upgrade
var errs []error
for _, comp := range state.Components {
if comp.State == client.Failed {
errs = append(errs, fmt.Errorf("component %s[%v] failed: %s", comp.Name, comp.ID, comp.Message))
}

If we made this change today, and had a single failed unit set the component state to failed, agent would begin rolling back upgrades because of unit level errors. We need to decide if this is behavior we want. My preference is to leave this unchanged so we ignore unit errors when deciding to roll back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

No branches or pull requests

4 participants