[pkg/component] - A component's health should also depend on health of units #5386

VihasMakwana · 2024-08-29T14:55:19Z

Describe the enhancement:

As per current implementation, a component's health is determined by checkins/missed checkins.

elastic-agent/pkg/component/runtime/command.go

Lines 185 to 192 in fd477ec

    
           case checkin := <-comm.CheckinObserved(): 
        
           	sendExpected := false 
        
           	changed := false 
        
           	if c.state.State == client.UnitStateStarting { 
        
           		// first observation after start set component to healthy 
        
           		c.state.State = client.UnitStateHealthy 
        
           		c.state.Message = fmt.Sprintf("Healthy: communicating with pid '%d'", c.proc.PID) 
        
           		changed = true

I believe this should also take all the individual units into the account.

Describe a specific use case for the enhancement or feature:

Consider the following image:
The beats-monitoring component is Healthy (state 2), but the monitoring unit is actually in a degraded state.
I believe the component should also be degraded.

What is the definition of done?

A component should also be considered degraded if any of the underlying units are degraded.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2024-08-29T14:55:37Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

VihasMakwana · 2024-08-29T14:55:54Z

Please share your thoughts!

blakerouse · 2024-08-30T15:46:03Z

I think the original idea on the split health status of an overall component versus a unit was that we can see the component itself is healthy but the unit that is running is not.

I think in practice most users just review the component health and don't look at individual units. I think taking an aggregated approach of make the unit status reflect the component status does make sense.

I am +1 for this type of change, if done correctly. I don't want to lose the context of the component health, so maybe we need to have two status levels for a component. The overall health of the component (including the aggregation of the units) and then a single health state for the components communication with the Elastic Agent.

cmacknz · 2024-09-03T17:15:56Z

I think in practice most users just review the component health and don't look at individual units. I think taking an aggregated approach of make the unit status reflect the component status does make sense.

From a user perspective I agree. The only case that worries me is the upgrade watcher:

elastic-agent/internal/pkg/agent/application/upgrade/watcher.go

Lines 222 to 228 in d8bdd71

    
           // agent is healthy; but a component might not be healthy 
        
           // upgrade tracks unhealthy component as an issue with the upgrade 
        
           var errs []error 
        
           for _, comp := range state.Components { 
        
           	if comp.State == client.Failed { 
        
           		errs = append(errs, fmt.Errorf("component %s[%v] failed: %s", comp.Name, comp.ID, comp.Message)) 
        
           	}

If we made this change today, and had a single failed unit set the component state to failed, agent would begin rolling back upgrades because of unit level errors. We need to decide if this is behavior we want. My preference is to leave this unchanged so we ignore unit errors when deciding to roll back.

VihasMakwana added the Team:Elastic-Agent Label for the Agent team label Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pkg/component] - A component's health should also depend on health of units #5386

[pkg/component] - A component's health should also depend on health of units #5386

VihasMakwana commented Aug 29, 2024

elasticmachine commented Aug 29, 2024

VihasMakwana commented Aug 29, 2024

blakerouse commented Aug 30, 2024

cmacknz commented Sep 3, 2024

[pkg/component] - A component's health should also depend on health of units #5386

[pkg/component] - A component's health should also depend on health of units #5386

Comments

VihasMakwana commented Aug 29, 2024

elasticmachine commented Aug 29, 2024

VihasMakwana commented Aug 29, 2024

blakerouse commented Aug 30, 2024

cmacknz commented Sep 3, 2024