Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal:add UnhealthyNodeNames feild in HyperNode status and common HyperNodeConditionType #155

Open
wants to merge 4 commits into
base: network-topology-dev
Choose a base branch
from

Conversation

fishingfly
Copy link

What is the problem you're trying to solve

  • HyperNode functions similarly to a switch or tor, requiring switch vendors to update both the Spec and Status fields on the HyperNode. Currently, the Status field only contains Conditions and NodeCount.
  • When an RDMA network card issue occurs on a node connected to the leaf switch or tor —such as high BER (Bit Error Rate) or link flapping (as discussed in the paper "RDMA over Ethernet for Distributed Training at Meta Scale")—vendors can only update the network card status in the Conditions field.
  • The Conditions field reflects the overall status of the HyperNode, which may be connected to multiple nodes. However, there is no existing mechanism to indicate the health status of individual nodes. This lack of granularity prevents the scheduler from accurately identifying and handling unhealthy nodes.
  • Since HyperNode essentially functions as a switch or tor, it is also necessary to introduce standard switch condition types for vendors to use in HyperNodeStatus.Conditions.

Describe the solution you'd like

  • Introduce a new field under Status, such as UnhealthyNodeNames, to explicitly list nodes that are currently unschedulable under the given HyperNode.
  • Define two common switch condition types to standardize switch health reporting:
    • HyperNodeSystemFailure: Indicates a system-level issue on the switch or tor, such as CPU or memory overload, power failure, fan malfunction, or other critical system faults.
    • HyperNodeNetworkUnavailable: Indicates a network-related issue on the switch or tor, such as abnormal link status, interface failures, or other network disruptions.

Expected HyperNode Structure

The final HyperNode status should incorporate these enhancements to provide granular node-level health insights and common switch condition types, improving scheduler awareness and resource allocation efficiency.

The final expected HyperNode is as follows:

apiVersion: topology.volcano.sh/v1alpha1
kind: HyperNode
metadata:
  creationTimestamp: "2025-02-05T09:35:50Z"
  generation: 2
  name: leaf1
  resourceVersion: "341389665"
  uid: 0be6f513-0c58-4845-97e9-7da84f04a4d4
spec:
  members:
  - selector:
      exactMatch:
        name: worker-28
    type: Node
  - selector:
      exactMatch:
        name: worker-29
    type: Node
  - selector:
      exactMatch:
        name: worker-30
    type: Node
  tier: 1
status:
  conditions:
  - lastTransitionTime: "2025-02-10T07:41:38Z"
    message: There are network-related problems with the switch
    reason: OPTICAL_LINK_SUBHEALTH_FourHundredGigE1_0_9
    status: "True"
    type: HyperNodeNetworkUnavailable
  - lastTransitionTime: "2025-02-10T07:41:38Z"
    message: The switch is healthy
    reason: SwitchSystemIsHealthy
    status: "False"
    type: HyperNodeSystemFailure
  unhealthyNodeNames:
    - worker-28
    - worker-29
  nodeCount: 3

Signed-off-by: fishingfly <zhouyuzf@163.com>
Signed-off-by: fishingfly <zhouyuzf@163.com>
@volcano-sh-bot
Copy link
Collaborator

Welcome @fishingfly!

It looks like this is your first PR to volcano-sh/apis.

Thank you, and welcome to Volcano. 😃

@volcano-sh-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign kevin-wangzefeng
You can assign the PR to them by writing /assign @kevin-wangzefeng in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@yeahdongcn
Copy link

yeahdongcn commented Feb 13, 2025

Hi @Monokaix
As we discussed offline on Tuesday, this is the initial proposal.

@fishingfly fishingfly changed the title Proposal:add UnhealthyNodeNames feild in HyperNode status and common SwitchConditionType Proposal:add UnhealthyNodeNames feild in HyperNode status and common HyperNodeConditionType Feb 13, 2025

const (
// HyperNodeSystemFailureType means the switch or tor has system issues, such as CPU or memory overload, abnormal power or fan status, or any other critical system malfunctions.
HyperNodeSystemFailureType HyperNodeConditionType = "HyperNodeSystemFailure "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These condition types are under HyperNode, so maybe "SystemFailure" is ok.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And there is a redundant space at the end.

Copy link
Author

@fishingfly fishingfly Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These condition types are under HyperNode, so maybe "SystemFailure" is ok.
like this ?

- HyperNodeSystemFailureType HyperNodeConditionType = "HyperNodeSystemFailure "
- HyperNodeNetworkUnavailableType HyperNodeConditionType = "HyperNodeNetworkUnavailable"
+SystemFailureType HyperNodeConditionType = "SystemFailure"
+NetworkUnavailableType HyperNodeConditionType = "NetworkUnavailable"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah.

Got it.

@@ -123,6 +134,9 @@ type RegexMatch struct {

// HyperNodeStatus represents the observed state of a HyperNode.
type HyperNodeStatus struct {
// UnhealthyNodeNames is a list of node names that have unhealthy RDMA (Remote Direct Memory Access) NICs,
// such as those experiencing high bit error rates (BER) or flapping issues or any other issues on node nic.
UnhealthyNodeNames []string `json:"unhealthyNodeNames,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also show the unhealthy reason or message?

Copy link
Author

@fishingfly fishingfly Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also show the unhealthy reason or message?

Ok, Add a new struct UnhealthyNode to set node name and unhealthy reason. @Monokaix

// HyperNodeStatus represents the observed state of a HyperNode.
type HyperNodeStatus struct {
+	// UnhealthyNodes is a list of nodes that have unhealthy RDMA (Remote Direct Memory Access) NICs,
+	// such as those experiencing high bit error rates (BER) or flapping issues or any other issues on node nic.
+	UnhealthyNodes []UnhealthyNode `json:"unhealthyNodes,omitempty"`
	// Conditions provide details about the current state of the HyperNode.
	Conditions []metav1.Condition `json:"conditions,omitempty"`

	// NodeCount is the total number of nodes currently in the HyperNode.
	// +kubebuilder:validation:Minimum=0
	NodeCount int64 `json:"nodeCount,omitempty"`
}

+ // UnhealthyNode represents unhealthy node name and unhealthy reason
+type UnhealthyNode struct {
+	// Name represents the name of the unhealthy node.
+	Name string `json:"name,omitempty"`
+	// Reason describes the specific issue affecting the node's NIC or any other related problems.
+	Reason string `json:"reason,omitempty"`
+}

…tring

Signed-off-by: fishingfly <zhouyuzf@163.com>
… detail

Signed-off-by: fishingfly <zhouyuzf@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants