-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal:add UnhealthyNodeNames feild in HyperNode status and common HyperNodeConditionType #155
base: network-topology-dev
Are you sure you want to change the base?
Proposal:add UnhealthyNodeNames feild in HyperNode status and common HyperNodeConditionType #155
Conversation
Signed-off-by: fishingfly <zhouyuzf@163.com>
Signed-off-by: fishingfly <zhouyuzf@163.com>
Welcome @fishingfly! |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @Monokaix |
|
||
const ( | ||
// HyperNodeSystemFailureType means the switch or tor has system issues, such as CPU or memory overload, abnormal power or fan status, or any other critical system malfunctions. | ||
HyperNodeSystemFailureType HyperNodeConditionType = "HyperNodeSystemFailure " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These condition types are under HyperNode, so maybe "SystemFailure" is ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And there is a redundant space at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These condition types are under HyperNode, so maybe "SystemFailure" is ok.
like this ?
- HyperNodeSystemFailureType HyperNodeConditionType = "HyperNodeSystemFailure "
- HyperNodeNetworkUnavailableType HyperNodeConditionType = "HyperNodeNetworkUnavailable"
+SystemFailureType HyperNodeConditionType = "SystemFailure"
+NetworkUnavailableType HyperNodeConditionType = "NetworkUnavailable"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah.
Got it.
@@ -123,6 +134,9 @@ type RegexMatch struct { | |||
|
|||
// HyperNodeStatus represents the observed state of a HyperNode. | |||
type HyperNodeStatus struct { | |||
// UnhealthyNodeNames is a list of node names that have unhealthy RDMA (Remote Direct Memory Access) NICs, | |||
// such as those experiencing high bit error rates (BER) or flapping issues or any other issues on node nic. | |||
UnhealthyNodeNames []string `json:"unhealthyNodeNames,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also show the unhealthy reason or message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also show the unhealthy reason or message?
Ok, Add a new struct UnhealthyNode
to set node name and unhealthy reason. @Monokaix
// HyperNodeStatus represents the observed state of a HyperNode.
type HyperNodeStatus struct {
+ // UnhealthyNodes is a list of nodes that have unhealthy RDMA (Remote Direct Memory Access) NICs,
+ // such as those experiencing high bit error rates (BER) or flapping issues or any other issues on node nic.
+ UnhealthyNodes []UnhealthyNode `json:"unhealthyNodes,omitempty"`
// Conditions provide details about the current state of the HyperNode.
Conditions []metav1.Condition `json:"conditions,omitempty"`
// NodeCount is the total number of nodes currently in the HyperNode.
// +kubebuilder:validation:Minimum=0
NodeCount int64 `json:"nodeCount,omitempty"`
}
+ // UnhealthyNode represents unhealthy node name and unhealthy reason
+type UnhealthyNode struct {
+ // Name represents the name of the unhealthy node.
+ Name string `json:"name,omitempty"`
+ // Reason describes the specific issue affecting the node's NIC or any other related problems.
+ Reason string `json:"reason,omitempty"`
+}
…tring Signed-off-by: fishingfly <zhouyuzf@163.com>
… detail Signed-off-by: fishingfly <zhouyuzf@163.com>
What is the problem you're trying to solve
Spec
andStatus
fields on the HyperNode. Currently, theStatus
field only containsConditions
andNodeCount
.Conditions
field.Conditions
field reflects the overall status of the HyperNode, which may be connected to multiple nodes. However, there is no existing mechanism to indicate the health status of individual nodes. This lack of granularity prevents the scheduler from accurately identifying and handling unhealthy nodes.HyperNodeStatus.Conditions
.Describe the solution you'd like
Status
, such asUnhealthyNodeNames
, to explicitly list nodes that are currently unschedulable under the given HyperNode.HyperNodeSystemFailure
: Indicates a system-level issue on the switch or tor, such as CPU or memory overload, power failure, fan malfunction, or other critical system faults.HyperNodeNetworkUnavailable
: Indicates a network-related issue on the switch or tor, such as abnormal link status, interface failures, or other network disruptions.Expected HyperNode Structure
The final HyperNode status should incorporate these enhancements to provide granular node-level health insights and common switch condition types, improving scheduler awareness and resource allocation efficiency.
The final expected HyperNode is as follows: