Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataplane: Add gRPC health check #182

Merged
merged 2 commits into from
Feb 12, 2024

Conversation

Shunpoco
Copy link
Contributor

Currently our dataplane server doesn't have health check, so we can't confirm whether the dataplane daemonset runs properly or not.

This PR introduce health check for gRPC to the dataplane server, and add liveness, readiness, and startup probe for the daemonset.

Fixes: #51

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 30, 2024
@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jan 30, 2024
@Shunpoco Shunpoco force-pushed the add-dataplane-health-check branch from 94020b5 to c2836d4 Compare January 30, 2024 00:09
@shaneutt shaneutt force-pushed the add-dataplane-health-check branch from c2836d4 to c3da400 Compare January 30, 2024 13:20
Copy link
Member

@shaneutt shaneutt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks for adding this @Shunpoco! 🥳

One thing we'll need to add is somewhere in the test a simple verification. I think right after deploying the dataplane just a requires.Eventually() that ensures we reached the expected state would be suffiient for now.

@Shunpoco
Copy link
Contributor Author

Shunpoco commented Jan 31, 2024

@shaneutt
Let me ask about the verification for readiness of the dataplane container.
WaitForBlixtReadiness already checks if the dataplane pod is ready at least in one node here.
I think before adding the readiness probe, k8s decided that the dataplane pod is ready if a container in the pod is not down (it ignores whether a container shows error or not).
After adding the readiness probe, k8s decides if the dataplane pod is ready using the probe.
So the correctness of the result of WaitForBlixtReadiness will be improved by the readiness probe, but I'm wondering we should add more verification or current implementation is enough.

@shaneutt
Copy link
Member

@shaneutt Let me ask about the verification for readiness of the dataplane container. WaitForBlixtReadiness already checks if the dataplane pod is ready at least in one node here. I think before adding the readiness probe, k8s decided that the dataplane pod is ready if a container in the pod is not down (it ignores whether a container shows error or not). After adding the readiness probe, k8s decides if the dataplane pod is ready using the probe. So the correctness of the result of WaitForBlixtReadiness will be improved by the readiness probe, but I'm wondering we should add more verification or current implementation is enough.

To me this seems like enough of an iterative improvement for now that we can move forward with this, see how it works out, and maybe down the road considering if we want to do more.

@shaneutt
Copy link
Member

shaneutt commented Feb 5, 2024

Hey @Shunpoco just checking in, everything going OK? Need any assistance?

@Shunpoco
Copy link
Contributor Author

Shunpoco commented Feb 5, 2024

@shaneutt
Thanks for the ping!
I think once @aryan9600 reviews this change and approves it, this PR is ready for merge 😄

@shaneutt
Copy link
Member

shaneutt commented Feb 5, 2024

@Shunpoco, did you see my comment above about adding a test? Are you comfortable with adding a test.Eventually() somewhere to the existing integration tests that verifies that the health checks are present and that the expected state is eventually reached? Let us know if you have any questions or confusion on that, the test suite is mostly pretty straightforward, but as I understand it, it has some trouble running on non-Linux systems.

@Shunpoco
Copy link
Contributor Author

Shunpoco commented Feb 5, 2024

@shaneutt
OK, let me clarify, I think I misunderstood.
In my comment, I wanted to say that WaitForBlixtReadiness already checks whether pod is ready, so we don't need to put requires.Eventually() for the readiness probe.
But do you want me to add an additional codes for check the pod's status?

@shaneutt
Copy link
Member

shaneutt commented Feb 6, 2024

@shaneutt OK, let me clarify, I think I misunderstood. In my comment, I wanted to say that WaitForBlixtReadiness already checks whether pod is ready, so we don't need to put requires.Eventually() for the readiness probe. But do you want me to add an additional codes for check the pod's status?

Yeah, my thinking was this: you're right that the current tests effectively exercise the health checks, but nothing is verifying their presence or configuration. A simple couple line check could help us to avoid regressions (though we would hope they would never happen!) or mistakes in test configuration later on and just double check that the health checks are present, and are configured as intended. Let me know what you think?

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 6, 2024
@Shunpoco
Copy link
Contributor Author

Shunpoco commented Feb 6, 2024

@shaneutt

I got it. I will add the verification.

I pushed da1be5f as a draft of the verification (I will blush up after I confirm we have the same view).

  • I put waitForDataplaneReadiness function in suite_test.go. The function does: 1) check if the dataplane pod has readinessprobe (if not, return an error), and then 2) check if the all dataplane pods are ready.
  • Call the function in TestMain after the dataplane is deployed.

Note: Because TestMain doesn't take testing.T, I don't use require.Eventually.

Does this change suit for your intention?

@shaneutt
Copy link
Member

shaneutt commented Feb 7, 2024

Yep, this looks good. Might seem a little superfluous but I appreciate you adding it as a documentation of intent, and on the off chance the health checks fail, now we can fail at a very specific place with specific logging which will be easier for debugging.

I'll mark this draft temporarily since you said you wanted to do a couple more things, please mark as ready and call me in for a final review whenever you're ready. Very much appreciate your contribution!

@shaneutt shaneutt marked this pull request as draft February 7, 2024 13:47
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 7, 2024
@Shunpoco Shunpoco force-pushed the add-dataplane-health-check branch from da1be5f to c7fc302 Compare February 7, 2024 23:59
@Shunpoco Shunpoco marked this pull request as ready for review February 8, 2024 00:01
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 8, 2024
@Shunpoco
Copy link
Contributor Author

Shunpoco commented Feb 8, 2024

@shaneutt
Now this PR is ready for review! please check c7fc302.

@shaneutt shaneutt self-assigned this Feb 8, 2024
@shaneutt shaneutt added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Feb 8, 2024
Copy link
Member

@shaneutt shaneutt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

Couple minor comments to sort out then I think we're ready to 🚢

Copy link
Member

@aryan9600 aryan9600 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty good, just have a couple of nits. also it'd be great if you could squash all your commits into 1-2 concise commits, thanks!

Shunpoco and others added 2 commits February 10, 2024 12:06
Currently our dataplane server doesn't have health check, so we can't
confirm whether the dataplane daemonset runs properly or not.

This PR introduce health check for gRPC to the dataplane server, and
add liveness, readiness, and startup probe for the daemonset.

Signed-off-by: Shunsuke Tokunaga <tkngsnsk313320@gmail.com>
Co-authored-by: Shane Utt <shane@shaneutt.com>
This commit adds a validation function for dataplane.
The function first checks whethre a dataplane pod has readiness probe setting. Then it checks all dataplane pods' readiness status. If all pods are ready, it finishes with no error.

Signed-off-by: Shunsuke Tokunaga <tkngsnsk313320@gmail.com>
Co-authored-by: Shane Utt <shane@shaneutt.com>
Co-authored-by: Sanskar Jaiswal <jaiswalsanskar078@gmail.com>
@Shunpoco Shunpoco force-pushed the add-dataplane-health-check branch from e2aa15d to d1351f4 Compare February 10, 2024 12:09
Copy link
Member

@aryan9600 aryan9600 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
Thanks @Shunpoco 🎖️

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 12, 2024
@shaneutt shaneutt self-requested a review February 12, 2024 15:42
Copy link
Member

@shaneutt shaneutt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 12, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aryan9600, shaneutt, Shunpoco

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@shaneutt shaneutt merged commit f62c215 into kubernetes-sigs:main Feb 12, 2024
11 checks passed
@Shunpoco Shunpoco deleted the add-dataplane-health-check branch February 12, 2024 18:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
Development

Successfully merging this pull request may close these issues.

Healthchecks for dataplane API
4 participants