-
Notifications
You must be signed in to change notification settings - Fork 928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add catboost integration tests #17931
base: branch-25.04
Are you sure you want to change the base?
Add catboost integration tests #17931
Conversation
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
/ok to test |
/ok to test |
/ok to test |
common: | ||
- output_types: conda | ||
packages: | ||
# TODO: Remove numpy pinning once https://github.com/catboost/catboost/issues/2671 is resolved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See this paragraph from the numpy 2 release
Breaking changes to the NumPy ABI. As a result, binaries of packages
that use the NumPy C API and were built against a NumPy 1.xx release
will not work with NumPy 2.0. On import, such packages will see an
ImportError with a message about binary incompatibility.
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the reviewer: These were just for testing. I'll remove before I merge.
python/cudf/cudf_pandas_tests/third_party_integration_tests/dependencies.yaml
Outdated
Show resolved
Hide resolved
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Giving you a ci-codeowners
/ packaging-codeowners
approval because the description says that this is just bringing back tests that already used to exist, and that's a net gain for test coverage here.
But please do see my suggestions about more thoroughly testing the CatBoost integration.
@@ -0,0 +1,128 @@ | |||
# Copyright (c) 2023-2025, NVIDIA CORPORATION. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Copyright (c) 2023-2025, NVIDIA CORPORATION. | |
# Copyright (c) 2025, NVIDIA CORPORATION. |
This is a brand new file, shouldn't the copyright date only be 2025? Or was it copied from somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copy and pasted from another test, this should be 2025.
model = CatBoostRegressor(iterations=10, verbose=0) | ||
model.fit(X.values, y.values) | ||
predictions = model.predict(X.values) | ||
return predictions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry in advance, I'm not that familiar with these tests but... I'm surprised to see pytest
test cases with a return
statement. What is the interaction between these test cases and this a few lines up?
pytestmark = pytest.mark.assert_eq(fn=assert_catboost_equal)
Did you mean for there to be some kind of testing assertion here? Or does that custom marker somehow end up invoking that function and comparing the output of the test case with pandas
inputs to its output with cudf
inputs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assertion function is used to check that results from "cudf.pandas on" and "cudf.pandas off" are equal. The logic to handle that is in the conftest file.
def classification_data(): | ||
X, y = make_classification( | ||
n_samples=100, n_features=10, n_classes=2, random_state=42 | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make_classification()
returns a dataset that has only continuous features.
from sklearn.datasets import make_classification
X, y = make_classification(
n_samples=100, n_features=10, n_classes=2, random_state=42
)
X
array([[-1.14052601, 1.35970566, 0.86199147, 0.84609208, 0.60600995,
-1.55662917, 1.75479418, 1.69645637, -1.28042935, -2.08192941],
...
For catboost
in particular, I strongly suspect you'll get better effective test coverage of this integration by including some categorical features.
Encoding and decoding categorical features is critical to how CatBoost works (docs), and there are lots of things that have to go exactly right when providing pandas
-like categorical input. Basically, everything here: https://pandas.pydata.org/docs/user_guide/categorical.html
I really think you should provide an input dataset that has some categorical features, ideally in 2 forms:
- integer-type columns
pandas.categorical
type columns
And ideally with varying cardinality.
You could consider adapting this code used in xgboost
's tests: https://github.com/dmlc/xgboost/blob/105aa4247abb3ce787be2cef2f9beb4c24b30049/demo/guide-python/categorical.py#L29
And here are some docs on how to tell CatBoost which features are categorical: https://catboost.ai/docs/en/concepts/python-usages-examples#class-with-array-like-data-with-numerical,-categorical-and-embedding-features
@pytest.fixture | ||
def classification_data(): | ||
X, y = make_classification( | ||
n_samples=100, n_features=10, n_classes=2, random_state=42 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n_samples=100, n_features=10, n_classes=2, random_state=42 | |
n_samples=1_000, n_features=10, n_classes=2, random_state=42 |
You may want to use slightly more data, here an in regression_data()
. There are some types of encoding and data access bugs that will only show up in certain codepaths in CatBoost that are exercised when there are enough splits per tree.
I've seen this before in LightGBM and XGBoost... someone will write a test that fits on a very small dataset and it'll look like nothing went wrong, only to later find that actually the dataset was so small that the model was just a collection of decision stumps (no splits), and so the test could never catch issues like "this encoding doesn't preserve NAs" or "these outputs are different because of numerical precision issues".
Thanks for the suggestions on improving these tests @jameslamb! This library is new to me to me so I appreciate the time you took to investigate some of the APIs. I think what's in this PR is a good starting point, but I agree with your suggestions so I'll include them in a follow-up PR. I think I'll also ask others offline who are more familiar with Catboost/XGBoost for their suggestions. |
Description
Apart of #17490. This PR adds back the catboost integration tests, which were originally added in #17267 but were later removed due to ABI incompatability between the version of numpy catboost is compiled against and the version of numpy installed in the test environment. This PR adds the tests back but pins a compatible numpy version in the catboost tests.
Checklist