Add catboost integration tests #17931

Matt711 · 2025-02-06T16:39:36Z

Description

Apart of #17490. This PR adds back the catboost integration tests, which were originally added in #17267 but were later removed due to ABI incompatability between the version of numpy catboost is compiled against and the version of numpy installed in the test environment. This PR adds the tests back but pins a compatible numpy version in the catboost tests.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2025-02-06T16:39:40Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Matt711 · 2025-02-06T16:39:46Z

/ok to test

Matt711 · 2025-02-06T18:51:59Z

/ok to test

Matt711 · 2025-02-06T22:47:09Z

/ok to test

Matt711 · 2025-02-06T22:56:28Z

python/cudf/cudf_pandas_tests/third_party_integration_tests/dependencies.yaml

+    common:
+      - output_types: conda
+        packages:
+        # TODO: Remove numpy pinning once https://github.com/catboost/catboost/issues/2671 is resolved


See this paragraph from the numpy 2 release

Breaking changes to the NumPy ABI. As a result, binaries of packages
that use the NumPy C API and were built against a NumPy 1.xx release
will not work with NumPy 2.0. On import, such packages will see an
ImportError with a message about binary incompatibility.

Matt711 · 2025-02-07T01:29:42Z

/ok to test

Matt711

For the reviewer: These were just for testing. I'll remove before I merge.

.github/workflows/pr.yaml

python/cudf/cudf_pandas_tests/third_party_integration_tests/dependencies.yaml

Matt711 · 2025-02-07T02:48:00Z

/ok to test

Matt711 · 2025-02-07T07:31:02Z

CI passes: https://github.com/rapidsai/cudf/actions/runs/13192187930/job/36827890440#step:10:1

jameslamb

Giving you a ci-codeowners / packaging-codeowners approval because the description says that this is just bringing back tests that already used to exist, and that's a net gain for test coverage here.

But please do see my suggestions about more thoroughly testing the CatBoost integration.

jameslamb · 2025-02-07T20:56:15Z

python/cudf/cudf_pandas_tests/third_party_integration_tests/tests/test_catboost.py

@@ -0,0 +1,128 @@
+# Copyright (c) 2023-2025, NVIDIA CORPORATION.


Suggested change

# Copyright (c) 2023-2025, NVIDIA CORPORATION.

# Copyright (c) 2025, NVIDIA CORPORATION.

This is a brand new file, shouldn't the copyright date only be 2025? Or was it copied from somewhere else?

Copy and pasted from another test, this should be 2025.

jameslamb · 2025-02-07T21:03:57Z

python/cudf/cudf_pandas_tests/third_party_integration_tests/tests/test_catboost.py

+    model = CatBoostRegressor(iterations=10, verbose=0)
+    model.fit(X.values, y.values)
+    predictions = model.predict(X.values)
+    return predictions


Sorry in advance, I'm not that familiar with these tests but... I'm surprised to see pytest test cases with a return statement. What is the interaction between these test cases and this a few lines up?

pytestmark = pytest.mark.assert_eq(fn=assert_catboost_equal)

Did you mean for there to be some kind of testing assertion here? Or does that custom marker somehow end up invoking that function and comparing the output of the test case with pandas inputs to its output with cudf inputs?

The assertion function is used to check that results from "cudf.pandas on" and "cudf.pandas off" are equal. The logic to handle that is in the conftest file.

jameslamb · 2025-02-07T21:16:51Z

python/cudf/cudf_pandas_tests/third_party_integration_tests/tests/test_catboost.py

+def classification_data():
+    X, y = make_classification(
+        n_samples=100, n_features=10, n_classes=2, random_state=42
+    )


make_classification() returns a dataset that has only continuous features.

from sklearn.datasets import make_classification X, y = make_classification( n_samples=100, n_features=10, n_classes=2, random_state=42 ) X

array([[-1.14052601, 1.35970566, 0.86199147, 0.84609208, 0.60600995, -1.55662917, 1.75479418, 1.69645637, -1.28042935, -2.08192941], ...

For catboost in particular, I strongly suspect you'll get better effective test coverage of this integration by including some categorical features.

Encoding and decoding categorical features is critical to how CatBoost works (docs), and there are lots of things that have to go exactly right when providing pandas-like categorical input. Basically, everything here: https://pandas.pydata.org/docs/user_guide/categorical.html

I really think you should provide an input dataset that has some categorical features, ideally in 2 forms:

integer-type columns

pandas.categorical type columns

And ideally with varying cardinality.

You could consider adapting this code used in xgboost's tests: https://github.com/dmlc/xgboost/blob/105aa4247abb3ce787be2cef2f9beb4c24b30049/demo/guide-python/categorical.py#L29

And here are some docs on how to tell CatBoost which features are categorical: https://catboost.ai/docs/en/concepts/python-usages-examples#class-with-array-like-data-with-numerical,-categorical-and-embedding-features

jameslamb · 2025-02-07T21:21:41Z

python/cudf/cudf_pandas_tests/third_party_integration_tests/tests/test_catboost.py

+@pytest.fixture
+def classification_data():
+    X, y = make_classification(
+        n_samples=100, n_features=10, n_classes=2, random_state=42


Suggested change

n_samples=100, n_features=10, n_classes=2, random_state=42

n_samples=1_000, n_features=10, n_classes=2, random_state=42

You may want to use slightly more data, here an in regression_data(). There are some types of encoding and data access bugs that will only show up in certain codepaths in CatBoost that are exercised when there are enough splits per tree.

I've seen this before in LightGBM and XGBoost... someone will write a test that fits on a very small dataset and it'll look like nothing went wrong, only to later find that actually the dataset was so small that the model was just a collection of decision stumps (no splits), and so the test could never catch issues like "this encoding doesn't preserve NAs" or "these outputs are different because of numerical precision issues".

Matt711 · 2025-02-08T01:25:59Z

Giving you a ci-codeowners / packaging-codeowners approval because the description says that this is just bringing back tests that already used to exist, and that's a net gain for test coverage here.

But please do see my suggestions about more thoroughly testing the CatBoost integration.

Thanks for the suggestions on improving these tests @jameslamb! This library is new to me to me so I appreciate the time you took to investigate some of the APIs. I think what's in this PR is a good starting point, but I agree with your suggestions so I'll include them in a follow-up PR. I think I'll also ask others offline who are more familiar with Catboost/XGBoost for their suggestions.

…pendencies.yaml

Add catboost integration tests

647b9bc

Matt711 added feature request New feature or request non-breaking Non-breaking change labels Feb 6, 2025

github-actions bot added Python Affects Python cuDF API. cudf.pandas Issues specific to cudf.pandas labels Feb 6, 2025

github-actions bot assigned Matt711 Feb 6, 2025

add tests

2da2981

add todo

acab0c4

Matt711 commented Feb 6, 2025

View reviewed changes

Matt711 and others added 2 commits February 6, 2025 17:27

update numpy

ebf3cd3

Merge branch 'branch-25.04' into fea/add-catboost-integration-tests

7828b38

Matt711 commented Feb 7, 2025

View reviewed changes

.github/workflows/pr.yaml Outdated Show resolved Hide resolved

.github/workflows/pr.yaml Outdated Show resolved Hide resolved

python/cudf/cudf_pandas_tests/third_party_integration_tests/dependencies.yaml Outdated Show resolved Hide resolved

skip more tests

14668ab

Matt711 marked this pull request as ready for review February 7, 2025 07:35

Matt711 requested review from a team as code owners February 7, 2025 07:35

Matt711 requested review from jameslamb, wence- and galipremsagar February 7, 2025 07:35

jameslamb approved these changes Feb 7, 2025

View reviewed changes

Update python/cudf/cudf_pandas_tests/third_party_integration_tests/de…

ee7890c

…pendencies.yaml

Matt711 added 3 commits February 7, 2025 21:29

Update .github/workflows/pr.yaml

37caae6

Update .github/workflows/pr.yaml

7d3161b

Merge branch 'branch-25.04' into fea/add-catboost-integration-tests

b671426

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add catboost integration tests #17931

Add catboost integration tests #17931

Matt711 commented Feb 6, 2025 •

edited

Loading

copy-pr-bot bot commented Feb 6, 2025

Matt711 commented Feb 6, 2025

Matt711 commented Feb 6, 2025

Matt711 commented Feb 6, 2025

Matt711 Feb 6, 2025

Matt711 commented Feb 7, 2025

Matt711 left a comment

Matt711 commented Feb 7, 2025

Matt711 commented Feb 7, 2025

jameslamb left a comment

jameslamb Feb 7, 2025

Matt711 Feb 8, 2025

jameslamb Feb 7, 2025

Matt711 Feb 8, 2025

jameslamb Feb 7, 2025 •

edited

Loading

jameslamb Feb 7, 2025

Matt711 commented Feb 8, 2025

		@@ -0,0 +1,128 @@
		# Copyright (c) 2023-2025, NVIDIA CORPORATION.

	# Copyright (c) 2023-2025, NVIDIA CORPORATION.
	# Copyright (c) 2025, NVIDIA CORPORATION.

	n_samples=100, n_features=10, n_classes=2, random_state=42
	n_samples=1_000, n_features=10, n_classes=2, random_state=42

Add catboost integration tests #17931

Are you sure you want to change the base?

Add catboost integration tests #17931

Conversation

Matt711 commented Feb 6, 2025 • edited Loading

Description

Checklist

copy-pr-bot bot commented Feb 6, 2025

Matt711 commented Feb 6, 2025

Matt711 commented Feb 6, 2025

Matt711 commented Feb 6, 2025

Matt711 Feb 6, 2025

Choose a reason for hiding this comment

Matt711 commented Feb 7, 2025

Matt711 left a comment

Choose a reason for hiding this comment

Matt711 commented Feb 7, 2025

Matt711 commented Feb 7, 2025

jameslamb left a comment

Choose a reason for hiding this comment

jameslamb Feb 7, 2025

Choose a reason for hiding this comment

Matt711 Feb 8, 2025

Choose a reason for hiding this comment

jameslamb Feb 7, 2025

Choose a reason for hiding this comment

Matt711 Feb 8, 2025

Choose a reason for hiding this comment

jameslamb Feb 7, 2025 • edited Loading

Choose a reason for hiding this comment

jameslamb Feb 7, 2025

Choose a reason for hiding this comment

Matt711 commented Feb 8, 2025

Matt711 commented Feb 6, 2025 •

edited

Loading

jameslamb Feb 7, 2025 •

edited

Loading