Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorporate categorical matrix into data setup and fix downstream errors #233

Merged
merged 21 commits into from
Jul 6, 2020

Conversation

ElizabethSantorellaQC
Copy link
Contributor

@ElizabethSantorellaQC ElizabethSantorellaQC commented Jun 25, 2020

  • Added storage="cat" CLI option to create a splitmatrix with dense and categorical parts. Involves moving OneHotEncoder transformations downstream, from create.py to problems.py
  • Improve test coverage for categorical functionality
  • Visualize sparsity and performance for split sandwich
  • Extend this so the matrix can have mixed sparse, dense, and categorical parts
  • Add a threshold so that only categorical matrices of a certain size get stored as categorical rather than sparse or dense
  • Moved one-hot encoding out of data/create.py and moved it into problems.py so it can be either one
  • Test CategoricalMatrix.tocsc function
  • Cythonized function for categorical sandwiched with other categorical
  • Fixed bug where check_array doesn't work on a CategoricalMatrix
  • Added tests for a SplitMatrix with multiple categorical components
  • Added golden master tests for different storage formats (for [Major] Golden master tests: Storage and weights vs. offset shouldn't change results #164 )
  • Continue to speed up slow spots, like categorical sandwiched with dense
  • Cythonize row and column limiting for categorical sandwiches (currently just have rows)
  • Make documentation clearer
  • Make sandwich functions methods of CategoricalMatrix for clarity

@ElizabethSantorellaQC ElizabethSantorellaQC changed the title Incorporate categorical matrix into data setup and fix downstream errors [WIP] Incorporate categorical matrix into data setup and fix downstream errors Jun 25, 2020
@ElizabethSantorellaQC ElizabethSantorellaQC changed the title [WIP] Incorporate categorical matrix into data setup and fix downstream errors Incorporate categorical matrix into data setup and fix downstream errors Jul 6, 2020
README.md Outdated
We support several types of matrix storage, passed with the argument "--storage".
"dense" is the default. "sparse" stores data as a csc sparse matrix. "cat" splits
the matrix into a dense component and categorical components. "split0.1" splits the
matrix into sparse and dense parts, where
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dangling clause

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no u


if not other.flags["C_CONTIGUOUS"]:
warnings.warn(
"""CategoricalMatrix._cross_dense(other, ...) is optimized for the case
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make an issue for implementing a F-ordered _cross_dense.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing this doesn't seem to give much of a performance hit. I would implement the F-ordered version simply because taking your X from a pandas dataframe will give you an F-ordered matrix, which is a pretty important use case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -89,7 +91,10 @@ def getcol(self, i: int):
mult = None
if self.mult is not None:
mult = [self.mult[i]]
return StandardizedMat(self.mat.getcol(i), [self.shift[i]], mult)
col = self.mat.getcol(i)
if isinstance(col, sps.csc_matrix) and not isinstance(col, MatrixBase):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is this necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CategoricalMatrix.getcol returns a csc sparse matrix

Copy link
Collaborator

@tbenthompson tbenthompson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to go. Can you make sure to either make a new PR or an issue for the remaining bits and pieces? I'd like to get this merged in sooner rather than later though so we don't have a huge PR.

Copy link
Member

@MarcAntoineSchmidtQC MarcAntoineSchmidtQC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code seems to be great. I did some profiling with a wide problem (encoding 'Density' x 'DrivAge' as a categorical, giving you 1,716 OHE columns). Performance is good but no much better than sparse matrix. Currently the main bottleneck is cross_sandwich: MKLSparse - Dense is much faster than Categorical - Dense. One point in favor of the CategoricalMatrix is the simplification and speedup of the data pipeline.

@@ -225,7 +230,7 @@ def __getitem__(self, item):
if isinstance(row, int):
out = mat_part.A
if mult_part is not None:
out *= mult_part
out = out * mult_part

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this solving an issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if out is of int dtype and mult is of float dtype, the *= won't work.

Pn: str, P: Problem, expected_all: dict,
):
@pytest.mark.parametrize("storage", ["cat", "split0.1", "sparse", "dense"])
def test_gm_benchmarks(Pn: str, P: Problem, expected_all: dict, storage: str):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

storage is not used

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I would use the artificial golden master tests to do this. Benchmarks golden masters are much slower.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, and we already have this for test_golden_master.py,

@@ -225,7 +230,7 @@ def __getitem__(self, item):
if isinstance(row, int):
out = mat_part.A
if mult_part is not None:
out *= mult_part
out = out * mult_part

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simply curious: why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if out is of int dtype and mult is of float dtype, the *= won't work.


if not other.flags["C_CONTIGUOUS"]:
warnings.warn(
"""CategoricalMatrix._cross_dense(other, ...) is optimized for the case

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing this doesn't seem to give much of a performance hit. I would implement the F-ordered version simply because taking your X from a pandas dataframe will give you an F-ordered matrix, which is a pretty important use case.

@ElizabethSantorellaQC
Copy link
Contributor Author

@MarcAntoineSchmidtQC I want to add a test to ensure we get the same answers with a categorical setup into test_golden_master.py, but this currently isn't possible because the data setup drops one column from each categorical, which we don't currently support. And adding the dropped column back in gives a singular matrix error. OK to defer that to a later PR? glm_benchmarks_run shows that the categorical setup is not affecting the results.

@ElizabethSantorellaQC ElizabethSantorellaQC merged commit db6826d into master Jul 6, 2020
@ElizabethSantorellaQC ElizabethSantorellaQC deleted the incorporate_cat_mat branch July 6, 2020 22:24
tbenthompson pushed a commit that referenced this pull request Oct 8, 2021
…ors (#233)

* Running into a bug with duplicate columns

* incorporated categorical into splitmatrix in problems.py

* Categorical-categorical sandwich

* Speedups for categorical x categorical sandwich; 'cat' CLI storage option

* Added dense-categorical Cython sandwich code, but it's very slow

* Updated readme

* Incorporated row limits into Cython stuff

* Fixed L_cols bug

* Cross-sandwich tests and refactoring

* Removed CategoricalMatrix.col_mult since we don't do _scale_cols_inplace anymore

* Fixed example

* Added visualizations to matrix readme

* Removed unused code; fixed int8

* Minor suggestions from PR

* Removed reference to file not included

Co-authored-by: Elizabeth Santorella <elizabeth.santorella@gmail.com>
tbenthompson pushed a commit that referenced this pull request Oct 8, 2021
…ors (#233)

* Running into a bug with duplicate columns

* incorporated categorical into splitmatrix in problems.py

* Categorical-categorical sandwich

* Speedups for categorical x categorical sandwich; 'cat' CLI storage option

* Added dense-categorical Cython sandwich code, but it's very slow

* Updated readme

* Incorporated row limits into Cython stuff

* Fixed L_cols bug

* Cross-sandwich tests and refactoring

* Removed CategoricalMatrix.col_mult since we don't do _scale_cols_inplace anymore

* Fixed example

* Added visualizations to matrix readme

* Removed unused code; fixed int8

* Minor suggestions from PR

* Removed reference to file not included

Co-authored-by: Elizabeth Santorella <elizabeth.santorella@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants