-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorporate categorical matrix into data setup and fix downstream errors #233
Conversation
README.md
Outdated
We support several types of matrix storage, passed with the argument "--storage". | ||
"dense" is the default. "sparse" stores data as a csc sparse matrix. "cat" splits | ||
the matrix into a dense component and categorical components. "split0.1" splits the | ||
matrix into sparse and dense parts, where |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dangling clause
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no u
|
||
if not other.flags["C_CONTIGUOUS"]: | ||
warnings.warn( | ||
"""CategoricalMatrix._cross_dense(other, ...) is optimized for the case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make an issue for implementing a F-ordered _cross_dense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing this doesn't seem to give much of a performance hit. I would implement the F-ordered version simply because taking your X from a pandas dataframe will give you an F-ordered matrix, which is a pretty important use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -89,7 +91,10 @@ def getcol(self, i: int): | |||
mult = None | |||
if self.mult is not None: | |||
mult = [self.mult[i]] | |||
return StandardizedMat(self.mat.getcol(i), [self.shift[i]], mult) | |||
col = self.mat.getcol(i) | |||
if isinstance(col, sps.csc_matrix) and not isinstance(col, MatrixBase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When is this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CategoricalMatrix.getcol
returns a csc sparse matrix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to go. Can you make sure to either make a new PR or an issue for the remaining bits and pieces? I'd like to get this merged in sooner rather than later though so we don't have a huge PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code seems to be great. I did some profiling with a wide problem (encoding 'Density' x 'DrivAge' as a categorical, giving you 1,716 OHE columns). Performance is good but no much better than sparse matrix. Currently the main bottleneck is cross_sandwich: MKLSparse - Dense is much faster than Categorical - Dense. One point in favor of the CategoricalMatrix is the simplification and speedup of the data pipeline.
@@ -225,7 +230,7 @@ def __getitem__(self, item): | |||
if isinstance(row, int): | |||
out = mat_part.A | |||
if mult_part is not None: | |||
out *= mult_part | |||
out = out * mult_part |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this solving an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, if out is of int dtype and mult is of float dtype, the *=
won't work.
Pn: str, P: Problem, expected_all: dict, | ||
): | ||
@pytest.mark.parametrize("storage", ["cat", "split0.1", "sparse", "dense"]) | ||
def test_gm_benchmarks(Pn: str, P: Problem, expected_all: dict, storage: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
storage is not used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I would use the artificial golden master tests to do this. Benchmarks golden masters are much slower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, and we already have this for test_golden_master.py
,
@@ -225,7 +230,7 @@ def __getitem__(self, item): | |||
if isinstance(row, int): | |||
out = mat_part.A | |||
if mult_part is not None: | |||
out *= mult_part | |||
out = out * mult_part |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simply curious: why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if out is of int dtype and mult is of float dtype, the *= won't work.
|
||
if not other.flags["C_CONTIGUOUS"]: | ||
warnings.warn( | ||
"""CategoricalMatrix._cross_dense(other, ...) is optimized for the case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing this doesn't seem to give much of a performance hit. I would implement the F-ordered version simply because taking your X from a pandas dataframe will give you an F-ordered matrix, which is a pretty important use case.
@MarcAntoineSchmidtQC I want to add a test to ensure we get the same answers with a categorical setup into |
…ors (#233) * Running into a bug with duplicate columns * incorporated categorical into splitmatrix in problems.py * Categorical-categorical sandwich * Speedups for categorical x categorical sandwich; 'cat' CLI storage option * Added dense-categorical Cython sandwich code, but it's very slow * Updated readme * Incorporated row limits into Cython stuff * Fixed L_cols bug * Cross-sandwich tests and refactoring * Removed CategoricalMatrix.col_mult since we don't do _scale_cols_inplace anymore * Fixed example * Added visualizations to matrix readme * Removed unused code; fixed int8 * Minor suggestions from PR * Removed reference to file not included Co-authored-by: Elizabeth Santorella <elizabeth.santorella@gmail.com>
…ors (#233) * Running into a bug with duplicate columns * incorporated categorical into splitmatrix in problems.py * Categorical-categorical sandwich * Speedups for categorical x categorical sandwich; 'cat' CLI storage option * Added dense-categorical Cython sandwich code, but it's very slow * Updated readme * Incorporated row limits into Cython stuff * Fixed L_cols bug * Cross-sandwich tests and refactoring * Removed CategoricalMatrix.col_mult since we don't do _scale_cols_inplace anymore * Fixed example * Added visualizations to matrix readme * Removed unused code; fixed int8 * Minor suggestions from PR * Removed reference to file not included Co-authored-by: Elizabeth Santorella <elizabeth.santorella@gmail.com>
create.py
toproblems.py
data/create.py
and moved it intoproblems.py
so it can be either one