Incorporate categorical matrix into data setup and fix downstream errors #233

ElizabethSantorellaQC · 2020-06-25T22:55:52Z

…tion

…ace anymore

tbenthompson · 2020-07-06T16:39:28Z

README.md

+We support several types of matrix storage, passed with the argument "--storage". 
+"dense" is the default. "sparse" stores data as a csc sparse matrix. "cat" splits
+the matrix into a dense component and categorical components. "split0.1" splits the
+matrix into sparse and dense parts, where 


dangling clause

tbenthompson · 2020-07-06T16:43:01Z

src/quantcore/glm/matrix/categorical_matrix.py

+
+        if not other.flags["C_CONTIGUOUS"]:
+            warnings.warn(
+                """CategoricalMatrix._cross_dense(other, ...) is optimized for the case


Can you make an issue for implementing a F-ordered _cross_dense.

Testing this doesn't seem to give much of a performance hit. I would implement the F-ordered version simply because taking your X from a pandas dataframe will give you an F-ordered matrix, which is a pretty important use case.

src/quantcore/glm/matrix/dense_glm_matrix.py

tbenthompson · 2020-07-06T16:44:51Z

src/quantcore/glm/matrix/standardized_mat.py

@@ -89,7 +91,10 @@ def getcol(self, i: int):
        mult = None
        if self.mult is not None:
            mult = [self.mult[i]]
-        return StandardizedMat(self.mat.getcol(i), [self.shift[i]], mult)
+        col = self.mat.getcol(i)
+        if isinstance(col, sps.csc_matrix) and not isinstance(col, MatrixBase):


When is this necessary?

CategoricalMatrix.getcol returns a csc sparse matrix

tbenthompson

Looks good to go. Can you make sure to either make a new PR or an issue for the remaining bits and pieces? I'd like to get this merged in sooner rather than later though so we don't have a huge PR.

MarcAntoineSchmidtQC

Code seems to be great. I did some profiling with a wide problem (encoding 'Density' x 'DrivAge' as a categorical, giving you 1,716 OHE columns). Performance is good but no much better than sparse matrix. Currently the main bottleneck is cross_sandwich: MKLSparse - Dense is much faster than Categorical - Dense. One point in favor of the CategoricalMatrix is the simplification and speedup of the data pipeline.

MarcAntoineSchmidtQC · 2020-07-06T17:15:50Z

src/quantcore/glm/matrix/standardized_mat.py

@@ -225,7 +230,7 @@ def __getitem__(self, item):
        if isinstance(row, int):
            out = mat_part.A
            if mult_part is not None:
-                out *= mult_part
+                out = out * mult_part


Is this solving an issue?

Yeah, if out is of int dtype and mult is of float dtype, the *= won't work.

MarcAntoineSchmidtQC · 2020-07-06T17:22:38Z

tests/sklearn_fork/test_benchmark_golden_master.py

-    Pn: str, P: Problem, expected_all: dict,
-):
+@pytest.mark.parametrize("storage", ["cat", "split0.1", "sparse", "dense"])
+def test_gm_benchmarks(Pn: str, P: Problem, expected_all: dict, storage: str):


storage is not used

Also, I would use the artificial golden master tests to do this. Benchmarks golden masters are much slower.

Ah, and we already have this for test_golden_master.py,

MarcAntoineSchmidtQC · 2020-07-06T17:27:20Z

src/quantcore/glm/matrix/standardized_mat.py

@@ -225,7 +230,7 @@ def __getitem__(self, item):
        if isinstance(row, int):
            out = mat_part.A
            if mult_part is not None:
-                out *= mult_part
+                out = out * mult_part


Simply curious: why?

if out is of int dtype and mult is of float dtype, the *= won't work.

MarcAntoineSchmidtQC · 2020-07-06T17:33:31Z

src/quantcore/glm/matrix/categorical_matrix.py

+
+        if not other.flags["C_CONTIGUOUS"]:
+            warnings.warn(
+                """CategoricalMatrix._cross_dense(other, ...) is optimized for the case


Testing this doesn't seem to give much of a performance hit. I would implement the F-ordered version simply because taking your X from a pandas dataframe will give you an F-ordered matrix, which is a pretty important use case.

ElizabethSantorellaQC · 2020-07-06T21:25:43Z

@MarcAntoineSchmidtQC I want to add a test to ensure we get the same answers with a categorical setup into test_golden_master.py, but this currently isn't possible because the data setup drops one column from each categorical, which we don't currently support. And adding the dropped column back in gives a singular matrix error. OK to defer that to a later PR? glm_benchmarks_run shows that the categorical setup is not affecting the results.

…ors (#233) * Running into a bug with duplicate columns * incorporated categorical into splitmatrix in problems.py * Categorical-categorical sandwich * Speedups for categorical x categorical sandwich; 'cat' CLI storage option * Added dense-categorical Cython sandwich code, but it's very slow * Updated readme * Incorporated row limits into Cython stuff * Fixed L_cols bug * Cross-sandwich tests and refactoring * Removed CategoricalMatrix.col_mult since we don't do _scale_cols_inplace anymore * Fixed example * Added visualizations to matrix readme * Removed unused code; fixed int8 * Minor suggestions from PR * Removed reference to file not included Co-authored-by: Elizabeth Santorella <elizabeth.santorella@gmail.com>

esantorella added 2 commits June 25, 2020 18:13

Running into a bug with duplicate columns

185fd21

incorporated categorical into splitmatrix in problems.py

bcf23d9

ElizabethSantorellaQC changed the title ~~Incorporate categorical matrix into data setup and fix downstream errors~~ [WIP] Incorporate categorical matrix into data setup and fix downstream errors Jun 25, 2020

esantorella added 14 commits June 26, 2020 11:43

Categorical-categorical sandwich

bd8610a

Speedups for categorical x categorical sandwich; 'cat' CLI storage op…

e206346

…tion

Added dense-categorical Cython sandwich code, but it's very slow

01711b9

Merge remote-tracking branch 'origin' into incorporate_cat_mat

a477f85

Merge remote-tracking branch 'origin' into incorporate_cat_mat

1b75693

Updated readme

d83585f

Incorporated row limits into Cython stuff

116058d

Fixed L_cols bug

4b6a5f5

Cross-sandwich tests and refactoring

f7b78d8

Merged in master

4828edd

Removed CategoricalMatrix.col_mult since we don't do _scale_cols_inpl…

80b5da3

…ace anymore

Merge remote-tracking branch 'origin' into incorporate_cat_mat

00f6555

Fixed example

4bc2255

Added visualizations to matrix readme

a3c1cda

ElizabethSantorellaQC changed the title ~~[WIP] Incorporate categorical matrix into data setup and fix downstream errors~~ Incorporate categorical matrix into data setup and fix downstream errors Jul 6, 2020

tbenthompson reviewed Jul 6, 2020

View reviewed changes

src/quantcore/glm/matrix/dense_glm_matrix.py Show resolved Hide resolved

tbenthompson reviewed Jul 6, 2020

View reviewed changes

tbenthompson approved these changes Jul 6, 2020

View reviewed changes

MarcAntoineSchmidtQC approved these changes Jul 6, 2020

View reviewed changes

esantorella added 2 commits July 6, 2020 16:04

Removed unused code; fixed int8

d227cc8

Merge remote-tracking branch 'origin' into incorporate_cat_mat

1cba329

ElizabethSantorellaQC mentioned this pull request Jul 6, 2020

[Major] Categorical matrix optimizations #246

Closed

8 tasks

esantorella added 2 commits July 6, 2020 17:27

Minor suggestions from PR

dbf6398

Merge remote-tracking branch 'origin' into incorporate_cat_mat

ea8bdba

Removed reference to file not included

ee0bd10

ElizabethSantorellaQC merged commit db6826d into master Jul 6, 2020

ElizabethSantorellaQC deleted the incorporate_cat_mat branch July 6, 2020 22:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorporate categorical matrix into data setup and fix downstream errors #233

Incorporate categorical matrix into data setup and fix downstream errors #233

ElizabethSantorellaQC commented Jun 25, 2020 •

edited

Loading

tbenthompson Jul 6, 2020

ElizabethSantorellaQC Jul 6, 2020

tbenthompson Jul 6, 2020

MarcAntoineSchmidtQC Jul 6, 2020

ElizabethSantorellaQC Jul 6, 2020

tbenthompson Jul 6, 2020

ElizabethSantorellaQC Jul 6, 2020

tbenthompson left a comment

MarcAntoineSchmidtQC left a comment

MarcAntoineSchmidtQC Jul 6, 2020

ElizabethSantorellaQC Jul 6, 2020

MarcAntoineSchmidtQC Jul 6, 2020

MarcAntoineSchmidtQC Jul 6, 2020

ElizabethSantorellaQC Jul 6, 2020

MarcAntoineSchmidtQC Jul 6, 2020

ElizabethSantorellaQC Jul 6, 2020

MarcAntoineSchmidtQC Jul 6, 2020

ElizabethSantorellaQC commented Jul 6, 2020

Incorporate categorical matrix into data setup and fix downstream errors #233

Incorporate categorical matrix into data setup and fix downstream errors #233

Conversation

ElizabethSantorellaQC commented Jun 25, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tbenthompson left a comment

Choose a reason for hiding this comment

MarcAntoineSchmidtQC left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ElizabethSantorellaQC commented Jul 6, 2020

ElizabethSantorellaQC commented Jun 25, 2020 •

edited

Loading