Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Major] Add a class for efficient operations on categorical features (sparse/dense split done) #76

Closed
ElizabethSantorellaQC opened this issue Apr 23, 2020 · 10 comments
Assignees
Labels
performance speed, memory, or accuracy this week's work we're doing it now

Comments

@ElizabethSantorellaQC
Copy link
Contributor

One-hot encoding categorical variables generates matrices where all nonzero elements are 1, and there is only one nonzero element per row. It is possible to store these matrices with much less memory than a general sparse matrix and to operate on them more efficiently. We could improve performance a lot by adding a class that represents our data as a partitioned matrix composed of several one-hot encoded matrices (and perhaps also a dense block).

@tbenthompson
Copy link
Collaborator

Relatedly, we should modify the way sklearn interacts with sparse matrices so that it operates through something like the scipy.sparse.linalg.LinearOperator class. We need:

  1. A sandwich product function
  2. A matrix-vector function
  3. A function to get an individual column for CD.

Probably another few things that I'm forgetting.

@tbenthompson
Copy link
Collaborator

We might as well include dense matrices in this API too.

@ElizabethSantorellaQC
Copy link
Contributor Author

Yeah, I think we will need some wrappers for either np.ndarray or scipy.sparse matrices since they already differ a bit in terms of syntax and the shapes of the results of matrix operations. If we did that we could probably simplify downstream logic a lot (by removing a lot of if sparse...)

@tbenthompson tbenthompson changed the title Add a class for efficient operations on one-hot encoded categorical variables [Critical] Add a class for efficient operations on one-hot encoded categorical variables and for separating dense and sparse subcomponents Apr 26, 2020
@tbenthompson
Copy link
Collaborator

The basics of a DataMatrix API are in PR #86.

@ElizabethSantorellaQC ElizabethSantorellaQC changed the title [Critical] Add a class for efficient operations on one-hot encoded categorical variables and for separating dense and sparse subcomponents [Major] Add a class for efficient operations on one-hot encoded categorical variables and for separating dense and sparse subcomponents Apr 30, 2020
@tbenthompson
Copy link
Collaborator

See PR #105 for a first draft of a dense-sparse split matrix.

@ElizabethSantorellaQC ElizabethSantorellaQC added performance speed, memory, or accuracy and removed enhancement labels May 7, 2020
@tbenthompson tbenthompson changed the title [Major] Add a class for efficient operations on one-hot encoded categorical variables and for separating dense and sparse subcomponents [Major] Add a class for efficient operations on separated dense and sparse subcomponents May 7, 2020
@tbenthompson tbenthompson changed the title [Major] Add a class for efficient operations on separated dense and sparse subcomponents [Major] Add a class for efficient operations on separated dense and sparse subcomponents and on categorical features May 7, 2020
@ElizabethSantorellaQC
Copy link
Contributor Author

References on fixed effects from Daniel:

@ElizabethSantorellaQC ElizabethSantorellaQC changed the title [Major] Add a class for efficient operations on separated dense and sparse subcomponents and on categorical features [Major] Add a class for efficient operations on categorical features (sparse/dense split done) May 20, 2020
@ElizabethSantorellaQC
Copy link
Contributor Author

ElizabethSantorellaQC commented May 26, 2020

  • Prototype something and make sure it is actually faster
  • Write up some documentation on how to store a categorical matrix in both csr format and csc format, and think about whether other formats are meaningful or helpful
  • dot product
  • Sandwich product (Python and Cython)
  • Implement column scaling
  • Minimal implementation of other methods that need to be supported
  • Test a run-through of sklearn-fork with CategoricalMatrix data vs. MKLSparseMatrix
  • Change SplitMatrix classes to allow a categorical component
  • Incorporate into benchmarks
  • Profile and see where other hotspots are (transpose_dot? csr to csc?) and see if those can be sped up: Looks like we should speed up transpose_dot
  • Speed up transpose_dot
  • Think about whether there's an easy way to detect categorical-ness in an existing matrix
  • Failing that, add functionality to tag features as categorical as in h2o. Change our data-generating scripts to not one-hot encode those variables. Tag them as categorical with both sklearn-fork and h2o
  • See if CD is faster when it deals with all indicators in a categorical matrix in parallel (doesn't require having this implementation, only tagging sets of dummies)
  • See what h2o is doing with categoricals

@ElizabethSantorellaQC
Copy link
Contributor Author

@lbittarello , @jtilly , can you remind me if this needs to support dropping columns?

@jtilly
Copy link
Member

jtilly commented Jun 10, 2020

@lbittarello , @jtilly , can you remind me if this needs to support dropping columns?

I think it's fine not to support dropping columns for this. If one really cares about base levels and what not, one can always just use one hot encoding.

@ElizabethSantorellaQC
Copy link
Contributor Author

Closed in favor of #246

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance speed, memory, or accuracy this week's work we're doing it now
Projects
None yet
Development

No branches or pull requests

3 participants