-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Major] Add a class for efficient operations on categorical features (sparse/dense split done) #76
Comments
Relatedly, we should modify the way sklearn interacts with sparse matrices so that it operates through something like the scipy.sparse.linalg.LinearOperator class. We need:
Probably another few things that I'm forgetting. |
We might as well include dense matrices in this API too. |
Yeah, I think we will need some wrappers for either np.ndarray or scipy.sparse matrices since they already differ a bit in terms of syntax and the shapes of the results of matrix operations. If we did that we could probably simplify downstream logic a lot (by removing a lot of |
The basics of a DataMatrix API are in PR #86. |
See PR #105 for a first draft of a dense-sparse split matrix. |
References on fixed effects from Daniel: |
|
@lbittarello , @jtilly , can you remind me if this needs to support dropping columns? |
I think it's fine not to support dropping columns for this. If one really cares about base levels and what not, one can always just use one hot encoding. |
Closed in favor of #246 |
One-hot encoding categorical variables generates matrices where all nonzero elements are 1, and there is only one nonzero element per row. It is possible to store these matrices with much less memory than a general sparse matrix and to operate on them more efficiently. We could improve performance a lot by adding a class that represents our data as a partitioned matrix composed of several one-hot encoded matrices (and perhaps also a dense block).
The text was updated successfully, but these errors were encountered: