[Major] Add a class for efficient operations on categorical features (sparse/dense split done) #76

ElizabethSantorellaQC · 2020-04-23T19:17:33Z

One-hot encoding categorical variables generates matrices where all nonzero elements are 1, and there is only one nonzero element per row. It is possible to store these matrices with much less memory than a general sparse matrix and to operate on them more efficiently. We could improve performance a lot by adding a class that represents our data as a partitioned matrix composed of several one-hot encoded matrices (and perhaps also a dense block).

tbenthompson · 2020-04-23T23:30:28Z

Relatedly, we should modify the way sklearn interacts with sparse matrices so that it operates through something like the scipy.sparse.linalg.LinearOperator class. We need:

A sandwich product function
A matrix-vector function
A function to get an individual column for CD.

Probably another few things that I'm forgetting.

tbenthompson · 2020-04-23T23:30:51Z

We might as well include dense matrices in this API too.

ElizabethSantorellaQC · 2020-04-24T15:34:30Z

Yeah, I think we will need some wrappers for either np.ndarray or scipy.sparse matrices since they already differ a bit in terms of syntax and the shapes of the results of matrix operations. If we did that we could probably simplify downstream logic a lot (by removing a lot of if sparse...)

tbenthompson · 2020-04-26T00:24:15Z

The basics of a DataMatrix API are in PR #86.

tbenthompson · 2020-05-01T20:23:04Z

See PR #105 for a first draft of a dense-sparse split matrix.

ElizabethSantorellaQC · 2020-05-08T19:19:23Z

References on fixed effects from Daniel:

ElizabethSantorellaQC · 2020-05-26T23:17:53Z

ElizabethSantorellaQC · 2020-06-10T19:18:49Z

@lbittarello , @jtilly , can you remind me if this needs to support dropping columns?

jtilly · 2020-06-10T20:02:25Z

@lbittarello , @jtilly , can you remind me if this needs to support dropping columns?

I think it's fine not to support dropping columns for this. If one really cares about base levels and what not, one can always just use one hot encoding.

ElizabethSantorellaQC · 2020-07-06T20:25:34Z

Closed in favor of #246

ElizabethSantorellaQC added the enhancement label Apr 23, 2020

tbenthompson changed the title ~~Add a class for efficient operations on one-hot encoded categorical variables~~ [Critical] Add a class for efficient operations on one-hot encoded categorical variables and for separating dense and sparse subcomponents Apr 26, 2020

ElizabethSantorellaQC assigned tbenthompson Apr 30, 2020

ElizabethSantorellaQC added performance speed, memory, or accuracy and removed enhancement labels May 7, 2020

tbenthompson changed the title ~~[Major] Add a class for efficient operations on one-hot encoded categorical variables and for separating dense and sparse subcomponents~~ [Major] Add a class for efficient operations on separated dense and sparse subcomponents May 7, 2020

tbenthompson changed the title ~~[Major] Add a class for efficient operations on separated dense and sparse subcomponents~~ [Major] Add a class for efficient operations on separated dense and sparse subcomponents and on categorical features May 7, 2020

ElizabethSantorellaQC changed the title ~~[Major] Add a class for efficient operations on separated dense and sparse subcomponents and on categorical features~~ [Major] Add a class for efficient operations on categorical features (sparse/dense split done) May 20, 2020

ElizabethSantorellaQC assigned ElizabethSantorellaQC and unassigned tbenthompson May 20, 2020

ElizabethSantorellaQC mentioned this issue May 27, 2020

Categorical matrix #153

Merged

ElizabethSantorellaQC added the this week's work we're doing it now label Jun 9, 2020

ElizabethSantorellaQC mentioned this issue Jun 11, 2020

[WIP] Categorical matrix stuff part 2 #202

Closed

ElizabethSantorellaQC mentioned this issue Jul 6, 2020

[Major] Categorical matrix optimizations #246

Closed

8 tasks

ElizabethSantorellaQC closed this as completed Jul 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Major] Add a class for efficient operations on categorical features (sparse/dense split done) #76

[Major] Add a class for efficient operations on categorical features (sparse/dense split done) #76

ElizabethSantorellaQC commented Apr 23, 2020

tbenthompson commented Apr 23, 2020

tbenthompson commented Apr 23, 2020

ElizabethSantorellaQC commented Apr 24, 2020

tbenthompson commented Apr 26, 2020

tbenthompson commented May 1, 2020

ElizabethSantorellaQC commented May 8, 2020

ElizabethSantorellaQC commented May 26, 2020 •

edited

Loading

ElizabethSantorellaQC commented Jun 10, 2020

jtilly commented Jun 10, 2020

ElizabethSantorellaQC commented Jul 6, 2020

[Major] Add a class for efficient operations on categorical features (sparse/dense split done) #76

[Major] Add a class for efficient operations on categorical features (sparse/dense split done) #76

Comments

ElizabethSantorellaQC commented Apr 23, 2020

tbenthompson commented Apr 23, 2020

tbenthompson commented Apr 23, 2020

ElizabethSantorellaQC commented Apr 24, 2020

tbenthompson commented Apr 26, 2020

tbenthompson commented May 1, 2020

ElizabethSantorellaQC commented May 8, 2020

ElizabethSantorellaQC commented May 26, 2020 • edited Loading

ElizabethSantorellaQC commented Jun 10, 2020

jtilly commented Jun 10, 2020

ElizabethSantorellaQC commented Jul 6, 2020

ElizabethSantorellaQC commented May 26, 2020 •

edited

Loading