predict() method for models? #182

dmeliza · 2023-09-14T15:00:10Z

The scikit-learn implementations of PLS and CCA have predict() methods that are very useful for cross-validation and forecasting. Is it possible to add these to cca-zoo models where appropriate?

jameschapman19 · 2023-09-14T17:01:27Z

Pushed a version of this to main

jameschapman19 · 2023-09-14T17:02:22Z

Works slightly differently to scikit-learn you pass views with optional missing views as None and it reconstructs all of the views from the learnt latent dimensions.

dmeliza · 2023-09-14T17:23:50Z

Thanks! I'll check it out.

dmeliza · 2023-09-19T13:26:41Z

This works well with my data, but only if the view data are whitened first. I'm not enough of an expert in these methods to say why this might be, but it looks like the methods for generating predictions are quite different in cca-zoo compared to sklearn's PLSRegression.

jameschapman19 · 2023-09-19T15:08:34Z

If you come back to me in a week and a half I think I will be able to come up with a more detailed response and fix.

Basically your observation is exactly what I would expect and a colleague of mine has been thinking about this in some depth recently.

We learn weights W_x which transform XW_x=Z_x and W_y which transform YW_y=Z_y. Going from data to latent space is usually known as a backward problem.

For prediction (or 'generation') we need a forward problem.

For PLS it turns out the forward problem is X=ZW_x^T and Y=ZW_y^T

But for CCA the forward problem is actually X=ZW_x^T\Sigma_X and Y=ZW_y^T\Sigma_Y.

The predict function I wrote up quickly for you uses the PLS forward problem (because that's what scikit-learn appears to do).

But notice that if Sigma_X is Identity then the forward problems are the same. Sigma_X is identity when your data is whitened and that's why you are seeing what you are seeing.

Based on the above you might be able to implement a CCA prediction function without my help and if you do get a change feel free to send a PR :) otherwise I'll do it when I get a moment.

dmeliza · 2023-09-20T13:08:31Z

I've been digging through the code and looking at weights, scores, loadings with
my data, and I'm starting to think prediction may be broken for some models in
scikit-learn.

To set the context, Y is 58000 by 40 and X is 58000 by 1500. sklearn's PLSRegression works reasonably well with about 10 components; sklearn.cross_decomposition.PLSCanonical, cca_zoo.linear.PLS and cca_zoo.linear.CCA all produce horrible in-sample predictions unless I whiten the inputs. However, whitening totally destroys out-of-sample performance, so it's not an option.

For PLSRegression (i.e. PLS2), prediction works great for unwhitened data. The class
computes a "rotation matrix" Pₓ that gives Zₓ = XPₓ. It's using Pₓ =
Wₓ(ΓᵀWₓ)^{-1} rather than just Wₓ as in your example above. Γᵀ being the matrix
of X loadings. Then the prediction is Y = XPₓΔᵀ where Δᵀ is the matrix of
loadings for Y. This works because Z_y ≈ Zₓα, with α = 1: if I fit a line through the X and
Y scores it has an intercept of 0 and a slope of 1.

For PLSCanonical, which I think is the same flavor of PLS as
cca_zoo.linear.PLS, α is not equal to 1, and it's different for each of the
components. So the predictions from the different components are not being
scaled appropriately, and the overall predictions look like garbage, because the first
component accounts for the lion's share of the variance. I am guessing that this
α is the same as your σ in your post above?

The reason I think there's an error in sklearn is that according to the User
Guide, this factor α needs to be inferred from the data, but I don't see anywhere in
the code that it does this. This is my very naive way of trying to fix it:

fm = LinearRegression()
fm.fit(model._x_scores, model._y_scores)
alpha = np.diag(np.diag(fm.coef_))

pred = X_test_scaled @ model.x_rotations_ @ alpha @ model.y_loadings_.T

It seems to work, although I'm sure there's a better way to get α than multiple
regression. I haven't tried yet with CCA. If you have a more sophisticated
solution I'm happy to write up a PR, and I can submit an issue to sklearn as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

predict() method for models? #182

predict() method for models? #182

dmeliza commented Sep 14, 2023

jameschapman19 commented Sep 14, 2023

jameschapman19 commented Sep 14, 2023

dmeliza commented Sep 14, 2023

dmeliza commented Sep 19, 2023 •

edited

Loading

jameschapman19 commented Sep 19, 2023 •

edited

Loading

dmeliza commented Sep 20, 2023

predict() method for models? #182

predict() method for models? #182

Comments

dmeliza commented Sep 14, 2023

jameschapman19 commented Sep 14, 2023

jameschapman19 commented Sep 14, 2023

dmeliza commented Sep 14, 2023

dmeliza commented Sep 19, 2023 • edited Loading

jameschapman19 commented Sep 19, 2023 • edited Loading

dmeliza commented Sep 20, 2023

dmeliza commented Sep 19, 2023 •

edited

Loading

jameschapman19 commented Sep 19, 2023 •

edited

Loading