Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

predict() method for models? #182

Open
dmeliza opened this issue Sep 14, 2023 · 6 comments
Open

predict() method for models? #182

dmeliza opened this issue Sep 14, 2023 · 6 comments

Comments

@dmeliza
Copy link

dmeliza commented Sep 14, 2023

The scikit-learn implementations of PLS and CCA have predict() methods that are very useful for cross-validation and forecasting. Is it possible to add these to cca-zoo models where appropriate?

@jameschapman19
Copy link
Owner

Pushed a version of this to main

@jameschapman19
Copy link
Owner

Works slightly differently to scikit-learn you pass views with optional missing views as None and it reconstructs all of the views from the learnt latent dimensions.

@dmeliza
Copy link
Author

dmeliza commented Sep 14, 2023

Thanks! I'll check it out.

@dmeliza
Copy link
Author

dmeliza commented Sep 19, 2023

This works well with my data, but only if the view data are whitened first. I'm not enough of an expert in these methods to say why this might be, but it looks like the methods for generating predictions are quite different in cca-zoo compared to sklearn's PLSRegression.

@jameschapman19
Copy link
Owner

jameschapman19 commented Sep 19, 2023

If you come back to me in a week and a half I think I will be able to come up with a more detailed response and fix.

Basically your observation is exactly what I would expect and a colleague of mine has been thinking about this in some depth recently.

We learn weights W_x which transform XW_x=Z_x and W_y which transform YW_y=Z_y. Going from data to latent space is usually known as a backward problem.

For prediction (or 'generation') we need a forward problem.

For PLS it turns out the forward problem is X=ZW_x^T and Y=ZW_y^T

But for CCA the forward problem is actually X=ZW_x^T\Sigma_X and Y=ZW_y^T\Sigma_Y.

The predict function I wrote up quickly for you uses the PLS forward problem (because that's what scikit-learn appears to do).

But notice that if Sigma_X is Identity then the forward problems are the same. Sigma_X is identity when your data is whitened and that's why you are seeing what you are seeing.

Based on the above you might be able to implement a CCA prediction function without my help and if you do get a change feel free to send a PR :) otherwise I'll do it when I get a moment.

@dmeliza
Copy link
Author

dmeliza commented Sep 20, 2023

I've been digging through the code and looking at weights, scores, loadings with
my data, and I'm starting to think prediction may be broken for some models in
scikit-learn.

To set the context, Y is 58000 by 40 and X is 58000 by 1500. sklearn's PLSRegression works reasonably well with about 10 components; sklearn.cross_decomposition.PLSCanonical, cca_zoo.linear.PLS and cca_zoo.linear.CCA all produce horrible in-sample predictions unless I whiten the inputs. However, whitening totally destroys out-of-sample performance, so it's not an option.

For PLSRegression (i.e. PLS2), prediction works great for unwhitened data. The class
computes a "rotation matrix" Pₓ that gives Zₓ = XPₓ. It's using Pₓ =
Wₓ(ΓᵀWₓ)^{-1} rather than just Wₓ as in your example above. Γᵀ being the matrix
of X loadings. Then the prediction is Y = XPₓΔᵀ where Δᵀ is the matrix of
loadings for Y. This works because Z_y ≈ Zₓα, with α = 1: if I fit a line through the X and
Y scores it has an intercept of 0 and a slope of 1.

For PLSCanonical, which I think is the same flavor of PLS as
cca_zoo.linear.PLS, α is not equal to 1, and it's different for each of the
components. So the predictions from the different components are not being
scaled appropriately, and the overall predictions look like garbage, because the first
component accounts for the lion's share of the variance. I am guessing that this
α is the same as your σ in your post above?

The reason I think there's an error in sklearn is that according to the User
Guide
, this factor α needs to be inferred from the data, but I don't see anywhere in
the code that it does this. This is my very naive way of trying to fix it:

fm = LinearRegression()
fm.fit(model._x_scores, model._y_scores)
alpha = np.diag(np.diag(fm.coef_))

pred = X_test_scaled @ model.x_rotations_ @ alpha @ model.y_loadings_.T

It seems to work, although I'm sure there's a better way to get α than multiple
regression. I haven't tried yet with CCA. If you have a more sophisticated
solution I'm happy to write up a PR, and I can submit an issue to sklearn as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants