Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Problems with the size of data #1606

Closed
rruizdeaustri opened this issue Jun 5, 2024 · 11 comments
Closed

[BUG] Problems with the size of data #1606

rruizdeaustri opened this issue Jun 5, 2024 · 11 comments
Labels
bug Something isn't working classification Classification package transformations Transformations package

Comments

@rruizdeaustri
Copy link

rruizdeaustri commented Jun 5, 2024

Describe the bug

Hi,

I want to use rocket algorithms to classify Gravitational waves. The size of my data is (400000, 2, 2048) where 2 is the number of channels and 2048 is the length of each time series. It does not work.

Thank you !

Roberto

Steps/Code to reproduce the bug

import sys
import numpy as np
import h5py
import time
from pathlib import Path

import tensorflow as tf
import matplotlib.pyplot as plt

from utils.configfiles import get_config
from utils.datasets import load_data_into_numpy, InjectionSNR
from utils.metrics import auc_snr_eval
import json

from aeon.classification.deep_learning import LITETimeClassifier
from aeon.classification.hybrid import HIVECOTEV1, HIVECOTEV2
from aeon.classification.convolution_based import Arsenal

from sklearn.metrics import roc_auc_score

# -----------------------------------------------------------------------------
# MAIN CODE
# -----------------------------------------------------------------------------

model = 'multirocket'

if __name__ == '__main__':

    # -------------------------------------------------------------------------
    # Preliminaries
    # -------------------------------------------------------------------------
    print(tf.config.list_physical_devices('GPU'))

    # Example usage with your configuration settings
    config = get_config()
    xtrain, ytrain = load_data_into_numpy(config['data']['training'])
    xtest, ytest = load_data_into_numpy(config['data']['testing'])
 
    injections_snr = InjectionSNR()

    if model == 'LITETime': 
     clf = LITETimeClassifier(batch_size=32, n_classifiers=5, n_epochs=50, file_path='checkpoints/', s
ave_best_model=True, best_file_name="best_model", verbose=True)
    elif model == 'hivecote':
     clf = HIVECOTEV2(time_limit_in_minutes=0.2, verbose=1)   
    elif model == 'multirocket':
     clf = Arsenal(rocket_transform="multirocket")
    else:
     print('wrong model')   
     sys.exit()   

    # Check unique values in data
    #unique_values, counts = np.unique(ytrain, return_counts=True)
    #print(f"Unique values in predictions: {unique_values}")
    #print(f"Counts of unique values: {counts}")
    #sys.exit()
    print(f"xtrain shape: {xtrain.shape}, type: {xtrain.dtype}")
    clf.fit(xtrain, ytrain)

    #Compute AUC versus SNR and plot
    ypred = clf.predict(xtest)

    # Check unique values in predictions
    unique_values, counts = np.unique(ypred, return_counts=True)
    print(f"Unique values in predictions: {unique_values}")
    print(f"Counts of unique values: {counts}")
    #print(ypred.shape, ypred, ytest)
    sys.exit()
    # Assume non-signal data are those with a true label of 0
    non_signal_indices = np.where(ytest == 0)[0]

    # Function recieves the scores and computes the bin AUCs
    auc_snr = auc_snr_eval(injections_snr, ypred, ytest, non_signal_indices)

    print(clf.score(xtest, ytest))

    # -------------------------------------------------------------------------
    # Save results as a JSON file
    # -------------------------------------------------------------------------
    print('Saving auc versus snr results to JSON file...', end=' ', flush=True)
    with open('results/metrics/auc_over_snr_aeon.json', 'w') as json_file:
        json.dump(auc_snr, json_file, sort_keys=True, indent=2)
        
    print('Done!') 

    # Extract data from results
    snr_bins = [(float(a), float(b)) for (a, b) in auc_snr['snr_bins']]
    auc_ratios = np.array(auc_snr['auc']).astype(float)
    grid = [np.mean(_) for _ in snr_bins]

    # Initialize a color cycle for plotting
    #colors = plt.cm.jet(np.linspace(0, 1, 1))

    # Plot the data with a different color for each curve
    plt.plot(grid, auc_ratios, marker='o', ms=2, mew=0.5, linestyle='-', label='aeon')

    # Initialize the legend list
    legend_labels = []

    # Add the model name to the legend
    legend_labels.append('CNN')

    # Configure the plot and add a legend
    plt.xlabel('SNR')
    plt.ylabel('AUC')
    plt.legend(legend_labels, loc='best')
    plt.grid(True)

    # Construct path to save this plot
    plots_dir = './plots'
    Path(plots_dir).mkdir(exist_ok=True)
    file_path = os.path.join(plots_dir, 'auc_snr_combined.pdf')

    # Save the plot as a PDF
    print('Saving plot as PDF...', end=' ', flush=True)
    plt.savefig(file_path, bbox_inches='tight', pad_inches=0)
    print('Done!', flush=True)
    #plt.show()

    auc = roc_auc_score(ytest, ypred)

    print('Test set AUC: {:.2f}%'.format(100.*auc))

    AUC = []
    AUC.append(auc)

    np.savetxt('results/metrics/auc_aeon.txt', AUC)

    print(80 * '-' + '\n\n' + 'Testing complete!')

    # -------------------------------------------------------------------------
    # Postliminaries
    # -------------------------------------------------------------------------

    print('')
    print(f'This took {time.time() - script_start:.1f} seconds!')
    print('')

Expected results

Just the classifier works

Actual results

 Traceback (most recent call last):
  File "/lustre/home/ific/rruiz/projects/gws/aeon/main.py", line 83, in <module>
    clf.fit(xtrain, ytrain)
  File "/lustre/home/ific/rruiz/.conda/envs/tf/lib/python3.10/site-packages/aeon/classification/base.py", line 129, in fit
    self._fit(X, y)
  File "/lustre/home/ific/rruiz/.conda/envs/tf/lib/python3.10/site-packages/aeon/classification/convolution_based/_arsenal.py", line 171, in _fit
    self._fit_arsenal(X, y)
  File "/lustre/home/ific/rruiz/.conda/envs/tf/lib/python3.10/site-packages/aeon/classification/convolution_based/_arsenal.py", line 335, in _fit_arsenal
    fit = Parallel(n_jobs=self._n_jobs, prefer="threads")(
  File "/lustre/home/ific/rruiz/.conda/envs/tf/lib/python3.10/site-packages/joblib/parallel.py", line 1918, in __call__
    return output if self.return_generator else list(output)
  File "/lustre/home/ific/rruiz/.conda/envs/tf/lib/python3.10/site-packages/joblib/parallel.py", line 1847, in _get_sequential_output
    res = func(*args, **kwargs)
  File "/lustre/home/ific/rruiz/.conda/envs/tf/lib/python3.10/site-packages/aeon/classification/convolution_based/_arsenal.py", line 367, in _fit_ensemble_estimator
    transformed_x = rocket.fit_transform(X)
  File "/lustre/home/ific/rruiz/.conda/envs/tf/lib/python3.10/site-packages/aeon/transformations/collection/base.py", line 161, in fit_transform
    Xt = self._fit_transform(X=X_inner, y=y_inner)
  File "/lustre/home/ific/rruiz/.conda/envs/tf/lib/python3.10/site-packages/aeon/transformations/collection/base.py", line 326, in _fit_transform
    return self._transform(X, y)
  File "/lustre/home/ific/rruiz/.conda/envs/tf/lib/python3.10/site-packages/aeon/transformations/collection/convolution_based/_multirocket_multivariate.py", line 168, in _transform
    X = _transform(
  File "/lustre/home/ific/rruiz/.conda/envs/tf/lib/python3.10/site-packages/numba/core/dispatcher.py", line 703, in _explain_matching_error
    raise TypeError(msg)
TypeError: No matching definition for argument type(s) array(float32, 3d, C), array(float32, 3d, C), Tuple(array(int32, 1d, C), array(int32, 1d, C), array(int32, 1d, C), array(int32, 1d, C), array(float32, 1d, C)), Tuple(array(int32, 1d, C), array(int32, 1d, C), array(int32, 1d, C), array(int32, 1d, C), array(float32, 1d, C)), int64

Versions

0.8.1

@rruizdeaustri rruizdeaustri added the bug Something isn't working label Jun 5, 2024
@baraline
Copy link
Member

baraline commented Jun 5, 2024

I suspect that this might be because of the strict definition of the signature of the transform function in multirocket, which only accepts float64 arrays. In the fit method, X is converted to float64, but not in transform.

What is the datatype of your input ? If it's float32, would converting it to float64 work (the size of the data might become an issue if you don't have enought RAM tho ..., but it's for testing purposes, you can reduce it) ?

If this is the cause of the bug, we would need to discuss why float64 has been made mandatory in the function signature, and if we can relax it to allow other types.

@MatthewMiddlehurst MatthewMiddlehurst added transformations Transformations package classification Classification package labels Jun 5, 2024
@TonyBagnall
Copy link
Contributor

TonyBagnall commented Jun 5, 2024

thanks for the bug report. From the trace this comes from fit called on arsenal. This works

from aeon.classification.convolution_based import Arsenal
import numpy as np
shape = (40, 2, 2000)
X = np.random.rand(*shape).astype(np.float32)
y = np.random.randint(0, 2, size=40)
afc = Arsenal()
afc.fit(X, y)

what is the data type for your xtrain?

@TonyBagnall
Copy link
Contributor

I would also recommend put a time limit on HC2 if you want to run it on a problem that size

@TonyBagnall
Copy link
Contributor

TonyBagnall commented Jun 5, 2024

ah ignore that, as @baraline pointed out on slack, I missed that you had set it to multirocket. This does indeed crash, definitely a bug.

from aeon.classification.convolution_based import Arsenal
from aeon.transformations.collection.convolution_based import MultiRocket
from aeon.classification.hybrid import HIVECOTEV2
import numpy as np
shape = (40, 2, 200)
X = np.random.rand(*shape).astype(np.float32)
print(X.shape)
y = np.random.randint(0, 2, size=40)
afc = Arsenal(rocket_transform="multirocket")
afc.fit(X, y)
print("Finished fit for arsenal")
print(afc.predict(X))

wait, its more complex. This crashes with multivariate series

TypeError: No matching definition for argument type(s) array(float32, 3d, C), array(float32, 3d, C)

but not with univariate
shape = (40, 1, 200)

@TonyBagnall
Copy link
Contributor

TonyBagnall commented Jun 5, 2024

for some bizarre reason we have MultiRocketMultivariate and MultiRocket, so problem lies with the former (dont ask why we have these weird versions, its legacy!)

mr = MultiRocketMultivariate()
mr.fit(X)
Xt = mr.transform(X)

gives the same type error. The problem occurs in the numba internal method _transform (confusingly not the one implementing the abstract class).

it has this numba signature

@njit(
    "float32[:,:](float64[:,:,:],float64[:,:,:],"
    "Tuple((int32[:],int32[:],int32[:],int32[:],float32[:])),"
    "Tuple((int32[:],int32[:],int32[:],int32[:],float32[:])),int32)",
    fastmath=True,
    parallel=True,
    cache=True,
)
def _transform(X, X1, parameters, parameters1, n_features_per_kernel=4):
    num_examples, num_channels, input_length = X.shape

the univariate version has this

@njit(
    "float32[:,:](float64[:,:],float64[:,:],Tuple((int32[:],int32[:],float32[:])),"
    "Tuple((int32[:],int32[:],float32[:])),int32)",
    fastmath=True,
    parallel=True,
    cache=True,
)
def _transform(X, X1, parameters, parameters1, n_features_per_kernel):

@rruizdeaustri
Copy link
Author

Hi,

The issue disappeared with the trick of converting the data to float64 but after sometime the code stopped with a memory issue. The data input shape was (400000, 2, 2048), probably to much to handle it with RAM memory and worst if the numbers are 64 bits. Is there no way of using batches in the training to avoid this ?

Thanks !

Rbt

@baraline
Copy link
Member

baraline commented Jun 6, 2024

Hey, I think the right way of handling this on our side would be to make those function support both float64 and float32 inputs, we'll discuss the best approach and work on a fix. In the meantime, I see two options if you want to use your full dataset, which unfortunately will include some tinkering :

  1. Edit the sources to modify the float64 in the _transform function to float32. This will fix the problem locally, and hopefully that would allow to not get memory error.

  2. Otherwise, If you can fit the multirocket transformer with the whole data, you can then transform the data and save it in batch to avoid the memory transform here.

    • If fit throws a memory error, you could fit only on part of the input and do the batch transform. I'm not sure on the impact of not fitting with the full data, but as rocket kernels are mostly random, it should be somewhat “fine”.

    To learn a classifier from this batch-transformed data, if memory is still an issue in the transformed format, you would need a sklearn classifier with the update capability, otherwise, you're fine to use a RidgeClassifierCV as in individual rocket classifiers.

Second option would of course be for only one rocket transformer, to mimic arsenal behavior, you would need to do this n_estimators times and combine the predictions of all of them using the ensemble scheme used in arsenal (i.e. this function )

@TonyBagnall
Copy link
Contributor

personally I would just train it on a subset, ultimately Rocket classifiers are pipelines which generate very large feature spaces. Flip side is you probably really dont need that much data to train, it is after all mostly random. Predict can oc be done in batches

@TonyBagnall
Copy link
Contributor

in terms of the code, I think for now just do what rocket does, cast to 32 bits
X.astype(np.float32)
I think it may have been @dguijo who did that? Whole module needs reworking tbh

@TonyBagnall
Copy link
Contributor

TonyBagnall commented Jun 7, 2024

@rruizdeaustri should be fixed by #1612 at least in terms of 32 bit/64 bit. We plan to redesign the whole rocket package, but it will always be memory intensive, (see #1126). I dont think you can avoid creating a n_cases, n_kernels array if you use it as the authors proposed. I suggest either reducing the train size or the number of kernels

ok to close this issue?

@rruizdeaustri
Copy link
Author

Yes and thanks a lot for the quick feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working classification Classification package transformations Transformations package
Projects
None yet
Development

No branches or pull requests

4 participants