Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to add ID column to the output #2

Open
agrodet opened this issue Jan 25, 2022 · 3 comments
Open

How to add ID column to the output #2

agrodet opened this issue Jan 25, 2022 · 3 comments

Comments

@agrodet
Copy link

agrodet commented Jan 25, 2022

I'd like to have the ID column mol_id, present in my input file, in the output file after prediction. I tried to add mol_id to target_column_names

"loader": {
        "type": "csv",
        "input_path": "/mnt/chembl_new.csv",
        "input_column_names": ["smiles"],
        "target_column_names": ["mol_id", ...]
    }

but this results in the following error

Starting featurization...
...
100%|██████████| 5790/5790 [00:57<00:00, 100.86it/s]
Restoring from Checkpoint: /mnt/checkpoint.182
Traceback (most recent call last):
  File "/root/miniconda3/envs/kmol/bin/kmol", line 33, in <module>
    sys.exit(load_entry_point('kmol', 'console_scripts', 'kmol')())
  File "/kmol/src/kmol/run.py", line 370, in main
    Executor(config=Config.from_json(args.config), config_path=args.config).run(args.job)
  File "/kmol/src/mila/factories.py", line 27, in run
    getattr(self, job)()
  File "/kmol/src/kmol/run.py", line 265, in predict
    for batch in data_loader.dataset:
  File "/root/miniconda3/envs/kmol/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 363, in __next__
    data = self._next_data()
  File "/root/miniconda3/envs/kmol/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 403, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/root/miniconda3/envs/kmol/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/kmol/src/kmol/data/resources.py", line 68, in apply
    batch = self._unpack(batch)
  File "/kmol/src/kmol/data/resources.py", line 54, in _unpack
    outputs = torch.FloatTensor(outputs)
TypeError: can't convert np.ndarray of type numpy.str_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.

When mol_id is an integer, the result is 0 and 1, like any other value, so I suppose, it shouldn't be added to target_column_names. input_column_names are not in the output, so I don't know what part of the config I should modify.

Thank you.

@romeoc
Copy link
Collaborator

romeoc commented Jan 28, 2022

The target_column_names are used as output values in the selected model architecture. They are tensorized and used for loss calculations. Furthermore, these values are predicted during inference (rather than printed directly).

If you would like to store additional data, I would suggest adding it to the "input_column_names" instead, but there is no option to print these values during inference. We should maybe consider adding this functionality in a future release.

However, there might be an easier solution for your needs. The data loader reads each line of the input file in the order they appear (an assumption I made based on the other created issue), so the lines of the original file should match the lines of the predicted output. One can easily open the files in a program like excel and copy additional columns from one file to another. On Linux, one can do this automatically with a one-liner, as I am illustrating by appending the SMILES column for the Tox21 dataset:

kmol predict data/configs/tox21.json > output.csv && paste -d',' output.csv <(cat data/input/tox21.csv | awk -F',' '{print $14}')
  • we first run inference and save the predicted values to "output.csv"
  • then, we load the original dataset and keep only column 14 (the SMILES field)
  • then we concatenate 2 files column-wise

Of course, this only works if the rows between the 2 files match. If a splitter was used, this is not the case anymore.

@agrodet
Copy link
Author

agrodet commented Feb 1, 2022

If a splitter was used, this is not the case anymore.

Is it also the case when using the following ?

"splitter": {
    "type": "index",
    "splits": { "test": 1.0 }
},

@romeoc
Copy link
Collaborator

romeoc commented Feb 1, 2022

This is equivalent to not using a split because the index splitter does not shuffle samples, and 100% of the samples are kept in the "test" split. It is the perfect setup for the above mentioned use case.

vincrichard added a commit that referenced this issue Mar 3, 2023
Adding support for recursive config
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants