How to add ID column to the output #2

agrodet · 2022-01-25T05:35:59Z

I'd like to have the ID column mol_id, present in my input file, in the output file after prediction. I tried to add mol_id to target_column_names

"loader": {
        "type": "csv",
        "input_path": "/mnt/chembl_new.csv",
        "input_column_names": ["smiles"],
        "target_column_names": ["mol_id", ...]
    }

but this results in the following error

Starting featurization...
...
100%|██████████| 5790/5790 [00:57<00:00, 100.86it/s]
Restoring from Checkpoint: /mnt/checkpoint.182
Traceback (most recent call last):
  File "/root/miniconda3/envs/kmol/bin/kmol", line 33, in <module>
    sys.exit(load_entry_point('kmol', 'console_scripts', 'kmol')())
  File "/kmol/src/kmol/run.py", line 370, in main
    Executor(config=Config.from_json(args.config), config_path=args.config).run(args.job)
  File "/kmol/src/mila/factories.py", line 27, in run
    getattr(self, job)()
  File "/kmol/src/kmol/run.py", line 265, in predict
    for batch in data_loader.dataset:
  File "/root/miniconda3/envs/kmol/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 363, in __next__
    data = self._next_data()
  File "/root/miniconda3/envs/kmol/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 403, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/root/miniconda3/envs/kmol/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/kmol/src/kmol/data/resources.py", line 68, in apply
    batch = self._unpack(batch)
  File "/kmol/src/kmol/data/resources.py", line 54, in _unpack
    outputs = torch.FloatTensor(outputs)
TypeError: can't convert np.ndarray of type numpy.str_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.

When mol_id is an integer, the result is 0 and 1, like any other value, so I suppose, it shouldn't be added to target_column_names. input_column_names are not in the output, so I don't know what part of the config I should modify.

Thank you.

The text was updated successfully, but these errors were encountered:

romeoc · 2022-01-28T01:47:06Z

The target_column_names are used as output values in the selected model architecture. They are tensorized and used for loss calculations. Furthermore, these values are predicted during inference (rather than printed directly).

If you would like to store additional data, I would suggest adding it to the "input_column_names" instead, but there is no option to print these values during inference. We should maybe consider adding this functionality in a future release.

However, there might be an easier solution for your needs. The data loader reads each line of the input file in the order they appear (an assumption I made based on the other created issue), so the lines of the original file should match the lines of the predicted output. One can easily open the files in a program like excel and copy additional columns from one file to another. On Linux, one can do this automatically with a one-liner, as I am illustrating by appending the SMILES column for the Tox21 dataset:

kmol predict data/configs/tox21.json > output.csv && paste -d',' output.csv <(cat data/input/tox21.csv | awk -F',' '{print $14}')

we first run inference and save the predicted values to "output.csv"
then, we load the original dataset and keep only column 14 (the SMILES field)
then we concatenate 2 files column-wise

Of course, this only works if the rows between the 2 files match. If a splitter was used, this is not the case anymore.

agrodet · 2022-02-01T06:15:44Z

If a splitter was used, this is not the case anymore.

Is it also the case when using the following ?

"splitter": {
    "type": "index",
    "splits": { "test": 1.0 }
},

romeoc · 2022-02-01T06:33:30Z

This is equivalent to not using a split because the index splitter does not shuffle samples, and 100% of the samples are kept in the "test" split. It is the perfect setup for the above mentioned use case.

Adding support for recursive config

agrodet mentioned this issue Feb 3, 2022

Corrections for DAIIA platform #3

Open

vincrichard added a commit that referenced this issue Mar 3, 2023

Merge pull request #2 from elix-tech/update_config

653babf

Adding support for recursive config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to add ID column to the output #2

How to add ID column to the output #2

agrodet commented Jan 25, 2022

romeoc commented Jan 28, 2022 •

edited

Loading

agrodet commented Feb 1, 2022 •

edited

Loading

romeoc commented Feb 1, 2022

How to add ID column to the output #2

How to add ID column to the output #2

Comments

agrodet commented Jan 25, 2022

romeoc commented Jan 28, 2022 • edited Loading

agrodet commented Feb 1, 2022 • edited Loading

romeoc commented Feb 1, 2022

romeoc commented Jan 28, 2022 •

edited

Loading

agrodet commented Feb 1, 2022 •

edited

Loading