Unsuccessful loading MNIST data for ImageClassifier example: training #41

hmf · 2023-07-19T10:57:18Z

hmf
Jul 19, 2023

I am trying to run the ImageClassifier example and have several issues. One of them is that no data is loaded for the training session. I am using a Mill build to do this, but I don't think this is an issue. Here is what I get:

$ ./mill examples.runMain Train --dataset-dir ./data/mnist
[build.sc] [44/49] cliImports 
List((pytorch,2.0.1), (mkl,2023.1), (openblas,0.3.23))
List((pytorch,2.0.1), (mkl,2023.1), (openblas,0.3.23))
List((pytorch,2.0.1), (mkl,2023.1), (openblas,0.3.23))
[114/114] examples.runMain 
[W CUDAFunctions.cpp:109] Warning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11080). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (function operator())
Using device: Device(CPU,-1)
Found 0 classes: []
Found 0 examples

Train size: 0
Eval size:  0
Model architecture: ResNet50
Exception in thread "main" java.lang.RuntimeException: PytorchStreamReader failed reading zip archive: failed finding central directory
Exception raised from valid at /__w/javacpp-presets/javacpp-presets/pytorch/cppbuild/linux-x86_64-gpu/pytorch/caffe2/serialize/inline_container.cc:178 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f51a00ac4d7 in /home/hmf/.javacpp/cache/pytorch-2.0.1-1.5.9-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f51a007636b in /home/hmf/.javacpp/cache/pytorch-2.0.1-1.5.9-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libc10.so)
frame #2: caffe2::serialize::PyTorchStreamReader::valid(char const*, char const*) + 0x8e (0x7f4f886ba99e in /home/hmf/.javacpp/cache/pytorch-2.0.1-1.5.9-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libtorch_cpu.so)
frame #3: caffe2::serialize::PyTorchStreamReader::init() + 0x9e (0x7f4f886badfe in /home/hmf/.javacpp/cache/pytorch-2.0.1-1.5.9-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libtorch_cpu.so)
frame #4: caffe2::serialize::PyTorchStreamReader::PyTorchStreamReader(std::shared_ptr<caffe2::serialize::ReadAdapterInterface>) + 0x7f (0x7f4f886bc68f in /home/hmf/.javacpp/cache/pytorch-2.0.1-1.5.9-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libtorch_cpu.so)
frame #5: torch::jit::pickle_load(std::vector<char, std::allocator<char> > const&) + 0x15e (0x7f4f8981bbbe in /home/hmf/.javacpp/cache/pytorch-2.0.1-1.5.9-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libtorch_cpu.so)
frame #6: Java_org_bytedeco_pytorch_global_torch_pickle_1load___3B + 0xc9 (0x7f4f7f1601a9 in /home/hmf/.javacpp/cache/pytorch-2.0.1-1.5.9-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libjnitorch.so)
frame #7: [0x7f51bc94453a]

	at org.bytedeco.pytorch.global.torch.pickle_load(Native Method)
	at torch.ops.CreationOps.pickleLoad(CreationOps.scala:326)
	at torch.ops.CreationOps.pickleLoad$(CreationOps.scala:37)
	at torch.package$.pickleLoad(package.scala:32)
	at torch.ops.CreationOps.pickleLoad(CreationOps.scala:340)
	at torch.ops.CreationOps.pickleLoad$(CreationOps.scala:37)
	at torch.package$.pickleLoad(package.scala:32)
	at torch.hub$.loadStateDictFromUrl(hub.scala:40)
	at ImageClassifier$.train(ImageClassifier.scala:110)
	at Train$.run(ImageClassifier.scala:320)
	at Train$.run(ImageClassifier.scala:320)
	at caseapp.core.app.CaseApp.main(CaseApp.scala:162)
	at caseapp.core.app.CaseApp.main(CaseApp.scala:133)
	at Train$.main(ImageClassifier.scala:326)
	at Train.main(ImageClassifier.scala)
1 targets failed

Note that I ran all the tests and all passed. This includes the downloads of the vision module. This is what I have:

$ ls -lh ./data/mnist/
total 53M
-rw-rw-r-- 1 hmf hmf 7,5M jul 18 16:06 t10k-images-idx3-ubyte
-rw-rw-r-- 1 hmf hmf 9,8K jul 18 16:06 t10k-labels-idx1-ubyte
-rw-rw-r-- 1 hmf hmf  45M jul 18 16:06 train-images-idx3-ubyte
-rw-rw-r-- 1 hmf hmf  59K jul 18 16:06 train-labels-idx1-ubyte

Just in case, I also used the full path, but get the same error.

$ ./mill examples.runMain Train --dataset-dir /mnt/ssd2/hmf/VSCodeProjects/storch/data/mnist/
[build.sc] [44/49] cliImports 
List((pytorch,2.0.1), (mkl,2023.1), (openblas,0.3.23))
List((pytorch,2.0.1), (mkl,2023.1), (openblas,0.3.23))
List((pytorch,2.0.1), (mkl,2023.1), (openblas,0.3.23))
[114/114] examples.runMain 
[W CUDAFunctions.cpp:109] Warning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11080). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (function operator())
Using device: Device(CPU,-1)
Found 0 classes: []
Found 0 examples

Train size: 0
Eval size:  0
Model architecture: ResNet50

Looking at the code it seem I should have a set of paths containing "jpg" and "png" file. I tried to load and extract data from the files listed above, but had no success. I tried adding the extensions "gz" and "zip" but on Linux, the archives were not recognized.

Is there something I have to do prior to running the training?

TIA

Answered by sbrunk

Jul 20, 2023

Yeah the examples definitely need better usage documentation.

The ImageClassifier expects images in directories with the following structure (each folder is a class with examples):

.
├── PetImages
    ├── Cat
    │   ├── 1.jpg
    │   ├── 2.jpg
    │   ├── ...
    └── Dog
        ├── 1.jpg
        ├── 2.jpg
        ├── ...

I've trained a model using this example on the Cat VS Dog dataset (download without requiring a kaggle account). It uses a ResNet implementation.

The simpler LeNet example downloads the MNIST dataset, which has it's own format.

View full answer

sbrunk · 2023-07-20T19:58:12Z

sbrunk
Jul 20, 2023
Maintainer

Yeah the examples definitely need better usage documentation.

The ImageClassifier expects images in directories with the following structure (each folder is a class with examples):

.
├── PetImages
    ├── Cat
    │   ├── 1.jpg
    │   ├── 2.jpg
    │   ├── ...
    └── Dog
        ├── 1.jpg
        ├── 2.jpg
        ├── ...

I've trained a model using this example on the Cat VS Dog dataset (download without requiring a kaggle account). It uses a ResNet implementation.

The simpler LeNet example downloads the MNIST dataset, which has it's own format.

1 reply

hmf Jul 21, 2023
Author

Thank you. Will try this out later. Maybe simply adding the above links to the source comment would be enough for now.

hmf · 2023-07-21T10:48:18Z

hmf
Jul 21, 2023
Author

Just a heads up on this. I downloaded the data and repeated the test. Below is what I got with the following command line (slight adaptation to work with Mill):

./mill examples.runMain commands.Train --dataset-dir /mnt/ssd2/hmf/datasets/computer_vision/kaggle_cats_and_dogs/pet_images --checkpoint-dir ~/.cache/storch/hub/checkpoints

I will have to set up a container to run more tests, including the prediction. But it seems to be working fine.

Thank you.

[114/114] examples.runMain 
Using device: Device(CUDA,-1)
SLF4J: No SLF4J providers were found.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See https://www.slf4j.org/codes.html#noProviders for further details.
SLF4J: Class path contains SLF4J bindings targeting slf4j-api versions 1.7.x or earlier.
SLF4J: Ignoring binding found at [jar:file:/home/hmf/.cache/coursier/v1/https/repo1.maven.org/maven2/ch/qos/logback/logback-classic/1.1.2/logback-classic-1.1.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See https://www.slf4j.org/codes.html#ignoredBindings for an explanation.
Found 2 classes: [Cat -> 0, Dog -> 1]
Found 24959 examples
Found 12490 examples for class Cat
Found 12469 examples for class Dog
Train size: 22463
Eval size:  2496
Model architecture: ResNet50
Downloading: https://github.com/sbrunk/storch/releases/download/pretrained-weights/resnet50-11ad3fa6.pth to /home/hmf/.cache/storch/hub/checkpoints/resnet50-11ad3fa6.pth
Evaluating         100% │█│ 312/312 (0:00:56 / 0:00:00)     Loss: 0,04658, Accu
Evaluating         100% │█│ 312/312 (0:00:46 / 0:00:00)     Loss: 0,02935, Accu
Evaluating         100% │█│ 312/312 (0:00:46 / 0:00:00)     Loss: 0,15334, Accu
Evaluating         100% │█│ 312/312 (0:00:47 / 0:00:00)     Loss: 0,02362, Accu
Training epoch 1/1 100% │█│ 2808/2808 (0:21:35 / 0:00:00)                      
Epoch 1/1, Training loss: 0,00000, Evaluation loss: 0,02362, Accuracy: 0,98878

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unsuccessful loading MNIST data for ImageClassifier example: training #41

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Unsuccessful loading MNIST data for ImageClassifier example: training #41

hmf Jul 19, 2023

Replies: 2 comments · 1 reply

sbrunk Jul 20, 2023 Maintainer

hmf Jul 21, 2023 Author

hmf Jul 21, 2023 Author

hmf
Jul 19, 2023

Replies: 2 comments 1 reply

sbrunk
Jul 20, 2023
Maintainer

hmf Jul 21, 2023
Author

hmf
Jul 21, 2023
Author