-
I am trying to run the $ ./mill examples.runMain Train --dataset-dir ./data/mnist
[build.sc] [44/49] cliImports
List((pytorch,2.0.1), (mkl,2023.1), (openblas,0.3.23))
List((pytorch,2.0.1), (mkl,2023.1), (openblas,0.3.23))
List((pytorch,2.0.1), (mkl,2023.1), (openblas,0.3.23))
[114/114] examples.runMain
[W CUDAFunctions.cpp:109] Warning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11080). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (function operator())
Using device: Device(CPU,-1)
Found 0 classes: []
Found 0 examples
Train size: 0
Eval size: 0
Model architecture: ResNet50
Exception in thread "main" java.lang.RuntimeException: PytorchStreamReader failed reading zip archive: failed finding central directory
Exception raised from valid at /__w/javacpp-presets/javacpp-presets/pytorch/cppbuild/linux-x86_64-gpu/pytorch/caffe2/serialize/inline_container.cc:178 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f51a00ac4d7 in /home/hmf/.javacpp/cache/pytorch-2.0.1-1.5.9-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f51a007636b in /home/hmf/.javacpp/cache/pytorch-2.0.1-1.5.9-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libc10.so)
frame #2: caffe2::serialize::PyTorchStreamReader::valid(char const*, char const*) + 0x8e (0x7f4f886ba99e in /home/hmf/.javacpp/cache/pytorch-2.0.1-1.5.9-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libtorch_cpu.so)
frame #3: caffe2::serialize::PyTorchStreamReader::init() + 0x9e (0x7f4f886badfe in /home/hmf/.javacpp/cache/pytorch-2.0.1-1.5.9-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libtorch_cpu.so)
frame #4: caffe2::serialize::PyTorchStreamReader::PyTorchStreamReader(std::shared_ptr<caffe2::serialize::ReadAdapterInterface>) + 0x7f (0x7f4f886bc68f in /home/hmf/.javacpp/cache/pytorch-2.0.1-1.5.9-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libtorch_cpu.so)
frame #5: torch::jit::pickle_load(std::vector<char, std::allocator<char> > const&) + 0x15e (0x7f4f8981bbbe in /home/hmf/.javacpp/cache/pytorch-2.0.1-1.5.9-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libtorch_cpu.so)
frame #6: Java_org_bytedeco_pytorch_global_torch_pickle_1load___3B + 0xc9 (0x7f4f7f1601a9 in /home/hmf/.javacpp/cache/pytorch-2.0.1-1.5.9-linux-x86_64-gpu.jar/org/bytedeco/pytorch/linux-x86_64-gpu/libjnitorch.so)
frame #7: [0x7f51bc94453a]
at org.bytedeco.pytorch.global.torch.pickle_load(Native Method)
at torch.ops.CreationOps.pickleLoad(CreationOps.scala:326)
at torch.ops.CreationOps.pickleLoad$(CreationOps.scala:37)
at torch.package$.pickleLoad(package.scala:32)
at torch.ops.CreationOps.pickleLoad(CreationOps.scala:340)
at torch.ops.CreationOps.pickleLoad$(CreationOps.scala:37)
at torch.package$.pickleLoad(package.scala:32)
at torch.hub$.loadStateDictFromUrl(hub.scala:40)
at ImageClassifier$.train(ImageClassifier.scala:110)
at Train$.run(ImageClassifier.scala:320)
at Train$.run(ImageClassifier.scala:320)
at caseapp.core.app.CaseApp.main(CaseApp.scala:162)
at caseapp.core.app.CaseApp.main(CaseApp.scala:133)
at Train$.main(ImageClassifier.scala:326)
at Train.main(ImageClassifier.scala)
1 targets failed Note that I ran all the tests and all passed. This includes the downloads of the $ ls -lh ./data/mnist/
total 53M
-rw-rw-r-- 1 hmf hmf 7,5M jul 18 16:06 t10k-images-idx3-ubyte
-rw-rw-r-- 1 hmf hmf 9,8K jul 18 16:06 t10k-labels-idx1-ubyte
-rw-rw-r-- 1 hmf hmf 45M jul 18 16:06 train-images-idx3-ubyte
-rw-rw-r-- 1 hmf hmf 59K jul 18 16:06 train-labels-idx1-ubyte Just in case, I also used the full path, but get the same error. $ ./mill examples.runMain Train --dataset-dir /mnt/ssd2/hmf/VSCodeProjects/storch/data/mnist/
[build.sc] [44/49] cliImports
List((pytorch,2.0.1), (mkl,2023.1), (openblas,0.3.23))
List((pytorch,2.0.1), (mkl,2023.1), (openblas,0.3.23))
List((pytorch,2.0.1), (mkl,2023.1), (openblas,0.3.23))
[114/114] examples.runMain
[W CUDAFunctions.cpp:109] Warning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11080). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (function operator())
Using device: Device(CPU,-1)
Found 0 classes: []
Found 0 examples
Train size: 0
Eval size: 0
Model architecture: ResNet50 Looking at the code it seem I should have a set of paths containing "jpg" and "png" file. I tried to load and extract data from the files listed above, but had no success. I tried adding the extensions "gz" and "zip" but on Linux, the archives were not recognized. Is there something I have to do prior to running the training? TIA |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
Yeah the examples definitely need better usage documentation. The
I've trained a model using this example on the Cat VS Dog dataset (download without requiring a kaggle account). It uses a ResNet implementation. The simpler |
Beta Was this translation helpful? Give feedback.
-
Just a heads up on this. I downloaded the data and repeated the test. Below is what I got with the following command line (slight adaptation to work with Mill): ./mill examples.runMain commands.Train --dataset-dir /mnt/ssd2/hmf/datasets/computer_vision/kaggle_cats_and_dogs/pet_images --checkpoint-dir ~/.cache/storch/hub/checkpoints I will have to set up a container to run more tests, including the prediction. But it seems to be working fine. Thank you. [114/114] examples.runMain
Using device: Device(CUDA,-1)
SLF4J: No SLF4J providers were found.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See https://www.slf4j.org/codes.html#noProviders for further details.
SLF4J: Class path contains SLF4J bindings targeting slf4j-api versions 1.7.x or earlier.
SLF4J: Ignoring binding found at [jar:file:/home/hmf/.cache/coursier/v1/https/repo1.maven.org/maven2/ch/qos/logback/logback-classic/1.1.2/logback-classic-1.1.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See https://www.slf4j.org/codes.html#ignoredBindings for an explanation.
Found 2 classes: [Cat -> 0, Dog -> 1]
Found 24959 examples
Found 12490 examples for class Cat
Found 12469 examples for class Dog
Train size: 22463
Eval size: 2496
Model architecture: ResNet50
Downloading: https://github.com/sbrunk/storch/releases/download/pretrained-weights/resnet50-11ad3fa6.pth to /home/hmf/.cache/storch/hub/checkpoints/resnet50-11ad3fa6.pth
Evaluating 100% │█│ 312/312 (0:00:56 / 0:00:00) Loss: 0,04658, Accu
Evaluating 100% │█│ 312/312 (0:00:46 / 0:00:00) Loss: 0,02935, Accu
Evaluating 100% │█│ 312/312 (0:00:46 / 0:00:00) Loss: 0,15334, Accu
Evaluating 100% │█│ 312/312 (0:00:47 / 0:00:00) Loss: 0,02362, Accu
Training epoch 1/1 100% │█│ 2808/2808 (0:21:35 / 0:00:00)
Epoch 1/1, Training loss: 0,00000, Evaluation loss: 0,02362, Accuracy: 0,98878 |
Beta Was this translation helpful? Give feedback.
Yeah the examples definitely need better usage documentation.
The
ImageClassifier
expects images in directories with the following structure (each folder is a class with examples):I've trained a model using this example on the Cat VS Dog dataset (download without requiring a kaggle account). It uses a ResNet implementation.
The simpler
LeNet
example downloads the MNIST dataset, which has it's own format.