Python implementation of the system for generating a textual description of a given photo. For the demonstration of how the API is working check ImageCaptioning.ipynb
it shows the whole pipeline:
- repository preparation
- prepare input data
- model preparation
- model training
- model validation
It also shows the use case of single image evaluation: predicted caption compared with true caption + attention weights describing the thought process of the model.
Required dependencies are given in requirements.txt file. You can simply run Makefile to install them:
pip install -r requirements.txt
Model is trained on COCO dataset. I prepared bash script to acquire it, though it is a big amount of data so it requires some dependencies:
- gsutlis
- unzip
- curl
If you have them, you will get dataset by runing getdataset.sh:
./coco/getdataset.sh
The script will acquire data and store it in the right location. WARNING it is a big amount of data, so be prepared.
- Load images
- Extract features from each image
- Store extracted features
- Train/validation split
- Prepare model
- Train model:
- save results (loss plot, training history)
- checkpoint
- optimize trainable variables
- Validate the trained model:
- compare prepared caption with ground truth
- prepare attention validation plot
- encoder - encode each feature by CNN dense layer
- decoder - decode encoded features by GRU reinforced with attention
- optimizer - ADAM
- loss object - sparse categorical cross-entropy
The results of image captioning are stored under results/
location. They are as follows:
- training loss list,
results/training/loss_plot.pkl
- training history text file,
results/taining/loss_history.txt
- validation results list of dictionaries with keys:
image_id
andcaption
,results/validation/captions_val2014.json
The 'results' folder grows quite big. Please create our own results directory or get mine.
To avoid redundant computations the script is storing computed values necessary in further steps in image_captioning/repository.pkl
file, repository.pkl contents:
- train feature path list
- train caption list
- tokenizer
- caption maximum length
- starting epoch
- validation image id list
- validation feature path list
- test image id list
- test feature path list
To restore my trained model please download my repository.pkl and place it in directory given above.
Image captioning is a very computational demanding task, so Tensorflow checkpoints are utilized. Checkpoints are storing information about given TensorFlow objects:
- encoder
- optimizer
- decoder
This allows to break training at any moment, then return to trained models. The checkpoint system is also used in case of validation and evaluation images, it allows to use previously pre-trained models. Checkpoint are saved in image_captioning/training_checkpoints/
directory.
To restore my trained model please download my training checkpoints or train model by yourself.
- train models on more data
- replace GRU with LSTM