This repo contains some scripts written to automate the indexing of a large collection of records. Specifically 78 RPM shellac records. The process for doing this was approximately:
- Photograph each side of every record in a predetermined set (e.g. records from the same label).
- Place these photos into a folder for that set (e.g. 'Set_1'). Then move them into a subdirectory named 'Original'.
- Run the
auto_crop_rename.sh
script in the set directory. - Run the
automatedExtract.sh
script (from the parent directory) to extract text for all the records in that set. - Run the
processText.py
program to collate the details for all the records in the set into a single tsv file. - Manually edit the tsv file to correct any mistakes with the label and catalogue number columns and save this as an ODS.
- Repeat above steps for all record sets.
- Run the
automatedDiscogs.sh
script to create a new ODS sheet for each set with additional information from and links to Discogs. - Run
mergeDiscogs.py
to merge together all the Discogs ODS files. This creates one sheet with all details together.
These scripts are provided only because they might be useful to someone else. They are incomplete in many ways and are not 'production ready' - e.g. error handling is not implemented for the most part.
Sometimes when taking photos a mistake might happen - such as taking two photos of one side or missing a side. In this case the naming sequence will get out of sync. Fixing this manually can be a lot of work depending on how many files need to be renamed. This script automated the process; it doesn't take any arguments and needs to be edited depending on the renaming to be done.
Removes EXIF metadata from all JPG files in the current directly and
reduce the quality to 80. It uses the convert
utility from ImageMagick
for this. This is useful when uploading images as not all websites
automatically remove EXIF details.
Takes all files in the 'Original' directory, crops them to zoom in on the centre label and saves them to a new file in the current directory. The new file has a filename of the current datetime plus a sequence number and the record side (A/B).
Takes a directory name as input and runs the extractText.py
Google
Cloud Vision text extraction program on every jpg file in the directory.
It also checks the required environment variable is set before doing
this.
Runs the requestsDiscogsSearch.py
ODS Discogs program on every ODS
file in the directory. The only special bit in the script is setting the
'IFS' environment variable to consider only newlines (and not spaces)
when iterating through the output of 'ls'.
Using Google Cloud Vision, this takes a filename of an image in the current directory, sends it to Google, takes the recognised text response and puts it into into two text files. One contains the recognised text and the likely language, the other contains each word plus its location (x,y) in the image.
Takes a directory name as input, gets all the records and adds their details to a tsv file (tabbed separated sheet). These details include the filename, text extracted from side A/B, links to the images for both sides and a guess for the label and catalogue number (based on numbers appearing on both A & B sides).
Does the same as processText.py
but is designed for Russian records.
The primary difference is Google Cloud translation is used to create an
English translation of the detected Russian text. This is then added as
an extra column.
Takes an existing ODS file containing processed (and manually corrected) record information. It then sends a request to the Discogs API to search for the record (using the label and catalogue number). If there is a match on Discogs, links to the pages(s) on Discogs are added along with the title for the first hit. This enhanced ODS file is saved with an identical name but with 'Discogs' appended.
The purpose of this script is to merge together multiple ODS files
created by the requestsDiscogsSearch.py
program into a single ODS file -
named 'summary.ODS'.