Skip to content

Commit

Permalink
DEV-1087 Argo Workflows monthly and daily (#41)
Browse files Browse the repository at this point in the history
- Miscellaneous ease of use changes for the hathifiles generation script.
- Add environment variables for history and redirect directories.
- Remove deprecated docker-compose version tag.
  • Loading branch information
moseshll authored Jun 10, 2024
1 parent a57a54f commit f398155
Show file tree
Hide file tree
Showing 6 changed files with 22 additions and 11 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ replaced by.
## Basic usage: add_monthly_and_dump_redirects.rb

```shell
bundle exec ruby bin/add_monthly_and_dump_redirect.rb \
bundle exec ruby bin/add_monthly_and_dump_redirects.rb \
../archive/hathi_full_20211101.txt.gz
```

Expand Down
15 changes: 9 additions & 6 deletions bin/add_monthly_and_dump_redirects.rb
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,17 @@
here = Pathname.new(__dir__)
libdir = here.parent + 'lib'
$LOAD_PATH.unshift libdir
root = here.parent

require 'date'
require 'date_named_file'
require 'logger'
require 'hathifile_history/records'

require "settings"
require "services"

STDOUT.sync = true
LOGGER = Logger.new(STDOUT)
LOGGER = Services[:logger]

def usage
$stderr.puts %Q{
Expand Down Expand Up @@ -46,7 +49,7 @@ def usage
end

if hathifile.nil?
hathifile = DateNamedFile.new('hathi_full_%Y%m%d.txt.gz').in_dir('../archive').last.to_s
hathifile = DateNamedFile.new('hathi_full_%Y%m%d.txt.gz').in_dir(Settings.hathifiles_dir).last.to_s
LOGGER.info "No input file given. Using #{hathifile}"
end

Expand All @@ -57,9 +60,9 @@ def usage
last_month = DateTime.parse("#{yyyy}-#{mm}-01").prev_month
last_yyyymm = last_month.strftime '%Y%m'

old_history_file ||= root + "history_files" + "#{last_yyyymm}.ndj.gz"
new_history_file ||= root + "history_files" + "#{yyyymm}.ndj.gz"
redirects_file ||= root + "redirects" + "redirects_#{yyyymm}.txt.gz"
old_history_file ||= File.join(Settings.history_files_dir, "#{last_yyyymm}.ndj.gz")
new_history_file ||= File.join(Settings.history_files_dir, "#{yyyymm}.ndj.gz")
redirects_file ||= File.join(Settings.redirects_dir, "redirects_#{yyyymm}.txt.gz")

unless File.exist?(old_history_file)
LOGGER.error "Can't find #{old_history_file} for loading historical data. Aborting."
Expand Down
2 changes: 2 additions & 0 deletions config/environments/production.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,5 @@ database:
hathifiles_web_path: <%= ENV['HATHIFILES_WEB_PATH'] %>
zephir_dir: <%= ENV['ZEPHIR_DIR'] %>
hathifiles_dir: <%= ENV['HATHIFILES_DIR'] %>
history_files_dir: <%= ENV['HISTORY_FILES_DIR'] %>
redirects_dir: <%= ENV['REDIRECTS_DIR'] %>
2 changes: 2 additions & 0 deletions config/environments/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,5 @@
hathifiles_web_path: /usr/src/app/tmp_web_hathifiles/
zephir_dir: /usr/src/app/tmp_zephir_dir/
hathifiles_dir: /usr/src/app/tmp_hathifiles_archive/
history_files_dir: /usr/src/app/history_files/
redirects_dir: /usr/src/app/redirects/
2 changes: 0 additions & 2 deletions docker-compose.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,3 @@
version: '3'

x-condition-healthy: &healthy
condition: service_healthy

Expand Down
10 changes: 8 additions & 2 deletions jobs/generate_hathifile.rb
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,12 @@ def run_file(zephir_file)
outfile = File.join(Settings.hathifiles_dir, zephir_file.hathifile)
Services[:logger].info "Outfile: #{outfile}"

Tempfile.create do |fout|
fin.each do |line|
Tempfile.create("hathifiles") do |fout|
Services[:logger].info "writing to tempfile #{fout.path}"
fin.each_with_index do |line, i|
if i % 100_000 == 0
Services[:logger].info "writing line #{i}"
end
BibRecord.new(line).hathifile_records.each do |rec|
fout.puts record_from_bib_record(rec).join("\t")
end
Expand Down Expand Up @@ -92,4 +96,6 @@ def record_from_bib_record(rec)
end
end

# Force logger to flush STDOUT on write so we can see what out Argo Workflows are doing.
$stdout.sync = true
GenerateHathifile.new.run if __FILE__ == $PROGRAM_NAME

0 comments on commit f398155

Please sign in to comment.