Skip to content

Latest commit

 

History

History
95 lines (83 loc) · 5.75 KB

bagging.md

File metadata and controls

95 lines (83 loc) · 5.75 KB

Baggers

What do Baggers do?

Baggers do some quality assurance on the dataset to make sure the content is correct and corresponds to what was described in the spreadsheet. Then they package the data into a bagit file (or "bag"), which includes basic technical metadata and upload it to final DataRefuge destination.

Getting set up as a Bagger

  • Apply to become a Bagger
    • By asking your DataRescue guide or by filling out this form
    • Skills recommended: in general, Baggers need to have some tech skills and a good understanding of harvesting goals.
    • Note that an email address is required to apply.
    • Note also that you should be willing to have your real name be associated with the datasets, to follow archival best practices (see guidelines on archival best practices for Data Refuge for more information).
  • Credentials, slack invite, Uncrawlable spreadsheet URL, and other details will be provided once your application is approved.
  • Test the Uploader application http://drp-upload-bagger.herokuapp.com with the credentials provided
    • Make sure to select the right event in the dropdown
  • Verify that you have write access to the #Baggers tab in the Uncrawlable spreadsheet
  • Get set up with Python and the Python script to make a bag at the command line https://github.com/LibraryOfCongress/bagit-python
  • If you need any assistance:
    • Talk to your DataRescue Guide if you are at an in-person event
    • Or post questions on Slack in the #Baggers channel.

Claiming a dataset for bagging

  • You will work on datasets that were harvested by Checkers.
  • Go to the Uncrawlable spreadsheet, click the Baggers tab, and look for a dataset to check
    • Available datasets are the ones whose cell "Baggers Handle" is empty
    • If an item is already claimed but its "Date Opened or Closed" cell has turned red, it is also available for you to claim (for more details see the last section of this document)
  • Claim it by entering your slack handle along with the status "Open" and today's date, for instance:
@khdelphine Open 1/22/2017

Downloading & opening the dataset

  • Go to the URL containing the zipped dataset (provided in cell "URL from upload of zip")

  • Download the zip file to your laptop, and unzip it.

  • Quality assurance: spot check to ensure the UUID and downloaded materials match to the spreadsheet row

  • Confirm content of Json file

    • The json should match the information from the Harvester and use the following format:
    {
      "Individual source or seed URL": "http://www.eia.gov/renewable/data.cfm",
      "UUID" : "E30FA3CA-C5CB-41D5-8608-0650D1B6F105",
      "id_agency" : 2,
      "id_subagency": ,
      "id_org":,
      "id_suborg":,
      "Institution facilitating the data capture creation and packaging": "Penn Data Refuge",
      "Date of capture": "2017-01-17",
      "Federal agency data acquired from": "Department of Energy/U.S. Energy Information Administration",
      "Name of resource": "Renewable and Alternative Fuels",
      "File formats contained in package": ".pdf, .zip",
      "Type(s) of content in package": "datasets, codebooks",
      "Free text description of capture process": "Metadata was generated by viewing page and using spreadsheet descriptions where necessary, data was bulk downloaded from the page using wget -r on the seed URL and then bagged.",
      "Name of package creator": "Mallick Hossain and Ben Goldman"
      }
    
    • If you make any changes, make sure to save this as a .json file.
    • Confirm that the json file is within the package with the dataset(s)
  • Creating the bag

    • Run python command line script which creates the bag
    bagit.py --contact-name '[your name]' /directory/to/bag
    
    • You should be left with a 'data' folder (which contains the downloaded content and metadata file) and four separate bagit files
      • bag-info.txt
      • bagit.txt
      • manifest-md5.txt
      • tagmanifest-md5.txt

Creating the Zip file and uploading it

  • Zip this entire collection (data folder and bagit files) and confirm that it is named with the row's UUID
  • Upload the zipped bag using the application http://drp-upload-bagger.herokuapp.com/
    • Make sure to select the name of your event in the dropdown (and "remote" if you are working remotely)
    • Note that files beyond 5 Gigs cannot be uploaded through this method
      • Please talk to your DataRescue guide/post on Slack in Baggers channel, if you have a larger file
  • Enter URL in cell "Bag URL" in Uncrawlable spreadsheet
    • The application will return the location URL for your zip file.
    • The syntax will be "[UrlStub]/[UUID].zip
  • Enter file size in cell "Size of bag"

Quality assurance and finishing up

  • To ensure that the bag was uploaded successfully, go to the URL and download the bag back to your laptop.
  • Unzip it, open it and spot check to make sure that the bag looks well formed and the files seem valid.
  • In the Uncrawlable spreadsheet, make sure you document all the actions you have taken by filling out all the cells.
  • In the Uncrawlable spreadsheet, change the status to "Closed" in the cell "Current Status", for instance:
@khdelphine Closed 1/22/2017
- If ever a day or more passed since you originally claimed the item, update the date to today's date.
- Note that if more than 2 days have passed since you claimed the dataset and it is still not closed, the **Date field will turn red**, signaling that someone else can claim it in your place and start working on it
  - This will avoid datasets being stuck in the middle of the workflow and not being finalized.