Skip to content
This repository has been archived by the owner on Jun 25, 2022. It is now read-only.

Workflow

Karen Majewicz edited this page Jan 5, 2021 · 7 revisions

Gather initial list of landing pages

  1. Collect inventory of landing page links from https://www.pasda.psu.edu/ as a CSV file. Use this script: https://github.com/BTAA-Geospatial-Data-Project/pasda/blob/main/datasetURLs.py

  2. Compare this list with PASDA records already in the geoportal

Process New Records

  1. For NEW items, use HTML parsing (https://github.com/BTAA-Geospatial-Data-Project/pasda/blob/main/soupHTML-to-CSV-Pasda.py) to query PASDA and gather this information:
  • Title
  • Date
  • Publisher
  • Description
  • Metadata Link
  1. Download the metadata files as HTML and store them in this folder: https://github.com/BTAA-Geospatial-Data-Project/pasda/tree/main/metadata

  2. Use HTML parsing to query the HTML files for bounding boxes

  3. Process metadata into the BTAA GeoBlacklight Schema with these differences:

  • For the Download link, duplicate the landing page
  • For a Metadata link, choose HTML and link to them stored in GitHub.

Process updated Records

TBD

Clone this wiki locally