This repository has been archived by the owner on Jun 25, 2022. It is now read-only.

Workflow

Jump to bottom

Karen Majewicz edited this page Jan 5, 2021 · 7 revisions

Gather initial list of landing pages

Collect inventory of landing page links from https://www.pasda.psu.edu/ as a CSV file. Use this script: https://github.com/BTAA-Geospatial-Data-Project/pasda/blob/main/datasetURLs.py
Compare this list with PASDA records already in the geoportal

Process New Records

For NEW items, use HTML parsing (https://github.com/BTAA-Geospatial-Data-Project/pasda/blob/main/soupHTML-to-CSV-Pasda.py) to query PASDA and gather this information:

Title
Date
Publisher
Description
Metadata Link

Download the metadata files as HTML and store them in this folder: https://github.com/BTAA-Geospatial-Data-Project/pasda/tree/main/metadata
Use HTML parsing to query the HTML files for bounding boxes
Process metadata into the BTAA GeoBlacklight Schema with these differences:

For the Download link, duplicate the landing page
For a Metadata link, choose HTML and link to them stored in GitHub.

Process updated Records

TBD