Skip to content

Commit

Permalink
Merge branch 'main' of https://github.com/lorae/roundup
Browse files Browse the repository at this point in the history
  • Loading branch information
lorae committed Dec 4, 2023
2 parents 93946c4 + 5e0f0d9 commit e50ce44
Show file tree
Hide file tree
Showing 2 changed files with 6 additions and 63 deletions.
13 changes: 6 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
View the website here: https://roundup.streamlit.app/
# The website is active!

View it here: https://roundup.streamlit.app/

# About

Expand All @@ -12,7 +14,7 @@ The scripts in this project gather six pieces of information on the most recent
- URL
- Paper number (according to each website's own numbering system)

The primary script used in this project is runall.py. It cycles through a variety of Python scripts that are each catered to one individual website, such as the National Bureau of Economic Research or the International Monetary Fund. The number of scripts in this project is constantly expanding.
The primary script used in this project is `runall.py`. It cycles through a variety of Python scripts that are each catered to one individual website, such as the National Bureau of Economic Research or the International Monetary Fund. The number of scripts in this project is constantly expanding.

Websites that are scraped for data, as of September 2023, are:

Expand Down Expand Up @@ -97,7 +99,7 @@ See below for instructions on how to run the project for the first time and any

4. **View results:**

Open in 'historic/weekly_data/YYYY-MM-DD-HHMM.html'. "YYYY-MM-DD-HHMM" will be populated with the day, hour and minute that you ran the code.
Open in `historic/weekly_data/YYYY-MM-DD-HHMM.html`. "YYYY-MM-DD-HHMM" will be populated with the day, hour and minute that you ran the code.
# Project Structure
The schematic below illustrates the basic file structure of the project.

Expand All @@ -110,17 +112,14 @@ The project directory.
- **runall.py**:
The main script in this project. It loops through each of the scripts in `roundup_scripts/scrapers/XXX.py`, first checking against `scraper_status.txt` to check if any of the scrapers are turned off. If they are, it skips executing the scraper. If the scraper is on, then it will attempt to run it (if there is an error during script execution, then it will turn the scraper off for future runs). Running each scraper script means gathering a data frame of all of the new data available from each website. Then it invokes the `compare_historic(df)` function from `roundup_scripts/compare.py` to see which of the working papers have already been seen, and which are truly novel. `compare_historic(df)` uses data from `papers_we_have_seen.txt` to make this determination. Once `compare_historic(df)` has been successfully executed, new date- and time- stamped files are saved as `historic/weekly_data/YYYY-MM-DD-HHMM.csv`, `historic/weekly_data/YYYY-MM-DD-HHMM.txt`, and `historic/weekly_data/YYYY-MM-DD-HHMM.html` which contain metadata (title, authors, abstract, URL, date published, paper number, and unique paper ID number) on only the working papers that have not previously been scraped by runall.py.

- **troubleshooter.py**:
A script Lorae is currently using on occasion to troubleshoot her code. Should she instead get vscode so she is not using Notepad++ and IDLE? Probably. But for now, this works.

- **README.md**:
The document you are currently reading.

- **requirements.txt**:
The necessary file to get your venv set up on this project.

- **scraper_status.txt**:
A file that lists whether each scraper is turned on or off. If a scraper is turned off, runall.py will not attempt to run it. runall.py also writes to this file, and switches scrapers off when it encounters an error trying to run them.
A file that lists whether each scraper is turned on or off. If a scraper is turned off, `runall.py` will not attempt to run it. `runall.py` also writes to this file, and switches scrapers off when it encounters an error trying to run them.
The purpose of this file is to enable the code to run, even if a few of the scrapers are broken. The changing nature of the websites means that even the most well-coded web scrapers will fail eventually.

- **historic**:
Expand Down
56 changes: 0 additions & 56 deletions troubleshooter.py

This file was deleted.

0 comments on commit e50ce44

Please sign in to comment.