Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annual GBIF stats #44

Open
PipBrewer opened this issue Dec 22, 2024 · 6 comments
Open

Annual GBIF stats #44

PipBrewer opened this issue Dec 22, 2024 · 6 comments
Assignees

Comments

@PipBrewer
Copy link
Contributor

PipBrewer commented Dec 22, 2024

Every year we need to write a report to the Ministry. For this, we need statistics.

List of publications citing specimens from the following institutions in 2024:
• Natural History Museum Denmark / Statens Naturhistorisk Museum / NHMA / SNM
• Natural History Museum Aarhus / Naturhistorisk Museum Aarhus / NHMA
• Aarhus University Herbarium/ AAU Herbarium / Aarhus Universitet Herbarium / Science Museums Aarhus / Science Museerne Aarhus / AAU / AU
• Naturama
• Fiskeri- og Søfartsmuseet / Fishing and Maritime Museum / Fisheries and Maritime Museum
• Museum Salling / Fur Museum
• Fossil and Mo-clay Museum, Museum Mors
• Museum Sønderjylland / Museum of Southern Jutland
• Østsjællands Museum / Geomuseum Faxe / East Zealand Museum

The list should contain the following details in a spreadsheet:
• Author last name (s)
• Author first name (s)
• Year of publication
• Title of article
• Title of publication
• Publisher of research (organizing/publishing house)
• Source
• Journal citation details (volume, issue, page numbers)
• Doi

The list should be specific to each institution if possible. The reason for the list is it can be used to interrogate the results, rather than just accept the number produced by GBIF, which may have things in it we shouldn't really include. The list can also be used to answer queries about impact.

Number of specimens in GBIF from the following institutions as of 01/02/2025:
• Natural History Museum Denmark / Statens Naturhistorisk Museum / NHMA / SNM
• Natural History Museum Aarhus / Naturhistorisk Museum Aarhus / NHMA
• Aarhus University Herbarium/ AAU Herbarium / Aarhus Universitet Herbarium / Science Museums Aarhus / Science Museerne Aarhus / AAU / AU
• Naturama
• Fiskeri- og Søfartsmuseet / Fishing and Maritime Museum / Fisheries and Maritime Museum
• Museum Salling / Fur Museum
• Fossil and Mo-clay Museum, Museum Mors
• Museum Sønderjylland / Museum of Southern Jutland
• Østsjællands Museum / Geomuseum Faxe / East Zealand Museum

The results should say the number of physical specimens (i.e., a preserved specimen) held at the above institutions and should be listed for each institution.

There will be a mix of static and dynamic (i.e., derived directly from Specify) datasets. Fedor estimated that the number of NHMD specimens published from Specify was 817,153 at the end of 2023. Hence, the total number should be higher. Datasets which include specimens which are not the property of NHMD should not be included and static datasets with too much overlap with dynamic ones should be discarded (as there will be too much duplication).

Astrid helped produce some of these stats previously and so should be able to advise. I suspect that it will be Kim S who will be putting the report together this year and so if this takes until after 20th January, they should be sent direct to him.

@beckerah
Copy link

beckerah commented Jan 2, 2025

Institution codes pulled from GRSciColl:

• Natural History Museum Denmark / Statens Naturhistorisk Museum : NHMD
• Natural History Museum Aarhus / Naturhistorisk Museum Aarhus : NHMA
• Aarhus University Herbarium / AAU Herbarium / Aarhus Universitet Herbarium / Science Museums Aarhus / Science Museerne Aarhus : AAU
• Naturama (Switzerland) : NAAG
• Fiskeri- og Søfartsmuseet / Fishing and Maritime Museum / Fisheries and Maritime Museum : ???
• Museum Salling / Fur Museum : ???
• Fossil and Mo-clay Museum, Museum Mors : ???
• Museum Sønderjylland / Museum of Southern Jutland : ???
• Østsjællands Museum / Geomuseum Faxe / East Zealand Museum : ???

@beckerah
Copy link

Most of these institutions will not exist in GBIF yet so there will be no stats for them.
DaSSCo/Reports and statistics folder contains info from previous years - see especially GBIF Statistics and Annual reporting to Ministry subfolders.
NHMD and NHMA are dynamically publishing to GBIF but AU is not so these will still be all static datasets.
For NHMA: Always an offset between Specify and GBIF here. May need to ask Fedor to take out dummy records for the Specify report. We want most of the data in GBIF - only things we can't publish (embargoes, etc.) shouldn't be there.
In GBIF, there will also be static datasets for NHMD and NHMA (like data from before everyone was using Specify). Isabel and Fedor are actively trying to get rid of these static datasets, which should be replaced with dynamic datasets, but not all have been removed yet. This is where we might see some duplication of specimens.
Make sure to only include stats for 2024 - nothing from beginning 2025.
Pip is going to add additional documentation (past reports, etc) to the reports and stats folder mentioned above.

@beckerah
Copy link

I have put together a python script to download the specimen data. It should be easy to re-use for the publications. One just needs to update the variables for institutions, date, and gbif login. Now I need to take a look at the data and do some exploration and processing. Scripts and downloads are currently on my local at Documents\scripts\annualStatsReporting

@beckerah
Copy link

beckerah commented Jan 31, 2025

Problems to solve:

  • Some datasets published by NHMD have only some occurrences with the institution code NHMD
  • Some datasets published by NHMD have 0 occurrences with the institution code NHMD (these don't come up in my initial search at all)
  • It seems like I may need a complete list of datasets to look for

@beckerah
Copy link

I have a workflow for occurrences of preserved specimens:

  • Search by publisher instead of institution
  • Create list of duplicate occurrences by catalogNumber
  • Group occurrences by publisher and dataset
  • Return counts (total, unique, duplicated) by publisher and dataset

I need to add into the processing script:

  • Get count of duplicates by dataset, add this to dataset summary df
  • Create 'unique' column in dataset summary df that subtracts count of dupes from total count for each dataset
  • Create 'unique' column in publisher summary df that subtracts count of dupes from total count for each publisher

@beckerah
Copy link

Once the processing script is updated and workflow is better documented, I will pull list of publications.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants