-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPDD-data-vis.Rmd
490 lines (357 loc) · 18.7 KB
/
PDD-data-vis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
---
title: "PDD data visualisation"
author:
name: Bevan Weir
affiliation: "Manaaki Whenua - Landcare Research"
date: "data as of 11 November 2024"
output:
html_document:
code_folding: hide
df_print: default
toc: true
toc_float: true
pdf_document:
toc: true
always_allow_html: true
subject: Nationally significant collections
---
```{r setup, include=FALSE}
#load all packages needed
library(tidyverse)
library(lubridate)
library(gridExtra)
library(maps)
library(mapdata)
library(dplyr)
library(knitr)
library(kableExtra)
library(sf)
library(ggspatial)
```
## Introduction
This is an R Markdown document for statistics derived from specimen data of the PDD Fungarium <https://www.landcareresearch.co.nz/pdd>. This is an experimental showcase of presenting visualisations of data analysed in R in a user-friendly format. The PDD is the national fungal collection of Aotearoa New Zealand with a focus on plant pathogens and fungi including 'mushrooms'.
```{r import, echo=TRUE, message=FALSE, warning=FALSE}
#import the files
PDD.as.imported.df <- read_csv("PDD-export-11-nov-2024.csv",
guess_max = Inf,
show_col_types = FALSE
) |>
distinct(AccessionNumber, .keep_all= TRUE) #remove dupes
PDD.df <- PDD.as.imported.df |>
mutate(Class_C1 = str_remove(Class_C1, " .*")) |>
mutate(Phylum_C1 = str_remove(Phylum_C1, " .*")) |>
mutate(specimen_kind = case_when(Phylum_C1 == 'Ascomycota' ~ 'ascomycetes',
Class_C1 == 'Agaricomycetes' ~ 'mushrooms',
Class_C1 == 'Agaricostilbomycetes' ~ 'other',
Class_C1 == 'Atractiellomycetes' ~ 'other',
Class_C1 == 'Basidiomycetes' ~ 'mushrooms',
Class_C1 == 'Classiculomycetes' ~ 'rusts and smuts',
Class_C1 == 'Cystobasidiomycetes' ~ 'rusts and smuts',
Class_C1 == 'Dacrymycetes Doweld' ~ 'mushrooms',
Class_C1 == 'Exobasidiomycetes' ~ 'rusts and smuts',
Class_C1 == 'Microbotryomycetes' ~ 'rusts and smuts',
Class_C1 == 'Pucciniomycetes' ~ 'rusts and smuts',
Class_C1 == 'Tremellomycetes' ~ 'mushrooms',
Class_C1 == 'Ustilaginomycetes' ~ 'rusts and smuts',
TRUE ~ 'other')) |>
filter(Deaccessioned == "FALSE") |>
mutate(date.collected = ymd(CollectionDateFromISO_CE1, truncated = 3))
#how many specimens are public
PDD.public <- PDD.as.imported.df |>
filter(SpecimenSecurityLevelText == "Public")
#create a NZ filter
PDD.NZ.df <- PDD.df |>
filter(StandardCountry_CE1 == "New Zealand")
```
## General statistics
The PDD database has `r count(PDD.as.imported.df)` entries, but after removing specimens those that are 'deaccessioned' there are **`r count(PDD.df)` specimens**. The total number of publicly available specimens is `r count(PDD.public)`. Most of the stats below use the 'total minus deaccessioned' number as the base data.
The table below lists the number of *Ascomycetes*; *'mushrooms'* as a broad category including mushrooms, brackets, paint-splash, clubs, puffballs, stinkhorns and allies (subphylum agaricomycotina); *rusts and smuts*, and *'other'* including chromists, slime moulds, bacteria... These stats include the entire collection and split by type specimens and New Zealand specimens.
```{r tables, echo=TRUE, message=FALSE, warning=FALSE}
total.table <- PDD.df |> # this one gets a total count
group_by(specimen_kind) |>
dplyr::summarize(
All.PDD = n(),
.groups = "drop"
)
types.table <- PDD.df |> # this one gets a types count
filter(TypeStatus != "") |>
group_by(specimen_kind) |>
dplyr::summarize(
Types = n(),
.groups = "drop"
)
NZ.table <- PDD.df |> # this one gets a New Zealand count
filter(StandardCountry_CE1 == "New Zealand") |>
group_by(specimen_kind) |>
dplyr::summarize(
NZ = n(),
.groups = "drop"
)
NZtype.table <- PDD.df |> # NZ and types
filter(StandardCountry_CE1 == "New Zealand" & TypeStatus != "") |>
group_by(specimen_kind) |>
dplyr::summarize(
NZ.Types = n(),
.groups = "drop"
)
PDD.count.table <- left_join(total.table, types.table) %>% #need old pipe due to extract left dot
left_join(NZ.table) %>%
left_join(NZtype.table) %>%
bind_rows(summarise(.,
across(where(is.numeric), sum),
across(where(is.character), ~"Total")))
kable(PDD.count.table, digits=0, "html",
#caption = "table of specimen types",
col.names = gsub("[.]", " ", names(PDD.count.table)))|> #replaces dots in names with spaces
kable_styling("striped", "bordered") |>
add_header_above(c(" " = 1, "number of specimens" = 4))
```
## Type specimens
```{r types, echo=TRUE, warning=FALSE}
PDD.types <- subset(PDD.df,!(TypeStatus == ""))
```
The PDD has **`r count(PDD.types)` type specimens** specimens. There are different 'kinds' of types listed below:
```{r types chart, echo=TRUE}
#barchart of all PDD types sorted by 'kind' of type, coloured by kind of organism
positions <- c("Type", "Holotype", "Paratype", "Isotype", "Neotype", "Isoneotype", "Epitype", "Isoepitype", "Syntype", "Lectotype", "Isolectotype")
ggplot(PDD.types, aes(TypeStatus, fill=specimen_kind)) +
labs(title = "Types in the PDD collection") +
labs(x = "'Kind' of type", y = "number of specimens") +
geom_bar() +
coord_flip() +
theme_bw()+
theme(legend.position.inside = c(0.85, 0.75)) +
scale_fill_brewer(palette = "Set2") +
scale_x_discrete(limits = positions)
#breaks = c("0","25","50","100","200","300","350")
```
## Extended specimen data
Extended specimen data is data beyond the basic taxonomic and and location. See: <https://academic.oup.com/bioscience/article/70/1/23/5637849> The charts below show the proportion of PDD specimens that have DNA sequences lodged in GenBank global repository, images (usually of specimen plates), and when a specimen is specifically cited in literature.
```{r Extended specimen data, echo=TRUE}
#sub set out the interesting columns using "select" in dpylr
only.ext.specimen <- select(PDD.df, "AccessionNumber","specimen_kind", "GenBank", "Literature", "Images")
#create two new columns using "pivot_longer" into a tibble table
tibble.ext.specimen <-pivot_longer(only.ext.specimen, cols=3:5, names_to="ExtendedSpecimen", values_to="present")
# ** specimens with extended specimen data
ggplot(tibble.ext.specimen, aes(ExtendedSpecimen, fill=present)) +
labs(title = "specimens with GenBank DNA sequences, images, and citations in literature") +
labs(x = "extended specimen data", y = "proportion of specimens with this data") +
theme_bw() +
scale_fill_brewer(palette = "Paired") +
geom_bar(position = "fill")
```
This data can be split by the type of specimen (Bacteria, Fungi, Chromist, Yeast) to show the relative proportions of extended specimen data for each of these.
```{r Extended specimen data 2, echo=TRUE}
# ** specimens with extended specimen data by ExtendedSpecimen with specimen_kind facet
ggplot(tibble.ext.specimen, aes(ExtendedSpecimen, fill=present)) +
labs(title = "specimens with GenBank DNA sequences, images, and citations in literature") +
labs(x = "Taxon", y = "number of specimens") +
scale_fill_brewer(palette = "Paired") +
geom_bar() +
theme_bw() +
theme(legend.position.inside = c(0.92, 0.6)) +
facet_grid(cols = vars(specimen_kind))
```
## Host plants
Many PDD specimens are associated with a host plant, for example ectomycorrhiza or plant pathogens. Below are the top 10 most common host plant families in the PDD with type of organism. The large number of Nothofagaceae hosts likely relates to the abundant ectomycorrhizal mushrooms on these trees.
```{r top 10 host families, echo=TRUE}
#barchart of to 10 host families sorted by 'kind' of type, coloured by kind of organism
positions.10hosts <- c("Nothofagaceae", "Gramineae", "Myrtaceae", "Compositae", "Leguminosae", "Pinaceae", "Rosaceae", "Cyperaceae", "Podocarpaceae", "Arecaceae")
PDD.10hosts <- subset(PDD.df, (Family_C2 == "Nothofagaceae" | Family_C2 == "Gramineae" | Family_C2 == "Myrtaceae" | Family_C2 == "Compositae" | Family_C2 == "Leguminosae" | Family_C2 == "Pinaceae" | Family_C2 == "Rosaceae" | Family_C2 == "Cyperaceae" | Family_C2 == "Podocarpaceae" | Family_C2 == "Arecaceae"))
ggplot(PDD.10hosts, aes(Family_C2, fill=specimen_kind)) +
labs(title = "Top 10 host plant Families in the PDD specimen collection") +
labs(x = "Host plant Family", y = "number of specimens") +
geom_bar() +
coord_flip() +
theme_bw() +
theme(legend.position.inside = c(0.85, 0.75)) +
scale_fill_brewer(palette = "Set2") +
scale_x_discrete(limits = positions.10hosts)
```
## Geography
PDD specimens have been collected from around the world.
### World distribution
```{r world map, echo=TRUE, warning=FALSE}
world <- map_data("world")
ggplot() +
geom_polygon(data = world, aes(x=long, y = lat, group = group), fill = "grey") +
theme_void() +
geom_point(data = PDD.df, aes(x = DecimalLongitude_CE1, y = DecimalLatitude_CE1), color = "red", size = 1, alpha = 0.2) +
coord_fixed(1.3)
```
The top 10 countries represented in the PDD with type of samples are shown below.
```{r top countries, echo=TRUE}
positions <- c("United States", "Australia", "Germany", "United Kingdom",
"Canada", "Austria", "Fiji", "Cook Islands", "Solomon Islands", "Vanuatu")
PDD.10county <- subset(PDD.df, (StandardCountry_CE1 == "United States" |
StandardCountry_CE1 == "Australia" | StandardCountry_CE1 == "Germany" |
StandardCountry_CE1 == "United Kingdom" | StandardCountry_CE1 == "Canada" |
StandardCountry_CE1 == "Austria" | StandardCountry_CE1 == "Fiji" |
StandardCountry_CE1 == "Cook Islands" | StandardCountry_CE1 == "Solomon Islands" |
StandardCountry_CE1 == "Vanuatu"))
ggplot(PDD.10county, aes(StandardCountry_CE1, fill=specimen_kind)) +
labs(title = "Top 10 Countries in the PDD (other than NZ)") +
labs(x = "StandardCountry_CE1", y = "number of specimens") +
theme(axis.text.x=element_text(angle=-90, hjust=0)) +
geom_bar() +
coord_flip() +
theme_bw() +
theme(legend.position.inside = c(0.85, 0.75)) +
scale_fill_brewer(palette = "Set2") +
scale_x_discrete(limits = positions)
```
### New Zealand distribution
Each of the specimens from New Zealand that have a geocoordinate are plotted below. An interactive map can be found on GBIF at: <https://www.gbif.org/dataset/3c6e7390-3b56-11dc-8c19-b8a03c50a862>.
```{r nz map, echo=TRUE, message=FALSE, warning=FALSE}
# Loading in data
#Reading in a NZ specific map
nz.sf <- st_read(dsn = "./data/coastline/coastline.shp", quiet = TRUE) |>
st_transform(2193) #Setting map projection - NZGD2000
#Transforming to an SF object
rhizoNZ.sf <- PDD.df |>
filter(!is.na(DecimalLatitude_CE1)) |> #Removing missing obs as sf doesn't play with these
st_as_sf(coords = c("DecimalLongitude_CE1", "DecimalLatitude_CE1")) |> #Defining what the coord columns are
st_set_crs(4326) |> #Telling sf it is in WSG84 projection
st_transform(2193) |> #Changing it to NZGD2000 to match coastline polygon
st_crop(st_bbox(nz.sf)) #Cropping out points that are outside the coastline polygons bounding box (e.g. not NZ)
#Plotting - quite slow
ggplot() +
geom_sf(data = nz.sf) +
geom_sf(data = rhizoNZ.sf, aes(colour = StandardNewZealandAreaCodes_CE1),
size = 1, alpha = 0.2, show.legend = FALSE) +
annotation_scale(location = "br") +
annotation_north_arrow(location = "tl",
which_north = "true",
style = north_arrow_fancy_orienteering) +
theme_minimal()
```
There are various ways to divide Aotearoa New Zealand in regions. The PDD uses 'Crosby Codes' which were developed for for recording specimen localities in the New Zealand. The PDD is based in Auckland explaining the large number of collections from this region.
```{r top NZ area codes, echo=TRUE, warning=FALSE}
#New Zealand Area codes
positions <- c("New Zealand", "Campbell Island", "Auckland Islands", "Snares Islands", "Chatham Islands", "Stewart Island", "Southland", "Fiordland", "Dunedin", "Central Otago", "Otago Lakes", "South Canterbury", "Mackenzie", "Westland", "Mid Canterbury", "North Canterbury", "Buller", "Kaikoura", "Marlborough", "Nelson", "Marlborough Sounds", "South Island", "Wairarapa", "Wellington", "Hawkes Bay", "Rangitikei", "Wanganui", "Gisborne", "Taupo", "Taranaki", "Bay of Plenty", "Waikato", "Coromandel", "Auckland", "Northland", "North Island", "Three Kings Islands", "Kermadec Islands")
ggplot(PDD.NZ.df, aes(StandardNewZealandAreaCodes_CE1)) +
labs(title = "PDD specimens by NZ region") +
labs(x = "Crosby Region", y = "number of specimens") +
theme(axis.text.x=element_text(angle=-90, hjust=0)) +
geom_bar() +
theme_bw() +
coord_flip() +
scale_x_discrete(limits = positions)
```
## Timeline
```{r calulate earliest specimen dates, echo=TRUE, message=FALSE}
#this is just for data checking e.g. ranking the early ones don't display in the final version
#all specimens sorted by date. Add a new column date.collected
PDD.df |>
arrange(date.collected) |>
select("AccessionNumber","specimen_kind", "CurrentNamePart_C1", "StandardCountry_CE1", "date.collected") |>
slice_head(n=5)
#New Zealand specimens sorted by date. Add a new column date.collected
PDD.NZ.df |>
arrange(date.collected) |>
select("AccessionNumber","specimen_kind", "CurrentNamePart_C1", "StandardCountry_CE1", "date.collected") |>
slice_head(n=5)
```
- The earliest specimen in PDD is [*Neodictyopus dictyopus* PDD 28052](https://scd.landcareresearch.co.nz/Specimen/PDD_28052) collected in 1840 from Brazil
- The earliest New Zealand ascomycete specimen is [*Pseudocyphellaria billardierei* PDD 71605](https://scd.landcareresearch.co.nz/Specimen/PDD_71605) collected in 1843
- The earliest New Zealand mushroom specimen is [*Peniophora polygonia* PDD 39832](https://scd.landcareresearch.co.nz/Specimen/PDD_39832) collected in 1848
- The earliest New Zealand rust specimen is [*Puccinia geranii-pilosi* PDD 40803](https://scd.landcareresearch.co.nz/Specimen/PDD_40803) collected in 1850
### Collections and deposits over time
```{r all collected dates chart, echo=TRUE, warning=FALSE}
#standard all PDD overtime stats faceted
ggplot(PDD.df, aes(date.collected, fill = specimen_kind)) +
labs(title = "Collection dates of PDD specimens") +
labs(x = "Date of Collection", y = "Number of specimens" , fill = "") +
scale_fill_brewer(palette = "Set2") +
scale_x_date(date_breaks = "10 years", date_labels = "%Y") +
theme_bw() +
geom_histogram(binwidth=365.25, show.legend = FALSE) +
facet_grid(specimen_kind ~ ., scales = "free")
```
### Collection month
In the New Zealand autumn, May is the peak mushroom fruiting season explain the number of fungi collected in this month. In general there is a trend of fewer Collections in Winter and more in Summer. The high numbers in January may be an artifact of a prior way recording the data in the database which if the month was unknown it would be recorded as January.
```{r NZ collected month chart, echo=TRUE, message=FALSE, warning=FALSE}
#new month PDD specimen collected (in NZ)
month.collected <- ymd(PDD.NZ.df$CollectionDateFromISO_CE1, truncated = 1)
mergemonths <- floor_date(month.collected, unit = "month")
ggplot(PDD.NZ.df, aes(month(mergemonths, label = TRUE), fill = specimen_kind)) +
labs(title = "Collection month of PDD specimens from NZ") +
labs(x = "Month of Collection", y = "Number of specimens" , fill = "") +
scale_fill_brewer(palette = "Set2") +
theme_bw() +
geom_bar() +
scale_x_discrete(na.translate = FALSE) # this removes NAs
```
## People
Top 10 collectors and identifiers of PDD specimens
```{r top collectors etc., echo=TRUE, message=FALSE, warning=FALSE}
#should combine these to a single table using full_join ?
#topCollector <- PDD.df |>
# count(StandardCollector, sort = TRUE) |>
# drop_na() |>
# slice_head(n=10)
#
#kable(topCollector, digits=0, "html",
# caption = "top collectors") |>
# kable_styling("striped", "bordered")
topCollector <- PDD.df |>
count(StandardCollector_CE1, sort = TRUE) |>
drop_na() |>
slice_head(n=10)
topDeterminer <- PDD.df |>
count(StandardDeterminer_C1, sort = TRUE) |>
drop_na() |>
slice_head(n=10)
people.table <- full_join(topCollector, topDeterminer,
by = c("StandardCollector_CE1" = "StandardDeterminer_C1" ))
kable(people.table, digits=0, "html",
#caption = "table of specimen types",
col.names = c('Person name', 'Collected', 'Determined')
)|> #replaces dots in names with spaces
kable_styling("striped", "bordered") |>
add_header_above(c(" " = 1, "number of events" = 2))
```
## Curation stats
These stats may be useful for curation or data checking for PDD staff.
### Last updated user
```{r last updated user chart, echo=TRUE}
PDD.df$LastUpdatedBy <- sapply(PDD.df$UpdatedBy, tolower)
ggplot(PDD.df, aes(LastUpdatedBy, fill = specimen_kind)) +
labs(title = "Last user to update an PDD record") +
labs(x = "User", y = "number of specimens") +
theme(axis.text.x=element_text(angle=-90, hjust=0)) +
theme_bw() +
geom_bar() +
scale_fill_brewer(palette = "Set2") +
coord_flip()
```
### Security levels
```{r table of security levels, echo=TRUE}
#table of security levels
table.security <- PDD.df |>
group_by(SpecimenSecurityLevelText) |>
tally()
kable(table.security, caption = "table of security levels") #knitr
```
### Biostatus
```{r table of biostatus, echo=TRUE}
#table of security levels
table.security <- PDD.df |>
group_by(OccurrenceDescription_C1) |>
tally()
kable(table.security, caption = "table of biostatus Occurrence") #knitr
#table of security levels
table.security <- PDD.df |>
group_by(BiostatusDescription_C1) |>
tally()
kable(table.security, caption = "table of biostatus description") #knitr
```
```{r testing3, echo=TRUE}
#knitr::include_url("https://public.flourish.studio/visualisation/7745803/")
```
### Head of the dataset
This table shows the first few rows of the full dataset. use the arrows to explore across the columns.
```{r testing, echo=TRUE}
rmarkdown::paged_table(head(PDD.df))
```
This document was compiled on `r date()` using R `r getRversion()`, ggplot2 `r packageVersion("ggplot2")`, and knitr `r packageVersion("knitr")`. All figures and code are licensed as [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) attributed to Bevan Weir. Source code is available at <https://github.com/onco-p53/CIS>.