---
title: "genesysr Tutorial"
author: "Matija Obreza & Nora Castaneda"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{genesysr Tutorial}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

Querying Genesys PGR
=====================

[Genesys PGR](https://www.genesys-pgr.org) is the global database on plant genetic resources
maintained *ex situ* in national, regional and international genebanks around the world.

**genesysr** uses the [Genesys API](https://www.genesys-pgr.org/documentation/apis) to query Genesys data.
The API is accessible at https://api.genesys-pgr.org.

Accessing data with **genesysr** is similar to downloading data in CSV or Excel format and loading
it into R.

## For the impatient

Accession passport data is retrieved with the `get_accessions` function.

The database is queried by providing a `filter` (see Filters below):

```
## Setup: use Genesys Sandbox environment
# genesysr::setup_sandbox() # Use this to connect to our test environment https://sandbox.genesys-pgr.org
# genesysr::setup_production() # This is initialized by default when loading genesysr

# Open a browser: login to Genesys and authorize access
genesysr::user_login()

# Retrieve first 1000 accessions for genus *Musa*
musa <- get_accessions(filters = list(taxonomy = list(genus = c('Musa'))), at.least = 1000)
# Or retrieve all accession data for genus *Musa*
musa <- get_accessions(filters = list(taxonomy = list(genus = c('Musa'))))

# Retrieve all accession data for the Musa International Transit Center, Bioversity International
itc <- get_accessions(list(institute = list(code = c('BEL084'))))

# Retrieve all accession data for the Musa International Transit Center, Bioversity International (BEL084) and the International Center for Tropical Agriculture (COL003)
some <- get_accessions(list(institute = list(code = c('BEL084','COL003'))))
```

**genesysr** provides utility functions to create `filter` objects using [Multi-Crop Passport Descriptors (MCPD)](https://www.genesys-pgr.org/documentation/basics) definitions:

```
# Retrieve data by country of origin (MCPD)
get_accessions(mcpd_filter(ORIGCTY = c("DEU", "SVN")))
```

# Processing fetched data

The data is provided by Genesys as CSV. Where multiple values are possible for a column,
there will be multiple columns. For example, accession `STORAGE` may be provided as:

|...|storage1|storage2|storage3|
|--|--|--|--|
|...|10|20|30|
|...|30|40|*NA*|
|...|30|*NA*|*NA*|
|...|10|20|30|

# Filters

The `filter` object is a named `list()` where names match a Genesys filter and the value
specifies the criteria to match.

The records returned by Genesys match all filters provided (*AND* operation), while individual filters
allow for specifying multiple criteria (*OR* operation):

```r
# (GENUS == Musa) AND ((ORIGCTY == NGA) OR (ORIGCTY == CIV))
filter <- list(taxonomy = list(genus = c('Musa'), species = c('aa')), countryOfOrigin = list(iso3 = c('NGA', 'CIV')))

# OR
filter <- list();
filter$taxonomy$genus = c('Musa')
filter$taxonomy$species = c('aa')
filter$countryOfOrigin$iso3 = c('NGA', 'CIV')

# See filter object as JSON
jsonlite::toJSON(filters)
```

There are a number of filtering options to retrieve data from Genesys. Best explore how filtering 
works on the actual website https://www.genesys-pgr.org/a/overview by inspecting the HTTP requests
sent by your browser to the API server and then replicating them here.

### Taxonomy

`taxonomy$genus` filters by a *list* of genera.

```r
filters <- list(taxonomy = list(genus = c('Hordeum', 'Musa')))
# Print
jsonlite::toJSON(filters)
```

`taxonomy$species` filters by a *list* of species.

```r
filters <- list(taxonomy = list(genus = c('Hordeum'), species = c('vulgare')))
# Print
jsonlite::toJSON(filters)
```

### Origin of material

`countryOfOrigin$iso3` filters by ISO3 code of country of origin of PGR material.

```r
# Material originating from Germany (DEU) and France (FRA)
filters <- list(countryOfOrigin = list(iso3 = c('DEU', 'FRA')))
```

`geo.latitude` and `geo.longitude` filters by latitude/longitude (in decimal format) of the
collecting site.

```r
# TBD
filters <- list(geo = list(latitude = genesysr::range(-10, 30), longitude = genesysr::range(30, 50)))
```


### Holding institute

`institute$code` filters by a *list* of FAO WIEWS institute codes of the holding institutes.

```r
# Filter for ITC (BEL084) and CIAT (COL003)
list(institute = list(code = c('BEL084', 'COL003')))
```

`institute$country$iso3` filters by a *list* of ISO3 country codes of country of the holding institute.

```r
# Filter for genebanks in Slovenia (SVN) and Belgium (BEL)
list(institute = list(country = list(iso3 = c('SVN', 'BEL'))))
```

# Selecting columns

Genesys API returns a lot of variables for accession passport data.
To reduce the amount of data to be processed and kept in memory, select the columns of interest the `fields` vector:

```
# Fetch only accession id, storage and taxonomic data for *Musa*
musa <- genesysr::get_accessions(list(taxonomy = list(genus = c('Musa'))), fields = c("taxonomy", "storage", "id"))
```

To list the variable names returned by the Genesys APIs, test the response and select columns of interest:

```r
# fetch_accessions uses the JSON format
accn <- fetch_accessions(filters = list(), at.least = 100)

# Print names used in JSON response from Genesys
sort(unique(names(unlist(accn$content))))
```


# Step-by-step example

Let's take a look of all the process of fetching accession passport data from Genesys.

1. Load genesysr

```r
library(genesysr)
```

2. Setup using user credentials

```r
setup_sandbox()
user_login()
```

3. Fetch data

```r
musa <- genesysr::get_accessions(list(taxonomy = list(genus = c('Musa'))), at.least = 1000)
```

4. Identify columns of interest

```r
names(musa)
```