Catégories
R

Kernel spatial smoothing : transforming points pattern to continuous coverage

Representing mass data (inhabitants, livestock,…) on a map in not always easy : choropleth maps are clearly a no-go, except if you normalize with area and then you stumble on the MAUP… A nice solution is smoothing, producing a raster. You even get freebies like (potential) statistical confidentiality, a better geographic synthesis and easy multiple layers computations.

The kernel smoothing should not be confused with interpolation or kriging : the aim here is to « spread » and sum point values, see Loonis and de Bellefon (2018) for a comprehensive explanation.

We could use the {RSAGA} package which provides a « Kernel Density Estimation » in the « grid_gridding » lib for use with rsaga.geoprocessor(). However, we’ll use the {btb} package (Santos et al. 2018) which has the great advantage of providing a way to specify a geographical study zone, avoiding our values to bleed in another country or in the sea for example.

We’ll map the french population :

  • the data is available on the IGN site
  • a 7z decompress utility must be available in your $PATH ;
  • the shapefile COMMUNE.shp has a POPULATION field ;
  • this is a polygon coverage, so we’ll take the « centroids ».
library(raster)  # load before dplyr (against select conflict)
library(tidyverse)
library(httr)
library(sf)
library(btb)

# download and extract data
zipped_file <- tempfile()
GET("ftp://Admin_Express_ext:Dahnoh0eigheeFok@ftp3.ign.fr/ADMIN-EXPRESS_2-0__SHP__FRA_2019-03-14.7z.001", 
    write_disk(zipped_file),
    progress())

system(paste("7z x -odata", zipped_file))

# create a unique polygon for France (our study zone)
fr <- read_sf("data/ADMIN-EXPRESS_2-0__SHP__FRA_2019-03-14/ADMIN-EXPRESS/1_DONNEES_LIVRAISON_2019-03-14/ADE_2-0_SHP_LAMB93_FR/REGION.shp") %>% 
  st_union() %>%
  st_sf() %>% 
  st_set_crs(2154)

# load communes ; convert to points
comm <- read_sf("data/ADMIN-EXPRESS_2-0__SHP__FRA_2019-03-14/ADMIN-EXPRESS/1_DONNEES_LIVRAISON_2019-03-14/ADE_2-0_SHP_LAMB93_FR/COMMUNE.shp")%>% 
  st_set_crs(2154) %>% 
  st_point_on_surface()
 

We create a function :

#' Kernel weighted smoothing with arbitrary bounding area
#'
#' @param df sf object (points)
#' @param field weight field in the df
#' @param bandwidth kernel bandwidth (map units)
#' @param resolution output grid resolution (map units)
#' @param zone sf study zone (polygon)
#' @param out_crs EPSG (should be an equal-area projection)
#'
#' @return a raster object
#' @import btb, raster, fasterize, dplyr, plyr, sf
lissage <- function(df, field, bandwidth, resolution, zone, out_crs = 3035) {
  
  if (st_crs(zone)$epsg != out_crs) {
    message("reprojecting data...")
    zone <- st_transform(zone, out_crs)
  }
  
  if (st_crs(df)$epsg != out_crs) {
    message("reprojecting study zone...")
    df <- st_transform(df, out_crs)
  }
  
  zone_bbox <- st_bbox(zone)
  
  # grid generation
  message("generating reference grid...")
  zone_xy <- zone %>% 
    dplyr::select(geometry) %>% 
    st_make_grid(cellsize = resolution,
                 offset = c(plyr::round_any(zone_bbox[1] - bandwidth, resolution, f = floor),
                            plyr::round_any(zone_bbox[2] - bandwidth, resolution, f = floor)),
                 what = "centers") %>%
    st_sf() %>% 
    st_join(zone, join = st_intersects, left = FALSE) %>% 
    st_coordinates() %>% 
    as_tibble() %>% 
    dplyr::select(x = X, y = Y)
  
  # kernel
  message("computing kernel...")
  kernel <- df %>% 
    cbind(., st_coordinates(.)) %>%
    st_set_geometry(NULL) %>% 
    dplyr::select(x = X, y = Y, field) %>% 
    btb::kernelSmoothing(dfObservations = .,
                         sEPSG = out_crs,
                         iCellSize = resolution,
                         iBandwidth = bandwidth,
                         vQuantiles = NULL,
                         dfCentroids = zone_xy)
  
  # rasterization
  message("\nrasterizing...")
  raster::raster(xmn = plyr::round_any(zone_bbox[1] - bandwidth, resolution, f = floor),
                 ymn = plyr::round_any(zone_bbox[2] - bandwidth, resolution, f = floor),
                 xmx = plyr::round_any(zone_bbox[3] + bandwidth, resolution, f = ceiling),
                 ymx = plyr::round_any(zone_bbox[4] + bandwidth, resolution, f = ceiling),
                 resolution = resolution) %>% 
    fasterize::fasterize(kernel, ., field = field)
}

Instead of a raw choropleth map like this (don’t do this at home) :

Inhabitants, quantile classification ; see the red arrow : a big commune with a somewhat low population (2100 inhabitants) pops out due to its big area

… we should use a proportional symbol. But it’s quite cluttered in urban areas :

You’ve got measles

We can also get the polygon density :

Density, quantile classification. Our previous commune is now more coherent, however the map is not very synthetic due to the heterogeneous size of the communes

We just have to run our function for example with a bandwidth of 20 km and a cell size of 2 km. We use an equal area projection : the LAEA Europa (EPSG:3035).

comm %>% 
  lissage("POPULATION", 20000, 2000, fr, 3035) %>%
  raster::writeRaster("pop.tif")

And lastly our smoothing :

Smoothing, discrete quantile classification

That’s a nice synthetic representation !

After that it’s easy in R to do raster algebra ; for example dividing a grid of crop yields by a grid of agricultural area, create a percent change between dates, etc.

As a conclusion, a quote :

Behind the aesthetic quality of the smoothed maps, however, lies a major trap. By construction, smoothing methods mitigate breakdowns and borders and induce continuous representation of geographical phenomena. The smoothed maps therefore show the spatial autocorrelation locally.Two points close to the smoothing radius have mechanically comparable characteristics in this type of analysis. As a result, there is little point in drawing conclusions from a smoothed map of geographical phenomena whose spatial scale is of the order of the smoothing radius.

Loonis and de Bellefon (2018)

References

Loonis, Vincent and Marie-Pierre de Bellefon, eds 2018. Handbook of Spatial Analysis. Theory and Application with R. INSEE Méthodes 131. Montrouge, France: Institut national de la statistique et des études économiques. https://www.insee.fr/en/information/3635545.

Santos, Arlindo Dos, Francois Semecurbe, Auriane Renaud, Cynthia Faivre, Thierry Cornely and Farida Marouchi. 2018. btb: Beyond the Border – Kernel Density Estimation for Urban Geography (version 0.1.30). https://CRAN.R-project.org/package=btb.

Catégories
R

Metadata : from PostgreSQL comments to R labels

Metadata are an essential part of a robust data science workflow ; they record the description of the tables and the meaning of each variable : its units, quality, allowed range, how we collect it, when it’s been recorded etc. Data without metadata are practically worthless. Here we will show how to transfer the metadata from PostgreSQL to R.

In PostgreSQL metadata can be stored in comments with the statements COMMENT ON TABLE ... IS '...' or COMMENT ON COLUMN ... IS '...'. So I hope your tables and columns have these nice comments and you can see them in psql or PgAdmin for example. But what about R ?

In R, metadata can be assigned as attributes of any object and mainly as « labels » for the columns. You may have seen labels when importing labelled data from SPSS for example.

We will use the {Hmisc} package which provides functions to manage labels. Another interesting package is {sjlabelled}.

library(Hmisc)
library(tidyverse)
library(RPostgreSQL)
library(glue)
library(httr)
library(rvest)
library(janitor)

cnx <- dbConnect(dbDriver("PostgreSQL"),
                 user = "***",
                 password = "***",
                 host = "***",
                 dbname = "***",
                 port = 5432
)

We’ll start by adding a table and its metadata in PostgreSQL. I chose to use the list of all the french communes from the code officiel géographique (COG). The data is provided as a zipped CSV file ; luckily a data dictionary appears on the page so we’ll scrape it.

# data from https://www.insee.fr/fr/information/3720946

# download
dl <- tempfile()
GET("https://www.insee.fr/fr/statistiques/fichier/3720946/commune2019-csv.zip",
    write_disk(dl))
unzip(dl)

# import in PostgreSQL
read_csv("commune2019.csv",
         col_types = cols(.default = col_character())) %>% 
  clean_names() %>% 
  dbWriteTable(cnx, c("ref_cog", "commune2019"), ., row.names = FALSE)
dbSendQuery(cnx, "ALTER TABLE ref_cog.commune2019 ADD PRIMARY KEY (com, typecom);")

# get the data dictionary from INSEE
page <- GET("https://www.insee.fr/fr/information/3720946") %>% 
  content() 

table_info <- page %>% 
  html_node("#titre-bloc-3 + div > p") %>% 
  html_text() %>% 
  str_trim() %>% 
  str_replace_all("\\s+", " ")

columns_info <- page %>% 
  html_node("table") %>% 
  html_table() %>% 
  clean_names() %>% 
  mutate(designation_et_modalites_de_la_variable = str_trim(designation_et_modalites_de_la_variable),
         designation_et_modalites_de_la_variable = str_replace_all(designation_et_modalites_de_la_variable, "\\s+", " "),
         nom_de_la_variable = make_clean_names(nom_de_la_variable))

# add table metadata in PostgreSQL
dbSendQuery(cnx, glue_sql("COMMENT ON TABLE ref_cog.commune2019 IS {table_info}", .con = cnx))

# add columns metadata in PostgreSQL
walk2(columns_info$nom_de_la_variable,
      columns_info$designation_et_modalites_de_la_variable,
      ~ dbSendQuery(cnx, glue_sql("COMMENT ON COLUMN ref_cog.commune2019.{`.x`} IS {.y}", .con = cnx)))

Now we have this nice table :

=> \dt+ ref_cog.*
                                                      List of relations
 Schema  |    Name     | Type  |  Owner  |  Size   |                Description                                                                                            
---------+-------------+-------+---------+---------+-------------------------------------------------------------------
 ref_cog | commune2019 | table | xxxxxxx | 3936 kB | Liste des communes, arrondissements municipaux, communes déléguées et
                                                     communes associées au 1er janvier 2019, avec le code des niveaux
                                                     supérieurs (canton ou pseudo-canton, département, région)
(1 row)


=> \d+ ref_cog.commune2019
                                                      Table "ref_cog.commune2019"
  Column   | Type | Modifiers | Storage  | Stats target |                         Description                                                           
-----------+------+-----------+----------+--------------+----------------------------------------------------------------
 typecom   | text |           | extended |              | Type de commune
 com       | text |           | extended |              | Code commune
 reg       | text |           | extended |              | Code région
 dep       | text |           | extended |              | Code département
 arr       | text |           | extended |              | Code arrondissement
 tncc      | text |           | extended |              | Type de nom en clair
 ncc       | text |           | extended |              | Nom en clair (majuscules)
 nccenr    | text |           | extended |              | Nom en clair (typographie riche)
 libelle   | text |           | extended |              | Nom en clair (typographie riche) avec article
 can       | text |           | extended |              | Code canton. Pour les communes « multi-cantonales » code décliné 
                                                            de 99 à 90 (pseudo-canton) ou de 89 à 80 (communes nouvelles)
 comparent | text |           | extended |              | Code de la commune parente pour les arrondissements municipaux et
                                                            les communes associées ou déléguées.
Indexes:
    "commune2019_pkey" PRIMARY KEY, btree (com, typecom)

Usually we query the data to R this way :

cog <- dbGetQuery(cnx,
"SELECT
  *
FROM ref_cog.commune2019
LIMIT 10")

We can create a function that will query the metadata of the table in information_schema.columns and add it to the data frame ; the function expects a data frame, the name of the schema.table from which we get the comments and a connection handler. It will return the data frame with labels and an attribute metadata with the description of the table.

#' Add attributes to a dataframe from metadata read in the PostgreSQL database
#'
#' @param df dataframe
#' @param schema_table "schema.table" from which to read the comments
#' @param cnx a database connexion from RPostgreSQL::dbConnect()
#'
#' @return a dataframe with attributes
#'
#' @examples \dontrun{add_metadata(iris, "public.iris", cnx)}
add_metadata <- function(df, schema_table, cnx) {
  
  # get the table description and add it to a data frame attribute called "metadata"
  attr(df, "metadata") <- dbGetQuery(cnx, 
                                     glue_sql("SELECT obj_description({schema_table}::regclass) AS table_description;",
                                              .con = cnx)) %>% 
    pull(table_description)
  
  # get colmumns comments
  meta <- str_match(schema_table, "^(.*)\\.(.*)$") %>% 
    glue_sql(
      "SELECT 
        column_name,    
        pg_catalog.col_description(
          format('%s.%s', isc.table_schema, isc.table_name)::regclass::oid,
                 isc.ordinal_position) AS column_description
      FROM information_schema.columns AS isc
      WHERE isc.table_schema = {s[2]}
        AND isc.table_name = {s[3]};",
      s = .,
      .con = cnx) %>% 
    dbGetQuery(cnx, .)
  
  # match the columns comments to the variables
  label(df, self = FALSE) <- colnames(df) %>% 
    enframe() %>% 
    left_join(meta, by = c("value" = "column_name")) %>% 
    pull(column_description)
  
  df
}

Now we would do :

cog <- dbGetQuery(cnx,
  "SELECT
    *
  FROM ref_cog.commune2019
  LIMIT 10") %>% 
  add_metadata("ref_cog.commune2019", cnx)

The table description is available with :

attr(cog, "metadata")

[1] "Liste des communes, arrondissements municipaux, communes déléguées et communes associées au 1er janvier 2019, avec le code des niveaux supérieurs (canton ou pseudo-canton, département, région)"

And you can see the metadata in the column headings of the RStudio viewer with View(cog) :

… the headings now show the metadata !

We can also use contents(cog) :

Data frame:cog	10 observations and 11 variables    Maximum # NAs:10
                                                                                                                                                   -         Labels                                                              Class     Storage    NAs
typecom   Type de commune                                                     character character   0
com       Code commune                                                        character character   0
reg       Code région                                                         character character   0
dep       Code département                                                    character character   0
arr       Code arrondissement                                                 character character   0
tncc      Type de nom en clair                                                character character   0
ncc       Nom en clair (majuscules)                                           character character   0
nccenr    Nom en clair (typographie riche)                                    character character   0
libelle   Nom en clair (typographie riche) avec article                       character character   0
can       Code canton. Pour les communes « multi-cantonales » code décliné... character character   0
comparent Code de la commune parente pour les arrondissements municipaux...   character character  10

Or :

cog %>% 
  label() %>%
  enframe()

# A tibble: 11 x 2
   name      value                                                                                           
   <chr>     <chr>                                                                                           
 1 typecom   Type de commune                                                                                 
 2 com       Code commune                                                                                    
 3 reg       Code région                                                                                     
 4 dep       Code département                                                                                
 5 arr       Code arrondissement                                                                             
 6 tncc      Type de nom en clair                                                                            
 7 ncc       Nom en clair (majuscules)                                                                       
 8 nccenr    Nom en clair (typographie riche)                                                                
 9 libelle   Nom en clair (typographie riche) avec article                                                   
10 can       Code canton. Pour les communes « multi-cantonales » code décliné de 99 à 90 (pseudo-canton) ou …
11 comparent Code de la commune parente pour les arrondissements municipaux et les communes associées ou dél…

Or lastly, for one column :

label(cog$tncc)

[1] "Type de nom en clair"

We can also search for information in the variable names or in the labels with another function that can be helpful when we have a few hundred columns…

search_var <- function(df, keyword) {
  df %>% 
    label() %>%
    enframe() %>% 
    rename(variable = name,
           metadata = value) %>% 
    filter_all(any_vars(str_detect(., regex(keyword, ignore_case = TRUE))))
}

search_var(cog, "canton")

# A tibble: 1 x 2
  variable metadata                                                                                          
  <chr>    <chr>                                                                                             
1 can      Code canton. Pour les communes « multi-cantonales » code décliné de 99 à 90 (pseudo-canton) ou de…
Catégories
R

Open and merge multiple shapefiles

or more precisely union many spatial tables in R in a tidy way.

This post is a mess. See the updated version.

  • dplyr::bind_rows doesn’t work on {sf} objects (it seems that as of 2020-12-27, it works) ;
  • base::rbind needs the objects to have the same structure and only works on two tables and so that’s not straightforward to use*.

→ So we’ll use purrr::map and tidyr::unnest.

NB : now in 2021, I advise to simply use :

 fs::dir_ls("my_folder", regexp = ".*\\.shp$") %>% 
    purrr::map_dfr(sf::read_sf) 

or now in 2023, because map_dfr() is being superseded :

fs::dir_ls("my_folder", regexp = ".*\\.shp$") %>% 
  purrr::map(sf::read_sf) %>% 
  dplyr::bind_rows()
dir_ls("my_folder", regexp = ".*\\.shp$") %>% 
  purrr::map(sf:::read_sf) %>%
  purrr::list_rbind() %>%
  sf::st_sf()

As of October 2019 this method doesn’t work any longer, due to an update in the vctrs package.
https://github.com/r-spatial/sf/issues/1172.
So your best bet is to have the same structure in your shapefile and use :

dir_ls("shp", glob = "*.shp") %>%
map(read_sf) %>%
do.call(rbind, .)

It works again with vctrs 0.2.2 (February 2020)


First get some data, the communes of three french départements :

library(tidyverse)
library(sf)
library(fs)
library(httr)
library(leaflet)

# https://fr.actualitix.com/blog/shapefiles-des-departements-de-france.html

url <-  c("https://fr.actualitix.com/blog/actgeoshap/01-Ain.zip",
          "https://fr.actualitix.com/blog/actgeoshap/73-savoie.zip",
          "https://fr.actualitix.com/blog/actgeoshap/74-haute-savoie.zip")

dep <- str_extract(url, "\\d{2}.*$")

list(url, dep) %>% 
  pwalk(~ GET(.x, write_disk(.y)))

walk(dep, unzip, junkpaths = TRUE, exdir = "shp")

We can then create a 3 rows data frame containing a list-column in which we store the sf object. Then we just unnest it. This operation erases the sf-class, we have to add it back.

res <- dir_ls("shp", glob = "*.shp") %>% 
  tibble(fname = .) %>%
  mutate(data = map(fname, read_sf)) %>%
  unnest(data) %>%
  st_sf() %>%
  st_set_crs("EPSG:2154")

write_sf(res, "shp/3dep.shp")

res %>% 
  st_transform(4326) %>% 
  leaflet() %>%
    addPolygons() %>% 
    addTiles()

Bonus : we have the source filename stored in the resulting shapefile.

* We could have used

dir_ls("shp", glob = "*.shp") %>% 
  map(read_sf) %>%
  do.call(rbind, .)

but the column structure doesn’t match here…

Catégories
R

Post to Mastodon from R

Mastodon is a decentralized microblogging platform.

We can analyse some data and directly post our findings to a Mastodon instance.

For example we can plot the different TLDs used by the Mastodon Fediverse and publish it on mastodon.cloud.

library(tidyverse)
library(httr)
library(jsonlite)
library(scales)
library(ggthemes)
library(ggrepel)

# data source
url <- "https://instances.mastodon.xyz/"

mastodon <- GET(paste0(url, "instances.json")) %>% 
  content(as = "text") %>% 
  fromJSON() %>% 
  select(-info) %>% 
  filter(!str_detect(name, "((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)")) %>% # remove IP
  mutate(statuses = as.integer(statuses),
         name_clean = str_remove(str_to_lower(name), "/$|\\.$|:\\d*$"), # remove trailing dots, slashes and port number
         tld = str_replace_all(name_clean, "(.*?)(\\.[a-z0-9]*)$", "\\2")) %>% # extract TLD
  filter(! tld %in% c("0")) # remove special cases

download_time <- format(Sys.time(), tz = "UTC", usetz = TRUE)

# saveRDS(mastodon, paste0("data/mastodon_", download_time, ".rds"))

nbi <- nrow(mastodon)
nbu <- sum(mastodon$users, na.rm = TRUE)

# cleaning and
# plot the TLDs (top level domains) of Mastondon  instances -----------------------

mastodon %>% 
  filter(! str_detect(tld, "xn--")) %>% 
  group_by(tld) %>% 
  summarise(nb = n(),
            users = sum(users, na.rm = TRUE),
            statuses = sum(statuses, na.rm = TRUE)) %>% 
  filter(nb > 0 & users > 0 & statuses > 0) %>% {
  ggplot(., aes(users, statuses, label = tld, size = nb, color = nb)) +
    geom_text_repel(segment.size = 0, force = 0.5) +
    #geom_point(alpha = 0.4) +
    labs(title = "Mastodon Top Level Domains",
         subtitle = paste(nrow(.), "TLDs for", nbi, "instances having", nbu, "users -", download_time),
         x = "users", 
         y = "statuses",
         caption = paste("r.iresmi.net\ndata from", url, "\nIDN removed ; positions adapted for clarity")) +
    scale_x_log10(labels = comma) +
    scale_y_log10(labels = comma) +
    scale_color_viridis_c(trans = "log", name = "# of\ninstances", labels = function(x) round(x, 0)) +
    scale_size(range = c(3, 10), guide = FALSE) +
    theme(plot.caption = element_text(size = 7)) }

(plot_file <- paste0("img/mastodon_tld_", download_time, ".png")) %>% 
  ggsave(width = 15, height = 10, units = "cm", dpi = 100, scale = 2)

Before posting we have to create an authorization token once.

# Registration
# run this part once and write down client_id and client_secret,
# you can then comment this part

r <- POST(paste0(instance , "api/v1/apps"),
     body = list(client_name = "my_application_name",
                 redirect_uris = "urn:ietf:wg:oauth:2.0:oob",
                 scopes = "write"))
stop_for_status(r)
apps <- content(r)

paste("client_id :", apps[["client_id"]])
paste("client_secret", apps[["client_secret"]])

# end of registration ; set your client id/secret below

Login

# Your instance, login and password

instance <- "https://mastodon.cloud/"
user <- "*******"
pass <- "*******"

client_id <- "********************"
client_secret <- "******************"

# Login -------------------------------------------------------------------

r <- POST(paste0(instance , "oauth/token"),
          body = list(client_id = client_id,
                      client_secret = client_secret,
                      grant_type = "password",
                      username = user,
                      password = pass,
                      scope = "write"))
stop_for_status(r)
token <- content(r)

We can then post the created image and its accompanying status.

# post image
r <- POST(paste0(instance , "api/v1/media"),
          add_headers(Authorization = paste("Bearer", token[["access_token"]])),
          body = list(file = upload_file(plot_file)))
stop_for_status(r)
media <- content(r)

# post status
r <- POST(paste0(instance , "/api/v1/statuses"),
          add_headers(Authorization = paste("Bearer", token[["access_token"]])),
          body = list(status = paste0("#Mastodon Top Level Domains\n#Rstats\n", media[["text_url"]]),
                      "media_ids[]" = media[["id"]]))
stop_for_status(r)
statuses <- content(r)
Catégories
R

Generate multiple language version plots

The use case is to create the same plot in different languages. I used this technique for Wikipedia plots.

We are going to build a list containing all translations, we will then loop over each language, generating and saving the plot.

# Mauna Loa atmospheric CO2 change
# multi language plot for Wikipedia

# Required packages
library(tidyverse)
library(gridExtra)
library(scales)
library(lubridate)
library(Hmisc)

# Translations ------------------------------------------------------------

language <- list(
  en_US = list(
    locale_lc_time = "en_US.UTF-8",
    title = expression(paste("Monthly mean ", CO[2], " concentration ")),
    caption = paste("Data : R. F. Keeling, S. J. Walker, S. C. Piper and A. F. Bollenbacher\nScripps CO2 Program (http://scrippsco2.ucsd.edu). Accessed ", Sys.Date()),
    x = "Year",
    y = expression(paste(CO[2], " fraction in dry air (", mu, "mol/mol)")),
    x2 = "Month",
    y2 = expression(atop(paste(CO[2], " fraction in dry air (", mu, "mol/mol)"), "Departure from yearly average")),
    title2 = "Seasonal variation"
  ),
  fr_FR = list(
    locale_lc_time = "fr_FR.UTF-8",
    title = expression(paste("Moyenne mensuelle de la concentration de ", CO[2])),
    caption = paste("données : R. F. Keeling, S. J. Walker, S. C. Piper et A. F. Bollenbacher\nScripps CO2 Program (http://scrippsco2.ucsd.edu). Accédé le", Sys.Date()),
    x = "année",
    y = expression(paste("fraction de ", CO[2], " dans l'air sec (", mu, "mol/mol)")),
    x2 = "mois",
    y2 = expression(atop(paste("fraction de ", CO[2], " dans l'air sec (", mu, "mol/mol)"), "en écart à la moyenne annuelle")),
    title2 = "Variation saisonnière"
  ),
  de_DE = list(
    locale_lc_time = "de_DE.UTF-8",
    title = expression(paste("Monatliche durchschnittliche ", CO[2], "-Konzentration")),
    caption = paste("Datei : R. F. Keeling, S. J. Walker, S. C. Piper und A. F. Bollenbacher\nScripps CO2 Program (http://scrippsco2.ucsd.edu). Zugänglich am", Sys.Date()),
    x = "Jahr",
    y = expression(paste(CO[2], "-Anteil in trockener Luft (", mu, "mol/mol)")),
    x2 = "Monate",
    y2 = expression(atop(paste(CO[2], "-Anteil in trockener Luft (", mu, "mol/mol)"), "Abweichung vom Jahresmittel")),
    title2 = "Monatliche Variation"
  ),
  es_ES = list(
    locale_lc_time = "es_ES.UTF-8",
    title = expression(paste("Media mensual de la concentración de ", CO[2])),
    caption = paste("dato : R. F. Keeling, S. J. Walker, S. C. Piper y A. F. Bollenbacher\nScripps CO2 Program (http://scrippsco2.ucsd.edu). Visitada", Sys.Date()),
    x = "Año",
    y = expression(paste("Fraccion de ", CO[2],  " en aire secco (", mu, "mol/mol)")),
    x2 = "Mes",
    y2 = expression(atop(paste("Fraccion de ", CO[2],  " en aire secco (", mu, "mol/mol)"), "Desviación de la media anual")),
    title2 = "Variación mensual"
  ),
  cs_CZ = list(
    locale_lc_time = "cs_CZ.UTF-8",
    title = expression(paste("Průměrné měsíční koncentrace oxidu uhličitého")),
    caption = paste("data : R. F. Keeling, S. J. Walker, S. C. Piper a A. F. Bollenbacher\nScripps CO2 Program (http://scrippsco2.ucsd.edu). Přístupné", Sys.Date()),
    x = "rok",
    y = expression(paste("koncentrace ", CO[2], " v suchém vzduchu (", mu, "mol/mol)")),
    x2 = "měsíc",
    y2 = expression(atop(paste("koncentrace ", CO[2], " v suchém vzduchu (", mu, "mol/mol)"), "odchylka od ročního průměru")),
    title2 = "Měsíční změna (průměrná roční odchylka)"
  ),
  nn_NO = list(
    locale_lc_time = "nn_NO.UTF-8",
    title = expression(paste("Gjennomsnittlig månedlig ", CO[2], "-konsentrasjon")),
    caption = paste("data : R. F. Keeling, S. J. Walker, S. C. Piper og A. F. Bollenbacher\nScripps CO2 Program (http://scrippsco2.ucsd.edu). Vist", Sys.Date()),
    x = "År",
    y = expression(paste(CO[2],"-andel i tørr luft (", mu, "mol/mol)")),
    x2 = "Måned",
    y2 = expression(atop(paste(CO[2],"-andel i tørr luft (", mu, "mol/mol)"),
                         "Avvik fra årlig gjennomsnitt")),
    title2 = "Årlig variasjon"
  )
)


# Data --------------------------------------------------------------------

# http://scrippsco2.ucsd.edu/data/atmospheric_co2/primary_mlo_co2_record
# used during US gov shutdown
co2ml <- read_csv("http://scrippsco2.ucsd.edu/assets/data/atmospheric/stations/in_situ_co2/monthly/monthly_in_situ_co2_mlo.csv",
                  col_names = c("year", "month", "xls_date", "decimal",
                                "co2", "co2_seas_adj", "fit", "fit_seas_adj",
                                "co2_filled", "co2_filled_seas_adj"),
                  col_types = "iiiddddddd",
                  skip = 57,
                  na = "-99.99",
                  comment = "\"") %>% 
  group_by(year) %>% 
  mutate(year_mean = mean(co2_filled, na.rm = TRUE),
         delta = co2_filled - year_mean,
         vdate = ymd(paste0("2015-", month, "-01")))

# Generate the plot for each language -------------------------------------

for (l in names(language)) {
  message(l)
  current <- language[[l]]
  
  # format the date in local names
  Sys.setlocale("LC_TIME", current$locale_lc_time)
  
  # main plot
  p1 <- ggplot(co2ml, aes(decimal, co2_filled)) + 	
    geom_line(color = "pink") +
    geom_point(color = "red", size = 0.6) +
    stat_smooth(span = 0.1) +
    scale_x_continuous(breaks = pretty_breaks()) +
    scale_y_continuous(breaks = pretty_breaks(4), minor_breaks = pretty_breaks(8)) +
    labs(
      x = current$x,
      y = current$y,
      title = current$title,
      subtitle = paste("Mauna Loa", min(co2ml$year), "-", max(co2ml$year)),
      caption = current$caption) +
    theme_bw() +
    theme(plot.caption = element_text(size = 7))
  
  # inset plot
  p2 <- ggplot(co2ml, aes(vdate, delta)) +
    geom_hline(yintercept = 0) +
    stat_smooth(span = 0.4, se = FALSE) +
    stat_summary(fun.data = "mean_cl_boot", colour = "red", size = 0.3) + 
    scale_x_date(breaks = pretty_breaks(4), minor_breaks = pretty_breaks(12), labels = date_format("%b")) +
    labs(
      x = current$x2,
      y = current$y2,
      title = current$title2) +
    theme_bw()
  
  # merge the plots and export in SVG
  p1 + annotation_custom(grob = ggplotGrob(p2), xmin = 1957, xmax = 1991, ymin = 361, ymax = 412)
  ggsave(file = paste("co2_mauna_loa", l, Sys.Date(), "wp.svg", sep = "_"), width = 20, height = 20, units = "cm", device = svg)
  
}

Our plots as a nice gallery :

Last version of the code on Gitlab.

Catégories
R

Mapping multiple trends with confidence

A tutorial to compute trends by groups and plot/map the results

We will use dplyr::nest to create a list-column and will apply a model (with purrr::map) to each row, then we will extract each slope and its p value with map and broom::tidy.

Setup

library(tidyverse)
library(httr)
library(sf)
library(readxl)
library(janitor)
library(fs)
library(broom)
library(scales)
library(rnaturalearth)
library(rnaturalearthdata)
library(rgeos)

fk <- function(x) format(x, big.mark = " ")

Data

Map data. Départements polygons from OSM.

if(!file_exists("departements-20140528-100m.shp")) {
  f <- tempfile()
  GET("http://osm13.openstreetmap.fr/~cquest/openfla/export/departements-20140528-100m-shp.zip",
      write_disk(f))
  unzip(f)
}

dep <- st_read("departements-20140528-100m.shp")

Population data by département 1990-2008 from INSEE.

if(!file_exists("pop.xls")) {
  GET("https://www.insee.fr/fr/statistiques/fichier/2012713/TCRD_004.xls", 
      write_disk("pop.xls"))
}
  
pop <- read_xls("pop.xls", skip = 3) %>% 
  clean_names() %>% 
  head(-1) %>% 
  rename(insee_dep = x1,
         dep = x2,
         x2018 = x2018_p) %>% 
  select(-4) %>% 
  gather(annee, pop, 3:7) %>% 
  mutate(annee = as.integer(str_replace(annee, "x", "")))

Population trends for each département

pop_model <- function(df) {
  lm(pop ~ annee, data = df)
}

trends <- pop %>% 
  group_by(insee_dep, dep) %>% 
  nest() %>% 
  mutate(model = map(data, pop_model),
         glance = map(model, glance),
         coeff = map(model, tidy, conf.int = TRUE)) 

Plot

trends %>% 
  unnest(coeff) %>% 
  filter(term == "annee",
         !insee_dep %in% c("F", "M")) %>% 
  ggplot(aes(fct_reorder(insee_dep, estimate), estimate,
             color = if_else(p.value <= .05,
                             if_else(estimate >= 0, "positive", "négative"),
                             "n.s."))) +
    geom_point() +
    geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = .5) +
    scale_color_manual(name = "Tendance",
                       values = c("positive" = "red",  "n.s." = "lightgray", "négative" = "blue")) +
    scale_y_continuous(labels = fk) +
    labs(title = "Évolution des populations départementales françaises",
         subtitle = "1990-2018",
         x = "dép.",
         y = bquote(Delta[population] ~ (habitant %.% an^{-1})),
         caption = "r.iresmi.net\ndonnées INSEE") +
    guides(color = guide_legend(reverse = TRUE)) +
    theme(plot.caption = element_text(size = 7))
Only 9 départements have a clearly decreasing population

Map

pop_dep <- trends %>% 
  unnest(coeff) %>% 
  filter(term == "annee") %>% 
  right_join(dep, by = c("insee_dep" = "code_insee")) %>%
  left_join(filter(pop, annee == 2018, !insee_dep %in% c("F", "M")), by = "insee_dep") %>% 
  st_as_sf() %>% 
  st_transform(2154) 

moy_fr <- trends %>% 
  unnest(coeff) %>% 
  filter(term == "annee",  !insee_dep %in% c("F", "M")) %>% 
  summarise(mean(estimate, na.rm = TRUE)) %>% 
  pull()

world <- ne_countries(scale = "medium", returnclass = "sf") %>% 
  filter(continent == "Europe") %>% 
  st_transform(2154) 

pop_dep %>% 
  ggplot() +
    geom_sf(data = world, fill = "grey97", color = 0) +
    geom_sf(color = "lightgrey", fill = "floralwhite", size = .2) +
    stat_sf_coordinates(data = filter(pop_dep, p.value > .05),
                        aes(size = pop),
                        fill = "lightgrey", color = "lightgrey", shape = 21, alpha = 0.8) +
    stat_sf_coordinates(data = filter(pop_dep, p.value <= .05),
                        aes(size = pop, fill = estimate),
                        color = "lightgrey", shape = 21, alpha = 0.8) +
    coord_sf(xlim = c(100000, 1200000), ylim = c(6000000, 7200000)) +
    scale_fill_gradient2(name = bquote(atop(displaystyle(atop(Delta ~ population[1990-2018],
                                                               (habitant %.% an^{-1}))), 
                                             moy. == .(round(moy_fr)) ~ ", en gris : n.s.")),
                          labels = fk,
                          low = "darkblue", mid = "white", high = "darkred", midpoint = moy_fr) +
    scale_size_area(name = "habitants (2018)", labels = fk, max_size = 10) +
    labs(title = "Évolution des populations départementales françaises",
         subtitle = "Métropole, 1990-2018",
         x = "",
         y = "",
         caption = "r.iresmi.net\ndonnées INSEE 2018\nfond cartographique : contributeurs Openstreetmap 2014\nNaturalearth") +
    theme_bw() +
    theme(plot.caption = element_text(size = 7),
          legend.text.align = 1)
Population is growing stronger in Paris suburbs and in peripheral southern départements

Catégories
R

Bibliography with knitr : cite your references and packages

A tutorial to use your Zotero references with {rmarkdown} : easily add the citations, automatically generate your bibliography, including the references of the packages used in your analysis.

It’s a good practice to cite the R packages you use in your analysis. However it can be cumbersome to maintain the list of your package’s references in Zotero while the packages used can change when you update your script. Here we use the automatic updates of Zotero to generate our main bibliography and we auto-generate the package bibliography and then merge them.

Prerequisites

Install R, Rstudio, Zotero.

Install Better BibTex for Zotero.

To easily add citations in the rmarkdown text in RStudio, we can also add the {citr} package.

Build the Zotero library

Add references to Zotero.

Create a collection for your project and drag and drop the references needed in this collection.

Zotero

Export the collection (BibLatex format). Tick the update option.

Export

Save it in your R project folder as zotero.bib. Each time you modify this Zotero collection, the zotero.bib file will be updated. Check in the Zotero settings :

Preferences

You can also add a personalized CSL file (used to format references) in your R project folder (see the repository). Below I chose ISO-690 (author-date, no abstract, French).

In your rmarkdown text

In the setup chunk, we load the required packages, generate the .bib file for the packages, merge it with our Zotero bibliography and add the packages as nocite (they are not cited in the text but have to appear in the references).

We must run the setup chunk at least once to generate the files and check which citation to use for the packages that provide multiple references.

We can then carry on our analysis with R chunks and some text with citations.

The citations can be added with [@key] where key is the citation key generated by Better Bibtex and found at the top of the Zotero info panel. Or we can use {citr} which will help us find it (Addins menu in Rstudio > Insert citations).

The bibliography will be added at the end of the document.

---
title: "Bibliography with knitr : cite your references and packages"
author: "r.iresmi.net"
date: "`r Sys.Date()`"
output: html_document
bibliography: biblio.bib
csl: iso690-author-date-fr-no-abstract.csl
link-citations: yes
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)

# all your necessary packages
packages <- c("tidyverse", 
             "knitr",     
             "bibtex"
             # add your other packages here     
             )  

# install if needed and load packages
to_install <- packages[! packages %in% installed.packages()[, "Package"]]
if (length(to_install)) { 
  install.packages(to_install, repos = "https://cloud.r-project.org")
}
invisible(lapply(packages, library, character.only = TRUE))

# get the packages version 
packages_versions <- function(p) {
  paste(packageDescription(p)$Package, packageDescription(p)$Version, sep = " ")
}

# Get the packages references
write.bib(packages, "packages.bib")

# merge the Zotero references and the packages references
cat(paste("% Automatically generated", Sys.time()), "\n% DO NOT EDIT",
    { readLines("zotero.bib") %>% 
      paste(collapse = "\n") },
    { readLines("packages.bib") %>% 
      paste(collapse = "\n") },
    file = "biblio.bib",
    sep = "\n")


# Some packages reference keys must be modified
# (their key is not the package name)
# check in packages.bib
packages_keys <- packages %>% 
  enframe() %>% 
  mutate(value = case_when(value == "knitr" ~ "@knitr1",
                           #value == "boot" ~ "@boot1",
                           TRUE ~ paste0("@", value)))
```

---
nocite: |
  @Svenssonguideornitho2007
  `r paste(packages_keys$value, collapse = "\n  ")`
---

Any text...

Data analyzed with R [@r_development_core_team_r:_2010][^r_version].

[^r_version]: `r version$version.string`, with : `r paste(lapply(sort(packages), packages_versions), collapse = ", ")`.

We can easily cite references (menu Addins / Insert citations) after having run the setup chunk to create the bibliography file [@moreno_measuring_2017].

Lorem Ipsum

## References

After knitting we get the file :

knitted

See the result.