The current 2021 french administrative limits database (ADMIN EXPRESS from IGN) is more detailed than the original version (from 50 MB zipped in 2017 to 500 MB zipped now), thanks to a more detailed geometry being currently based on the BD TOPO. However we don’t always need large scale details especially for web applications. The commune layer itself is a huge 400 MB shapefile not really usable for example in a small scale leaflet map. Even the light version (ADMIN EXPRESS COG CARTO) is 120 MB for the commune layer.
Using sf::st_simplify() in R or a similar command in QGIS on these shapefiles would create holes or overlapping polygons, shapefiles not being topologically aware. We could probably convert to lines, build topology, simplify, clean, build polygons in GRASS or ArcGis, but it’s quite a hassle…
A nice solution is using Mapshaper on mapshaper.org, or better for reproducibility using {mapshaper} in R. For such large dataset it is advised to use a node.js install instead of relying on the package’s embedded version.
in red the original, in black the simplified version with départements in bold
On Debian-like :
> sudo apt-get install nodejs npm
or on windows : install https://nodejs.org/. If needed add C:\Users\xxxxxxxx\AppData\Roaming\npm to your $PATH.
> npm config set proxy "http://login:password@proxy:8080" # if necessary
> npm install -g mapshaper
For ms_simplify() we will set sys = TRUE to take advantage of the node.js executable. Experiment with the other parameters to get a resolution that suits you. Here we use Visvalingam at 3%, squeezing the commune layer from 400 MB to 30 MB. From here we rebuild departement, region and epci with ms_dissolve() commands. Then we join back with original attributes and export in a geopackage with some metadata.
library(tidyverse)
library(sf)
library(rmapshaper)
library(geojsonio)
library(janitor)
library(fs)
# ADMIN EXPRESS COG France entière édition 2021 (in WGS84)
# ftp://Admin_Express_ext:Dahnoh0eigheeFok@ftp3.ign.fr/ADMIN-EXPRESS-COG_3-0__SHP__FRA_WM_2021-05-19.7z
# also available on :
# http://files.opendatarchives.fr/professionnels.ign.fr/adminexpress/ADMIN-EXPRESS-COG_3-0__SHP__FRA_WM_2021-05-19.7z
# originals ---------------------------------------------------------------
source_ign <- "~/sig/ADMINEXPRESS/ADMIN-EXPRESS-COG_3-0__SHP__FRA_2021-05-19/ADMIN-EXPRESS-COG/1_DONNEES_LIVRAISON_2021-05-19/ADECOG_3-0_SHP_WGS84G_FRA"
com <- source_ign %>%
path("COMMUNE.shp") %>%
read_sf() %>%
clean_names()
dep <- source_ign %>%
path("DEPARTEMENT.shp") %>%
read_sf() %>%
clean_names()
reg <- source_ign %>%
path("REGION.SHP") %>%
read_sf() %>%
clean_names()
epci <- source_ign %>%
path("EPCI.shp") %>%
read_sf() %>%
clean_names()
# simplify ---------------------------------------------------------------
check_sys_mapshaper()
# 6 min
# using a conversion to geojson_json to avoid encoding problems
com_simpl <- com %>%
geojson_json(lat = "lat", lon = "long", group = "INSEE_COM", geometry = "polygon", precision = 6) %>%
ms_simplify(keep = 0.03, method = "vis", keep_shapes = TRUE, sys = TRUE)
dep_simpl <- com_simpl %>%
ms_dissolve(field = "insee_dep", sys = TRUE)
reg_simpl <- com_simpl %>%
ms_dissolve(field = "insee_reg", sys = TRUE)
epci_simpl <- com_simpl %>%
ms_dissolve(field = "siren_epci", sys = TRUE)
# add attributes and export ----------------------------------------------
destination <- "~/donnees/ign/adminexpress_simpl.gpkg"
com_simpl %>%
geojson_sf() %>%
st_write(destination, layer = "commune",
layer_options = c("IDENTIFIER=Communes Adminexpress 2021 simplifiées",
"DESCRIPTION=France WGS84 version COG (2021-05). Simplification mapshaper."))
dep_simpl %>%
geojson_sf() %>%
left_join(st_drop_geometry(dep), by = "insee_dep") %>%
st_write(destination, layer = "departement",
layer_options = c("IDENTIFIER=Départements Adminexpress 2021 simplifiés",
"DESCRIPTION=France WGS84 version COG (2021-05). Simplification mapshaper."))
reg_simpl %>%
geojson_sf() %>%
left_join(st_drop_geometry(reg), by = "insee_reg") %>%
st_write(destination, layer = "region",
layer_options = c("IDENTIFIER=Régions Adminexpress 2021 simplifiées",
"DESCRIPTION=France WGS84 version COG (2021-05). Simplification mapshaper."))
epci_simpl %>%
geojson_sf() %>%
mutate(siren_epci = str_remove(siren_epci, "200054781/")) %>% # remove Grand Paris
left_join(st_drop_geometry(epci), by = c("siren_epci" = "code_siren")) %>%
st_write(destination, layer = "epci",
layer_options = c("IDENTIFIER=EPCI Adminexpress 2021 simplifiés",
"DESCRIPTION=Établissement public de coopération intercommunale France WGS84 version COG (2021-05). Simplification mapshaper."))
Télécharger le géopackage (communes, EPCI, départements, régions simplifiés, 2021-05) en WGS84 (EPSG:4326) – 22 Mo
The {geofacet} package allows to « arrange a sequence of plots of data for different geographical entities into a grid that strives to preserve some of the original geographical orientation of the entities« .
Like the previous post, it’s interesting if you view each entity as a unit and don’t care for its real size or weight, and don’t want to spend too much time manually finding the best grid.
We will again use the sameCOVID-19 dataset. We manually add the overseas départements once we have found the right grid (by trying different seeds) and adjust Corsica position.
COVID-19 deceased in hospital, by département, for 100 000 inhab.
Hexagon tessellation using the great {geogrid} package.
The départements are the second level of administrative government in France. They neither have the same area nor the same population and this heterogeneity provides a few challenges for a fair and accurate map representation (see the post on smoothing).
However if we are just interested in the départements as units, we can use a regular grid for visualization. Since France is often called the hexagon, we could even use an hexagon tiling (a fractal map !)…
Creating the grid and conserving minimal topological relations and the general shape can be time consuming, but thanks to Geogrid it’s quite easy. The geogrid dev page provides nice examples. We will reuse our code of the COVID19 animation. The resulting GIS file is provided below.
The global shape and relations are well rendered. Deformations are quite important for the small départements around Paris, but the map remains legible.
From the official data by Santé Publique France, we spatially smooth the decease (produced by SPF at the département scale) and normalize by a similarly smoothed population grid. For that we use the {btb} package.
Metadata are an essential part of a robust data science workflow ; they record the description of the tables and the meaning of each variable : its units, quality, allowed range, how we collect it, when it’s been recorded etc. Data without metadata are practically worthless. Here we will show how to transfer the metadata from PostgreSQL to R.
In PostgreSQL metadata can be stored in comments with the statementsCOMMENT ON TABLE ... IS '...' or COMMENT ON COLUMN ... IS '...'. So I hope your tables and columns have these nice comments and you can see them in psql or PgAdmin for example. But what about R ?
In R, metadata can be assigned as attributes of any object and mainly as « labels » for the columns. You may have seen labels when importing labelled data from SPSS for example.
We will use the {Hmisc} package which provides functions to manage labels. Another interesting package is {sjlabelled}.
We’ll start by adding a table and its metadata in PostgreSQL. I chose to use the list of all the french communes from the code officiel géographique (COG). The data is provided as a zipped CSV file ; luckily a data dictionary appears on the page so we’ll scrape it.
# data from https://www.insee.fr/fr/information/3720946
# download
dl <- tempfile()
GET("https://www.insee.fr/fr/statistiques/fichier/3720946/commune2019-csv.zip",
write_disk(dl))
unzip(dl)
# import in PostgreSQL
read_csv("commune2019.csv",
col_types = cols(.default = col_character())) %>%
clean_names() %>%
dbWriteTable(cnx, c("ref_cog", "commune2019"), ., row.names = FALSE)
dbSendQuery(cnx, "ALTER TABLE ref_cog.commune2019 ADD PRIMARY KEY (com, typecom);")
# get the data dictionary from INSEE
page <- GET("https://www.insee.fr/fr/information/3720946") %>%
content()
table_info <- page %>%
html_node("#titre-bloc-3 + div > p") %>%
html_text() %>%
str_trim() %>%
str_replace_all("\\s+", " ")
columns_info <- page %>%
html_node("table") %>%
html_table() %>%
clean_names() %>%
mutate(designation_et_modalites_de_la_variable = str_trim(designation_et_modalites_de_la_variable),
designation_et_modalites_de_la_variable = str_replace_all(designation_et_modalites_de_la_variable, "\\s+", " "),
nom_de_la_variable = make_clean_names(nom_de_la_variable))
# add table metadata in PostgreSQL
dbSendQuery(cnx, glue_sql("COMMENT ON TABLE ref_cog.commune2019 IS {table_info}", .con = cnx))
# add columns metadata in PostgreSQL
walk2(columns_info$nom_de_la_variable,
columns_info$designation_et_modalites_de_la_variable,
~ dbSendQuery(cnx, glue_sql("COMMENT ON COLUMN ref_cog.commune2019.{`.x`} IS {.y}", .con = cnx)))
Now we have this nice table :
=> \dt+ ref_cog.*
List of relations
Schema | Name | Type | Owner | Size | Description
---------+-------------+-------+---------+---------+-------------------------------------------------------------------
ref_cog | commune2019 | table | xxxxxxx | 3936 kB | Liste des communes, arrondissements municipaux, communes déléguées et
communes associées au 1er janvier 2019, avec le code des niveaux
supérieurs (canton ou pseudo-canton, département, région)
(1 row)
=> \d+ ref_cog.commune2019
Table "ref_cog.commune2019"
Column | Type | Modifiers | Storage | Stats target | Description
-----------+------+-----------+----------+--------------+----------------------------------------------------------------
typecom | text | | extended | | Type de commune
com | text | | extended | | Code commune
reg | text | | extended | | Code région
dep | text | | extended | | Code département
arr | text | | extended | | Code arrondissement
tncc | text | | extended | | Type de nom en clair
ncc | text | | extended | | Nom en clair (majuscules)
nccenr | text | | extended | | Nom en clair (typographie riche)
libelle | text | | extended | | Nom en clair (typographie riche) avec article
can | text | | extended | | Code canton. Pour les communes « multi-cantonales » code décliné
de 99 à 90 (pseudo-canton) ou de 89 à 80 (communes nouvelles)
comparent | text | | extended | | Code de la commune parente pour les arrondissements municipaux et
les communes associées ou déléguées.
Indexes:
"commune2019_pkey" PRIMARY KEY, btree (com, typecom)
Usually we query the data to R this way :
cog <- dbGetQuery(cnx,
"SELECT
*
FROM ref_cog.commune2019
LIMIT 10")
We can create a function that will query the metadata of the table in information_schema.columns and add it to the data frame ; the function expects a data frame, the name of the schema.table from which we get the comments and a connection handler. It will return the data frame with labels and an attribute metadata with the description of the table.
#' Add attributes to a dataframe from metadata read in the PostgreSQL database
#'
#' @param df dataframe
#' @param schema_table "schema.table" from which to read the comments
#' @param cnx a database connexion from RPostgreSQL::dbConnect()
#'
#' @return a dataframe with attributes
#'
#' @examples \dontrun{add_metadata(iris, "public.iris", cnx)}
add_metadata <- function(df, schema_table, cnx) {
# get the table description and add it to a data frame attribute called "metadata"
attr(df, "metadata") <- dbGetQuery(cnx,
glue_sql("SELECT obj_description({schema_table}::regclass) AS table_description;",
.con = cnx)) %>%
pull(table_description)
# get colmumns comments
meta <- str_match(schema_table, "^(.*)\\.(.*)$") %>%
glue_sql(
"SELECT
column_name,
pg_catalog.col_description(
format('%s.%s', isc.table_schema, isc.table_name)::regclass::oid,
isc.ordinal_position) AS column_description
FROM information_schema.columns AS isc
WHERE isc.table_schema = {s[2]}
AND isc.table_name = {s[3]};",
s = .,
.con = cnx) %>%
dbGetQuery(cnx, .)
# match the columns comments to the variables
label(df, self = FALSE) <- colnames(df) %>%
enframe() %>%
left_join(meta, by = c("value" = "column_name")) %>%
pull(column_description)
df
}
attr(cog, "metadata")
[1] "Liste des communes, arrondissements municipaux, communes déléguées et communes associées au 1er janvier 2019, avec le code des niveaux supérieurs (canton ou pseudo-canton, département, région)"
And you can see the metadata in the column headings of the RStudio viewer with View(cog) :
… the headings now show the metadata !
We can also use contents(cog) :
Data frame:cog 10 observations and 11 variables Maximum # NAs:10
- Labels Class Storage NAs
typecom Type de commune character character 0
com Code commune character character 0
reg Code région character character 0
dep Code département character character 0
arr Code arrondissement character character 0
tncc Type de nom en clair character character 0
ncc Nom en clair (majuscules) character character 0
nccenr Nom en clair (typographie riche) character character 0
libelle Nom en clair (typographie riche) avec article character character 0
can Code canton. Pour les communes « multi-cantonales » code décliné... character character 0
comparent Code de la commune parente pour les arrondissements municipaux... character character 10
Or :
cog %>%
label() %>%
enframe()
# A tibble: 11 x 2
name value
<chr> <chr>
1 typecom Type de commune
2 com Code commune
3 reg Code région
4 dep Code département
5 arr Code arrondissement
6 tncc Type de nom en clair
7 ncc Nom en clair (majuscules)
8 nccenr Nom en clair (typographie riche)
9 libelle Nom en clair (typographie riche) avec article
10 can Code canton. Pour les communes « multi-cantonales » code décliné de 99 à 90 (pseudo-canton) ou …
11 comparent Code de la commune parente pour les arrondissements municipaux et les communes associées ou dél…
Or lastly, for one column :
label(cog$tncc)
[1] "Type de nom en clair"
We can also search for information in the variable names or in the labels with another function that can be helpful when we have a few hundred columns…
search_var <- function(df, keyword) {
df %>%
label() %>%
enframe() %>%
rename(variable = name,
metadata = value) %>%
filter_all(any_vars(str_detect(., regex(keyword, ignore_case = TRUE))))
}
search_var(cog, "canton")
# A tibble: 1 x 2
variable metadata
<chr> <chr>
1 can Code canton. Pour les communes « multi-cantonales » code décliné de 99 à 90 (pseudo-canton) ou de…