What is the longest year number (yet) written in Roman numerals ?
library(tidyverse)
tibble(y = 1:2020) %>%
mutate(r = as.roman(y),
l = str_length(r)) %>%
slice_max(l)
# A tibble: 1 x 3
y r l
<int> <roman> <int>
1 1888 MDCCCLXXXVIII 13
It is year 1888, with 13 characters…
And the largest writable number being 3899, according to the strict rules in R (however some say it’s 3999),
tibble(y = 1:5000) %>%
mutate(r = as.roman(y),
l = str_length(r)) %>%
filter(lead(is.na(l))) %>%
slice_min(l)
# A tibble: 1 x 3
y r l
<int> <roman> <int>
1 3899 MMMDCCCXCIX 11
the longest overall year will be year 3888 with 15 characters.
tibble(y = 1:3899) %>%
mutate(r = as.roman(y),
l = str_length(r)) %>%
slice_max(l)
# A tibble: 1 x 3
y r l
<int> <roman> <int>
1 3888 MMMDCCCLXXXVIII 15
Nice pattern :
tibble(y = 1:3899) %>%
mutate(r = as.roman(y),
l = str_length(r)) %>%
ggplot(aes(y, l)) +
# geom_col(width = 1) +
geom_point(alpha = .2) +
# geom_line(alpha = .5) +
geom_smooth() +
labs(title = "Characters length of roman numeral years",
x = "year",
y = "characters")
Characters length of roman numeral years
And there are only eleven palindromic years :
tibble(y = 1:3899) %>%
mutate(r = as.character(as.roman(y)),
rr = stringi::stri_reverse(r)) %>%
filter(r == rr,
str_length(r) > 1)
# A tibble: 11 x 3
y r rr
<int> <chr> <chr>
1 2 II II
2 3 III III
3 19 XIX XIX
4 20 XX XX
5 30 XXX XXX
6 190 CXC CXC
7 200 CC CC
8 300 CCC CCC
9 1900 MCM MCM
10 2000 MM MM
11 3000 MMM MMM
Trail running can be hard (Malliaropoulos, Mertyri and Tsaklis 2015; Hespanhol Junior, van Mechelen and Verhagen 2017; Istvan et al. 2019; Messier et al. 2018). During one, I was wondering if we can avoid the race completely and predict the race time according to distance, elevation change, sex and performance index (so yes, we have to run sometimes to measure it).
We need a dataset to play with. ITRA maintains a quite comprehensive database (International Trail Running Association 2020), but no current dataset is openly available. However we can use the hidden JSON API to scrape a sample. I created a R package {itra} for this purpose ; with it we can get some runners and their results:
itra_runners("jornet")
#> # A tibble: 5 x 15
#> id_coureur id cote nom prenom quartz team M L XL XXL
#> <int> <int> <int> <chr> <chr> <int> <chr> <int> <lgl> <lgl> <int>
#> 1 2704 2.70e3 950 JORN… Kilian 20 "" 898 NA NA 896
#> 2 707322 7.07e5 538 RENU… Arturo 0 "" NA NA NA NA
#> 3 2667939 2.67e6 520 PARE… Samuel 0 "" NA NA NA NA
#> 4 2673933 2.67e6 468 PÉRE… Anna 0 "" NA NA NA NA
#> 5 2673935 2.67e6 468 RIUS… Sergi 0 "" NA NA NA NA
#> # … with 4 more variables: S <int>, XS <int>, XXS <lgl>, palmares <chr>
It works…
Get the data
So, we’re going to build a daframe of single letters and random page index to create our dataset of runners. I’ll manually add some low index to oversample the elite athletes otherwise we will probably miss them entirely.
# helpers -----------------------------------------------------------------
# compute the mode (to find the runner sex from its details ; see below)
stat_mode <- function(x, na.rm = FALSE) {
if(na.rm) {
x <- x[!is.na(x)]
}
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
# sample -------------------------------------------------------------
# progress bar versions of ITRA functions (scraping will be slow, we want to know if it works)
pb_itra_runners <- function(...) {
pb$tick()
itra_runners(...)
}
pb_itra_runner_results <- function(...) {
pb$tick()
itra_runner_results(...)
}
pb_itra_runner_details <- function(...) {
pb$tick()
itra_runner_details(...)
}
# dataset generation from a "random" set of index
#
oversample_high_score <- c(0, 50, 100, 250, 500)
my_sample <- tibble(l = letters,
n = c(oversample_high_score,
sample(0:100000,
length(letters) - length(oversample_high_score)))) %>%
expand(l, n) %>%
distinct(l, n)
We have 676 tuples of (letter, index). We can now start querying the website, we should get 5 runners (by default) for each tuple. But we will have some duplicates because for example at index 0 we will get « Kilian Jornet » for the letters « k », « i », « l »…
I use purr::map2 to iterate along our sample, using slowly to avoid overloading the website and possibly to get NA instead of stopping errors.
You’ll see that the JSON variables have french or english names!
Some preprocessing is necessary. Sex is not a direct property of the runners, we must get it from runners_details. We then join our 3 dataframes and compute some new variables.
# analysis -----------------------------------------------------------------
# runners' ITRA performance index
runners %>%
left_join(runners_sex) %>%
ggplot(aes(cote, fill = sex)) +
geom_histogram(binwidth = 10)
Most athletes have an ITRA score between 300 and 700. We can also see the results of our oversampling with many athletes added above 700. We can notice trail running is sadly a male (H) sport.
# distance ran
results_clean %>%
ggplot(aes(dist_tot)) +
geom_histogram(binwidth = 2)
We can see the common distances at 25, 50, 100 km and 160 km (100 miles)
# times ran
results_clean %>%
ggplot(aes(hours)) +
geom_histogram(binwidth = 1 / 6)
Random Forest
18339 samples
5 predictors
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 16649, 16704, 16338, 16524, 16407, 16223, ...
Resampling results across tuning parameters:
mtry splitrule RMSE Rsquared MAE
2 variance 1.822388 0.9472348 0.8747978
2 extratrees 1.844228 0.9463288 0.8810525
3 variance 1.827556 0.9469227 0.8801065
3 extratrees 1.847274 0.9459055 0.8800262
5 variance 1.873645 0.9440697 0.9021319
5 extratrees 1.853316 0.9455854 0.8882530
Tuning parameter 'min.node.size' was held constant at a value of 5
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were mtry = 2, splitrule = variance
and min.node.size = 5.
So with a root mean square error of 1.82 hours, I can’t say that my model is super precise. However in my case, my last 16 km (01:40:00) is pretty well predicted!
So next time I’ll just run the model instead of running the race.
I’ll leave the tuning to the specialists. While writing this post I found another (Fogliato, Oliveira, and Yurko 2020) potential similar use for the {itra} package.
References
Fogliato, Riccardo, Natalia L. Oliveira and Ronald Yurko. « TRAP: A Predictive Framework for Trail Running Assessment of Performance ». arXiv:2002.01328 [stat], 12 july 2020. http://arxiv.org/abs/2002.01328.
Hespanhol Junior, Luiz Carlos, Willem van Mechelen and Evert Verhagen. 2017. « Health and Economic Burden of Running-Related Injuries in Dutch Trailrunners: A Prospective Cohort Study ». Sports Medicine 47 (2): 367‑77. https://doi.org/10.1007/s40279-016-0551-8.
International Trail Running Association. « ITRA », 2020. https://itra.run/.
Istvan, Albertus Oosthuizen, Paul Yvonne, Jeremy Ellapen Terry, Barnard Marco, Bongani Qumbu Timothy, Valery Hammill Henriëtte and Lukas Strydom Gert. 2019. « Common ultramarathon trail running injuries and illnesses: A review (2007-2016) ». International Journal of Medicine and Medical Sciences 11 (4): 36‑42. https://doi.org/10.5897/IJMMS2018.1386.
Malliaropoulos, Nikolaos, Dimitra Mertyri and Panagiotis Tsaklis. 2015. « Prevalence of Injury in Ultra Trail Running ». Human Movement 16 (2). https://doi.org/10.1515/humo-2015-0026.
Messier, Stephen P., David F. Martin, Shannon L. Mihalko, Edward Ip, Paul DeVita, D. Wayne Cannon, Monica Love et al. 2018. « A 2-Year Prospective Cohort Study of Overuse Running Injuries: The Runners and Injury Longitudinal Study (TRAILS) ». The American Journal of Sports Medicine 46 (9): 2211‑21. https://doi.org/10.1177/0363546518773755.
The {geofacet} package allows to « arrange a sequence of plots of data for different geographical entities into a grid that strives to preserve some of the original geographical orientation of the entities ».
Like the previous post, it’s interesting if you view each entity as a unit and don’t care for its real size or weight, and don’t want to spend too much time manually finding the best grid.
We will again use the sameCOVID-19 dataset. We manually add the overseas départements once we have found the right grid (by trying different seeds) and adjust Corsica position.
COVID-19 deceased in hospital, by département, for 100 000 inhab.
Hexagon tessellation using the great {geogrid} package.
The départements are the second level of administrative government in France. They neither have the same area nor the same population and this heterogeneity provides a few challenges for a fair and accurate map representation (see the post on smoothing).
However if we are just interested in the départements as units, we can use a regular grid for visualization. Since France is often called the hexagon, we could even use an hexagon tiling (a fractal map !)…
Creating the grid and conserving minimal topological relations and the general shape can be time consuming, but thanks to Geogrid it’s quite easy. The geogrid dev page provides nice examples. We will reuse our code of the COVID19 animation. The resulting GIS file is provided below.
The global shape and relations are well rendered. Deformations are quite important for the small départements around Paris, but the map remains legible.
From the official data by Santé Publique France, we spatially smooth the decease (produced by SPF at the département scale) and normalize by a similarly smoothed population grid. For that we use the {btb} package.
After a classic plot of the Keeling curve (see our former post) used on Wikipedia, we can explore another data visualization. The CO₂ atmospheric concentration, the main cause of the climate warming, is following a seasonal cycle so it could be interesting (or ironic ?) to use a polar plot.
Config and data
We only keep two translations for brevity here…
# Required packages
library(tidyverse)
library(scales)
library(lubridate)
# Translations ------------------------------------------------------------
language <- list(
en_US = list(
locale_lc_time = "en_US.UTF-8",
title = bquote("Monthly mean"~CO[2]~"concentration"),
caption = paste("Data : P. Tans, NOAA/ESRL (www.esrl.noaa.gov/gmd/ccgg/trends/)\nand R. Keeling, Scripps Institution of Oceanography (scrippsco2.ucsd.edu/). Accessed", Sys.Date()),
x = "Year",
y = bquote(CO[2]~"fraction in dry air ("*mu*"mol/mol)"),
x2 = "Month",
y2 = bquote(atop(CO[2]~"fraction in dry air ("*mu*"mol/mol)", "Departure from yearly average")),
title2 = "Seasonal variation"
),
fr_FR = list(
locale_lc_time = "fr_FR.UTF-8",
title = bquote("Moyenne mensuelle de la concentration de"~CO[2]),
caption = paste("données : P. Tans, NOAA/ESRL (www.esrl.noaa.gov/gmd/ccgg/trends/)\net R. Keeling, Scripps Institution of Oceanography (scrippsco2.ucsd.edu/). Accédé le", Sys.Date()),
x = "année",
y = bquote("fraction de"~CO[2]~"dans l'air sec ("*mu*"mol/mol)"),
x2 = "mois",
y2 = bquote(atop("fraction de"~CO[2]~"dans l'air sec ("*mu*"mol/mol)", "en écart à la moyenne annuelle")),
title2 = "Variation saisonnière"
))
# Data --------------------------------------------------------------------
# https://www.esrl.noaa.gov/gmd/ccgg/trends/
co2ml <- read_delim("ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt",
delim = " ",
locale = locale(decimal_mark = "."),
na = c("-99.99", "-1"),
col_types = "iiddddi",
col_names = c("year", "month", "decimal", "co2", "co2_interpol", "co2_trend", "days"),
comment = "#",
trim_ws = TRUE) %>%
group_by(year) %>%
mutate(year_mean = mean(co2_interpol, na.rm = TRUE),
delta = co2_interpol - year_mean,
vdate = ymd(paste0("2015-", month, "-01"))) %>%
ungroup()
Create the plot for each language and save
We use a virtual date to keep the data in the same January-December interval and we add a partial dataframe to smooth the Dec./Jan. transition and build the spiral.
# Polar plot
for (l in names(language)) {
message(l)
current <- language[[l]]
# format the date in local names
Sys.setlocale("LC_TIME", current$locale_lc_time)
p3 <- co2ml %>%
filter(vdate == "2015-01-01") %>%
mutate(vdate = ymd("2015-12-31"),
year = year -1) %>%
bind_rows(co2ml) %>%
ggplot(aes(vdate, co2_interpol, group = year, color = year)) +
geom_line(size = 1.2) +
scale_x_date(breaks = pretty_breaks(12), labels = date_format("%b")) +
scale_color_viridis_c() +
labs(subtitle = current$title,
x = "",
y = current$y,
color = current$x,
title = paste("Mauna Loa", min(co2ml$year), "-", max(co2ml$year)),
caption = current$caption) +
coord_polar() +
theme_bw() +
theme(axis.title.y = element_text(hjust = .85),
panel.grid.major.y = element_blank(),
panel.grid.minor.x = element_blank(),
panel.border = element_blank(),
plot.caption = element_text(size = 7))
ggsave(p3, file = paste("co2_mauna_loa_polar", l, Sys.Date(), "wp.svg", sep = "_"), width = 20, height = 20, units = "cm", device = svg)
}
60 years of CO₂ increase
Does this new year look good ? Will the spiral cross its path soon ?
Representing mass data (inhabitants, livestock,…) on a map in not always easy : choropleth maps are clearly a no-go, except if you normalize with area and then you stumble on the MAUP… A nice solution is smoothing, producing a raster. You even get freebies like (potential) statistical confidentiality, a better geographic synthesis and easy multiple layers computations.
The kernel smoothing should not be confused with interpolation or kriging : the aim here is to « spread » and sum point values, see Loonis and de Bellefon (2018) for a comprehensive explanation.
We could use the {RSAGA} package which provides a « Kernel Density Estimation » in the « grid_gridding » lib for use with rsaga.geoprocessor(). However, we’ll use the {btb} package (Santos et al. 2018) which has the great advantage of providing a way to specify a geographical study zone, avoiding our values to bleed in another country or in the sea for example.
a 7z decompress utility must be available in your $PATH ;
the shapefile COMMUNE.shp has a POPULATION field ;
this is a polygon coverage, so we’ll take the « centroids ».
library(raster) # load before dplyr (against select conflict)
library(tidyverse)
library(httr)
library(sf)
library(btb)
# download and extract data
zipped_file <- tempfile()
GET("ftp://Admin_Express_ext:Dahnoh0eigheeFok@ftp3.ign.fr/ADMIN-EXPRESS_2-0__SHP__FRA_2019-03-14.7z.001",
write_disk(zipped_file),
progress())
system(paste("7z x -odata", zipped_file))
# create a unique polygon for France (our study zone)
fr <- read_sf("data/ADMIN-EXPRESS_2-0__SHP__FRA_2019-03-14/ADMIN-EXPRESS/1_DONNEES_LIVRAISON_2019-03-14/ADE_2-0_SHP_LAMB93_FR/REGION.shp") %>%
st_union() %>%
st_sf() %>%
st_set_crs(2154)
# load communes ; convert to points
comm <- read_sf("data/ADMIN-EXPRESS_2-0__SHP__FRA_2019-03-14/ADMIN-EXPRESS/1_DONNEES_LIVRAISON_2019-03-14/ADE_2-0_SHP_LAMB93_FR/COMMUNE.shp")%>%
st_set_crs(2154) %>%
st_point_on_surface()
We create a function :
#' Kernel weighted smoothing with arbitrary bounding area
#'
#' @param df sf object (points)
#' @param field weight field in the df
#' @param bandwidth kernel bandwidth (map units)
#' @param resolution output grid resolution (map units)
#' @param zone sf study zone (polygon)
#' @param out_crs EPSG (should be an equal-area projection)
#'
#' @return a raster object
#' @import btb, raster, fasterize, dplyr, plyr, sf
lissage <- function(df, field, bandwidth, resolution, zone, out_crs = 3035) {
if (st_crs(zone)$epsg != out_crs) {
message("reprojecting data...")
zone <- st_transform(zone, out_crs)
}
if (st_crs(df)$epsg != out_crs) {
message("reprojecting study zone...")
df <- st_transform(df, out_crs)
}
zone_bbox <- st_bbox(zone)
# grid generation
message("generating reference grid...")
zone_xy <- zone %>%
dplyr::select(geometry) %>%
st_make_grid(cellsize = resolution,
offset = c(plyr::round_any(zone_bbox[1] - bandwidth, resolution, f = floor),
plyr::round_any(zone_bbox[2] - bandwidth, resolution, f = floor)),
what = "centers") %>%
st_sf() %>%
st_join(zone, join = st_intersects, left = FALSE) %>%
st_coordinates() %>%
as_tibble() %>%
dplyr::select(x = X, y = Y)
# kernel
message("computing kernel...")
kernel <- df %>%
cbind(., st_coordinates(.)) %>%
st_set_geometry(NULL) %>%
dplyr::select(x = X, y = Y, field) %>%
btb::kernelSmoothing(dfObservations = .,
sEPSG = out_crs,
iCellSize = resolution,
iBandwidth = bandwidth,
vQuantiles = NULL,
dfCentroids = zone_xy)
# rasterization
message("\nrasterizing...")
raster::raster(xmn = plyr::round_any(zone_bbox[1] - bandwidth, resolution, f = floor),
ymn = plyr::round_any(zone_bbox[2] - bandwidth, resolution, f = floor),
xmx = plyr::round_any(zone_bbox[3] + bandwidth, resolution, f = ceiling),
ymx = plyr::round_any(zone_bbox[4] + bandwidth, resolution, f = ceiling),
resolution = resolution) %>%
fasterize::fasterize(kernel, ., field = field)
}
Instead of a raw choropleth map like this (don’t do this at home) :
Inhabitants, quantile classification ; see the red arrow : a big commune with a somewhat low population (2100 inhabitants) pops out due to its big area
… we should use a proportional symbol. But it’s quite cluttered in urban areas :
You’ve got measles
We can also get the polygon density :
Density, quantile classification. Our previous commune is now more coherent, however the map is not very synthetic due to the heterogeneous size of the communes
We just have to run our function for example with a bandwidth of 20 km and a cell size of 2 km. We use an equal area projection : the LAEA Europa (EPSG:3035).
comm %>%
lissage("POPULATION", 20000, 2000, fr, 3035) %>%
raster::writeRaster("pop.tif")
And lastly our smoothing :
Smoothing, discrete quantile classification
That’s a nice synthetic representation !
After that it’s easy in R to do raster algebra ; for example dividing a grid of crop yields by a grid of agricultural area, create a percent change between dates, etc.
As a conclusion, a quote :
Behind the aesthetic quality of the smoothed maps, however, lies a major trap. By construction, smoothing methods mitigate breakdowns and borders and induce continuous representation of geographical phenomena. The smoothed maps therefore show the spatial autocorrelation locally.Two points close to the smoothing radius have mechanically comparable characteristics in this type of analysis. As a result, there is little point in drawing conclusions from a smoothed map of geographical phenomena whose spatial scale is of the order of the smoothing radius.
Loonis and de Bellefon (2018)
References
Loonis, Vincent and Marie-Pierre de Bellefon, eds 2018. Handbook of Spatial Analysis. Theory and Application with R. INSEE Méthodes 131. Montrouge, France: Institut national de la statistique et des études économiques. https://www.insee.fr/en/information/3635545.
Santos, Arlindo Dos, Francois Semecurbe, Auriane Renaud, Cynthia Faivre, Thierry Cornely and Farida Marouchi. 2018. btb: Beyond the Border – Kernel Density Estimation for Urban Geography (version 0.1.30). https://CRAN.R-project.org/package=btb.