The CORINE Landcover dataset is distributed as a geopackage weighting more than 8 Go. To limit the memory used when we only work on a subset, we can clip it at opening time. Here we will map the Cyprus Island :
https://www.flickr.com/photos/54544400@N00 - CC by Mikael Leppä
The Suunto watches (Spartan, Suunto 9,…) can record waypoints (or POIs) but although they can be visualized in the Suunto app (or on the watch), they cannot be exported to be used with other tools. It used to be possible to access them from the Movescount website but it was discontinued a few months ago.
Now it’s just a matter of copying the file (via USB, bluetooth, etc.) to a PC, connecting to the database and finding where the POIs are stored (in a table called… pois) ; we can use QGIS or R for that…
With R, we can select this year POIs (dates are stored as UNIX timestamp) and export them as GeoJSON :
The current 2021 french administrative limits database (ADMIN EXPRESS from IGN) is more detailed than the original version (from 50 MB zipped in 2017 to 500 MB zipped now), thanks to a more detailed geometry being currently based on the BD TOPO. However we don’t always need large scale details especially for web applications. The commune layer itself is a huge 400 MB shapefile not really usable for example in a small scale leaflet map. Even the light version (ADMIN EXPRESS COG CARTO) is 120 MB for the commune layer.
Using sf::st_simplify() in R or a similar command in QGIS on these shapefiles would create holes or overlapping polygons, shapefiles not being topologically aware. We could probably convert to lines, build topology, simplify, clean, build polygons in GRASS or ArcGis, but it’s quite a hassle…
A nice solution is using Mapshaper on mapshaper.org, or better for reproducibility using {mapshaper} in R. For such large dataset it is advised to use a node.js install instead of relying on the package’s embedded version.
in red the original, in black the simplified version with départements in bold
On Debian-like :
> sudo apt-get install nodejs npm
or on windows : install https://nodejs.org/. If needed add C:\Users\xxxxxxxx\AppData\Roaming\npm to your $PATH.
> npm config set proxy "http://login:password@proxy:8080" # if necessary
> npm install -g mapshaper
For ms_simplify() we will set sys = TRUE to take advantage of the node.js executable. Experiment with the other parameters to get a resolution that suits you. Here we use Visvalingam at 3%, squeezing the commune layer from 400 MB to 30 MB. From here we rebuild departement, region and epci with ms_dissolve() commands. Then we join back with original attributes and export in a geopackage with some metadata.
library(tidyverse)
library(sf)
library(rmapshaper)
library(geojsonio)
library(janitor)
library(fs)
# ADMIN EXPRESS COG France entière édition 2021 (in WGS84)
# ftp://Admin_Express_ext:Dahnoh0eigheeFok@ftp3.ign.fr/ADMIN-EXPRESS-COG_3-0__SHP__FRA_WM_2021-05-19.7z
# also available on :
# http://files.opendatarchives.fr/professionnels.ign.fr/adminexpress/ADMIN-EXPRESS-COG_3-0__SHP__FRA_WM_2021-05-19.7z
# originals ---------------------------------------------------------------
source_ign <- "~/sig/ADMINEXPRESS/ADMIN-EXPRESS-COG_3-0__SHP__FRA_2021-05-19/ADMIN-EXPRESS-COG/1_DONNEES_LIVRAISON_2021-05-19/ADECOG_3-0_SHP_WGS84G_FRA"
com <- source_ign %>%
path("COMMUNE.shp") %>%
read_sf() %>%
clean_names()
dep <- source_ign %>%
path("DEPARTEMENT.shp") %>%
read_sf() %>%
clean_names()
reg <- source_ign %>%
path("REGION.SHP") %>%
read_sf() %>%
clean_names()
epci <- source_ign %>%
path("EPCI.shp") %>%
read_sf() %>%
clean_names()
# simplify ---------------------------------------------------------------
check_sys_mapshaper()
# 6 min
# using a conversion to geojson_json to avoid encoding problems
com_simpl <- com %>%
geojson_json(lat = "lat", lon = "long", group = "INSEE_COM", geometry = "polygon", precision = 6) %>%
ms_simplify(keep = 0.03, method = "vis", keep_shapes = TRUE, sys = TRUE)
dep_simpl <- com_simpl %>%
ms_dissolve(field = "insee_dep", sys = TRUE)
reg_simpl <- com_simpl %>%
ms_dissolve(field = "insee_reg", sys = TRUE)
epci_simpl <- com_simpl %>%
ms_dissolve(field = "siren_epci", sys = TRUE)
# add attributes and export ----------------------------------------------
destination <- "~/donnees/ign/adminexpress_simpl.gpkg"
com_simpl %>%
geojson_sf() %>%
st_write(destination, layer = "commune",
layer_options = c("IDENTIFIER=Communes Adminexpress 2021 simplifiées",
"DESCRIPTION=France WGS84 version COG (2021-05). Simplification mapshaper."))
dep_simpl %>%
geojson_sf() %>%
left_join(st_drop_geometry(dep), by = "insee_dep") %>%
st_write(destination, layer = "departement",
layer_options = c("IDENTIFIER=Départements Adminexpress 2021 simplifiés",
"DESCRIPTION=France WGS84 version COG (2021-05). Simplification mapshaper."))
reg_simpl %>%
geojson_sf() %>%
left_join(st_drop_geometry(reg), by = "insee_reg") %>%
st_write(destination, layer = "region",
layer_options = c("IDENTIFIER=Régions Adminexpress 2021 simplifiées",
"DESCRIPTION=France WGS84 version COG (2021-05). Simplification mapshaper."))
epci_simpl %>%
geojson_sf() %>%
mutate(siren_epci = str_remove(siren_epci, "200054781/")) %>% # remove Grand Paris
left_join(st_drop_geometry(epci), by = c("siren_epci" = "code_siren")) %>%
st_write(destination, layer = "epci",
layer_options = c("IDENTIFIER=EPCI Adminexpress 2021 simplifiés",
"DESCRIPTION=Établissement public de coopération intercommunale France WGS84 version COG (2021-05). Simplification mapshaper."))
Télécharger le géopackage (communes, EPCI, départements, régions simplifiés, 2021-05) en WGS84 (EPSG:4326) – 22 Mo
In trail running or orienteering people say that if you have to run 100 m of elevation up and down it would take the same time as running flat for 1000 m. We can find a similar old rule of thumb and more recently, some researchers (Davey, Hayes & Norman, 1994 ; Scarf, 2007) added a little science to this vernacular knowledge but with few data points (2 athletes and 300 race results, respectively).
Could we add some modern massive data to check this saying ? Using our dataset scraped from ITRA on 16 949 race results from 2 802 runners, we can fit a basic non linear model to estimate the parameters :
> (model <- results %>%
> nls(heures ~ (dist_tot + deniv_tot / k) / v, data = .,
> start = list(k = 100, v = 8)))
Nonlinear regression model
model: heures ~ (dist_tot + deniv_tot/k)/v
data: .
k v
87.239 9.565
residual sum-of-squares: 313542
Number of iterations to convergence: 3
Achieved convergence tolerance: 1.292e-06
> confint(model)
Waiting for profiling to be done...
2.5% 97.5%
k 81.672165 93.296957
v 9.332102 9.809812
So we see that the average flat speed sustainable over a long period of our sample (which is biased towards elite runners) is around 9.6 km⋅h-1 and that 1 km flat is equivalent to 87 m [82 – 93] of height gain, not far from the old 100 m. Of course these values will vary according the athlete shape, the total race length and profile and many other parameters…
References
Davey, R. C., Hayes, M., & Norman, J. M. (1994). Running uphill : An experimental result and its applications. The Journal of the Operational Research Society, 45(1), 25. https://doi.org/10.2307/2583947
Scarf, P. (2007). Route choice in mountain navigation, Naismith’s rule, and the equivalence of distance and climb. Journal of Sports Sciences, 25(6), 719‑726. https://doi.org/10.1080/02640410600874906
What is the longest year number (yet) written in Roman numerals ?
library(tidyverse)
tibble(y = 1:2020) %>%
mutate(r = as.roman(y),
l = str_length(r)) %>%
slice_max(l)
# A tibble: 1 x 3
y r l
<int> <roman> <int>
1 1888 MDCCCLXXXVIII 13
It is year 1888, with 13 characters…
And the largest writable number being 3899, according to the strict rules in R (however some say it’s 3999),
tibble(y = 1:5000) %>%
mutate(r = as.roman(y),
l = str_length(r)) %>%
filter(lead(is.na(l))) %>%
slice_min(l)
# A tibble: 1 x 3
y r l
<int> <roman> <int>
1 3899 MMMDCCCXCIX 11
the longest overall year will be year 3888 with 15 characters.
tibble(y = 1:3899) %>%
mutate(r = as.roman(y),
l = str_length(r)) %>%
slice_max(l)
# A tibble: 1 x 3
y r l
<int> <roman> <int>
1 3888 MMMDCCCLXXXVIII 15
Nice pattern :
tibble(y = 1:3899) %>%
mutate(r = as.roman(y),
l = str_length(r)) %>%
ggplot(aes(y, l)) +
# geom_col(width = 1) +
geom_point(alpha = .2) +
# geom_line(alpha = .5) +
geom_smooth() +
labs(title = "Characters length of roman numeral years",
x = "year",
y = "characters")
Characters length of roman numeral years
And there are only eleven palindromic years :
tibble(y = 1:3899) %>%
mutate(r = as.character(as.roman(y)),
rr = stringi::stri_reverse(r)) %>%
filter(r == rr,
str_length(r) > 1)
# A tibble: 11 x 3
y r rr
<int> <chr> <chr>
1 2 II II
2 3 III III
3 19 XIX XIX
4 20 XX XX
5 30 XXX XXX
6 190 CXC CXC
7 200 CC CC
8 300 CCC CCC
9 1900 MCM MCM
10 2000 MM MM
11 3000 MMM MMM
Trail running can be hard (Malliaropoulos, Mertyri and Tsaklis 2015; Hespanhol Junior, van Mechelen and Verhagen 2017; Istvan et al. 2019; Messier et al. 2018). During one, I was wondering if we can avoid the race completely and predict the race time according to distance, elevation change, sex and performance index (so yes, we have to run sometimes to measure it).
We need a dataset to play with. ITRA maintains a quite comprehensive database (International Trail Running Association 2020), but no current dataset is openly available. However we can use the hidden JSON API to scrape a sample. I created a R package {itra} for this purpose ; with it we can get some runners and their results:
NOTE (2021-09) : the ITRA website has changed, so the package doesn’t work any longer.
itra_runners("jornet")
#> # A tibble: 5 x 15
#> id_coureur id cote nom prenom quartz team M L XL XXL
#> <int> <int> <int> <chr> <chr> <int> <chr> <int> <lgl> <lgl> <int>
#> 1 2704 2.70e3 950 JORN… Kilian 20 "" 898 NA NA 896
#> 2 707322 7.07e5 538 RENU… Arturo 0 "" NA NA NA NA
#> 3 2667939 2.67e6 520 PARE… Samuel 0 "" NA NA NA NA
#> 4 2673933 2.67e6 468 PÉRE… Anna 0 "" NA NA NA NA
#> 5 2673935 2.67e6 468 RIUS… Sergi 0 "" NA NA NA NA
#> # … with 4 more variables: S <int>, XS <int>, XXS <lgl>, palmares <chr>
It works…
Get the data
So, we’re going to build a daframe of single letters and random page index to create our dataset of runners. I’ll manually add some low index to oversample the elite athletes otherwise we will probably miss them entirely.
# helpers -----------------------------------------------------------------
# compute the mode (to find the runner sex from its details ; see below)
stat_mode <- function(x, na.rm = FALSE) {
if(na.rm) {
x <- x[!is.na(x)]
}
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
# sample -------------------------------------------------------------
# progress bar versions of ITRA functions (scraping will be slow, we want to know if it works)
pb_itra_runners <- function(...) {
pb$tick()
itra_runners(...)
}
pb_itra_runner_results <- function(...) {
pb$tick()
itra_runner_results(...)
}
pb_itra_runner_details <- function(...) {
pb$tick()
itra_runner_details(...)
}
# dataset generation from a "random" set of index
#
oversample_high_score <- c(0, 50, 100, 250, 500)
my_sample <- tibble(l = letters,
n = c(oversample_high_score,
sample(0:100000,
length(letters) - length(oversample_high_score)))) %>%
expand(l, n) %>%
distinct(l, n)
We have 676 tuples of (letter, index). We can now start querying the website, we should get 5 runners (by default) for each tuple. But we will have some duplicates because for example at index 0 we will get « Kilian Jornet » for the letters « k », « i », « l »…
I use purr::map2 to iterate along our sample, using slowly to avoid overloading the website and possibly to get NA instead of stopping errors.
You’ll see that the JSON variables have french or english names!
Some preprocessing is necessary. Sex is not a direct property of the runners, we must get it from runners_details. We then join our 3 dataframes and compute some new variables.
# analysis -----------------------------------------------------------------
# runners' ITRA performance index
runners %>%
left_join(runners_sex) %>%
ggplot(aes(cote, fill = sex)) +
geom_histogram(binwidth = 10)
Most athletes have an ITRA score between 300 and 700. We can also see the results of our oversampling with many athletes added above 700. We can notice trail running is sadly a male (H) sport.
# distance ran
results_clean %>%
ggplot(aes(dist_tot)) +
geom_histogram(binwidth = 2)
We can see the common distances at 25, 50, 100 km and 160 km (100 miles)
# times ran
results_clean %>%
ggplot(aes(hours)) +
geom_histogram(binwidth = 1 / 6)
Random Forest
18339 samples
5 predictors
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 16649, 16704, 16338, 16524, 16407, 16223, ...
Resampling results across tuning parameters:
mtry splitrule RMSE Rsquared MAE
2 variance 1.822388 0.9472348 0.8747978
2 extratrees 1.844228 0.9463288 0.8810525
3 variance 1.827556 0.9469227 0.8801065
3 extratrees 1.847274 0.9459055 0.8800262
5 variance 1.873645 0.9440697 0.9021319
5 extratrees 1.853316 0.9455854 0.8882530
Tuning parameter 'min.node.size' was held constant at a value of 5
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were mtry = 2, splitrule = variance
and min.node.size = 5.
So with a root mean square error of 1.82 hours, I can’t say that my model is super precise. However in my case, my last 16 km (01:40:00) is pretty well predicted!
So next time I’ll just run the model instead of running the race.
I’ll leave the tuning to the specialists. While writing this post I found another (Fogliato, Oliveira, and Yurko 2020) potential similar use for the {itra} package.
References
Fogliato, Riccardo, Natalia L. Oliveira and Ronald Yurko. « TRAP: A Predictive Framework for Trail Running Assessment of Performance ». arXiv:2002.01328 [stat], 12 july 2020. http://arxiv.org/abs/2002.01328.
Hespanhol Junior, Luiz Carlos, Willem van Mechelen and Evert Verhagen. 2017. « Health and Economic Burden of Running-Related Injuries in Dutch Trailrunners: A Prospective Cohort Study ». Sports Medicine 47 (2): 367‑77. https://doi.org/10.1007/s40279-016-0551-8.
International Trail Running Association. « ITRA », 2020. https://itra.run/.
Istvan, Albertus Oosthuizen, Paul Yvonne, Jeremy Ellapen Terry, Barnard Marco, Bongani Qumbu Timothy, Valery Hammill Henriëtte and Lukas Strydom Gert. 2019. « Common ultramarathon trail running injuries and illnesses: A review (2007-2016) ». International Journal of Medicine and Medical Sciences 11 (4): 36‑42. https://doi.org/10.5897/IJMMS2018.1386.
Malliaropoulos, Nikolaos, Dimitra Mertyri and Panagiotis Tsaklis. 2015. « Prevalence of Injury in Ultra Trail Running ». Human Movement 16 (2). https://doi.org/10.1515/humo-2015-0026.
Messier, Stephen P., David F. Martin, Shannon L. Mihalko, Edward Ip, Paul DeVita, D. Wayne Cannon, Monica Love et al. 2018. « A 2-Year Prospective Cohort Study of Overuse Running Injuries: The Runners and Injury Longitudinal Study (TRAILS) ». The American Journal of Sports Medicine 46 (9): 2211‑21. https://doi.org/10.1177/0363546518773755.
The {geofacet} package allows to « arrange a sequence of plots of data for different geographical entities into a grid that strives to preserve some of the original geographical orientation of the entities« .
Like the previous post, it’s interesting if you view each entity as a unit and don’t care for its real size or weight, and don’t want to spend too much time manually finding the best grid.
We will again use the sameCOVID-19 dataset. We manually add the overseas départements once we have found the right grid (by trying different seeds) and adjust Corsica position.
COVID-19 deceased in hospital, by département, for 100 000 inhab.