<- "https://static.data.gouv.fr/resources/bureaux-de-vote-et-adresses-de-leurs-electeurs/20230626-135723/table-adresses-reu.parquet"
dataset
library(duckdb)
library(tidyverse)
library(glue)
<- dbConnect(duckdb())
cnx
# To do once:
# dbExecute(cnx, "INSTALL httpfs")
dbExecute(cnx, "LOAD httpfs")
dbSendQuery(cnx, glue("
CREATE VIEW bureaux AS
SELECT *
FROM '{dataset}'"))
Querying a remote parquet file via HTTP with DuckDB.
The french statistical service (INSEE) has made available its first parquet file on data.gouv.fr in June.
It’s a 470 MB file (from a 1.8 GB CSV) with 16·106 rows, showing for each address in France which polling station it belongs to.
It would take at least 23 s to only download the file (at 20 MB·s-1) and some more time to open it and aggregate the data. But with DuckDB coupled with parquet we don’t need to download the file; it’s just magical!
Config
With {DBI} and SQL
It takes less than 2 s to query the top ten communes!
# available columns
dbGetQuery(cnx, "
DESCRIBE bureaux")
column_name column_type null key default extra
1 code_commune_ref VARCHAR YES <NA> <NA> NA
2 reconstitution_code_commune VARCHAR YES <NA> <NA> NA
3 id_brut_bv_reu VARCHAR YES <NA> <NA> NA
4 id VARCHAR YES <NA> <NA> NA
5 geo_adresse VARCHAR YES <NA> <NA> NA
6 geo_type VARCHAR YES <NA> <NA> NA
7 geo_score DOUBLE YES <NA> <NA> NA
8 longitude DOUBLE YES <NA> <NA> NA
9 latitude DOUBLE YES <NA> <NA> NA
10 api_line VARCHAR YES <NA> <NA> NA
11 nb_bv_commune UINTEGER YES <NA> <NA> NA
12 nb_adresses UINTEGER YES <NA> <NA> NA
# number of rows
dbGetQuery(cnx,"
SELECT COUNT(*)
FROM bureaux")
count_star()
1 15970992
# top communes by address number
dbGetQuery(cnx,"
SELECT
code_commune_ref,
sum(nb_adresses) AS total_nb_adresses
FROM bureaux
GROUP BY code_commune_ref
ORDER BY total_nb_adresses DESC
LIMIT 10")
code_commune_ref total_nb_adresses
1 31555 183488
2 34172 124866
3 06088 123217
4 33063 106018
5 59350 83468
6 97411 70738
7 13001 64095
8 44109 60255
9 72181 56159
10 83137 54995
With {dbplyr}
Idem with {dbplyr}!
<- tbl(cnx, "bureaux")
bureaux
# available columns
colnames(bureaux)
[1] "code_commune_ref" "reconstitution_code_commune"
[3] "id_brut_bv_reu" "id"
[5] "geo_adresse" "geo_type"
[7] "geo_score" "longitude"
[9] "latitude" "api_line"
[11] "nb_bv_commune" "nb_adresses"
# number of rows
%>%
bureaux summarize(rows = n())
# Source: SQL [1 x 1]
# Database: DuckDB 0.8.1 [unknown@Linux 5.15.0-87-generic:R 4.3.1/:memory:]
rows
<dbl>
1 15970992
# top communes by address number
%>%
bureaux group_by(code_commune_ref) %>%
summarise(tot_nb_adresses = sum(nb_adresses)) %>%
arrange(desc(tot_nb_adresses))
# Source: SQL [?? x 2]
# Database: DuckDB 0.8.1 [unknown@Linux 5.15.0-87-generic:R 4.3.1/:memory:]
# Ordered by: desc(tot_nb_adresses)
code_commune_ref tot_nb_adresses
<chr> <dbl>
1 31555 183488
2 34172 124866
3 06088 123217
4 33063 106018
5 59350 83468
6 97411 70738
7 13001 64095
8 44109 60255
9 72181 56159
10 83137 54995
# ℹ more rows
Cleaning
dbDisconnect(cnx, shutdown = TRUE)
It’s so fast, I’m not sure I can believe it…