Fast remote parquet

Magic!
R
Parquet
DuckDB
Author

Michaël

Published

2023-09-14

Modified

2023-11-02

A photo of a parquet

Parquet – CC BY-SA by tetue

Querying a remote parquet file via HTTP with DuckDB.

The french statistical service (INSEE) has made available its first parquet file on data.gouv.fr in June.

It’s a 470 MB file (from a 1.8 GB CSV) with 16·106 rows, showing for each address in France which polling station it belongs to.

It would take at least 23 s to only download the file (at 20 MB·s-1) and some more time to open it and aggregate the data. But with DuckDB coupled with parquet we don’t need to download the file; it’s just magical!

Config

dataset <- "https://static.data.gouv.fr/resources/bureaux-de-vote-et-adresses-de-leurs-electeurs/20230626-135723/table-adresses-reu.parquet"

library(duckdb)
library(tidyverse)
library(glue)

cnx <- dbConnect(duckdb())

# To do once:
# dbExecute(cnx, "INSTALL httpfs")
dbExecute(cnx, "LOAD httpfs")

dbSendQuery(cnx, glue("
  CREATE VIEW bureaux AS
    SELECT * 
    FROM '{dataset}'"))

With {DBI} and SQL

It takes less than 2 s to query the top ten communes!

# available columns
dbGetQuery(cnx, "
  DESCRIBE bureaux")
                   column_name column_type null  key default extra
1             code_commune_ref     VARCHAR  YES <NA>    <NA>    NA
2  reconstitution_code_commune     VARCHAR  YES <NA>    <NA>    NA
3               id_brut_bv_reu     VARCHAR  YES <NA>    <NA>    NA
4                           id     VARCHAR  YES <NA>    <NA>    NA
5                  geo_adresse     VARCHAR  YES <NA>    <NA>    NA
6                     geo_type     VARCHAR  YES <NA>    <NA>    NA
7                    geo_score      DOUBLE  YES <NA>    <NA>    NA
8                    longitude      DOUBLE  YES <NA>    <NA>    NA
9                     latitude      DOUBLE  YES <NA>    <NA>    NA
10                    api_line     VARCHAR  YES <NA>    <NA>    NA
11               nb_bv_commune    UINTEGER  YES <NA>    <NA>    NA
12                 nb_adresses    UINTEGER  YES <NA>    <NA>    NA
# number of rows
dbGetQuery(cnx,"
  SELECT COUNT(*)
  FROM bureaux")
  count_star()
1     15970992
# top communes by address number
dbGetQuery(cnx,"
  SELECT 
    code_commune_ref, 
    sum(nb_adresses) AS total_nb_adresses
  FROM bureaux
  GROUP BY code_commune_ref 
  ORDER BY total_nb_adresses DESC
  LIMIT 10")
   code_commune_ref total_nb_adresses
1             31555            183488
2             34172            124866
3             06088            123217
4             33063            106018
5             59350             83468
6             97411             70738
7             13001             64095
8             44109             60255
9             72181             56159
10            83137             54995

With {dbplyr}

Idem with {dbplyr}!

bureaux <- tbl(cnx, "bureaux") 

# available columns
colnames(bureaux)
 [1] "code_commune_ref"            "reconstitution_code_commune"
 [3] "id_brut_bv_reu"              "id"                         
 [5] "geo_adresse"                 "geo_type"                   
 [7] "geo_score"                   "longitude"                  
 [9] "latitude"                    "api_line"                   
[11] "nb_bv_commune"               "nb_adresses"                
# number of rows
bureaux %>%
  summarize(rows = n())
# Source:   SQL [1 x 1]
# Database: DuckDB 0.8.1 [unknown@Linux 5.15.0-87-generic:R 4.3.1/:memory:]
      rows
     <dbl>
1 15970992
# top communes by address number
bureaux %>% 
  group_by(code_commune_ref) %>% 
  summarise(tot_nb_adresses = sum(nb_adresses)) %>% 
  arrange(desc(tot_nb_adresses))
# Source:     SQL [?? x 2]
# Database:   DuckDB 0.8.1 [unknown@Linux 5.15.0-87-generic:R 4.3.1/:memory:]
# Ordered by: desc(tot_nb_adresses)
   code_commune_ref tot_nb_adresses
   <chr>                      <dbl>
 1 31555                     183488
 2 34172                     124866
 3 06088                     123217
 4 33063                     106018
 5 59350                      83468
 6 97411                      70738
 7 13001                      64095
 8 44109                      60255
 9 72181                      56159
10 83137                      54995
# ℹ more rows

Cleaning

dbDisconnect(cnx, shutdown = TRUE)

It’s so fast, I’m not sure I can believe it…