library(osmdata) # get data from OSM
library(WikidataR) # get data from Wikidata
library(sf)
library(dplyr)
library(purrr)
library(tidyr)
library(glue)
library(janitor)
Some OpenStreetMap elements have Wikidata entities as tags, like for example the summit of Mont Blanc has the key/value pair wikidata=Q583
directing to Mont Blanc where we can see it has the correct identifier OpenStreetMap node ID (P11693
) directing back to the point 281399025
; and everything is fine…
However, sometimes Wikidata entities are missing the OSM ID(s). Here is my workflow to find these entities in a defined area to check and complete them.
Config
Data
First we chose an area of interest, here around Termignon, and get the OSM data.
<- getbb("Termignon, France") |>
osm_wd opq() |>
add_osm_feature(key = "wikidata") |>
osmdata_sf()
Utilities
Then we make two functions that will allow us to query the Wikidata API to get the OSM identifiers associated to a Wikidata entity. There are three properties needed (for nodes, ways and relations). An entity can have zero, one, two or three of these properties, and for each property it can have one or several IDs.
#' Get the OpenStreetMap IDs of a wikidata item
#'
#' @param item (char) wikidata ID (e.g. "Q19368619")
#'
#' @returns (vec<char>)
#' OSM ID (possibly several), prefixed with
#' "n" for "node",
#' "w" for "way" and
#' "r" for relation
#' NULL if no OSM ID or
#' NA if not available,
#' @example get_wd_osm_id("Q19368619")
<- purrr::possibly(function(item) {
get_wd_osm_id
# avoid spending time querying if no wikidata item
if (is.na(item) | item == "" | !stringr::str_detect(item, "Q[0-9]+")) {
return(NA_character_)
else {
} <- WikidataR::get_item(item)
i
# P402 relation
<- purrr::pluck(i,
relation 1, "claims", "P402", "mainsnak", "datavalue", "value")
<- if (!is.null(relation)) { paste0("r", relation) } else { NULL }
relation
# P10689 way
<- purrr::pluck(i,
way 1, "claims", "P10689", "mainsnak", "datavalue", "value")
<- if (!is.null(way)) { paste0("w", way) } else { NULL }
way
# P11693 node
<- purrr::pluck(i,
node 1, "claims", "P11693", "mainsnak", "datavalue", "value")
<- if (!is.null(node)) { paste0("n", node) } else { NULL }
node
return(purrr::compact(c(relation, way, node)))
}otherwise = NA_character_)
},
#' Add a column with the OpenStreetMap IDs from wikidata
#'
#' For all features of an osmdata sf object having a wikidata ID, get the
#' associated OSM IDs recorded in wikidata
#'
#' @param osmdata_features (sf) osmdata sub-object (osm_points, osm_lines,...)
#' with a `wikidata` column
#'
#' @returns (sf) input object with a new column `wd_osm_id` as a list-column of
#' character IDs (or NULL)
#' @examples add_osmid_from_wikidata(my_osmdata_sf$osm_points)
<- function(osmdata_features) {
add_osm_id_from_wikidata <- sf::st_geometry_type(osmdata_features, by_geometry = FALSE)
geom_type |>
osmdata_features ::as_tibble(.name_repair = janitor::make_clean_names) |>
tibble::mutate(
dplyrwd_osm_id = purrr::map(
wikidata,slowly(get_wd_osm_id),
.progress = glue::glue("Getting OSM ID for {geom_type}...")))
}
#' Add a prefix to the OSM ID according to the element geometry type
#'
#' It will allow us to compare `osm_id` and `wd_osm_id`
#'
#' @param x (sf) OSM data sub-object
#'
#' @returns x with the `osm_id` field prefixed with "n" for "node",
#' "w" for "way" and "r" for relation
#' @examples prefix_osm_id(osm_wd$osm_points)
<- function(x) {
prefix_osm_id <- sf::st_geometry_type(x, by_geometry = FALSE)
geom_type
<- case_when(geom_type == "POINT" ~ "n",
p %in% c("LINESTRING", "POLYGON") ~ "w",
geom_type %in% c("MULTILINESTRING", "MULTIPOLYGON") ~ "r",
geom_type .default = "")
|>
x mutate(osm_id = glue("{p}{osm_id}"))
}
With these functions we can look at our OSM data, keep those having a Wikidata tag, and for these entities get their OSM IDs, allowing us to check if they are similar or, for those missing, adding the ID manually in Wikidata. Since the OSM data is dispatched in different objects, according the geometry type, we need to do it for each of them.
<- list(
osm_wd_augmented prefix_osm_id(osm_wd$osm_points),
prefix_osm_id(osm_wd$osm_lines),
prefix_osm_id(osm_wd$osm_polygons),
prefix_osm_id(osm_wd$osm_multilines),
prefix_osm_id(osm_wd$osm_multipolygons)) |>
map(\(x) {
|>
x filter(!is.na(wikidata) & wikidata != "") |>
add_osm_id_from_wikidata() |>
select(osm_id, name, wikidata, wd_osm_id) |>
unnest(wd_osm_id, keep_empty = TRUE)}) |>
list_rbind() |>
distinct()
We get a new variable wd_osm_id
whose signification is “it is one of the OSM identifiers in the Wikidata entity which is indicated in the OSM element”
Use cases
For example if we want to see the OSM elements having a Wikidata entity not linking back:
|>
osm_wd_augmented filter(is.na(wd_osm_id)) |>
arrange(name)
# A tibble: 29 × 4
osm_id name wikidata wd_osm_id
<glue> <chr> <chr> <chr>
1 w194448267 Cenischia Q3539637 <NA>
2 n41645050 Champagny-en-Vanoise Q34791526 <NA>
3 w108980863 Chapelle Notre-Dame de la Visitation Q13518652 <NA>
4 w131213010 Chapelle Saint-Antoine Q22975798 <NA>
5 w131213054 Chapelle Saint-Sébastien Q22968509 <NA>
6 n4573291011 Cinéma Chantelouve Q61858809 <NA>
7 r2149907 Communauté de Communes Terra Modana Q17355571 <NA>
8 w37792978 Dora di Bardonecchia Q3714186 <NA>
9 w131215535 Espace Baroque Q22968504 <NA>
10 r377905 GR 55 La Vanoise Q124149580 <NA>
# ℹ 19 more rows
Some are maybe legit (?), but some other may need editing…
Another example, check incoherence between osm_id
and wd_osm_id
:
|>
osm_wd_augmented filter(osm_id != wd_osm_id) |>
arrange(name)
# A tibble: 25 × 4
osm_id name wikidata wd_osm_id
<glue> <chr> <chr> <chr>
1 n26691864 Albertville Q159469 r111528
2 n26691864 Albertville Q159469 r17160712
3 n41644953 Aussois Q567783 r89823
4 r2149905 Communauté de Communes de Haute Maurienne-Va… Q2987514 r6876759
5 n11646993705 Dent Parrachée Q1189850 n6705389…
6 w1257820968 Ferrovia del Moncenisio Q950823 r15987785
7 w1257820969 Ferrovia del Moncenisio Q950823 r15987785
8 w1257820970 Ferrovia del Moncenisio Q950823 r15987785
9 w1257820971 Ferrovia del Moncenisio Q950823 r15987785
10 w993297614 Glacier de Méan-Martin Q348352… w42239115
# ℹ 15 more rows
It could indicate that the OSM elements have been heavily edited or deleted/recreated without updating the corresponding Wikidata entity. Or it appears when the wikidata entity has several IDs (think a train station: the building and the train stop).
Corrections require manual back and forth between R, the OSM and Wikidata websites, but these utilities make it quite easy to improve data quality.