Cross check OSM IDs between OSM and Wikidata

Improving data quality
R
OSM
Wikidata
Author

Michaël

Published

2025-05-03

Modified

2025-05-04

Mont Blanc mountain

Mont Blanc – CC-BY-NC-ND by Viktor Kirilko

Some OpenStreetMap elements have Wikidata entities as tags, like for example the summit of Mont Blanc has the key/value pair wikidata=Q583 directing to Mont Blanc where we can see it has the correct identifier OpenStreetMap node ID (P11693) directing back to the point 281399025; and everything is fine…

However, sometimes Wikidata entities are missing the OSM ID(s). Here is my workflow to find these entities in a defined area to check and complete them.

Config

library(osmdata)   # get data from OSM
library(WikidataR) # get data from Wikidata
library(sf)
library(dplyr)
library(purrr)
library(tidyr)
library(glue)
library(janitor)

Data

First we chose an area of interest, here around Termignon, and get the OSM data.

osm_wd <- getbb("Termignon, France") |> 
  opq() |> 
  add_osm_feature(key = "wikidata") |> 
  osmdata_sf()

Utilities

Then we make two functions that will allow us to query the Wikidata API to get the OSM identifiers associated to a Wikidata entity. There are three properties needed (for nodes, ways and relations). An entity can have zero, one, two or three of these properties, and for each property it can have one or several IDs.

#' Get the OpenStreetMap IDs of a wikidata item
#'
#' @param item (char) wikidata ID (e.g. "Q19368619")
#'
#' @returns (vec<char>) 
#'   OSM ID (possibly several), prefixed with 
#'    "n" for "node",
#'    "w" for "way" and
#'    "r" for relation
#'  NULL if no OSM ID or
#'  NA if not available,
#' @example get_wd_osm_id("Q19368619")
get_wd_osm_id <- purrr::possibly(function(item) {
  
  # avoid spending time querying if no wikidata item
  if (is.na(item) | item == "" | !stringr::str_detect(item, "Q[0-9]+")) { 
    return(NA_character_)
  } else {
    i <- WikidataR::get_item(item)
    
    # P402 relation
    relation <- purrr::pluck(i, 
                             1, "claims", "P402", "mainsnak", "datavalue", "value")
    relation <- if (!is.null(relation)) { paste0("r", relation) } else { NULL }
    
    # P10689 way
    way <- purrr::pluck(i, 
                        1, "claims", "P10689", "mainsnak", "datavalue", "value")
    way <- if (!is.null(way)) { paste0("w", way) } else { NULL }
    
    # P11693 node
    node <- purrr::pluck(i, 
                         1, "claims", "P11693", "mainsnak", "datavalue", "value")
    node <- if (!is.null(node)) { paste0("n", node) } else { NULL }
    
    return(purrr::compact(c(relation, way, node)))
  }
}, otherwise = NA_character_)

#' Add a column with the OpenStreetMap IDs from wikidata
#' 
#' For all features of an osmdata sf object having a wikidata ID, get the 
#' associated OSM IDs recorded in wikidata
#'
#' @param osmdata_features (sf) osmdata sub-object (osm_points, osm_lines,...)
#'   with a `wikidata` column
#'
#' @returns (sf) input object with a new column `wd_osm_id` as a list-column of
#'   character IDs (or NULL)
#' @examples add_osmid_from_wikidata(my_osmdata_sf$osm_points)
add_osm_id_from_wikidata <- function(osmdata_features) {
  geom_type <- sf::st_geometry_type(osmdata_features, by_geometry = FALSE)
  osmdata_features |>
    tibble::as_tibble(.name_repair = janitor::make_clean_names) |> 
    dplyr::mutate(
      wd_osm_id = purrr::map(
        wikidata,
        slowly(get_wd_osm_id), 
        .progress = glue::glue("Getting OSM ID for {geom_type}...")))
}

#' Add a prefix to the OSM ID according to the element geometry type
#' 
#' It will allow us to compare `osm_id` and `wd_osm_id`
#'
#' @param x (sf) OSM data sub-object
#'
#' @returns x with the `osm_id` field prefixed with "n" for "node",
#'    "w" for "way" and "r" for relation
#' @examples prefix_osm_id(osm_wd$osm_points)
prefix_osm_id <- function(x) {
  geom_type <- sf::st_geometry_type(x, by_geometry = FALSE)
  
  p <- case_when(geom_type == "POINT" ~ "n",
                 geom_type %in% c("LINESTRING", "POLYGON") ~ "w",
                 geom_type %in% c("MULTILINESTRING", "MULTIPOLYGON") ~ "r",
                 .default = "")
  
  x |> 
    mutate(osm_id = glue("{p}{osm_id}"))
}

With these functions we can look at our OSM data, keep those having a Wikidata tag, and for these entities get their OSM IDs, allowing us to check if they are similar or, for those missing, adding the ID manually in Wikidata. Since the OSM data is dispatched in different objects, according the geometry type, we need to do it for each of them.

osm_wd_augmented <- list(
  prefix_osm_id(osm_wd$osm_points),
  prefix_osm_id(osm_wd$osm_lines),
  prefix_osm_id(osm_wd$osm_polygons),
  prefix_osm_id(osm_wd$osm_multilines),
  prefix_osm_id(osm_wd$osm_multipolygons)) |>
  map(\(x) {
    x |> 
      filter(!is.na(wikidata) & wikidata != "") |> 
      add_osm_id_from_wikidata() |> 
      select(osm_id, name, wikidata, wd_osm_id) |> 
      unnest(wd_osm_id, keep_empty = TRUE)}) |> 
  list_rbind() |> 
  distinct()

We get a new variable wd_osm_id whose signification is “it is one of the OSM identifiers in the Wikidata entity which is indicated in the OSM element

Use cases

For example if we want to see the OSM elements having a Wikidata entity not linking back:

osm_wd_augmented |> 
  filter(is.na(wd_osm_id)) |> 
  arrange(name)
# A tibble: 29 × 4
   osm_id      name                                 wikidata   wd_osm_id
   <glue>      <chr>                                <chr>      <chr>    
 1 w194448267  Cenischia                            Q3539637   <NA>     
 2 n41645050   Champagny-en-Vanoise                 Q34791526  <NA>     
 3 w108980863  Chapelle Notre-Dame de la Visitation Q13518652  <NA>     
 4 w131213010  Chapelle Saint-Antoine               Q22975798  <NA>     
 5 w131213054  Chapelle Saint-Sébastien             Q22968509  <NA>     
 6 n4573291011 Cinéma Chantelouve                   Q61858809  <NA>     
 7 r2149907    Communauté de Communes Terra Modana  Q17355571  <NA>     
 8 w37792978   Dora di Bardonecchia                 Q3714186   <NA>     
 9 w131215535  Espace Baroque                       Q22968504  <NA>     
10 r377905     GR 55 La Vanoise                     Q124149580 <NA>     
# ℹ 19 more rows

Some are maybe legit (?), but some other may need editing…

Another example, check incoherence between osm_id and wd_osm_id:

osm_wd_augmented |> 
  filter(osm_id != wd_osm_id) |> 
  arrange(name)
# A tibble: 25 × 4
   osm_id       name                                          wikidata wd_osm_id
   <glue>       <chr>                                         <chr>    <chr>    
 1 n26691864    Albertville                                   Q159469  r111528  
 2 n26691864    Albertville                                   Q159469  r17160712
 3 n41644953    Aussois                                       Q567783  r89823   
 4 r2149905     Communauté de Communes de Haute Maurienne-Va… Q2987514 r6876759 
 5 n11646993705 Dent Parrachée                                Q1189850 n6705389…
 6 w1257820968  Ferrovia del Moncenisio                       Q950823  r15987785
 7 w1257820969  Ferrovia del Moncenisio                       Q950823  r15987785
 8 w1257820970  Ferrovia del Moncenisio                       Q950823  r15987785
 9 w1257820971  Ferrovia del Moncenisio                       Q950823  r15987785
10 w993297614   Glacier de Méan-Martin                        Q348352… w42239115
# ℹ 15 more rows

It could indicate that the OSM elements have been heavily edited or deleted/recreated without updating the corresponding Wikidata entity. Or it appears when the wikidata entity has several IDs (think a train station: the building and the train stop).

Corrections require manual back and forth between R, the OSM and Wikidata websites, but these utilities make it quite easy to improve data quality.