SQLBits 2024

Data Linkage Options: MDM vs Splink (PySpark)

Comparison of various data linkage solutions.
The problem of linking multiple disconnected datasets is a challenging one. It arises when there is no common ID field available to join the different datasets. Instead, other less precise methods must be used such as fuzzy matching across multiple fields.

In this session we will provide an overview comparing options for linking data, including out-of-the-box options (e.g. MDM), code-heavy options (e.g. Spark) and other solutions in between. We will then introduce some of the theory used in probabilistic data linkage models and their differences from deterministic models. For our examples we will use Splink, an open-source Python package developed by the Ministry of Justice in the UK to implement fast probabilistic record linkage and deduplication at scale.

You might want to check out our in-depth session on splink, which you can find on youtube here:
https://www.youtube.com/watch?v=1ijNR3V4v3w