Statistical matching, or when one single data source is not enough

data-science

Published

July 19, 2019

I was recently asked how to go about matching several datasets where different samples of individuals were interviewed. This sounds like a big problem; say that you have dataset A and B, and that A contain one sample of individuals, and B another sample of individuals, then how could you possibly match the datasets? Matching datasets requires a common identifier, for instance, suppose that A contains socio-demographic information on a sample of individuals I, while B, contains information on wages and hours worked on the same sample of individuals I, then yes, it will be possible to match/merge/join both datasets.

But that was not what I was asked about; I was asked about a situation where the same population gets sampled twice, and each sample answers to a different survey. For example the first survey is about labour market information and survey B is about family structure. Would it be possible to combine the information from both datasets?

To me, this sounded a bit like missing data imputation problem, but where all the information about the variables of interest was missing! I started digging a bit, and found that not only there was already quite some literature on it, there is even a package for this, called {StatMatch} with a very detailed vignette. The vignette is so detailed, that I will not write any code, I just wanted to share this package!