Estimation¶
This class estimates the parameters of the Fellegi-Sunter model, a Naive Bayes classifier commonly used in probabilistic record linkage, with the output generated by either the Comparison or Deduplication classes.
For reference, below is a comprehensive description of the Fellegi-Sunter model.
Suppose we want to join observations from two data sets, \(\mathcal{A}\) and \(\mathcal{B}\), with sizes \(N_\mathcal{A}\) and \(N_\mathcal{B}\), respectively. Both datasets have \(K\) variables in common. We evaluate all possible pairwise comparisons of the values for these variables. Specifically, for each of the \(N_\mathcal{A} \times N_\mathcal{B}\) pairs of values, we define an agreement vector of length \(K\), denoted \(\mathbf{\gamma}_{ij}\). The \(k^{\textrm{th}}\) element of this vector indicates the discrete level of similarity for the \(k^{\textrm{th}}\) variable between the \(i^{\textrm{th}}\) observation from dataset \(\mathcal{A}\) and the \(j^{\textrm{th}}\) observation from dataset \(\mathcal{B}\).
The model presumes the existence of a latent variable \(M_{ij}\), which captures whether the pair of observations consisting of the \(i^{\textrm{th}}\) observation from dataset \(\mathcal{A}\) and the \(j^{\textrm{th}}\) observation from dataset \(\mathcal{B}\) constitutes a match. The model follows a simple finite mixture structure:
The vector \(\mathbf{\pi}_{km}\), of length \(L\), encodes the probability of each discrete similarity level being observed for the \(k^{\textrm{th}}\) variable conditional on whether the pair is a match (\(m=1\)) or not (\(m=0\)). The parameter \(\lambda\) denotes the overall probability of a match across all pairwise comparisons. The model’s estimands are the parameters \(\lambda\) and \(\mathbf{\pi}\). Once estimated, these parameters can be used to calculate the conditional match probability for all pairs of observations.
For more details on the Fellegi-Sunter model, refer to this excellent paper.
- class faster.estimation.Estimation(K_Fuzzy: int, K_Exact: int, Counts: array)[source]¶
A class for estimating the parameters of the Fellegi–Sunter model based on observed patterns of discrete similarity levels across multiple variables.
- Parameters:
- Gamma¶
Holds the matrix of observed patterns of discrete similarity levels across variables.
- Returns:
Matrix encoding all observed combinations of discrete similarity levels across variables.
Each row represents a combination of discrete similarity levels.
Each column represents a variable.
Each element represents the discrete similarity level for a specific variable in the given pattern.
- Return type:
numpy.ndarray
- property Ksi¶
Holds the conditional match probabilities for each combination of discrete levels of similarity across variables, given the estimated parameters of the Fellegi-Sunter model.
- Returns:
Array containing the conditional match probabilities for each pattern of discrete similarity levels across variables.
- Return type:
numpy.ndarray
- Raises:
Exception – The model must be fitted first.
- Lambda¶
Holds the estimated overall probability that any two observations are matching.
- Returns:
Unconditional match probability.
- Return type:
- Pi¶
Holds the estimated probability of observing each discrete level of similarity for each variable, conditional on the latent match status.
- Returns:
Three-dimensional tensor containing the estimated probabilities of observing each discrete level of similarity for each variable, conditional on latent match status.
The first index denotes the latent match status, where 0 represents a non-match and 1 represents a match.
The second index denotes the variable.
The third index denotes the discrete level of similarity, with higher values reflecting greater similarity.
- Return type: