Estimation

This class estimates the parameters of the Fellegi-Sunter model, a Naive Bayes classifier commonly used in probabilistic record linkage, with the output generated by either the Comparison or Deduplication classes.

For reference, below is a comprehensive description of the Fellegi-Sunter model.

Suppose we want to join observations from two data sets, \(\mathcal{A}\) and \(\mathcal{B}\), with sizes \(N_\mathcal{A}\) and \(N_\mathcal{B}\), respectively. Both datasets have \(K\) variables in common. We evaluate all possible pairwise comparisons of the values for these variables. Specifically, for each of the \(N_\mathcal{A} \times N_\mathcal{B}\) pairs of values, we define an agreement vector of length \(K\), denoted \(\mathbf{\gamma}_{ij}\). The \(k^{\textrm{th}}\) element of this vector indicates the discrete level of similarity for the \(k^{\textrm{th}}\) variable between the \(i^{\textrm{th}}\) observation from dataset \(\mathcal{A}\) and the \(j^{\textrm{th}}\) observation from dataset \(\mathcal{B}\).

The model presumes the existence of a latent variable \(M_{ij}\), which captures whether the pair of observations consisting of the \(i^{\textrm{th}}\) observation from dataset \(\mathcal{A}\) and the \(j^{\textrm{th}}\) observation from dataset \(\mathcal{B}\) constitutes a match. The model follows a simple finite mixture structure:

\[\gamma_{ij}(k) | M_{ij} = m \sim \textrm{Discrete}(\mathbf{\pi}_{km})\]
\[M_{ij} \sim \textrm{Bernoulli}(\lambda).\]

The vector \(\mathbf{\pi}_{km}\), of length \(L\), encodes the probability of each discrete similarity level being observed for the \(k^{\textrm{th}}\) variable conditional on whether the pair is a match (\(m=1\)) or not (\(m=0\)). The parameter \(\lambda\) denotes the overall probability of a match across all pairwise comparisons. The model’s estimands are the parameters \(\lambda\) and \(\mathbf{\pi}\). Once estimated, these parameters can be used to calculate the conditional match probability for all pairs of observations.

For more details on the Fellegi-Sunter model, refer to this excellent paper.

class faster.estimation.Estimation(K_Fuzzy: int, K_Exact: int, Counts: array)[source]

A class for estimating the parameters of the Fellegi–Sunter model based on observed patterns of discrete similarity levels across multiple variables.

Parameters:
  • K_Fuzzy (int) – Number of variables compared for fuzzy matching.

  • K_Exact (int) – Number of variables compared for exact matching.

  • Counts (numpy.ndarray) – Array containing the observed counts for each pattern of discrete similarity levels across the compared variables.

Gamma

Holds the matrix of observed patterns of discrete similarity levels across variables.

Returns:

Matrix encoding all observed combinations of discrete similarity levels across variables.

  • Each row represents a combination of discrete similarity levels.

  • Each column represents a variable.

  • Each element represents the discrete similarity level for a specific variable in the given pattern.

Return type:

numpy.ndarray

property Ksi

Holds the conditional match probabilities for each combination of discrete levels of similarity across variables, given the estimated parameters of the Fellegi-Sunter model.

Returns:

Array containing the conditional match probabilities for each pattern of discrete similarity levels across variables.

Return type:

numpy.ndarray

Raises:

Exception – The model must be fitted first.

Lambda

Holds the estimated overall probability that any two observations are matching.

Returns:

Unconditional match probability.

Return type:

float

Pi

Holds the estimated probability of observing each discrete level of similarity for each variable, conditional on the latent match status.

Returns:

Three-dimensional tensor containing the estimated probabilities of observing each discrete level of similarity for each variable, conditional on latent match status.

  • The first index denotes the latent match status, where 0 represents a non-match and 1 represents a match.

  • The second index denotes the variable.

  • The third index denotes the discrete level of similarity, with higher values reflecting greater similarity.

Return type:

list[list[numpy.ndarray]]

fit(Tolerance=0.0001, Max_Iter=5000)[source]

Estimates the parameters of the Fellegi–Sunter model using the Expectation–Maximization (EM) algorithm.

Parameters:
  • Tolerance (float, optional) – Convergence threshold: the algorithm stops when the largest change in Pi is smaller than this value. Defaults to 1e-4.

  • Max_Iter (int, optional) – Maximum number of EM iterations to perform. Defaults to 5000.

Raises:

Exception – If the model has already been fitted, it cannot be fitted again.