Linkage

This class integrates the outputs from the Comparison and Deduplication classes with the parameters of the Fellegi-Sunter model estimated by the Estimation class, using the latter to identify the records most likely to refer to the same unit of observation.

Tip

To extract duplicates in a dataset, simply provide the same dataframe as input for both the df_A and df_B arguments.

class faster.linkage.Linkage(df_A: DataFrame, df_B: DataFrame, Indices, Ksi: array)[source]

A class for linking records between two Pandas DataFrames based on previously estimated conditional match probabilities.

Parameters:
  • df_A (pandas.DataFrame) – First DataFrame to be linked.

  • df_B (pandas.DataFrame) – Second DataFrame to be linked.

  • Indices (list[cupy.ndarray]) – List of arrays, where each array contains the indices of record pairs from df_A and df_B corresponding to a specific pattern of discrete similarity levels across variables.

  • Ksi (numpy.ndarray) – Array of conditional match probabilities for all combinations of discrete similarity levels across variables.

transform(Threshold=0.85)[source]

Returns a DataFrame containing all pairs of records from df_A and df_B whose conditional match probabilities exceed a specified threshold.

Parameters:

Threshold (float, optional) – Threshold value above which pairs of records from df_A and df_B are considered matches. Defaults to 0.85.

Returns:

DataFrame linking all pairs of records from df_A and df_B with conditional match probabilities greater than the specified threshold.

Return type:

pandas.DataFrame

Raises:

Exception – If no pairs of records have conditional match probabilities exceeding the threshold.