Linkage¶
This class integrates the outputs from the Comparison and Deduplication classes with the parameters of the Fellegi-Sunter model estimated by the Estimation class, using the latter to identify the records most likely to refer to the same unit of observation.
Tip
To extract duplicates in a dataset, simply provide the same dataframe as input for both the df_A and df_B arguments.
- class faster.linkage.Linkage(df_A: DataFrame, df_B: DataFrame, Indices, Ksi: array)[source]¶
A class for linking records between two Pandas DataFrames based on previously estimated conditional match probabilities.
- Parameters:
df_A (pandas.DataFrame) – First DataFrame to be linked.
df_B (pandas.DataFrame) – Second DataFrame to be linked.
Indices (list[cupy.ndarray]) – List of arrays, where each array contains the indices of record pairs from
df_Aanddf_Bcorresponding to a specific pattern of discrete similarity levels across variables.Ksi (numpy.ndarray) – Array of conditional match probabilities for all combinations of discrete similarity levels across variables.
- transform(Threshold=0.85)[source]¶
Returns a DataFrame containing all pairs of records from
df_Aanddf_Bwhose conditional match probabilities exceed a specified threshold.- Parameters:
Threshold (float, optional) – Threshold value above which pairs of records from
df_Aanddf_Bare considered matches. Defaults to 0.85.- Returns:
DataFrame linking all pairs of records from
df_Aanddf_Bwith conditional match probabilities greater than the specified threshold.- Return type:
pandas.DataFrame
- Raises:
Exception – If no pairs of records have conditional match probabilities exceeding the threshold.