Comparison¶
This class represents our main contribution, as it performs the GPU-accelerated computation of the Jaro-Winkler similarity metric for each pair of values between two datasets. In addition to fuzzy matching based on the Jaro-Winkler similarity metric, the class also supports comparing variables for exact matching.
For reference, the Jaro-Winkler similarity is a continuous measure that ranges from 0 to 1. The similarity between two strings, \(s_1\) and \(s_2\), is calculated using the following formula:
where:
In these equations, \(\left|s\right|\) denotes the length of the string \(s\), \(m\) is the number of matching characters between the strings, and \(t\) is the number of transpositions between matching characters. Furthermore, \(\ell\) (ranging from 0 to 4) represents the number of consecutive matching characters at the beginning of both strings, and \(w\) (ranging from 0 to 0.25) is the weight assigned to \(\ell\). We discretize the Jaro-Winkler similarity so that the values of the agreement vectors \(\mathbf{\gamma}\) are integers between 0 and \(L-1\), with higher integer values reflecting a greater similarity. In practice, we categorize the Jaro-Winkler similarity into three levels, using two thresholds to define these partitions.
- class faster.comparison.Comparison(df_A: DataFrame, df_B: DataFrame, Vars_Fuzzy_A, Vars_Fuzzy_B, Vars_Exact_A=[], Vars_Exact_B=[])[source]¶
A class for comparing the values of selected variables between two pandas DataFrames.
This class supports fuzzy and exact comparisons. Variables to be compared must be specified in corresponding lists for each DataFrame.
- Parameters:
df_A (pandas.DataFrame) – First DataFrame to compare.
df_B (pandas.DataFrame) – Second DataFrame to compare.
Vars_Fuzzy_A (list[str]) – List of variable names in
df_Ato be compared using fuzzy matching.Vars_Fuzzy_B (list[str]) – List of variable names in
df_Bcorresponding toVars_Fuzzy_A, in the same order.Vars_Exact_A (list[str], optional) – List of variable names in
df_Ato be compared using exact matching. Defaults to an empty list.Vars_Exact_B (list[str], optional) – List of variable names in
df_Bcorresponding toVars_Exact_A, in the same order. Defaults to an empty list.
- Raises:
Exception – If the lengths of
Vars_Fuzzy_AandVars_Fuzzy_Bdiffer.Exception – If the lengths of
Vars_Exact_AandVars_Exact_Bdiffer.Exception – If any name in
Vars_Fuzzy_AorVars_Fuzzy_Bis not found indf_Aordf_Brespectively.Exception – If any name in
Vars_Exact_AorVars_Exact_Bis not found indf_Aordf_Brespectively.
- property Counts¶
Holds the count of record pairs corresponding to each combination of discrete similarity levels across all compared variables.
- Returns:
Array containing the number of pairs for each combination of discrete similarity levels across variables.
- Return type:
numpy.ndarray
- Indices¶
Holds a list of index arrays representing pairs of records from
df_Aanddf_Bthat correspond to each combination of discrete similarity levels across all compared variables.- Returns:
List of arrays, where each array contains indices of record pairs associated with a specific combination of discrete similarity levels.
Indices represent
i * len(str_B) + j, whereiis the element’s index instr_Aandjis the element’s index instr_B.Similarity patterns are defined iteratively across variables (both fuzzy and exact), following the order specified by the user. Variables listed later in the sequence define faster-changing discrete levels of similarity.
The pattern representing no similarity between records is omitted.
- Return type:
list[cupy.ndarray]
- fit(p=0.1, Lower_Thr=0.88, Upper_Thr=0.94, Num_Threads=256, Max_Chunk_Size=2.0)[source]¶
Compares all pairs of observations across the selected variables in both data frames. The result is stored in the Indices attribute.
- Parameters:
p (float, optional) – Scaling factor applied to the common prefix in the Jaro-Winkler similarity. Defaults to 0.1.
Lower_Thr (float, optional) – Lower threshold for discretizing the Jaro-Winkler distance. Defaults to 0.88.
Upper_Thr (float, optional) – Upper threshold for discretizing the Jaro-Winkler distance. Defaults to 0.94.
Num_Threads (int, optional) – Number of threads per block. Defaults to 256.
Max_Chunk_Size (float, optional) – Maximum memory allocation per processing chunk, in gigabytes (GB). Defaults to 2.0.
- Raises:
Exception – If the model has already been fitted, it cannot be fitted again.
Utility Functions¶
These functions are used internally by the Comparison class. You can use them to build your own record linkage pipelines.
- faster.comparison.jaro_winkler_gpu(str1, str2, offset=0, p=0.1, lower_thr=0.88, upper_thr=0.94, num_threads=256)[source]¶
Computes the Jaro-Winkler similarity between all pairs of strings in two arrays and returns the indices corresponding to pairs of strings whose Jaro-Winkler similarity falls within specified thresholds.
- Parameters:
str1 (numpy.ndarray) – First array of strings.
str2 (numpy.ndarray) – Second array of strings.
offset (int, optional) – Value added to all output indices. Defaults to 0.
p (float, optional) – Scaling factor applied to the common prefix in the Jaro-Winkler similarity. Defaults to 0.1.
lower_thr (float, optional) – Lower threshold for discretizing the Jaro-Winkler distance. Defaults to 0.88.
upper_thr (float, optional) – Upper threshold for discretizing the Jaro-Winkler distance. Defaults to 0.94.
num_threads (int, optional) – Number of threads per block. Defaults to 256.
- Returns:
- List containing two arrays of indices:
Indices with Jaro-Winkler distance between
lower_thrandupper_thr.Indices with Jaro-Winkler distance above
upper_thr.
Indices represent
i * len(str_B) + j, whereiis the element’s index instr_Aandjis the element’s index instr_B.- Return type:
list[cupy.ndarray]
- faster.comparison.jaro_winkler_unique_gpu(str_A, str_B, p=0.1, lower_thr=0.88, upper_thr=0.94, num_threads=256, max_chunk_size=2.0)[source]¶
Computes the Jaro-Winkler similarity between all pairs of strings in two arrays and returns the indices corresponding to pairs of strings whose Jaro-Winkler similarity falls within specified thresholds.
To speed up processing, this function restricts comparisons to unique values in both input strings.
- Parameters:
str1 (numpy.ndarray) – First array of strings.
str2 (numpy.ndarray) – Second array of strings.
offset (int, optional) – Value added to all output indices. Defaults to 0.
p (float, optional) – Scaling factor applied to the common prefix in the Jaro-Winkler similarity. Defaults to 0.1.
lower_thr (float, optional) – Lower threshold for discretizing the Jaro-Winkler distance. Defaults to 0.88.
upper_thr (float, optional) – Upper threshold for discretizing the Jaro-Winkler distance. Defaults to 0.94.
num_threads (int, optional) – Number of threads per block. Defaults to 256.
- Returns:
- List containing two arrays of indices:
Indices with Jaro-Winkler distance between
lower_thrandupper_thr.Indices with Jaro-Winkler distance above
upper_thr.
Indices represent
i * len(str_B) + j, whereiis the element’s index instr_Aandjis the element’s index instr_B.- Return type:
list[cupy.ndarray]
- faster.comparison.exact_gpu(str_A, str_B, num_threads=256)[source]¶
Compares all pairs of strings in two arrays and returns the indices of exact matches.
- Parameters:
str_A (numpy.ndarray) – First array of strings.
str_B (numpy.ndarray) – Second array of strings.
num_threads (int, optional) – Number of threads per block. Defaults to 256.
- Returns:
Array of indices corresponding to pairs with an exact match.
Indices represent
i * len(str_B) + j, whereiis the element’s index instr_Aandjis the element’s index instr_B.- Return type:
list[cupy.ndarray]