Comparison¶

This class represents our main contribution, as it performs the GPU-accelerated computation of the Jaro-Winkler similarity metric for each pair of values between two datasets. In addition to fuzzy matching based on the Jaro-Winkler similarity metric, the class also supports comparing variables for exact matching.

For reference, the Jaro-Winkler similarity is a continuous measure that ranges from 0 to 1. The similarity between two strings, \(s_1\) and \(s_2\), is calculated using the following formula:

\[\mathcal{S}(s_1, s_2) = \mathcal{J}(s_1, s_2) + w \times \ell \times \left(1 - \mathcal{J}(s_1, s_2)\right),\]

where:

\[\mathcal{J}(s_1, s_2) = \frac{1}{3} \left( \frac{m}{\left|s_1\right|} + \frac{m}{\left|s_2\right|} + \frac{m-\frac{t}{2}}{m}\right).\]

In these equations, \(\left|s\right|\) denotes the length of the string \(s\), \(m\) is the number of matching characters between the strings, and \(t\) is the number of transpositions between matching characters. Furthermore, \(\ell\) (ranging from 0 to 4) represents the number of consecutive matching characters at the beginning of both strings, and \(w\) (ranging from 0 to 0.25) is the weight assigned to \(\ell\). We discretize the Jaro-Winkler similarity so that the values of the agreement vectors \(\mathbf{\gamma}\) are integers between 0 and \(L-1\), with higher integer values reflecting a greater similarity. In practice, we categorize the Jaro-Winkler similarity into three levels, using two thresholds to define these partitions.

class faster.comparison.Comparison(df_A: DataFrame, df_B: DataFrame, Vars_Fuzzy_A, Vars_Fuzzy_B, Vars_Exact_A=[], Vars_Exact_B=[])[source]¶

A class for comparing the values of selected variables between two pandas DataFrames.

This class supports fuzzy and exact comparisons. Variables to be compared must be specified in corresponding lists for each DataFrame.

Parameters:

df_A (pandas.DataFrame) – First DataFrame to compare.
df_B (pandas.DataFrame) – Second DataFrame to compare.
Vars_Fuzzy_A (list[str]) – List of variable names in df_A to be compared using fuzzy matching.
Vars_Fuzzy_B (list[str]) – List of variable names in df_B corresponding to Vars_Fuzzy_A, in the same order.
Vars_Exact_A (list[str], optional) – List of variable names in df_A to be compared using exact matching. Defaults to an empty list.
Vars_Exact_B (list[str], optional) – List of variable names in df_B corresponding to Vars_Exact_A, in the same order. Defaults to an empty list.

Raises:

Exception – If the lengths of Vars_Fuzzy_A and Vars_Fuzzy_B differ.
Exception – If the lengths of Vars_Exact_A and Vars_Exact_B differ.
Exception – If any name in Vars_Fuzzy_A or Vars_Fuzzy_B is not found in df_A or df_B respectively.
Exception – If any name in Vars_Exact_A or Vars_Exact_B is not found in df_A or df_B respectively.

property Counts¶

Holds the count of record pairs corresponding to each combination of discrete similarity levels across all compared variables.

Returns:: Array containing the number of pairs for each combination of discrete similarity levels across variables.
Return type:: numpy.ndarray

Indices¶

Holds a list of index arrays representing pairs of records from df_A and df_B that correspond to each combination of discrete similarity levels across all compared variables.

Returns:

List of arrays, where each array contains indices of record pairs associated with a specific combination of discrete similarity levels.

Indices represent i * len(str_B) + j, where i is the element’s index in str_A and j is the element’s index in str_B.

Similarity patterns are defined iteratively across variables (both fuzzy and exact), following the order specified by the user. Variables listed later in the sequence define faster-changing discrete levels of similarity.

The pattern representing no similarity between records is omitted.

Return type:

list[cupy.ndarray]

fit(p=0.1, Lower_Thr=0.88, Upper_Thr=0.94, Num_Threads=256, Max_Chunk_Size=2.0)[source]¶

Compares all pairs of observations across the selected variables in both data frames. The result is stored in the Indices attribute.

Parameters:

p (float, optional) – Scaling factor applied to the common prefix in the Jaro-Winkler similarity. Defaults to 0.1.
Lower_Thr (float, optional) – Lower threshold for discretizing the Jaro-Winkler distance. Defaults to 0.88.
Upper_Thr (float, optional) – Upper threshold for discretizing the Jaro-Winkler distance. Defaults to 0.94.
Num_Threads (int, optional) – Number of threads per block. Defaults to 256.
Max_Chunk_Size (float, optional) – Maximum memory allocation per processing chunk, in gigabytes (GB). Defaults to 2.0.

Raises:

Exception – If the model has already been fitted, it cannot be fitted again.

Utility Functions¶

These functions are used internally by the Comparison class. You can use them to build your own record linkage pipelines.

faster.comparison.jaro_winkler_gpu(str1, str2, offset=0, p=0.1, lower_thr=0.88, upper_thr=0.94, num_threads=256)[source]¶

Computes the Jaro-Winkler similarity between all pairs of strings in two arrays and returns the indices corresponding to pairs of strings whose Jaro-Winkler similarity falls within specified thresholds.

Parameters:

str1 (numpy.ndarray) – First array of strings.
str2 (numpy.ndarray) – Second array of strings.
offset (int, optional) – Value added to all output indices. Defaults to 0.
p (float, optional) – Scaling factor applied to the common prefix in the Jaro-Winkler similarity. Defaults to 0.1.
lower_thr (float, optional) – Lower threshold for discretizing the Jaro-Winkler distance. Defaults to 0.88.
upper_thr (float, optional) – Upper threshold for discretizing the Jaro-Winkler distance. Defaults to 0.94.
num_threads (int, optional) – Number of threads per block. Defaults to 256.

Returns:

List containing two arrays of indices:

Indices with Jaro-Winkler distance between lower_thr and upper_thr.
Indices with Jaro-Winkler distance above upper_thr.

Indices represent i * len(str_B) + j, where i is the element’s index in str_A and j is the element’s index in str_B.

Return type:

list[cupy.ndarray]

faster.comparison.jaro_winkler_unique_gpu(str_A, str_B, p=0.1, lower_thr=0.88, upper_thr=0.94, num_threads=256, max_chunk_size=2.0)[source]¶

To speed up processing, this function restricts comparisons to unique values in both input strings.

Parameters:

str1 (numpy.ndarray) – First array of strings.
str2 (numpy.ndarray) – Second array of strings.
offset (int, optional) – Value added to all output indices. Defaults to 0.
p (float, optional) – Scaling factor applied to the common prefix in the Jaro-Winkler similarity. Defaults to 0.1.
lower_thr (float, optional) – Lower threshold for discretizing the Jaro-Winkler distance. Defaults to 0.88.
upper_thr (float, optional) – Upper threshold for discretizing the Jaro-Winkler distance. Defaults to 0.94.
num_threads (int, optional) – Number of threads per block. Defaults to 256.

Returns:

List containing two arrays of indices:

Indices with Jaro-Winkler distance between lower_thr and upper_thr.
Indices with Jaro-Winkler distance above upper_thr.

Indices represent i * len(str_B) + j, where i is the element’s index in str_A and j is the element’s index in str_B.

Return type:

list[cupy.ndarray]

faster.comparison.exact_gpu(str_A, str_B, num_threads=256)[source]¶

Compares all pairs of strings in two arrays and returns the indices of exact matches.

Parameters:

str_A (numpy.ndarray) – First array of strings.
str_B (numpy.ndarray) – Second array of strings.
num_threads (int, optional) – Number of threads per block. Defaults to 256.

Returns:

Array of indices corresponding to pairs with an exact match.

Indices represent i * len(str_B) + j, where i is the element’s index in str_A and j is the element’s index in str_B.

Return type:

list[cupy.ndarray]