Deduplication

This class represents our main contribution, as it performs the GPU-accelerated computation of the Jaro-Winkler similarity metric for each pair of values in the dataset. In addition to fuzzy matching based on the Jaro-Winkler similarity metric, the class also supports comparing variables for exact matching.

For reference, the Jaro-Winkler similarity is a continuous measure that ranges from 0 to 1. The similarity between two strings, \(s_1\) and \(s_2\), is calculated using the following formula:

\[\mathcal{S}(s_1, s_2) = \mathcal{J}(s_1, s_2) + w \times \ell \times \left(1 - \mathcal{J}(s_1, s_2)\right),\]

where:

\[\mathcal{J}(s_1, s_2) = \frac{1}{3} \left( \frac{m}{\left|s_1\right|} + \frac{m}{\left|s_2\right|} + \frac{m-\frac{t}{2}}{m}\right).\]

In these equations, \(\left|s\right|\) denotes the length of the string \(s\), \(m\) is the number of matching characters between the strings, and \(t\) is the number of transpositions between matching characters. Furthermore, \(\ell\) (ranging from 0 to 4) represents the number of consecutive matching characters at the beginning of both strings, and \(w\) (ranging from 0 to 0.25) is the weight assigned to \(\ell\). We discretize the Jaro-Winkler similarity so that the values of the agreement vectors \(\mathbf{\gamma}\) are integers between 0 and \(L-1\), with higher integer values reflecting a greater similarity. In practice, we categorize the Jaro-Winkler similarity into three levels, using two thresholds to define these partitions.

class faster.deduplication.Deduplication(df: DataFrame, Vars_Fuzzy, Vars_Exact=[])[source]

A class for comparing the values of selected variables in one pandas DataFrame.

Parameters:
  • df (pandas.DataFrame) – DataFrame to deduplicate.

  • Vars_Fuzzy (list[str]) – List of variable names to be compared using fuzzy matching.

  • Vars_Exact (list[str], optional) – List of variable names to be compared using exact matching. Defaults to an empty list.

Raises:

Exception – If any name in Vars_Fuzzy or Vars_Exact is not found in df.

property Counts

This property stores the count of record pairs corresponding to each combination of discrete similarity levels across all compared variables.

Returns:

Array containing the number of pairs for each pattern of discrete similarity levels across variables.

Return type:

numpy.ndarray

Indices

This attribute stores a list of index arrays representing pairs of records from df_A and df_B that correspond to each combination of discrete similarity levels across all compared variables.

Returns:

List of arrays, where each array contains indices of record pairs associated with a specific combination of discrete similarity levels.

Indices represent i * len(str_B) + j, where i is the element’s index in str_A and j is the element’s index in str_B.

Similarity patterns are defined iteratively across variables (both fuzzy and exact), following the order specified by the user. Variables listed later in the sequence define faster-changing discrete levels of similarity.

The pattern representing no similarity between records is omitted.

Return type:

list[cupy.ndarray]

fit(p=0.1, Lower_Thr=0.88, Upper_Thr=0.94, Num_Threads=256, Max_Chunk_Size=2.0)[source]

Compares all pairs of observations across the selected variables in the dataframe. The result is stored in the Indices attribute.

Parameters:
  • p (float, optional) – Scaling factor applied to the common prefix in the Jaro-Winkler similarity. Defaults to 0.1.

  • Lower_Thr (float, optional) – Lower threshold for discretizing the Jaro-Winkler distance. Defaults to 0.88.

  • Upper_Thr (float, optional) – Upper threshold for discretizing the Jaro-Winkler distance. Defaults to 0.94.

  • Num_Threads (int, optional) – Number of threads per block. Defaults to 256.

  • Max_Chunk_Size (float, optional) – Maximum memory allocation per processing chunk, in gigabytes (GB). Defaults to 2.0.

Raises:

Exception – If the model has already been fitted, it cannot be fitted again.

Utility Functions

These functions are used internally by the Deduplication class. You can use them to build your own record linkage pipelines.

faster.deduplication.jaro_winkler_dedup_gpu(string, p=0.1, lower_thr=0.88, upper_thr=0.94, num_threads=256, max_chunk_size=2.0)[source]

Computes the Jaro-Winkler similarity between all pairs of strings in an array and returns the indices corresponding to pairs of strings whose Jaro-Winkler similarity falls within specified thresholds.

Parameters:
  • string (numpy.ndarray) – Array of strings.

  • p (float, optional) – Scaling factor applied to the common prefix in the Jaro-Winkler similarity. Defaults to 0.1.

  • lower_thr (float, optional) – Lower threshold for discretizing the Jaro-Winkler distance. Defaults to 0.88.

  • upper_thr (float, optional) – Upper threshold for discretizing the Jaro-Winkler distance. Defaults to 0.94.

  • num_threads (int, optional) – Number of threads per block. Defaults to 256.

  • max_chunk_size (float, optional) – Maximum memory allocation per processing chunk, in gigabytes (GB). Defaults to 2.0.

Returns:

List containing two arrays of indices:
  1. Indices with Jaro-Winkler distance between lower_thr and upper_thr.

  2. Indices with Jaro-Winkler distance above upper_thr.

Indices represent i * len(str_B) + j, where i is the element’s index in str_A and j is the element’s index in str_B.

Return type:

list[cupy.ndarray]

faster.deduplication.exact_dedup_gpu(string, num_threads=256)[source]

Compares all pairs of strings in an array and returns the indices of exact matches.

Parameters:
  • string (numpy.ndarray) – Array of strings.

  • num_threads (int, optional) – Number of threads per block. Defaults to 256.

Returns:

Array of indices corresponding to pairs with an exact match.

Indices represent i * len(str_B) + j, where i is the element’s index in str_A and j is the element’s index in str_B.

Return type:

list[cupy.ndarray]