Uses stringdist::stringdist() to compute distances and a
data.table-orchestrated backend with compiled 'C++' assembly to produce
the final result. This is the main user-facing entry point for fuzzy joins on
strings.
Usage
fuzzystring_join(
x,
y,
by = NULL,
max_dist = 2,
method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw",
"soundex"),
mode = "inner",
ignore_case = FALSE,
distance_col = NULL,
...
)
fuzzystring_inner_join(x, y, by = NULL, distance_col = NULL, ...)
fuzzystring_left_join(x, y, by = NULL, distance_col = NULL, ...)
fuzzystring_right_join(x, y, by = NULL, distance_col = NULL, ...)
fuzzystring_full_join(x, y, by = NULL, distance_col = NULL, ...)
fuzzystring_semi_join(x, y, by = NULL, distance_col = NULL, ...)
fuzzystring_anti_join(x, y, by = NULL, distance_col = NULL, ...)Arguments
- x
A
data.frameordata.table.- y
A
data.frameordata.table.- by
Columns by which to join the two tables. You can supply a character vector of common names (e.g.
c("name")), or a named vector mappingxtoy(e.g.c(name = "approx_name")).- max_dist
Maximum distance to use for joining. Smaller values are stricter.
- method
Method for computing string distance, see
?stringdist::stringdistand thestringdistpackage vignettes.- mode
One of
"inner","left","right","full","semi", or"anti".- ignore_case
Logical; if
TRUE, comparisons are case-insensitive.- distance_col
If not
NULL, adds a column with this name containing the computed distance for each matched pair (orNAfor unmatched rows in outer joins).- ...
Additional arguments passed to
stringdist.
Value
A joined table that preserves the container class of x:
data.table inputs return data.table, tibble inputs return
tibble, and plain data.frame inputs return plain
data.frame. See fuzzystring_join_backend for details
on output structure.
Details
If method = "soundex", max_dist is automatically set to 0.5,
since Soundex distance is 0 (match) or 1 (no match).
When by maps multiple columns, the same method,
max_dist, and any additional stringdist arguments are applied
independently to each mapped column, and a row pair is kept only when all
mapped columns satisfy the distance threshold.
For single-column joins, fuzzystring uses adaptive candidate planning before
calling stringdist::stringdist(). For Levenshtein-like methods
("osa", "lv", "dl"), a fast prefilter is applied: if
abs(nchar(v1) - nchar(v2)) > max_dist, the pair cannot match, so
distance is not computed for that pair. For low-duplication workloads, the
planner can also evaluate larger dense blocks of unique values to reduce
orchestration overhead while preserving the same matching semantics.
Examples
# \donttest{
if (requireNamespace("ggplot2", quietly = TRUE)) {
d <- data.table::data.table(approximate_name = c("Idea", "Premiom"))
# Match diamonds$cut to d$approximate_name
res <- fuzzystring_inner_join(ggplot2::diamonds, d,
by = c(cut = "approximate_name"),
max_dist = 1
)
head(res)
}
#> # A tibble: 6 × 11
#> carat cut color clarity depth table price x y z approximate_name
#> <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>
#> 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 Idea
#> 2 0.21 Prem… E SI1 59.8 61 326 3.89 3.84 2.31 Premiom
#> 3 0.29 Prem… I VS2 62.4 58 334 4.2 4.23 2.63 Premiom
#> 4 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.46 Idea
#> 5 0.22 Prem… F SI1 60.4 61 342 3.88 3.84 2.33 Premiom
#> 6 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71 Idea
# }