Uses stringdist::stringdist() to compute distances and a data.table-based
backend to assemble the final result. This is the main user-facing entry point
for fuzzy joins on strings.
Usage
fuzzystring_join(
x,
y,
by = NULL,
max_dist = 2,
method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw",
"soundex"),
mode = "inner",
ignore_case = FALSE,
distance_col = NULL,
...
)
fuzzystring_inner_join(x, y, by = NULL, distance_col = NULL, ...)
fuzzystring_left_join(x, y, by = NULL, distance_col = NULL, ...)
fuzzystring_right_join(x, y, by = NULL, distance_col = NULL, ...)
fuzzystring_full_join(x, y, by = NULL, distance_col = NULL, ...)
fuzzystring_semi_join(x, y, by = NULL, distance_col = NULL, ...)
fuzzystring_anti_join(x, y, by = NULL, distance_col = NULL, ...)Arguments
- x
A
data.frameordata.table.- y
A
data.frameordata.table.- by
Columns by which to join the two tables. You can supply a character vector of common names (e.g.
c("name")), or a named vector mappingxtoy(e.g.c(name = "approx_name")).- max_dist
Maximum distance to use for joining. Smaller values are stricter.
- method
Method for computing string distance, see
?stringdist::stringdistand thestringdistpackage vignettes.- mode
One of
"inner","left","right","full","semi", or"anti".- ignore_case
Logical; if
TRUE, comparisons are case-insensitive.- distance_col
If not
NULL, adds a column with this name containing the computed distance for each matched pair (orNAfor unmatched rows in outer joins).- ...
Additional arguments passed to
stringdist.
Value
A joined table (same container type as x). See
fuzzystring_join_backend for details on output structure.
Details
If method = "soundex", max_dist is automatically set to 0.5,
since Soundex distance is 0 (match) or 1 (no match).
For Levenshtein-like methods ("osa", "lv", "dl"), a fast
prefilter is applied: if abs(nchar(v1) - nchar(v2)) > max_dist, the pair
cannot match, so distance is not computed for that pair.
Examples
# \donttest{
if (requireNamespace("ggplot2", quietly = TRUE)) {
d <- data.table::data.table(approximate_name = c("Idea", "Premiom"))
# Match diamonds$cut to d$approximate_name
res <- fuzzystring_inner_join(ggplot2::diamonds, d,
by = c(cut = "approximate_name"),
max_dist = 1
)
head(res)
}
#> carat cut color clarity depth table price x y z
#> <num> <ord> <ord> <ord> <num> <num> <int> <num> <num> <num>
#> 1: 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
#> 2: 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
#> 3: 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
#> 4: 0.23 Ideal J VS1 62.8 56 340 3.93 3.90 2.46
#> 5: 0.22 Premium F SI1 60.4 61 342 3.88 3.84 2.33
#> 6: 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71
#> approximate_name
#> <char>
#> 1: Idea
#> 2: Premiom
#> 3: Premiom
#> 4: Idea
#> 5: Premiom
#> 6: Idea
# }