Runs a matching pipeline with exact and partial matching for binomial and trinomial names, including infraspecific rank validation.
Usage
wcvp_matching(
df,
target_df = NULL,
prefilter_genus = TRUE,
allow_duplicates = FALSE,
max_dist = 1,
method = "osa",
add_name_distance = FALSE,
name_distance_method = "osa",
profile = FALSE,
output_name_style = c("snake_case", "legacy")
)Arguments
- df
Input tibble/data.frame with either
Genus/SpeciesorOrig.Genus/Orig.Species. For trinomials, includeInfra.RankandInfraspecies(orOrig.Infra.Rank/Orig.Infraspecies).- target_df
Optional custom target table. If
NULL, data are read from the optionalwcvpdatachecklist when available; otherwise passtarget_dfexplicitly.- prefilter_genus
Logical. If
TRUE, prefiltertarget_dfto candidate genera (exact + fuzzy) before running the matching pipeline.- allow_duplicates
Logical. If
TRUE, duplicated taxon keys are deduplicated internally for matching and then expanded back to original rows. Output includesinput_indexfor traceability to the original input.- max_dist
Maximum distance used in all fuzzy matching stages (genus, species, infraspecies).
- method
A string indicating the fuzzy matching method (passed to
fozziejoin). Supported methods:"levenshtein": Levenshtein edit distance (default)."osa": Optimal string alignment."damerau_levensthein"or"dl": Damerau-Levenshtein distance."hamming": Hamming distance (equal-length strings only)."lcs": Longest common subsequence."qgram": Q-gram similarity (requiresq)."cosine": Cosine similarity (requiresq)."jaccard": Jaccard similarity (requiresq)."jaro": Jaro similarity."jaro_winkler"or"jw": Jaro-Winkler similarity."soundex": Soundex codes based on the National Archives standard.
- add_name_distance
Logical. If
TRUE, addmatched_distas pairwise distance between input name (Input.NamefallbackOrig.Name) andmatched_taxon_name.- name_distance_method
Method passed to
stringdist::stringdistwhenadd_name_distance = TRUE(for example"osa").- profile
Logical. If
TRUE, attach a timing table in the"timings"attribute of the returned tibble, with elapsed seconds per pipeline stage.- output_name_style
Naming style for output columns:
"snake_case"returns standardized lower snake_case names."legacy"keeps the historical mixed naming convention.
Value
Tibble with matched names, process flags, and taxonomic context
columns: matched_plant_name_id, matched_taxon_name, taxon_status,
accepted_plant_name_id, accepted_taxon_name, is_accepted_name.
Examples
# \donttest{
library(wcvpmatch)
# Match a single name
wcvp_matching(data.frame(Genus = "Opuntia", Species = "yanganucensis"))
#> ℹ Input was converted from <data.frame> to a <tibble>.
#> See <https://tibble.tidyverse.org/> for more details.
#> # A tibble: 1 × 43
#> input_index input_name orig_name orig_genus orig_species infra_rank
#> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 1 Opuntia yanganucensis NA Opuntia yanganucens… NA
#> # ℹ 37 more variables: orig_infraspecies <chr>, matched_genus <chr>,
#> # matched_species <chr>, matched_infra_rank <chr>,
#> # matched_infraspecies <chr>, author <chr>, matched_plant_name_id <dbl>,
#> # matched_taxon_name <chr>, matched_taxon_authors <chr>, taxon_status <chr>,
#> # accepted_plant_name_id <dbl>, accepted_taxon_name <chr>,
#> # accepted_taxon_authors <chr>, is_accepted_name <lgl>, matched <lgl>,
#> # direct_match <lgl>, genus_match <lgl>, fuzzy_match_genus <lgl>, …
# Match multiple names with snake_case output
names <- c("Aniba heterotepala", "Anthurium quipuscoae")
df <- classify_spnames(names)
wcvp_matching(df, output_name_style = "snake_case")
#> # A tibble: 2 × 43
#> input_index input_name orig_name orig_genus orig_species infra_rank
#> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 1 Aniba heterotepala Aniba het… Aniba heterotepala NA
#> 2 2 Anthurium quipuscoae Anthurium… Anthurium quipuscoae NA
#> # ℹ 37 more variables: orig_infraspecies <chr>, matched_genus <chr>,
#> # matched_species <chr>, matched_infra_rank <chr>,
#> # matched_infraspecies <chr>, author <chr>, matched_plant_name_id <dbl>,
#> # matched_taxon_name <chr>, matched_taxon_authors <chr>, taxon_status <chr>,
#> # accepted_plant_name_id <dbl>, accepted_taxon_name <chr>,
#> # accepted_taxon_authors <chr>, is_accepted_name <lgl>, matched <lgl>,
#> # direct_match <lgl>, genus_match <lgl>, fuzzy_match_genus <lgl>, …
# Attach per-stage timings for profiling
out <- wcvp_matching(df, output_name_style = "snake_case", profile = TRUE)
#> Error in wcvp_matching(df, output_name_style = "snake_case", profile = TRUE): unused argument (profile = TRUE)
attr(out, "timings")
#> Error: object 'out' not found
# }