Takes a character vector and "simplifies" it by uppercasing, removing most non-alphabetic (or alphanumeric) characters, removing accents, forcing UTF-8 encoding, removing excess spaces, and optionally removing stop words. Useful in cases where you have two large vector of person or business names you need to compare, but where misspellings may be common.

simplify_string(
  x,
  alpha = TRUE,
  digits = FALSE,
  unaccent = TRUE,
  utf8_only = TRUE,
  case = "upper",
  trim = TRUE,
  stopwords = NA
)

Arguments

x

A character vector.

alpha

Should alphabetic characters be included in the cleaned up string? (Default: TRUE)

digits

Should digits be included in the cleaned up string? (Default: FALSE)

unaccent

Should characters be de-accented? (Default: TRUE)

utf8_only

Should characters be UTF-8 only? (Default: TRUE)

case

What casing should characters use? Can be one of 'upper', 'lower', 'sentence', 'title', or 'keep' for the existing casing (Default: 'upper')

trim

Should strings be trimmed of excess spaces? (Default: TRUE)

stopwords

An optional vector of stop words to be removed.

Value

A character vector of simplified strings.

Examples

simplify_string(c('J. Jonah Jameson', 'j jonah jameson',
  'j   jonah 123   jameson', 'J Jónah Jameson...'))
#> [1] "J JONAH JAMESON" "J JONAH JAMESON" "J JONAH JAMESON" "J JONAH JAMESON"
simplify_string(c('123 Business Inc.', '123 business incorporated',
  '123 ... Business ... Inc.'), digits = TRUE, stopwords = c('INC', 'INCORPORATED'))
#> [1] "123 BUSINESS" "123 BUSINESS" "123 BUSINESS"