Skip to content

Walla Walla

Latest
Compare
Choose a tag to compare
@albarrentine albarrentine released this 09 May 18:37
· 53 commits to master since this release
9c97597

This release adds three important groups of functions to libpostal's C API to support the lieu address/venue deduping project. The APIs are somewhat low-level at this point, but should still be useful in a wide range of geo applications, particularly for batch geocoding large data sets. This is the realization of some of the earlier work on address expansion.

Near-duplicate detection

Near-dupe hashing builds on the expand_address functionality to allow hashing a parsed address into strings suitable for direct comparison and automatic clustering. The hash keys are used to group similar records together prior to pairwise deduping so that we don't need to compare every record to every other record (i.e. N² comparisons). Instead, if we have a function that can generate the same hash key for records that are possible dupes (like "100 Main" and "100 E Main St"), while also being highly selective, we can ensure that most duplicates will be captured for further comparison downstream, and that dissimilar records can be safely considered non-dupes. In a MapReduce context, near-dupe hashes can be used as keys to ensure that possible dupes will be grouped together on the same shard for pairwise checks, and in a search/database context, they can be used as an index for quick lookups of candidate dupes before running more thorough comparisons with the few records that match the hash. This is the first step in the deduping process to identify candidate dupes, and can be thought of as the blocking function in record linkage (this is a highly selective version of a blocking function) or as a form of locally sensitive hashing in the near-duplicate detection literature. Libpostal's near-dupe hashes use a combination of several new features of the library:

  1. Address root expansions: removes tokens that are ignorable such as "Ave", "Pl", "Road", etc. in street names so that something like "West 125th St" can potentially match "W 125". This also allows for exact comparison of apartment numbers where "Apt 2" and "#2" mean the same thing. Every address component uses certain dictionaries in libpostal to determine what is ignorable or not, and although the method is rule-based and deterministic, it can also identify the correct root tokens in many complex cases like "Avenue Rd", "Avenue E", "E St SE", "E Ctr St", etc. While many of the test cases used so far are for English, libpostal's dictionary structure also allows it to work relatively well around the world, e.g. matching Spanish street names where "Calle" might be included in a government data set but is rarely used colloquially or in self-reported addresses. For street names, we additionally strip whitespace from the root expansion so "Sea Grape Ln" and "Seagrape Ln" will both normalize to "seagrape".

  2. Phonetic matching for names: the near-dupe hashes for venue/place/company names written in Latin script include a modified version of the double metaphone algorithm which can be useful for comparing misspelled human names, as well as comparing machine transliterations against human ones in languages where names might written in multiple scripts in different data sets e.g. Arabic or Japanese.

  3. Geo qualifiers: for address data sets with lat/lons, geohash tiles (with a precision of 6 characters by default) and their 8 neighbors (to avoid faultlines) are used to narrow down the comparisons to addresses/places in a similar location. If there's no lat/lon, and the data are known to be from a single country, the postal code or the city name can optionally be used as the geo qualifier. Future improvements include disambiguating toponyms and mapping them to IDs in a hierarchy, such that multiple names for cities, etc. can resolve to one or more IDs, and e.g. an NYC address that uses a neighborhood name in place of the city e.g. "Harlem, NY" could match "New York, NY" by traversing the hierarchy and outputting the city's ID instead.

  4. Acronym generation: when generating hash keys for names, we also try to guess acronyms when there are two or more tokens. While the acronym alignments downstream can be more exact, there are cases like "Brooklyn Academy of Music" vs "BAM" which would not match a single token using the previous methods, but are the same place nonetheless. Acronym generation attempts to solve this problem for venue names, human initials, etc. It is stopword-aware in various languages, but generates multiple potential forms of the acronym for cases like "Museum of Modern Art" vs. "MoMA" where the stopword is included in the acronym. The method also includes rudimentary support for sub-acronyms i.e. generating new tokens at every other non-contiguous stopword (so "University of Texas at Austin" will produce "UT"), as well as at punctuation marks like commas or colons (e.g. "University of Texas, Austin"). Acronym generation also leaves intact any known dictionary phrases that resolve to multi-word phrases so e.g. "Foo High School" and "Foo HS" can share a hash key, and also keeps any acronyms identified as part of the tokenizer from the internal period structure like "A.B.C." All of this also works across scripts for any non-ideographic languages. Though this method will not cover all valid acronyms directly, we use a double metaphone on the output in Latin scripts so with certain types of more complex acronyms like "sonar" vs. "Sound Navigation and Ranging", where more than the initial character of each word is used, the generated acronym may still be phonetically the same as the real acronym (in this case, both resolve to "SNR"), especially if the other letters used are vowels.

  5. Name quad-grams: in order to handle a variety of spelling differences, as well as languages with longer concatenated words as in German, all of the name hashes mentioned above are converted in to unique sequences of 4 characters. Simple numeric tokens consisting of digits and hyphens, etc. are included as-is, without the quadgrams or double metaphone encoding. For the words that do use quadgrams, there is no token boundary disambiguation i.e. the double metaphone for "Nationalgalerie" would be "NXNLKLR" which would then generate strings ["NXNL", "XNLK", "NLKL", "LKLR"] (as opposed to the behavior of our languages classifier features which would generate ["NXNL_", "_XNLK_", "_NLKL_", "_LKLR"], using the underscore to indicate whether the 4-gram is at the beginning, middle, or end of the word). Since the fully concatenated name without whitespace is also double metaphone encoded and hashed to 4-grams in this fashion, it should also account for variance in the phonetic output resulting from spacing/concatenation differences (vowels at the beginning of a word are usually converted to "A" in double metaphone, whereas they're ignored in the middle of a word). Overall, though this method does slightly reduce the precision of near-dupe hashing (causes us to identify more potential dupes and thus make more pairwise comparisons in the next stage), it also increases recall of the process (we don't miss as many true dupes).

Component-wise deduping

Once we have potential candidate dupe pairs, we provide per-component methods for comparing address/name pairs and determining if they're duplicates. Each relevant address component has it own function, with certain logic for each, including which libpostal dictionaries to use, and whether a root expansion match counts as an exact duplicate or not. For instance, in a secondary unit, "# 2", "Apt 2", and "Apt # 2" can be considered an exact match in English whereas we wouldn't want to make that kind of assumption for street names e.g. "Park Ave" and "Park Pl". In the latter case, we can still classify the street names as needing to be reviewed by a human.

The duplicate functions return one of the following values:

  • LIBPOSTAL_NULL_DUPLICATE_STATUS
  • LIBPOSTAL_NON_DUPLICATE
  • LIBPOSTAL_POSSIBLE_DUPLICATE_NEEDS_REVIEW
  • LIBPOSTAL_LIKELY_DUPLICATE
  • LIBPOSTAL_EXACT_DUPLICATE

The likely and exact classifications can be considered duplicates and merged automatically, whereas the needs_review response is for flagging possible duplicates.

Having special functions for each component can also be useful down the line e.g. for deduping with compound house numbers/ranges (though this is not implemented yet).

Since identifying the correct language is crucial to effective matching, and individual components like house_number and unit may not provide any useful information about the language, we also provide a function that returns the language(s) for an entire parsed/labeled address using all of its textual components. The returned language codes can be reused for subsequent calls.

Fuzzy deduping for names

For venue/street names, we also want to be able to handle inexact name matches, minor spelling differences, words out of order (see this often with human names, which can sometimes be listed as Last, First Middle), and removing tokens that may not be ignorable in terms of libpostal's dictionaries but are very common, or very common in a particular geography.

In this release, we build on the idea of Soft-TFIDF, which blends a local similarity function (usually Jaro-Winkler in the literature, though we use a hybrid method), with global corpus statistics (TFIDF weights or other similar term weights, supplied by the user in our case, see the lieu project for details on constructing indices based on TFIDF or information gain (performs better in the venue deduping setting), and their per-geo variants.

Here's how it works:

  1. for strings s1 and s2, each token in s1 is aligned with its most similar token in s2 in terms of a user-specified local similarity metric, provided that it meets a specified similarity threshold. This allows for small spelling mistakes in the individual words and also makes the method invariant to word-order.

  2. given a vector of scores for each string, the final similarity is, for each token t1 in s1 and its closest match t2 in s2 (if local_sim >= theta): local_sim * scores1[t1] * scores2[t2]. In our case, the final fuzzy dot product is then normalized by the product of the L2 norms of each score vector, so it can be thought of as a form of soft cosine similarity. The two main weighting schemes are information gain and TFIDF. Using TFIDF means that rare words are given more weight in the similarity metric than very common words like "Salon" or "Barbershop". Using information gain means terms are weighted by relevance i.e. how much the given word tells us about other words, and can be thought of as running a mini feature selection algorithm or organizing the branches of a small decision tree for each string. Overall information gain tends to widen the gap between high-information, core name words and simple low-information words which can be safely ignored if the rest of the tokens match. Finally, it's possible to give all words equal weight using a uniform distribution (give each token a weight of 1 / # of tokens), and this is the approach we take for fuzzy street name comparisons.

  3. Assuming the chosen token scores add up to 1, the total similarity score for the string that's between 0 and 1, and there are user-specified thresholds for when to consider the records as various classes of dupes. The default threshold is 0.9 for likely dupes and 0.7 for needs_review, but may be changed depending on tolerance for false positives.

Note: for the lieu project we use a linear combination of information gain (or TFIDF) and a geo-specific variant score where the index is computed for a specific, roughly city-sized geohash tile, and smaller tiles are lumped in with their larger neighbors. The geo-specific scores mean that something like "San Francisco Department of Building Inspection" and "Department of Building Inspection" can match because the words "San Francisco" are very common in the region. This approach was inspired by some of the research in https://research.fb.com/publications/deduplicating-a-places-database/.

Unique to this implementation, we use a number of different local similarity metrics to qualify a given string for inclusion in the final similarity score:

  1. Jaro-Winkler similarity: this is a string similarity metric developed for comparing names in the U.S. Census. It detects small spelling differences in words based on the number of matches and transpositions relative to the lengths of the two strings. The Winkler variant gives a more favorable score to words with a shared common prefix. This is the local similarity metric used in most of the Soft-TFIDF literature, and we use the commonly-cited value of 0.9 for the inclusion threshold, which works reasonably well in practice. Note: all of our string similarity methods use unicode characters rather than bytes in their calculations.

  2. Damerau-Levenshtein distance: the traditional edit distance metric, where transpositions of two characters count as a single edit. If a string does not meet the Jaro-Winkler threshold, but has a maximum edit distance of 1 (could be that the first character was transposed), and a minimum length of 4 (many short strings are within edit distance 1 of each other, so don't want to generate too many false positives). Note: since distances and similarities are not on the same scale, we use the Damerau-Levenshtein only as a qualifying threshold, and use the Jaro-Winkler similarity value (even though it did not meet the threshold) for the qualifying pair in the final similarity calculation.

  3. Sequence alignment with affine gap penalty and edit operation subtotals: a new, efficient method for sequence alignment and abbreviation detection. This builds on the Smith-Waterman-Gotoh algorithm with affine gap penalties, which was originally used for alignment of DNA sequences, but works well for other types of text. When we see a rare abbreviation that's not in the libpostal dictionaries, say "Service" and "Svc", the alignment would be "S--v-c-". In other words, we match "S", open a gap, extend that gap for two characters, then match "v", open another gap, extend it one character, match "c", open a gap, and extend it one more character at the end. In the original Smith-Waterman, O(mn) time and space was required to compute this alignment (where m is the length of the first string and n is the length of the second). Gotoh's improvement still needs O(mn) time and O(m) space (where m is the length of the longer string), but it does not store the sequence of operations, only a single cost where each type of edit pays a particular penalty, where the affine gap penalty is the idea that we should pay more for opening a gap than extending it. The problem with the single cost is it's not always clear what to make of that single combined score. The new method we use in libpostal stores and returns a breakdown of the counts and specific types of edits it makes (matches, mismatches, gap opens, gap extensions, and transpositions) rather than rolling them up into a single cost, and without needing to return or compute the full alignment as in Needleman-Wunsch or Hirschberg's variant. Using this method we know that for "Service" and "Svc", the number of matches is equal to the length of the shorter string, regardless of how many gaps were opened, and the two share a common prefix of "S", so "Svc" can be considered a possible abbreviation for "Service". When we find one of these possible abbreviations, and none of the other thresholds are met (which can easily happen with abbreviations), it qualifies both tokens for inclusion in the final similarity, again using their Jaro-Winkler similarity as the weight in the final calculation. For strict abbreviations (match the criteria for possible abbreviations and also share a common suffix e.g. "Festival" and "Fstvl") that are greater than a certain length, an optional static score is used (if it's higher than the Jaro-Winkler score), so that cases that really look like abbreviations can be weighted as more similar than the Jaro-Winkler score might indicate. We also use the square of the larger of the two term scores in place of their product, since we're effectively considering these words an exact match, and we calculate an appropriate offset for the lower-scoring vector's L2 norm to compensate.

  4. Acronym alignments: especially prevalent in universities, non-profits, museums, government agencies, etc. We provide a language-based stopword-aware acronym alignment method which can match "Museum of Modern Art" to "moma" (no capitalization needed), "University of California Berkeley" to "UC Berkeley", etc. If tokens in the shorter string are an acronym for tokens in the longer string, all of the above are included in the similarity score with a 1.0 local similarity (so those tokens' scores will be counted as evidence for a match, not against it). For cases like "AB" vs. "A B" where the scores may be significantly different between the combined version and the separate single letters version, we take the greater of a) the acronym score in the shorter string or b) the L2 norm of the individual words for the longer string, and again use an offset for the norm of the lower-scoring vector.

  5. Known-phrase alignments: similar to acronym alignment, libpostal's dictionaries are used at deduping time as well, and it two strings contain known phrases with the same canonical, those phrase spans are given the maximum similarity (product of L2 norms of both).

  6. Multi-word phrase alignments for spacing variations: again similar to acronym alignment, but for handling spacing variations like "de la Cruz" vs. "dela Cruz" or "Sea Grape Dr" vs. "Seagrape Dr" as exact matches. In a token-based approach, the first word in a phrase might match on Jaro-Winkler similarity alone, but the subsequent words in the string with spaces would count against the similarity value. Having multiple words align to their concatenated form pays no penalty for the whitespace.

  7. Match demotion for differing sets of initials: depending on the term weighting scheme used and the type of names in the corpus, the weights for single letters (found in human initials) may have a low enough weight that they don't affect the similarity much. This is especially true for information gain, where a single letter like "A" or "B" may cooccur with many different names and end up conveying no more information than a common stopword like "the" or "of". When this is true, "A & B Jewelry" vs. "B & C Jewelry" might generate a false positive because the scores for "A" and "C" are so low compared to "Jewelry" that they are ignored. In cases where there's a likely dupe and, given sets of single letter tokens S1 and S2 in the two strings respectively, we demote the match to needs_review if there is a symmetric difference and both sides (S1 - S2 and S2 - S1) are non-empty. Note: the typical case of human names with/without a middle initial will still match under this heuristic i.e. "Yvette Clarke" matches "Yvette D Clarke" because the initial only exists in one string and there is no conflict. However, something like "J Dilla" vs. "K Dilla" would be classified as needing human review.


The above assumes non-ideographic strings. In Chinese, Japanese, Korean, etc. we currently use the Jaccard similarity of the set of individual ideograms instead. In future versions it might be useful to weight the Jaccard similarity by TFIDF/information gain scores as well, and if we ever add a statistical word segmentation model for CJK languages, the word boundaries from that model could be used instead of ideograms.

The fuzzy duplicate methods are currently implemented for venue names and street names, which seemed to make the most sense. The output for these methods is a struct containing the dupe classification as well as the similarity value itself.

For fuzzy deduplication of street names, we implement a special case of "soft set containment". Here, the Soft-TFIDF/Soft-Information-Gain method reports the number of tokens that were matched (where matched means the soft notion of equality under local similarity functions, minus acronym alignments and the sequence alignment method so i.e. only Jaro Winkler and Levenshtein). When one set of tokens is contained within the other i.e. the number of matched tokens is equal to the number of tokens in the shorter of the two strings, we classify the pair as a likely duplicate regardless of the similarity value. This allows e.g. "Park" and "Park Ave" to match without allowing "Park Ave" and "Park St" to match (though they would be classified as needing review under the exact dupe method since they share a root expansion).