Conflict Checking Names

Using The Name Pattern Search Form

by Tanczos Istvan

Introduction

This article started as a handout on how to use the Name Pattern Search Form on the online O&A (currently http://oanda.sca.org/oanda_np.cgi). It is a collection of ideas on one way that the form can be used, not the only possible way the form can be used. This method is what's known as a "phonetic algorithm", similar to the Soundex algorithm that OSCAR uses to suggest possible conflicts (also NYSIS, Metaphone, and the like), which works based on grouping letters that frequently have similar sounds when pronounced.

For this article, we need to understand the concept of a 'character'. A 'character' is one possibly accented glyph from the Latin alphabet, plus Arabic digits (which you won't need), punctuation symbols, and the blank we use for a space. You can think of it as "one glyph that shows up on the screen when I'm typing" if you want to. Accents are an integral part of the character: 'Ü' is a character. It is distinct from 'Ú'. Both are distinct from 'U'. Note that 'space' is included as a character. This all can become important later. Ligatures (Æ, œ) are considered a single character, but see below about the difference between broad and narrow searches.

How The O&A Says The Form Works

The form is about building 'patterns' to 'match' the name we're interested in, but leave it sufficiently general that conflicting names will also match. For example - "A 'g' sound, followed by a vowel, followed by a nasal" is a decent description of the name "Jon". It is also a description of the name "Jim" and the word 'Gum'. This is roughly the method that we can get this form to do. Matching is what the computer does - it takes every single name in the O&A and checks it against the pattern it is given. If the pattern could describe that name, the name is said to match. A pattern that matches both 'Jim' and 'Jon' could be written as 'j[aeiou][mn]'. Note that, unless told otherwise, the pattern can appear anywhere in the matched name - the pattern also matches the names 'Arslan Batujin', 'Logan Mersc Macjenkyne', and 'Benjamin Xanthus Ruthendale'.

If we check the hints page for the name pattern search form (http://oanda.sca.org/hints_np.html), we find a few things that could help us doing this conflict checking. These are the clues it gives us:

Letters Letters each match the letter specified in the pattern.

^ A caret character at the beginning of the pattern matches the start of a name. '^Jo' matches "Johan" but not "Cujo". Note that this matches the full name, so '^Jo' would not match "George Johnson".

$ A dollar sign at the end of the pattern matches the end of the name. The pattern 'th$' matches "Smith" but not "Theodore".

. A period character anywhere in the pattern matches a single character of any type, including spaces.

+ One or more of the previous expression, total. So 'A+' matches "A" or "AA" or "AAAAAAAA", but not "B". The pattern "xa+x" matches "xax", "xaax", "xaaax", etc. It does not match "xx".
\S Any non-space character.

[...] a single character from the list that appears inside the brackets. '[ab]+' will match "a" and "b" and "aababba", but not "abc".

Broad searches strip accents and separate ligatures into their component characters ("Æ" is considered as "AE"), narrow searches do not strip accents and keep the ligature as a single character. A broad search for "Jan" will find "Ján", a narrow one will not. Note that this means that the “.” specifier will match a ligature in a narrow search but not in a broad search.

A narrow search also excludes those items in which the name appears only as the owner of an order, title, household, or alternate name, as the target of a cross-reference, name-change, or transfer, or as the designation or joint holder of a badge. This means that "Non Scripta Herald" will not appear in a narrow search for "Istvan". This part doesn't matter for purposes of this article - we're just looking for primary or alternate names, not the owners of items.

Fortunately, the method shown below is usable with both a broad and a narrow search: the pattern used for vowels will match the ligatures in either case, and we don't care about the extra records that the broad search matches.

How The Form Really Works

Programmers, you have almost certainly been thinking that this all sounds familiar. That's because this form uses regular expressions in Perl, and all the basics of perl regular expressions work.

Non-programmers, what's going to follow is a quick course in regular expressions. It's not that hard for most people once you get some basic concepts, most of which we've covered already.

Regular Expressions

A regular expression is a pattern that is tested against a series of characters to see if it matches. Really, that's all of it. There are some more things that you can use to specify matches, but not many are useful for this form. Here's a list of the ones not listed above that I find occasionally useful:

[a-m] Matches a single character that is one of the first 13 letters of the alphabet (aka "in the range of a through m")

[^abc] Matches any character except a, b, or c.

* Zero or more of the previous character. The pattern "xa*x" matches "xx", "xax", "xaax", etc.

\b The name should have a word boundary here. "J\S+n\b" will match Jon, John, Jan, Jen, and Juergen but not Jennifer or Jonas.

| Alternative - "a|b" will match either a or b.

( ) Grouping - "s(am|un)" will match "sam" and "sun", but not "san" or "sum". Even more tricky, "ba(na)+" will match "bana", "banana", "bananana", ...

\B The name should not have a word boundary here. "\Bate" will match "Catelin" but not "Atenveldt"

Checking sound/spelling using Regular Expressions

So, how do you use this?

The easy first way is to check for spelling variants. If we want to conflict check "Tanczos Istvan" for spelling, knowing that the given name (Istvan, since it's Hungarian) was spelled Istvan, Istwan, and Ystvan means that we could give the form a pattern of "[IY]st[vw]an" and it would match all those possible spellings. Note that "(I|Y)st(v|w)an" is an equivalent expression.

Another way to use it is to check for what we'll call sound variants. This method will never be exact until and unless we have a fully phonetic alphabet. The key for this is to cast our net broadly, to find anything which might sound the same, rather than trying to get everything the same. This also lets us wiggle around with the rules, which require either changes to two syllables or substantial changes to one syllable.

Generally, don't bother to specify leading or trailing repeats of non-space characters. The pattern "^\S+[nm]\S* " is effectively the same as the pattern "[nm]\S* " for our purposes. Patterns will match anywhere in the word.

Another thing to think of is places where letters may or may not be. The difference between "Jon" and "John". Or the difference between "Eirikr" and "Eric" - since the terminal r in Norse names is frequently silent, these would be pronounced identically, so the pattern needs to be sprinkled with indicators that there could be silent or missing letters.

So, here's the list that I frequently use:

Sounds which do not pair with anything: r, l
Groups: [cskxqz], [td], [pbgj], [nm], [fv]
Vowels: \S+ or \S* (So that you catch all of them, including ligatures.)
Possibly silent letters or missing vowels: \S*
Anything else: \S+ or \S*

Note that you don't have to fully specify each possible sound in the name - using the form is a balance between over-specifying and under-specifying. Also, be smart about it - the larger groups (particularly cskxqz) do not always need to all be specified, due to pronunciations of the specific letters.

I generally try to get between 30 and 60 matches - it's not too many to look through quickly, it's not too specific that we're missing potential conflicts.

Some examples

All of these names are real examples taken straight from the East Kingdom's November 30. 2013 Letter of Intent, they are not contrived or invented for this article. I'll explain the process as I go. Also, note that when I say that the search form returns N matches, that's the number of matching records, including armory (which is stored under the submitter's primary name, so they add to the count for that name. The pattern "Tanczos Istvan" matches 5 items because Istvan has 5 items registered, only one of which is his primary name.

Any three-word name can be represented by the pattern "^\S+ \S+ \S+$". That reads "The start of the name followed by any number of non space characters, followed by a space, followed by any number of non-space characters, followed by a space, finishing with any number of non-space characters, followed by the end of the name." We can't do this without the ^ and $, or it would match any name that had three or more parts. Examples: Ronald Wilson Regan. Barak Hussein Obama. One that does not fit is "Tanczos Istvan" since that has only two words.

Borujin Acilaldai - I started with " ^[pb]\S+r\S+ [jgc]\S+[nm]$", then I realized that the name wasn't "Boru Jin". I next tried "^[pb]\S+r\S+ [jgc]\S+[nm] \S*l\S+l[td]", which returns no names and still isn't "Boru Jin." Finally, "^[pb]\S+r\S*\S+[nm]* \S*l\S*l" returns 50 matches, which is not a bad list to visually inspect.

Danika of Stonemarche - The pattern "^[dt]\S+[nm]\S+ *\S+ *[csk]\S*[td]\S+[nm] " produced 431 items, too many to check. Adding a very little bit to the end, "^[dt]\S+[nm]\S+ \S+ [csk]\S*[td]\S+[nm]\S*r" produces 20 matches. "Daniel of Stonemarche" is not a conflict - a two-syllable name element (Daniel) is substantially different than a three-syllable name element (Danika). Because of the rules around prepositions, we must also check "^[dt]\S+[nm]\S+ \S* *[csk]\S*[td]\S+[nm]\S*r"

Diederik von Wolffhagen - ""^[dt]\S+r\S+ \S+ \S*[fvw]\S*[nm]" produced 240 matches. "^[dt]\S+r\S+ \S+ \S*[fvw]\S*l\S*[nm]" produces a manageable 53. Because of the rules around prepositions, we must also check ^[dt]\S+r\S+ \S* *\S*[fvw]\S*l\S*[nm]

Else vom Schnee - "l\S*[csz]\S* \S+ \S*[cskxqz]\S*[nm]" produces 1365 matches, only 500 of which will be displayed. "^\S*l\S*[cskxqz]\S+ \S+ \S*[cskxqz]\S*[nm]\S+$" produces 456 sometimes, you just have to slog through. Alternately, we can be smarter: "^\S*l\S*[csz]\S+ \S+ \S*[cskxqz]\S*[nm]\S+$" produces 397 matches. Closer. “^\S*l\S*[csz]\S+ [vf]\S+[nm] \S*[cskxqz]\S*[nm]\S+$" produces 50. Prepositions again: “^\S*l\S*[csz]\S+ \S* *[cskxqz]\S*[nm]\S+$" - which is too many.

Francisco Sanchez Pancho - "^[fv]\S*r\S*[nm]\S+ \S+[nm]\S* [pb]\S+[nm]" matches three items.

Karl von Weisbaden - I started with "[ckq]\S*r\S*l \S+ \S+[bp]\S*[td]\S*[nm]$", which matched nothing. The pattern "[ckq]\S*r\S*l \S* *\S+[bp]\S*[nm]$" matches 2 - probably ok, but let's try to be a bit more broad. "[ckq]\S*r\S*l \S* *\S+[pb]\S+[nm]" matches 4. "[ckq]\S*r\S*l \S* *\S+[pb]" matches 34.

Nikolaus Johann Claus - "[nm]\S*[ckq]\S*l\S* \S*[nm] [ckq]" matches 19 items.

Shoshanah Gryffyth - Tricky, since there are so many sibilants. I first tried "[nm]\S* [pbjg]\S*r\S*[fv]", which returned too many matches. "^[scz]\S+[nm]\S* [pbjg]\S*r\S*[fv]" produced 72 matches.

Þórý Veðardóttir - How do I manage the Norse characters? I avoid them. The pattern "r\S* *[fv]\S+r\S+[dt]\S+r$" produces 65 matches.

Eirikr Þorisson - Unvoiced and possible missing letters. Whee! Fortunately, I can combine some of them with the adjacent \S+, since the combination \S+\S* is equivalent to \S+. Let's try "^\S*r\S+[ck]\S* \S+r\S+[nm]$". Unfortunately, this produces 365 records (lots of armory). It shows us why we want to check these: Eirikr Thorinsson was registered in 1987 in An Tir, Eiríkr Thórisson was registered in 1997 in Calontir. "^\S*r\S+[ck]\S* \S+r\S*[cskxqz]\S+[nm]$ only gives 239.

Things that it takes time to learn

One of the easier ones is which part of the name to concentrate on: concentrate on the less common part of the name. 'John Mazjewska' should have most of the checking done around the surname, not the given name.

How much of the name should you specify in the pattern? This is also a balancing act. Too much specification can lead to missing conflicts.

How to handle articles and word breaks (SENA: "Names are compared as complete items, so that Lisa Betta Gonzaga conflicts with Lisabetta Gonzaga, although the elements are different.") This is where spaces matter. A pattern of " *" anywhere there could be a space can work for an optional space. My usual practice is to check both options. A recent case involved "Rónán Lestrange" and "Rowland le Strange" - I found the conflict by checking the surname both with and without the space. This is also a situation where concentrating on the other part of the name can make things easier.

Which letters match which sounds? This is actually a very, very difficult problem, since much of it is dependent on which language you're speaking and we have to take into account all possible languages, including the Slavic, Oriental, and Uralic families. This is why there's a whole list of algorithms at the top of this article. Soundex is a 50-step process. Double Metaphone, which is a modern extension to Soundex, has thousands of steps. Give it your best guess. This is part of why commentary is a group activity - we're not all going to catch everything. Having multiple different people doing things multiple different ways is a feature in this situation.

Another of the things that takes a long time to learn is knowing what variations in spelling that a particular name might be found with in a given language. This one is pure experience. (See the 'Istvan' example from earlier)

Pitfalls

The pattern "\S" must always be capitalized for it to work properly. The pattern "\s" matches "space characters" - space, tab, newline, and the like. Only spaces usually appear in names.

While we do not currently register abbreviations, such as "Mc" for "Mac" and "St." for "Saint", this was not always the case and several of each have been registered. Take care to write a pattern that captures both.

Using ^ anywhere other than the beginning of a pattern, or $ anywhere but at the end of a pattern will not produce the results that you think you want. The pattern "$ab^" is not a meaningless pattern, but I know of no names with $ and ^ in them.

Spaces at the beginning or end of a pattern that you write are hard to notice and will frequently mess up your search.

2020 Addeudum: OSCAR and the NP link

Near the mid-point of the year, OSCAR added a link present on all new name submissions for Laurel commenters. This link reads "NP" if you can see it and will take you to the name pattern search form already filled out for the given name. It doesn't use the same letter patterns used here, and it does the entire name. Use it as a starting point. Edit the pattern to get what you need.

Last revision: Sunday, Feb 2, 2025

Letters	Letters each match the letter specified in the pattern.
^	A caret character at the beginning of the pattern matches the start of a name. '^Jo' matches "Johan" but not "Cujo". Note that this matches the full name, so '^Jo' would not match "George Johnson".
$	A dollar sign at the end of the pattern matches the end of the name. The pattern 'th$' matches "Smith" but not "Theodore".
.	A period character anywhere in the pattern matches a single character of any type, including spaces.
+	One or more of the previous expression, total. So 'A+' matches "A" or "AA" or "AAAAAAAA", but not "B". The pattern "xa+x" matches "xax", "xaax", "xaaax", etc. It does not match "xx".
\S	Any non-space character.
[...]	a single character from the list that appears inside the brackets. '[ab]+' will match "a" and "b" and "aababba", but not "abc".

[a-m]	Matches a single character that is one of the first 13 letters of the alphabet (aka "in the range of a through m")
[^abc]	Matches any character except a, b, or c.
*	Zero or more of the previous character. The pattern "xa*x" matches "xx", "xax", "xaax", etc.
\b	The name should have a word boundary here. "J\S+n\b" will match Jon, John, Jan, Jen, and Juergen but not Jennifer or Jonas.
\|	Alternative - "a\|b" will match either a or b.
( )	Grouping - "s(am\|un)" will match "sam" and "sun", but not "san" or "sum". Even more tricky, "ba(na)+" will match "bana", "banana", "bananana", ...
\B	The name should not have a word boundary here. "\Bate" will match "Catelin" but not "Atenveldt"