[Updated 2023-02-17 with chart]
Finding names automatically in texts is hard! (This is a more technical post.)
Natural language processing (NLP) has come a long way, and what it can do is amazing. However, it still has quite a ways to go, at least for the kinds of things I would like to do.
Previously I talked about my Photo-Era Search tool, which allows you stop search the Photo-Era magazine, published from 1898-1932. One of my motivations in looking at Photo-Era is to get a better idea of the role that women photographers played in one aspect of the history of photography by examining how they are mentioned in this magazine. In order to do that, I need to be able to identify the names of photographers and determine (guess, really) which of them were women. This was a long and tedious process.
I started by downloading the text from Internet Archive and parsing it using spacy. (In what follows I’ll talk about various problems, but these are NOT criticisms of spacy, but of the state of the art in natural language processing.) Spacy will identify some strings as the names of people (as an aside, it is really hard not to anthropomorphize and say that spacy identifies strings it “thinks” are names — but I’ll try). However, there are many examples of things that aren’t names that need to be excluded. For example:
- Names of non-people, like “Kodak”
- Common noun phrases referring to people, like “Concert Pianist”
- Phrases that don’t refer to people, like “Good Luck”, “Inner Tube”
- Run-ons with multiple names, like “Madonna Leonardo Da Vinci”
- Run-ons containing name of people with non-people, like “Arthur M. Underwood Second Place”
- Names combined with OCR errors, like “Raymond A. Jhe” (instead of “Raymond A. Wohlrabe (newline) The traveler bound for China …)
Many of these false positive errors are understandable, especially the run-ons, which are often due to OCR problems, including not handling tables of contents properly. Eliminating the false positives involved an interactive process of seeing some in the output and filtering them out, then seeing ones I had overlooked, and repeating. Many times, until I was cross-eyed.
There are also false negatives, names that spacy does not identify as names. These are harder to deal with than false positives, since we can see the false positives in the output, but (obviously) not the false negatives. One class of false negatives are Title + Name (like “Mrs. Kasebier”). However, we can guess these ourselves by taking the common titles (Miss, Mrs., Mr. etc.) and looking for sequences of capitalized words following those titles. The search tool actually does a simplified version of this by optionally combining titles with known last names. Much harder false negatives are names that for some reason just slip by spacy. Some of these are due to OCR errors (like “Mary 0 [zero] Sampson” instead of “Mary O Sampson”). Others are baffling, like not identifying “Frances Benjamin Johnston“. That one is particularly annoying because not only was she a woman, but she was very prominent in her time, which is why I had my eye out for her — when I didn’t see her in the output, I suspected there was a problem. I have identified a few other overlooked names, but of course I don’t know what ones I missed, a type of the (in)famous “unknown unknowns“.
My aim here is not to do a complete error analysis, but just to show some of the common types of false positives and false negatives.
The case of Title + Name brings up the issue of variants: different forms of a name (and different names) which refer to the same person. For the purposes of keeping track of the photographers, it is important to know that “Mrs. Kasebier” and “Gertrude Kasebier” are the same person. In fact all of these names refer to the same person, Gertrude Kasebier (these variants are not used in the public Photo-Era Search tool, but they are used in our private version):
Mrs. Kisebier, Gertrude Kase, Mrs. Gertrude Kassebier, Mrs. Gertrude Kiasebier, Mrs. Kase, Mrs. Gertrude Kiisebier, Mrs. Gertrude Kasebier, Mrs. Kiasebier, Mrs. Gertrude Kaesebier, Mrs. Kasebier, Mrs. Kasestates, Gertrude Kisebier, Gertrude Kaesebier, Mrs. Kgsebier, Mrs. Kaesebier, Mrs. Kiisebier, Mrs. Gertrude Kesabier, Mrs Kasebier, Gertrude Kiisebier, Gertrude Ka, Mrs. Kesebier, Gertrude Kiasebier, Gertrude Kasebier, Mrs. Gertrude Kesebier, Mrs. Keesebier
Some of the variants are typos or other errors in the original text (like “Kisebier” instead of “Kasebier”); some of the variants are OCR errors (like “Kgsebier”); and some are just the normal variation of names (First name + Last name, Title + First name + Last name, Title + Last name). The typos and errors have to be checked manually, while the normal variation can be done automatically. This goes for finding variants in the first place as well as identifying them. To resolve Title + Last name, I look for the full name in the same issue, checking not only the last name, but the gender. One particularly difficult type is when a woman uses her husband’s name (like Mrs. Henry Snowden Ward = Catherine Weed Barnes Ward). Sometimes that resolution can be done automatically, but most often it has to be done through (my) world knowledge.
The final issue for the names is trying to figure out which ones refer to women and which ones to men. Good clues are titles like Miss, Mrs., Mr. Other titles such as Doctor and Professor also seem to refer only to men in Photo-Era, even though there were at least female doctors in general in that time period. First and middle names are a bit trickier: while many names are used for one gender or the other, others are used for both (like Chris and Lee, and family names used as middle names, like Weed above), including ones that we (well, I) find surprising. For example, M. Frank Kimball was a woman. I semi-automatically compiled a list of name-gender associations, and used that in conjunction with titles to resolve genders. So if the text had “Miss M. Frank Kimball” (it doesn’t), then we would know that the person was a woman, despite Frank usually referring to a man.
[Update with chart]
Here’s the distribution of names by gender over the duration of Photo-Era. The big drop-off at the end is largely due the final volume (volume 68, 1932) having only 3 issues. As you can see, there’s still a lot of names that I have not identified for gender, ones with only initials and a last name, like R.W. Dawson, who is mentioned 30 times, the most of any unidentified name.
So, to find the names of women and men in Photo-Era, it took state of the art NLP (spacy), plus additional programming (by me), plus a lot of manual identification and verification (also by me). Whew! However, I am optimistic that this work will pay off by enabling various kinds of analyses. In the meantime, it has already helped us identify some interesting women, both photographers and non-photographers — Lee will have more to say about them. So stay tuned!