Back when I was starting to study linguistics (in the 1980s), the syntax professors (Ivan Sag and Tom Wasow) made a point of using gender neutral names in their examples, names that can be used for both males and females, like Chris (Chris Evert and Chris Columbus), rather than John and Mary, which had typically been used in examples. The other day I was curious to see if we can get an idea of to what extent a name or a noun is used for females and males using Google Ngrams. We can, and here’s what I did.
The basic idea is that there is a strong tendency for pronouns to refer to close nouns (and other pronouns, but they aren’t relevant here). For example, in the phrase “Chris and her friends”, her is likely to be interpreted as referring to Chris, unless there is other contextual information to the contrary. For example, in “Mary was disappointed that her father Chris and her friends did not get along” her doesn’t refer to Chris but to Mary. With this idea, we can look for sequences of X and (her|his) _NOUN_, where X is the name or noun we are interested in, and _NOUN_ stands for any noun. (See the notes for other types of sequences we could look for.) Here’s what we get when we compare woman and her _NOUN_ with woman and his _NOUN_. Click on the chart to see a larger version. (See the notes for how I made the charts.) Not surprisingly, her occurs a lot more with woman than his does.
Since we’ll want to compare different nouns and names that occur with different frequencies, it is more useful to look at the ratio of one possessive pronoun to the other. In order to make comparison easier, I will always choose as the numerator the possessive pronoun which is (generally) more frequent. So in this case, we’ll look at the ratio of woman and her _NOUN_ to woman and his _NOUN_. (The dashed black line indicates a ratio of 1:1, when the two terms would be used equally frequently.) Now we can see that her is used with woman anywhere from 20 to over 60 times as often as his.
There’s one more factor we should take into account, and that is that his is more frequent than her. Here’s their ratio:
In order to take the disproportionate use of his into account, we need to normalize our noun-possessive ratio by the his/her ratio (or her/his if her is more frequent with the noun we are interested in). These two steps let us compare words of different frequencies, adjusting for the preponderance of his over her.
Here’s a comparison for woman and man. Again, the dashed line indicates a ratio of 1:1. What we see is that the normalized ratio of the her vs. his with woman (the blue line) ranges from almost 275:1 (in 1860) down to about 35:1 (in 2019). While that variation remains a mystery, what remains is that her is always much more associated with woman than his is. In other words, woman is associated with female gender. Similarly, man is associated with male gender, and the effect is strong (75:1 to 25:1), though not as strong as that of woman with her.
Next up: Mary and John. No surprises here — we see the same kind of pattern as with woman and man.
Before we get to the gender-neutral names, let’s look at a pair of homonyms, where one spelling is used for females (Frances) and one for males (Francis). Again, there are no surprises, but it is nice to see that our approach seems to be working.
Now let’s try the gender-neutral names of Chris and Pat (as in Pat Nixon and Pat Riley). Finally we get something different. Let’s take Pat first (the blue line). What we can see is that before 1920, Pat is associated more with the male his (the blue line is below the 1:1 ratio dashed black line), but afterwards Pat is more strongly associated with the female her. Chris is a lot more variable, largely hovering around the 1:1 ratio (indicating a true gender neutral usage), but then after about 1985 becoming more strongly associated with the male his. While neither effect is as strong as the others we have seen (with the normalized ratio maxing out at a little more than 4, as opposed to the ratios of over 100 that we saw above), we still see that some gender effects, even for these gender neutral names.
Turning back to nouns, we can compare friend and teacher. Here we see two similar patterns, where both nouns start out weakly associated with his by 1865, but both change to be weakly associated with her, teacher by 1905 and friend by 1940. While we might think there were more female teachers in the first half of the 20th century, the associations are still weak, in fact the weakest we’ve seen.
Of course we aren’t limited to people. Not surprisingly mare is associated with her while stallion is associated with his. More interesting is cat and dog, where cat starts off strongly associated with her, though declining after about 1945. On the other hand, dog is fairly neutral, especially after 1915 or so. In other words cats are commonly associated with female gender while dogs are not particularly associated with either.
For the last chart, we can look at what is probably personification, of the moon and the sun. Here we see that moon is strongly associated with her until roughly the 1970s (perhaps because of space exploration?)and remains weakly associated with her, while sun moves from a mild association with his to being neutral by 1935.
One thing that is striking across these examples is that association with her seems to be much stronger than association with his, even with the normalization. I have no good idea why that might be. In addition, although we might guess at motives for certain patterns (as with moon), pretty much every noun and name will have its own story.
In the end, it was fun to use this ngram technique to discover gender associations, but the results raise even more questions.
For the names used here, I have known people with all of the names mentioned here, including many family members, in addition to myself (Chris). See also the notes for more discussion.
1. There are certainly other environments that we could use to check for gender association, like Noun + Preposition + Possessor + Noun (cat with her kittens). While reflexive pronouns are another possibility, like Noun + Verb + Reflexive (man washed himself), they are less common than possessors. For example Noun + and + Possessor + Noun is 1.5 – 2.5 times more common than Noun + Verb + Reflexive. However, the reflexives are not foolproof, since we find sequences like:
- “The average modern woman married to the average man finds herself constantly trying to adjust herself to changing demands on the part of her husband”
- “Bassianus, the son of a Roman puppet ruler through a British woman, finds himself raised to the insular throne because his people prefer him over his brother of pure Roman descent”
While we could use the syntactic dependencies of the Ngram Viewer to eliminate these and other undesired examples, they are even less accurate than the part of speech tags, which is why I avoid them.
2. To make the charts I manually did simple queries in Google Ngram Viewer using the “English 2019” corpus and using a smoothing of 5 (which calculates a running average over 5 years before and after the target year, so over about a decade). I took 1860 as an arbitrary starting date when published sources seem to be more reliable in Google Books. I then extracted the data using an Apple Shortcut that I wrote, following which I used ggplot2 (in R) to plot the calculated data. While it is possible to make very similar charts directly in the Ngram Viewer, using this two phase process allows me more control over the colors and labels. In addition, I have found that the Ngram Viewer graph isn’t always visually accurate, due to its use of interpolated curves instead of straight lines to connect the data points.
3. Not all names lend themselves to this kind of ngram analysis. For example, Lee which is another gender neutral given name, is also a family name, and the prominence of Robert E. Lee in particular throws off any search for Lee as a given name. Another interesting type of example are names which are typically different genders in different languages. For example, Andrea is typically female in many languages but male in Italian and a few other languages.