Map, Authors in Study by Country, Jack Elliott
Jack Elliott, "Authors in Study by Country," courtesy of author.

The category romances of Harlequin Mills and Boon are an international phenomenon—sold everywhere from cell phones in Japan to railway stations in Europe and written by authors from around the world. The most popular of these categories is Harlequin Presents, which was originally started with only authors from England, but is now dominated by writers from the United Kingdom, Australia, and the United States. Because Harlequins are a mass-market product sold around the world, some argue that regional differences in the writing have been erased by Harlequin’s tight corporate control. 1

At superficial level, this is certainly not the case—writers from North America use North American spelling, and writers from elsewhere prefer British English. But are there other, more significant underlying regional differences? To see if I can find these differences, I’ve taken 185 Harlequin Presents published in the lead-up to April this year, and lumped them into three regions. The Antipodes encompasses writers from Australia, South Africa, and New Zealand. The European region takes all writers from the United Kingdom and mainland Europe together. The North America region includes both Canadian and U.S. authors.

Now, we could read all 185 novels and try to compare how authors from the different regions use language. This might be a little time-consuming, and it would certainly be out of date by the time we finished! Luckily, one can train a computer to spot differences in word-use. The program I’ve used, Breiman and Cutler’s Random Forests, gives me everything I need to know—an estimate of the accuracy, an explanation of the results, and—most importantly of all—a picture of what it’s found.

Graph, Location of Origin, Jack Elliott

It turns out that a machine can easily tell the difference between the three regions, with texts from the Antipodes and Europe closer together and a couple of novels attempting to “break away” from Europe and join with the group of North America. These are the novels of Karen van der Zee. Born in Holland but married to an American, she has spent “a while” in the United States, and her language use reflects this. (Poor machine—I’d promised it strict categories with clear delineation: something real life cannot provide!)

Graph, Dependency Plot for Antipodes, Jack Elliott

The Random Forests also provide a way to explain how the writing differs by region. The next graph shows the most important differences between the regions from the point of view of the Antipodes. Notice how the red dotted line (the word “Sydney,” a city in Australia) and the orange dotted line (the word `”Australia”) move upwards? This means that writers using those two words are much more likely to be from the Antipodes. These isn’t just writing by Australians, it is also about Australians.

Graph, Dependency Plot for North America, Jack Elliott

On the other hand, the North American region has its own distinctive characteristics. Using the same 10 words I tracked in the previous graph, here is a plot that clearly shows distinctive North American usage. Some of these don’t have much bearing on theme or ideas—the use of “toward” rather than “towards,” for example, or the frequent use of the word “gotten”—but look at the striking preoccupation with time in the North American novels! “Forever” and “anymore” are both words favored by North Americans—although the more workaday “afterwards” is not (that’s a European word).

Incidentally, “never” and “endless” are also favored by North American writers, but they are less important for differentiation.

Sometimes it’s just as instructive to look at the words writers don’t use. We can train Random Forests to differentiate writers based not on the words they use, but the synonyms they prefer. For example, a writer describing the eye color of their heroine might prefer the word “emerald” rather than green—an important stylistic difference that Random Forests can pick up on. For example, North American writers prefer “near” rather than “close.”

Graph, Dependency Plot for North America: "Near" "Close" Comparison, Jack Elliott

When talking about subway or mass-transit systems, European writers seem to be setting their writing mostly in London and surrounds, as they prefer “underground” or “tube.”

Graph, Dependency Plot for Europe: Comparison of Transport Terms, Jack Elliott

While Antipodean writers prefer the more neutral “metro.”

Graph, Dependency Plot for Antipodes: Comparison of Transport Terms, Jack Elliott

This finding is also a key that unlocks another difference in the writing. Harlequin Presents emphasizes glamorous, exotic locations. For Europeans that often means European capitals with mass transport systems. Australians can imagine nothing more exotic than the quiet efficiency of the Parisian Metro. North American writers, on the other hand, do not heavily employ any synonym for metro in their writing—even the North American “subway” is rarely used. For North Americans, it seems, there is nothing glamorous about commuting.

Jack Elliot

Jack Elliott is a PhD student at the Centre for Literary and Linguistic Computing, University of Newcastle.

