What makes a good methodology? Well, the first step is to have a clear methodology. And it probably pays to understand the topic. For Matthew Britt and Jaaron Wingo, authors’ of “Christ Before Jesus,” they lack in both departments. This is part two of my critique their work.
One of the things that gave some hopeful promise to Britt’s and Wingo’s work was that they anticipated potential concerns others would have about their data. They wanted to provide some assurance that their analysis was legitimate. However, their answers to these concerns only raised further issues.
The first question they tackle is about whether the texts they are using are too short. They attempt to justify their choice on running a chapter level analysis, but what became clear is that they didn’t know the limitations of stylometric analysis.
That texts can be too short is even something that Britt and Wingo mention, and in order to calm such objections, they state that in some cases, such as Second and Third John, they combined the texts. Yet, they imply that the average of 400 words per chapter in the Pauline texts was suitable.
Is it suitable though? No, not even close. The first issue is that their numbers are off. Along with the two Johannie Epistles, they claim that Philemon also fell below the average 400 words. Yet that’s not true. In most English versions of the Bible, Philemon is somewhere between 430 and 490 words.
When looking at Pauline texts, this would fall within the average range, which is closer to right around 489 words (on the lower end of the spectrum. Depending on the translation, this can rise to over 500 words per chapter on average). But maybe they were actually talking about the original Greek text.
If we ignore the English, and go to the Greek, we get an average of around 372 words per chapter. Philemon would be below this average, coming in at 335 words, but it’s not a significant difference. The difference is just about what the actual average word count is compared to what Britt and Wingo claimed it was.
Now, to be transparent here, to get to the average I’m using, I’m including all of the epistles that are traditionally said to have been written by Paul, excluding Hebrews. Could one get the average to be closer to 400 by excluding certain letters? Sure. And this is one of the problems that Britt’s and Wingo’s work suffer from. They aren’t clear on how they are coming about their averages, or what they are including, or excluding.
But moving to a second point, even if we take an average of 400 words as being true, or even if we double that, we still fall short of the word counts we actually need. To get an accurate read, we need around 10 times as many words to do a proper analysis.
Looking at Eder, the creator of Stylo, in their 2015 article, “Does Size Matter?,” he argues that samples below 5,000 words “provide a poor ‘guessing,’ because they can be immensely affected by random noise.” For samples below 3,000 words, he says the results are “simply disastrous.”
The point of success that he found was between 5,000 to 10,000 words. In a later study in 2017, “Short samples in authorship attribution,” Eder does show that in certain cases, one can reduce the number to around 2,000 words, and at times one can get accurate readings. But this is only when one is comparing a single text to an entire corpus.
So in practice, this would be taking the entire Pauline corpus, and then comparing single texts to the whole. As a further step, each text is re-sampled multiple times. However, an important point here is that this method doesn’t always work.
First, for this method to be successful, you need texts that have a clear authorial fingerprint. In some cases, that doesn’t exist, and much larger sample sizes are needed. Even then, correct attribution may still not happen. Second, you need a sizable corpus known to be by a certain author that a text can be compared to. You also have to know what category an analyzed text belongs to. And again, if the text being analyzed happens to lack a clear authorial fingerprint, regardless of it truly being by the same author, there is a severe risk of misclassification.
So even though one could theoretically use a smaller sample size (which is still multiples higher than what Britt and Wingo suggest), it wouldn’t matter here as this specific sort of analysis isn’t being used by the authors.
Pivoting from here, we can also looks specifically at most frequent words, or MFW. In the 2017 article, “Understanding and explaining Delta measures for authorship attribution,” published in Digital Scholarship in the Humanities, they found that with Eder’s Delta, it peaks around 1,000-1,500 MFW. Obviously, if the samples being examined are less than 1,000 words, and average closer to 400, there is going to be a serious problem here.
The research is quite clear here, 400 words is not enough. It’s no where close to enough. Yet, this is something that Britt and Wingo appear to understand. Earlier in Chapter 7, when discussing the analysis they ran on their own book, they state clearly that “when a text or number of texts are too short for Stylo to analyze, they will cluster together far away from everything else.”
The chapters in their book that they claim are too short was their “Note from the Authors,” and the Introduction. Yet, the Note is 494 words, while the Introduction is 434 words. Both fall within the average number of words they claim for the chapters they were analyzing in the Pauline Epistles. Both are longer that the Greek text of Philemon, with even the Introduction being just about 100 words more. Even in English, both works are longer than Philemon.
Now, let’s be fair here. At the beginning of the book, they do claim to be using the NIV translation, and in that case, Philemon is at 480 words, so more than the Introduction, but still short of their Note. Yet, in Chapter 7, they also use the American Standard Version, which has Philemon at 438 words, so basically on par with their Introduction.
Clearly there is a problem here, one that Britt and Wingo acknowledge, yet they don’t actually factor in when it comes to their study. And that point really needs to be hammered home, as they state that those two chapters of their book were too short for Stylo. This in itself invalidates their entire argument.
One final thing before we move on to the next concern. As discussed, when looking at MFW, to get an accurate read with Eder’s Delta, the formula Britt and Wingo state they are using, you need between 1,000-1,500 most frequent words. How many do the authors use? If the analysis they did on their own book is any clue, just 100.
Concern over Methodology
The second concern that the authors addressed was in regards to how they ran their analysis. They anticipated that people would ask if they shouldn’t have ran different variables. In order to counter this, they make the claim that they didn’t start off trying to prove themselves, but they just wanted to see what they could find.
Because of this, they claim that the approach they developed came out of trial runs they conducted and “a number of academic papers regarding Stylo.”
So what did these trials consist of? It seems that they were largely just them playing around with Stylo. They claimed to have ran every delta option, and a variety of options for n-grams (both for characters and words) as well as different MFW values.
There’s a whole lot of issues here, and much of it comes back to transparency. First, we aren’t told what academic papers on Stylo they are referencing. We are simply told that yeah, they’ve done their research. In many ways, this seems like a logical fallacy. They are appealing to authority, but in this case, some vague sense of authority.
What makes this worse is that the authors continually state that the reader can replicate their analysis and see if the results are accurate. Yet, we aren’t given any data that would allow us to do that.
Having tried to look for some academic papers that dive into Stylo, I haven’t found anything besides the documentation by the authors of the program. This is the documentation provided by the authors on Github, which consist of a somewhat basic how to manual. There is definitely an expectation here that you have some background in stylometry in general. So while Stylo is still a relatively low bar for entrance into computational stylometrics, there will still be a large learning curve, and simply playing around with the program will not suffice.
Expanding my search outside of academic papers, I was able to find a few basic blog posts that mention some initial testing they were doing, but didn’t dive into methodology. In order to get some clarification on this, I did reach out to the authors. Their response did lead me to uncover two studies, but that was done indirectly.
However, largely, the answer was the same as was in the book. They claimed they have a large folder on their desktop, but because people have tried to catch them on copyright violations for sharing the articles, they can’t share anymore. The bigger issue though is that their response shifted the goal post.
In their book, they make the very specific claim that they used academic papers on Stylo. In their response to me, they moved the claim to incorporate stylometry as a whole, as well as other analyses of texts.
They did share two links to collections of works by the Computational Stylistic Group, who created Stylo. However, much of their work had nothing to do with Stylo itself, besides basic introductions. Other pieces of work were incredibly focused on interesting material, but was not relevant here. Other portions were more focused on the topic of translation, which can be fascinating, but again, not really relevant.
Now, for anyone who would be interested in this sort of information, doing a quick search of Computational Stylistics Group, Github, will get you to their site.
If you’re interested in stylometry, I do highly recommend their website. In fact, it has been one of the sources I frequently went back to while doing my own research. The problem for Britt and Wingo though is that most of the relevant information in these articles call into question their own methodology. If we recall back, Eder’s own articles demonstrated that for an effective analysis, word count had to be multiple times higher than what these authors were trying to use.
However, Britt and Wingo did send me one other link, and that was to an article by Jacques Savoy, titled “Authorship of Pauline Epistles Revisted,” published in 2019. While this article does not mention Stylo in any form, Britt and Wingo assured me that the author must have known about the program as they cited one of Eder’s papers.
There’s a problem though. The paper Savoy cited is the same 2015 article I cite above. The one that demonstrates that a length of 5,000 words is needed to make a reliable attribution. It is in this sense that Savoy even cites the paper, because he recognized that even the full texts of the Pauline Epistles were short. In their conclusion, for this very reason, they state that “these attribution results must be taken with caution because ten epistles over fourteen have less than 5,000 words.”
Complicating this is that Britt and Wingo response to my email stated that it wasn’t necessarily about finding a paper on Stylo, but more finding one that used Stylo, and then looking at the methodology. While Savoy didn’t use Stylo, I figure it’s fair to compare his methodology with that of these authors, as they after all suggested the study.
Interestingly enough, according to Savoy, and the Greek text he used, Philemon is 388 words. Based on this, with it only being 12 words off of the average they claimed, there should have been no issue with running it in the analysis. Ignoring that, and acknowledging that Savoy’s study would be different from what would be done on Stylo, we do see some glaring issues.
One, Savoy, when displaying his figures, includes the entire figure. As in, he doesn’t cut off the metrics, while Britt and Wingo do. Savoy explains on page 6 of his article that the distance between texts ranging from 0 to 1, highlights how similar they are. So texts at 0 are identical, while texts at 1 have nothing in common. Savoy relates a 1 distance to something similar to having one text in English, and another in Ancient Greek. Clearly though, what this shows is that the distance scale tells us a lot. Why Britt and Wingo would cut it out is still a mystery.
Savoy also argues on page 8 that a minimum of 200 most frequent terms is needed, and somewhere between 200-500 is ideal. This again is in contrast to Britt and Wingo, who as best as one can tell, based on the one full graph they provided, used just 100.
Based on even just this small sample of methodology, there are some clear issues. If Savoy is a legitimate study according to Britt and Wingo, and one should assume that it is since they reference it, one comes to the conclusion that their own methodology is a bit sketchy at best, as it completely ignores what they highlight.
But as I mentioned, there were two studies that I stumbled upon after speaking with Britt and Wingo. Both studies were by James O’Sullivan, with one being coauthored by Rachel McCarthy.
The first study, published in 2020 by O’Sullivan and McCarthy is titled, “Who Wrote Wuthering Heights.” The second, published in 2021 by just O’Sullivan is titled, “The Sociology of Style.” What can we gather from these two works and how we should conduct out methodology? Well, again, it’s not all that promising for Britt and Wingo.
Before we get into the bad though, let’s look at a somewhat positive similarity. O’Sullivan does subscribe to the view that 100 MFW is enough. In his 2020 study, he does state this is a theoretical view, and it’s based on the idea that using a small sample size of high frequency words, provides a sample of words that are “especially resistant to intentional authorial manipulation.”
Here, he is citing David L. Hoover, and his 2009 article, “Word Frequency, Statistical Stylistics and Authorship Attribution.” The problem is that O’Sullivan is misrepresenting what Hoover said. The full quote adds the words, “it was assumed that their frequences should be…” Hoover goes on to say, in the introduction of his article here, that recent work has changed this.
O’Sullivan repeats this claim in his 2021 article as well, misrepresenting Hoover in the same exact manner. So what does Hoover actually conclude? Larger word frequency lists almost always improve the results. Hoover states that in his earlier work, he had expanded his list up to the 800 MFW, then to 1,200 MFW, and when working with long texts, up to 4,000 MFW. This begins to fall in line with what we saw earlier that with Eder’s Delta, the ideal was 1,000-1,500 MFW.
This falls more in line with Hoover’s specific testing with Delta, which showed that 2,000 MFW produced much more accurate results. Obviously, that isn’t always possible with short texts, but Hoover also shows that 700-800 MFW can greatly improve the results, but more words is still deemed better.
On top of this though, Hoover also describes other ways that he processes the texts in order to remove some background noise. For instance, he talks about removing dialogue from novels, as it prevents a number of problems. Except when it doesn’t, because it can add other errors, or at times, it can take away too much. He also removes pronouns, but he mentions that’s not standard practice. He also removes words that are frequent in a corpus just because it has an exceptionally high frequency in a single text. Doing such removes items that are closely tied to a piece of content, which can make a sizable shift.
It appears that Britt and Wingo don’t take any of this into consideration, as no culling seems to be occurring. Going back to O’Sullivan though, moving past the MFW issue, he also states that their study uses Support Vector Machine classification. SVM is a supervised machine learning algorithm. We don’t need to dive much into this as we can be certain that Britt and Wingo aren’t using such.
In a YouTube discussion with the Godless Engineer, John Gleason, the authors make it very clear that what they are doing is “not AI, it’s not machine learning, it’s not that kind of stuff.” In addition, they also make it clear that they are not running supervised analyses, but instead chose to work in an unsupervised fashion, that outputs the graphs, which visualize the distance between texts, and then relies on a researcher establishing the meaning of the data based on their knowledge of the texts.
In contrast, O’Sullivan is taking the work of known authors, and running that against a disputed text. However, we do run into some problems here when it comes to Britt’s and Wingo’s study.
O’Sullivan mentions a few problems in creating a corpus, the primary one being genre. As opposed to Britt’s and Wingo’s claim, O’Sullivan makes clear that genre can make texts unsuitable samples. More so, he also highlights the fact that the size of the texts do matter. Now while he claims that in certain circumstances, a word count as low as 1,000 can provide decent results, it’s still imperfect.
More so, looking at O’Sullivans 2014 study, “Finn’s Hotel and the Joycean Canon,” he cites Eder, who posited the 5,000 word threshold, while also stating that samples below 1,000 will not provide reliable results. From this, O’Sullivan claimed that the lowest reliable threshold was 1,000 words, which in his specific case, appeared to work. But as he has to admit, there were limitations to his approach.
Suffice to say, the methodology here does not appear to have influenced Britt’s and Wingo’s work, and largely contradicts the effectiveness of their methodology, or what we can gather was their methodology. As a side note, O’Sullivan here does provide the metrics with his graphs, which again lends to the view that this was not a study that was used to inform Britt and Wingo.
If we move to the 2021 study, we see much of the same thing. In this case, known authors were examined to determine if family’s of writers shared similar styles, which was shown to be a tentative yes. This could in fact complicate such studies even more when dealing with works of debated or unknown authorship, as it suggest that authorial fingerprints may be shared to some degree. If so, they would call into question various views that Britt and Wingo have put forth, largely around the supposed uniqueness of a person’s linguistic style.
What all of this illuminates is that we can not replicate or even really gain insight into what Britt’s and Wingo’s methodology is. Instead, it highlights problems with the little we can garner about their methodology.
Beyond this, there is also reason to have suspicions about their understanding of stylometry in general. As I think has been demonstrated, the established methods and guidelines for stylometric analysis has largely been ignored by Britt and Wingo. But it goes further than that. For instance, on page 216, they make the claim that they ran every delta option in Stylo.
This is a slight thing, but it would be more accurate to claim they ran every distance option in Stylo. Unless they specifically meant they ran each variation of Delta (and it should be noted, Delta is capitalized, as it’s a proper noun. It’s not a general term in the way that Britt and Wingo seem to have used it). But why then ignore other distance measures, such as Manhattan? Based on their phrasing, and the setup of Stylo, it would appear that they simply assumed each distance measure was a “delta option.”
After considering all of this, it’s hard for me to not conclude that they simply found the setup that confirmed their views, and ran with that. That in total, they don’t fully understand stylometry, or how to form a proper methodology here, which is why they seem to be evasive, and vague.
But on top of all of this, they still make the claim that they “were able to determine the most effective way for each language and not publish an image in this book that uses a method lower than 96% accuracy.” They really bolster the idea that their results are reliable and they ran the texts in the best way possible. And to bolster this confidence a bit more, they even encourage their readers to follow up on their research.
The question is how? They don’t explain their methodology. They don’t give citations of the studies they used to develop their methodology. As I’ve shown, the actual studies that use Stylo contradict the basic ideas that Britt and Wingo are suggesting, and the studies done on stylometry in general make it clear that the little we can glean about their methodology is simply wrong.
It would seem as if, in fact, they don’t want their readers to try to replicate their results, but instead simply take them on their word. Even in my own correspondence with them, while they claim this is easy, they also overly complicate the process. For instance, right before claiming it’s easy, they also stated that:
“It took a year or two of studying the data science literature, sourcing and cleaning files, understanding the program both from a theoretical perspective and a hands-on-approach, reading the program documentation, and even digging into the codebase in order to develop our approach (we did not modify the software in any way). To get it up and running and test it is easy… but to actually work to validate and promote specific methods took a ton of work.”
So it’s easy, but it also takes years or work, especially since a reader of their book is really left on their own in order to create their own methodology and approach. Even for someone who has reached out and tried to get additional answers, something they encourage, the responses are vague and do little to help clarify anything.
Now, after going through all of this, I decided to ask Britt and Wingo one final question. I was just looking for a bit of clarification about their methodology. To be honest, I expected more vagueness, and because of that, I asked more pointed questions.
To be very fair with Britt and Wingo, they answered all of my questions in this regard. And even though I clearly disagree with much of their work, I do have to say that they appeared to be rather friendly, and even though I’ve dissected some of their responses in our email correspondence, they have shown a willingness to actually have these discussions, or so I thought. As for their clarification on their methodology though, it didn’t reveal a lot.
So what is their methodology here? Well, it’s basically the default procedure in Stylo. They used 100 MFW, Eder’s Simple Delta, no culling, and 1-word n-grams, which is what Stylo does by default. They don’t use sampling either. If you were to open Stylo, in order to get these settings, the only thing you’d have to change is the distance measure to Eder’s Simple.
For me, this was a bit wild, because why isn’t this spelt out in the book? Why not be transparent about this? Instead, they make it seem as if their methodology is more in depth. That it came from having studied other research and running a whole mess of tests. Yet, what they get to is basically, open up Stylo and just run it on default.
Now, one may be wondering, well did they do any pre or post processing of the data? No, they don’t. They state the reason for this is largely because they didn’t want to complicate the process any further, and they didn’t want to do anything extra that would draw even more criticism.
And that may very well be the truth. But it kind of seems like a crutch, an easy excuse that hides the fact that their study lacks in a variety of ways, while also trying to build up the idea that their study is legitimate. Because again, their methodology is nothing more than opening up Stylo, changing just the distance measure, and then running the program. And even then, they couldn’t spell this out in their actual book. So for me, it raises some suspicions. But again, I want to give them the benefit of the doubt, as at least they seemed to be sincere, at least in the email exchange.
