I have gone down many blind alleys in my text-mining project, and I don’t think of them as failures; they are experiments that didn’t work out. The project involves mining the text of the U.S. horror genre television show Supernatural (the CW network, 2005-present) to try to demonstrate in an objective fashion the quality of the dialogue. I’ve hit roadblocks and met dead ends in every phase of the project so far, from creating the data, to learning computer-assisted analysis and other tools, to defining the research questions. And of course the darkest alley of all is the one I haven’t reached yet: what if the results don’t back me up?
The first and by far the most time and labor intensive step in the process was creating the datasets – in this case, the dialogue from the first 10 years of the program, comprising 217 episodes (the SPN Corpus), the dialogue from one of the main characters (the Dean Corpus), and the dialogue from the other main character (the Sam Corpus). I attempted to create the SPN Corpus a few different ways: using transcripts of aired episodes made by fans and stripping out everything that wasn’t dialogue and associated speakers; starting with files used in closed captioning and dubbing, stripping the time stamps (which could be done automatically using one of the programs available on the internet for that purpose), and adding the speakers; and enlisting the help of a graduate student who valiantly tried to write a program to automate the process by magically comparing the transcript and caption files. The first method is time consuming. The second method turned out to be even more time-consuming and not reliably accurate. The third method worked to some extent but only, at best, about 75% of the time; going through and fixing errors was the most time-consuming of all. This last experiment is one I regret, because it took up quite a bit of the programmer’s time. I hope he is able to make use of what he learned writing the program.
There were other stumbling blocks, the most notable of which I encountered when creating the separate Sam and Dean corpora. There are numerous times in the series when one or both of the characters was possessed, supernaturally influenced, impersonated, hallucinatory, sent to an alternate reality, or body-swapped. I waivered considerably while deciding what exactly to count as dialogue spoken by Sam or Dean, and not some version of them. I consulted some fans, got a few opinions, and tried to devise a set of criteria. In the end, there were no rules I could apply universally, and some of the decisions I made were based on instinct. But there was quite of bit of mind-changing and stress in the decision-making.
Incidentally, from what I can tell, an actual corpus linguist wouldn’t sweat these small details. It likely wouldn’t make much difference to the results. But whether it’s the librarian in me or the fan or a combination of the two, I want precision to the extent possible.
At various points in this project I have shared my experience. I published a chapter on issues with creating the data and finding and learning a corpus analysis toolkit. I gave a few presentations at a couple of THATCamps and a Research Bazaar (ResBaz), mostly on how to use developer Laurence Anthony’s AntConc, the tool I selected to analyze my corpora. And at each of these points, I learned something. Before writing about or presenting on AntConc I needed to learn how to use it, and at each stage I learned a bit more about computer-assisted textual analysis and corpus linguistics generally.I did a good deal of reading, playing around, and corresponding with Dr. Anthony. I progressed from being able to demonstrate only the most basic concordance and word frequency tools to becoming sufficiently knowledgeable – but not expert — at explaining the more advanced tools in the AntConc kit. I recall trying to show how to compare the text of Frankenstein to Dracula at my first THAT Camp and getting completely confused about which was the reference corpus and what I was seeing in the results, until someone in the audience showed me. It was a complete “duh” moment, more so because I was the presenter. However, no one left the room in a huff, and I didn’t hear any murmurs to the effect that the presenter was incompetent. I credit the atmosphere of the THATCamp for this; the nature of the format allows for some stumbling. We’re all learning, after all.
It was while presenting to the ResBaz that I learned some extremely useful skills, and I only learned them because I was open about what I didn’t know. I wanted to create word lists of adjectives, nouns, and verbs used by Sam and Dean, separately. I’d found an easy part-of-speech tagger but I had two problems: first, the tagger created individual files and placed them in the same folder as the untagged files. That was 217 files I had to pull out of a folder. Second, I didn’t want the same word to appear more than once, since I’m concerned with the breadth of the vocabulary, not word frequency. But I didn’t know how to deal with the duplicates. I could make out some sotto voce discussion among two attendees. They obviously knew something I didn’t, but seemed hesitant to say so. I’m sure they were being polite and didn’t want to call me on my ignorance. But I assured them that I’d be more than grateful to hear what they had to say, no matter how elementary it might seem to them. If these two guys hadn’t have happened to be at my little presentation I would never have learned how to concatenate files using GitBash or how to deduplicate using Excel.
At the very least, sharing ignorance can confirm that what you don’t know, others also don’t know. Maybe that tool you were hoping for doesn’t exist, yet. I asked a few text analysis experts at my second THAT Camp if they knew of a tool that could identify synonyms in a word list. No one did. But of course, if anyone reading this knows of one, I hope you’ll share it with me!
By Liorah Golomb, Humanities Librarian, University of Oklahoma