Text Mining Oklahoma’s Newspapers

Share This:

The Oklahoma Historical Society Research Division’s website, Gateway to Oklahoma History, is a fantastic, freely available source that includes digitized Oklahoma newspapers from the 1840s to the 1920s. Like many digitization projects, it’s a work in progress, but already contains a wealth of valuable information.

The basic search screen, shown here, allows you to search the full text of the database, metadata, title, subject and creator. The “Explore” function in the upper right corner allows you to browse locations, dates, and titles.

OKgateway

When I first encountered this database, I did what many of us do, but may be ashamed to admit: looked up my family name. Unfortunately, the first newspaper result is about a bank robber who shares my last name and who may or may not be related. I quickly tried to think of ways to refine my search to the decent Scriveners, but saw no easy way out. Additionally, “scrivener” is a legal term and a profession that appears fairly frequently in historic newspapers.

Luckily, I remembered my father’s mother’s maiden name, Malahy, and quickly found a much nicer feel-good story: my grandmother had written a letter to Santa Claus that was published in a Shawnee newspaper in 1917, when she was 7 years old.

Screen Shot 2016-08-01 at 4.56.13 PM

Whatever type of research you are doing, here are some helpful hints for the Gateway to Oklahoma History:

  • When you do a keyword search, your terms will be highlighted in yellow within the digitized document.
  • If the term doesn’t appear on the first page, be sure to check the other pages.
    • I prefer to do this by clicking on the front page of the newspaper, and then clicking on “zoom/full page.”
    • At that point, I can easily scroll through the pages until I find the yellow highlighting, or zoom in on the page so I can actually read it.
  • Keep in mind that optical character recognition is not perfect. For example, when I searched for the name Malahy, the word “salary” came up a lot.
  • In this database, like in most that have digitized materials, there is very little after the year 1923.
  • Even though my last name is not nearly as common as some, I still run into many “scriveners” when I do a keyword search because the word was more commonly used in the past.
    • When doing research in any digital database, have patience and be flexible with your keywords.

By Laurie Scrivener, History and Area Studies Librarian

On the Value of Sharing Failure and Ignorance

Share This:

LiorahLibrarianI have gone down many blind alleys in my text-mining project, and I don’t think of them as failures; they are experiments that didn’t work out. The project involves mining the text of the U.S. horror genre television show Supernatural (the CW network, 2005-present) to try to demonstrate in an objective fashion the quality of the dialogue. I’ve hit roadblocks and met dead ends in every phase of the project so far, from creating the data, to learning computer-assisted analysis and other tools, to defining the research questions. And of course the darkest alley of all is the one I haven’t reached yet: what if the results don’t back me up?

The first and by far the most time and labor intensive step in the process was creating the datasets – in this case, the dialogue from the first 10 years of the program, comprising 217 episodes (the SPN Corpus), the dialogue from one of the main characters (the Dean Corpus), and the dialogue from the other main character (the Sam Corpus). I attempted to create the SPN Corpus a few different ways: using transcripts of aired episodes made by fans and stripping out everything that wasn’t dialogue and associated speakers; starting with files used in closed captioning and dubbing, stripping the time stamps (which could be done automatically using one of the programs available on the internet for that purpose), and adding the speakers; and enlisting the help of a graduate student who valiantly tried to write a program to automate the process by magically comparing the transcript and caption files. The first method is time consuming. The second method turned out to be even more time-consuming and not reliably accurate. The third method worked to some extent but only, at best, about 75% of the time; going through and fixing errors was the most time-consuming of all. This last experiment is one I regret, because it took up quite a bit of the programmer’s time. I hope he is able to make use of what he learned writing the program.

There were other stumbling blocks, the most notable of which I encountered when creating the separate Sam and Dean corpora. There are numerous times in the series when one or both of the characters was possessed, supernaturally influenced, impersonated, hallucinatory, sent to an alternate reality, or body-swapped. I waivered considerably while deciding what exactly to count as dialogue spoken by Sam or Dean, and not some version of them. I consulted some fans, got a few opinions, and tried to devise a set of criteria. In the end, there were no rules I could apply universally, and some of the decisions I made were based on instinct. But there was quite of bit of mind-changing and stress in the decision-making.

Incidentally, from what I can tell, an actual corpus linguist wouldn’t sweat these small details. It likely wouldn’t make much difference to the results. But whether it’s the librarian in me or the fan or a combination of the two, I want precision to the extent possible.

At various pointsLiorahpost in this project I have shared my experience. I published a chapter on issues with creating the data and finding and learning a corpus analysis toolkit. I gave a few presentations at a couple of THATCamps and a Research Bazaar (ResBaz), mostly on how to use developer Laurence Anthony’s AntConc, the tool I selected to analyze my corpora. And at each of these points, I learned something. Before writing about or presenting on AntConc I needed to learn how to use it, and at each stage I learned a bit more about computer-assisted textual analysis and corpus linguistics generally.I did a good deal of reading, playing around, and corresponding with Dr. Anthony. I progressed from being able to demonstrate only the most basic concordance and word frequency tools to becoming sufficiently knowledgeable – but not expert — at explaining the more advanced tools in theScreen Shot 2016-07-11 at 3.31.21 PM AntConc kit. I recall trying to show how to compare the text of Frankenstein to Dracula at my first THAT Camp and getting completely confused about which was the reference corpus and what I was seeing in the results, until someone in the audience showed me. It was a complete “duh” moment, more so because I was the presenter. However, no one left the room in a huff, and I didn’t hear any murmurs to the effect that the presenter was incompetent. I credit the atmosphere of the THATCamp for this; the nature of the format allows for some stumbling. We’re all learning, after all.

It was while presenting to the ResBaz that I learned some extremely useful skills, and I only learned them because I was open about what I didn’t know. I wanted to create word lists of adjectives, nouns, and verbs used by Sam and Dean, separately. I’d found an easy part-of-speech tagger but I had two problems: first, the tagger created individual files and placed them in the same folder as the untagged files. That was 217 files I had to pull out of a folder. Second, I didn’t want the same word to appear more than once, since I’m concerned with the breadth of the vocabulary, not word frequency. But I didn’t know how to deal with the duplicates. I could make out some sotto voce discussion among two attendees. They obviously knew something I didn’t, but seemed hesitant to say so. I’m sure they were being polite and didn’t want to call me on my ignorance. But I assured them that I’d be more than grateful to hear what they had to say, no matter how elementary it might seem to them. If these two guys hadn’t have happened to be at my little presentation I would never have learned how to concatenate files using GitBash or how to deduplicate using Excel.

At the very least, sharing ignorance can confirm that what you don’t know, others also don’t know. Maybe that tool you were hoping for doesn’t exist, yet. I asked a few text analysis experts at my second THAT Camp if they knew of a tool that could identify synonyms in a word list. No one did. But of course, if anyone reading this knows of one, I hope you’ll share it with me!

By Liorah Golomb, Humanities Librarian, University of Oklahoma

A Perfect Fit: Citizen Science Soil Collection and SHAREOK

Share This:

Screen Shot 2016-05-16 at 4.47.55 PM


The University of Oklahoma Natural Products Discovery Group have teamed up with the OU Libraries to make data from their innovative Citizen Science Soil Collection program open to the public.

Citizen scientists support natural products drug discovery

The Natural Products Discover Group (NPDG), led by Principle Investigator Dr. Robert Cichewicz, work to discover and develop new drugs that can be used to treat infections and cancers. The starting point for their drug discovery research are chemicals called natural products that are produced by organisms like the fungi found in soils.

In the beginning, they collected their own soil samples and tested the fungi that could be found locally. Soon, though, they realized that they would need help if they ever hoped to collect samples that spanned the diverse ecosystems of the United States. To expand their reach, the group has created The Citizen Science Soil Collection Program through which citizen scientists from across the country sign up to send in samples from their own backyards. Volunteers go online to request a soil collection kit. The kit includes everything needed to collect a soil sample, from a tiny plastic shovel to return postage. Citizen scientists mail their soil samples back to the NPDG lab, where fungal isolates can be plated and tested. Chemical products produced by fungi found in the soil samples are screened to identify disease-fighting products that can be used to develop drugs to treat human diseases.

The Citizen Science Soil Collection Program was launched in 2010, and since then over 10,000 citizen scientists have requested sample kits.

Sharing research data

With the influx of soil samples and the resultant increase in data produced, the NPDG needed a robust method to communicate data gathered from soil samples. When they reached out to the University of Oklahoma Libraries for help, the libraries’ repository services team—David Corbly, Tao Zhao, and Zhongda Zhang, and Logan Cox—rose to the challenge. They designed a custom site to host the Citizen Science project data in SHAREOK, the joint institutional repository of the University of Oklahoma and Oklahoma State University. The SHAREOK site allows citizen scientists to look up the data generated from their soil samples and also serves as a permanent archive of research data that can be mined by current and future researchers.

How to help

If you are interested in becoming a citizen scientist yourself, visit the Citizen Science Soil Collection Program website to learn more and request a sample kit.

If you would like to donate to support the Citizen Science Soil Collection Program, visit the Thousands Strong donation site. The Thousands Strong fundraiser runs from May 16 to June 16, 2016.

Carolyn Mead-Harvey, Science Librarian, University of Oklahoma

Historical Sheet Music Online

Share This:

Screen Shot 2016-05-06 at 4.11.09 PMBefore the radio burst into millions of homes during the 1920s the piano often was the focus of the evening’s entertainment for many middle class families.   The sheet music of the 19th and early 20th centuries provides revealing glimpses into the attitudes and culture of earlier times. Many digital collections of this repertoire are available, a few starting places include:

While some do require specialized skills to read, many could be used by researchers in other fields without difficulty.

– Matthew Stock, Fine and Applied Arts Librarian, University of Oklahoma

Digital Scholarship in Libraries

Share This:

Stewart Varner and Patricia Hswe wrote an informative piece about the status of digital humanities in academic libraries based on the results of a American Libraries/Gale Cengage Survey. This is not the only survey that’s been conducted to gauge how libraries are supporting and contributing to digital humanities projects. In the past two years, as Digital Scholarship Specialist I have participated in four surveys conducted by other universities or non-profit digital libraries organizations — all trying to gather an environmental scan of how academic libraries are responding to this growing discipline.

About ResBaz 2016

Share This:

ResBaz_transparent2-cropped_small copy

You’re invited! OU Libraries’ Digital Scholarship Lab is hosting a ResBaz (ResearchBazaar) February 2 – 4, 2016, in conjunction with the University of Melbourne’s ResBaz. The focus of a ResBaz is to bring together researchers from a variety of disciplines and invite them to participate in a range of workshops on digital tools.The format of ResBaz is in the spirit of a market or bazaar – it is meant to be informal and inviting. http://melbourne.resbaz.edu.au/about We  have a lined up a number of technologists (GIS, Text Analysis, Data Vis) and subject librarians to participate in ResBaz and I hope to invite several faculty to also offer some workshops on tools that they have used.