HathiTrust Digital Library recently announced that now “it provides access to the text of the complete 16.7-million-item HathiTrust corpus for non-consumptive research, such as data mining and computational analysis, including items protected by copyright.” It is major development for HathiTrust that took years of planning and negotiation to ensure it was in compliance with copyright law. For those who are not familiar with non-consumptive research, it is research that uses computational methods to analyze textual or visual digital content without downloading or displaying substantial portions of the digitized work. (An analogy would be researching birds out in the field rather than bringing them back to a lab for research.) The computational tasks for non-consumptive research may include assembling collections, text mining, textual analysis, data extraction and visualization, linguistic analysis, image analysis, or file manipulation. So now, regardless of an item’s copyright status, researchers can use the analytics services and tools developed by HathiTrust Research Center (HTRC) to mine and analyze digitized content in the repository for research and educational purposes.
What is in the HathiTrust Digital Library?
The HathiTrust is a partnership of approximately 140 research institutions and libraries contributing their digitized collections to a shared repository. HathiTrust comprises 16.7 million items, of which 6.2 million can be viewed fully, dating back to the 10th century. Items in HathiTrust Digital Library include books, serials, musical scores, manuscripts, records, reference materials, statistics, and conference proceedings in more than 27 languages.
How does this new development affect research at the University of Oklahoma?
The University’s students, faculty and staff can access analytics tools offered through the HathiTrust Research Center (HTRC). The HTRC, run by Indiana University and the University of Illinois at Urbana-Champaign, develops computational tools for analyzing content in the HathiTrust Digital Library and makes them available to researchers of all skill levels on its website.
For novice users, HathiTrust’s Bookworm enables users to discover trends and patterns by adding single keywords and filtering the search in a number of ways. The Bookworm Playground allows one to conduct comparative searches on the geographic locations of publications while the bar charts and heat map applications provide a higher level view of contents in the HathiTrust corpus.
For those who want to dive deeper, the HTRC Analytics portal is designed for users who have some programming experience and an understanding of the basic concepts of text analysis. The featured services in the Analytics portal offer several ways to conduct more complex and customized searches. HathiTrust’s latest announcement highlighted its Extracted Features tool that allows one to download the metadata and wordcounts from the HathiTrust Digital Library. Another featured service is HTRC Algorithms that are presented as “click-and-run” tools to enable users to run an algorithm on a subcollection or workset without having to write code. While the HathiTrust Research Center has improved its services and documentation significantly, all three of HTRC Analytics tools require an investment of time to learn and use them effectively. At OU Libraries we have specialists who can work with faculty and students interested in exploring both the Bookworm applications and HTRC’s analytics services.
Head, Digital Scholarship Lab