Textual Analysis & Mining

In the field of study that is the Humanities, textual analysis always has, and will continue to be, an integral subcategory within it. The reason this is so is because literature is the richest amount of information humanity possesses of its past, inasmuch as it is indisputably the most intelligible of all archaeological evidences for the purpose of elucidating the nature of our pedigree. Until the digital age, despite minor differences regarding procedure, all scholars could agree there has only been one method of examination that would allow an understanding of texts from the humanities viewpoint. That method happens to be known as close reading, or a careful, mindful, and thorough analysis of aforementioned writings. However, the arrival of digital era would eventuate the formation of new dissociative approaches toward textual analysis, these came to collectively be known as distant reading. Though these new techniques may be the antithesis of close reading, they need not be its archrival. Contrariwise, they can in fact be complementary.

Subsequent to the arrival of the computational power endowed by digital tools, novel forms of analysis of literature became feasible. The capacity of calculations that a computer could perform surpassed the amount a human brain could accomplish to an unfathomable extent. The application of that analytical power to textual analysis allowed an abstraction of texts in ways never before imagined. Such practices may include considerations of the rise in use of a word over time1, the quantity of a certain number of words in a piece of literature, or more specifically, an author’s employment of various pronouns2.

Seeing texts from vantage points theretofore beyond the bounds of possibility begot space for scholars to pose innovative questions. Ideated inquiries in regard to these new data could question when a word came into use3, how the use of certain words related to current events or geographic areas4, or what the use of particular words could say about an author’s personality5. Such questions could not be asked before the advent of distant reading because there are not enough hours in a day, days in a year, or even years in a lifetime for a human to collect, classify, and catalog such information comprehensibly. This is especially true for large corpuses of writing, which are necessary in order to attain answers that are statistically defensible.

Contrastingly, the text mining techniques which enable the practice of distant reading can also be used to search the content of databases of texts for keywords in order to find pertinent sources to then be used for close reading analysis research. This approach holds a strong allure; generally a scholar can find as many sources as he or she desires in an hour or less all while never leaving their home. For some, there is cause to be concerned with someone’s ability to find a large amount of sources which agree with that person’s opinion on their area of research so quickly and in an unencumbered manner.

Dr. Ted Underwood, an Associate Professor of English literature at the University of Illinois Urbana-Champaign with a special interest in the Digital Humanities, raises some of these concerns in his paper “Theorizing Research Practices We Forgot to Theorize Twenty Years Ago”. Underwood reports that:

It’s true that full-text search can confirm almost any thesis you bring to it, but that may not be its most dangerous feature. The deeper problem is that by sorting sources in order of relevance to your query, it also tends to filter out all the alternative theses you didn’t bring. Search is a form of data mining, but a strangely focused form that only shows you what you already know to expect. This limitation would be a problem in any domain, but it’s particularly acute in historical research, since other periods don’t always organize their knowledge in the ways we find intuitive. Our guesses about search terms may well project contemporary associations and occlude unfamiliar patterns of thought.6

Here he makes a strong case, the core of which is that even for a humanist seeking to be impartial and objective when conducting research using data mining techniques, there still may be systematic error that is outside of that scholar’s control. I think it is important to note that though his apprehension is well merited, as he said there have always been errors. Before data mining techniques, the typical course for securing research sources was by consulting colleagues, appendices of sources already in held in possession, and local libraries. All of these paths could strengthen confirmation bias, and more importantly show you what you are already well acquainted with and thus what you know to expect.

While distant reading and data-mining do have their faults when being utilized to attempt to obtain source material that is unbiased, a scrupulous scholar should be able to do so reasonably impartially. For this to happen, the researcher must hold him or herself to the same standards that they always have. One must not only use an ample number of sources, he or she must also pursue texts which conflict with his or her intended thesis, explore the appendices of sources that are found, as well as ensure the documents that are used are reputable.

While it can be hard to find sources showing viewpoints in opposition to the one the scholar may be predisposed to, it is not impossible. Dr. Stephen Ramsay, an Associate Professor of English at the University of Lincoln–­Nebraska, described in his article The Hermeneutics of Screwing Around; or What You Do with a Million Books” the importance of browsing to learning. In his piece he explained how always needing to have something to search for in order to access the millions of books available to humanity inherently limits the ability of scholars and others ability to learn interesting new ideas.7 Though it can be difficult to locate such perspectives in traditional searchable digital databases, there are other resources. While it is not acceptable for use as an evidence in research, Wikipedia is a fine starting place to explore a variety of ideas that can lead to new branches of investigation precisely because its vast interconnected web of hyperlinks fosters browsing.

One of the main systematic sources of error that could cause data mining or distant reading to breakdown is the data that is used. Data itself is a human construct, so it is prone to error. Trevor Owens, a digital humanist at the Library of Congress, describes this thought succinctly in his piece Defining Data for Humanists: Text, Artifact, Information or Evidence?” Owens notes that “data itself has an artifactual quality to it. What one researcher considers noise, or something to be discounted in a dataset, may provide essential evidence for another.”8 An example of this is illustrated by the innovative work of Dr. Natalie M. Houston, a Professor of English at the University of Houston. In Houston’s work, she focuses on the oft ignored visual features of poetry books from the Victorian English era. Before Houston’s work, when these books were uploaded to databases, the graphical layout was not retained, only the text was recorded. The study of poetry book’s organization is important because it is relatively unstudied and relates directly to the linguistic features of the poem.9

Thus distant reading and text mining can allow an abstract interpretation of data that sanctions areas of study that are impossible in its absence while they can also support the traditional textual analysis practices of close reading by expediting the collection of relevant sources. Both of these functions, like all types of research, can be inconsistent with the truth of the subject being examined if their application is not correct. In conclusion, the place of text mining in textual analysis is too integral to remove from the humanities, and as long as scholars strive for the utmost accuracy and precision, is not harmful to the integrity of the field.


  1. The Google Ngram Viewer Team. “Ngram Viewer.” Google, 1 Jan. 2013. Web. 13 Mar. 2015 <https://books.google.com/ngrams/info>. Google, 1 Jan. 2013. Web. 13 Mar. 2015. <https://books.google.com/ngrams>.
  2. Sinclair, Stéfan, and Geoffrey Rockwell. “Tools Index.” Voyant Tools Documentation. 1 Jan. 2013. Web. 13 Mar. 2015. <http://docs.voyant-tools.org/tools/>.
    Sinclair, Stéfan, and Geoffrey Rockwell. “See Through Your Text.” Voyant. 1 Jan. 2015. Web. 13 Mar. 2015. <http://voyant-tools.org/>.
  3. Zimmer, Ben. “Google’s Ngram Viewer Goes Wild.” The Atlantic. The Atlantic, 17 Oct. 2013. Web. 13 Mar. 2015. <http://www.theatlantic.com/technology/archive/2013/10/googles-ngram-viewer-goes-wild/280601/>.
  4. Anderson, Alyssa. “Using Voyant for Text Analysis.” DIGITAL HISTORY METHODS. 17 Oct. 2013. Web. 13 Mar. 2015. <http://ricedh.github.io/02-voyant.html>.
  5. Cook, Gareth. “The Secret Language Code.” Scientific American. Scientific American, 16 Aug. 2011. Web. 13 Mar. 2015. <http://www.scientificamerican.com/article/the-secret-language-code/>.
  6. Underwood, Ted. “Theorizing Research Practices We Forgot to Theorize Twenty Years Ago.” (2014): 3. Print.
  7. Ramsay, Stephen. “The Hermeneutics of Screwing Around; or What You Do with a Million Books.” Digital Culture Books. Ed. Kevin Kee. Ann Arbor, MI: University of Michigan Press, 1 Jan. 2014. Web. 13 Mar. 2015. <http://quod.lib.umich.edu/d/dh/12544152.0001.001/1:5/–pastplay-teaching-and-learning-history-with-technology?g=dculture;rgn=div1;view=fulltext;xc=1>.
  8. Owens, Trevor. “Defining Data for Humanists: Text, Artifact, Information or Evidence?”Journal of Digital Humanities. Roy Rosenzweig Center for History and New Media, 15 Dec. 2011. Web. 13 Mar. 2015. <http://journalofdigitalhumanities.org/1-1/defining-data-for-humanists-by-trevor-owens/>.
  9. Houston, Natalie M. “Toward a Computational Analysis of Victorian Poetics.” Victorian Studies. Special Issue: Papers and Responses from the Eleventh Annual Conference of the North American Victorian Studies Association3 (2014): 498-510. Print.