Topic Modeling Thomas Osborne Davis

I intend for my final project to focus on Thomas Osborne Davis, an Irish nationalist who lived from 1814 to 1845 (he died a week after the potato blight that caused the infamous Famine was first reported). During Davis’ lifetime, his poetry and prose appeared in The Nation, a newspaper he cofounded in 1842. After his death, Charles Gavan Duffy—one of The Nation’s other founders—published Davis’ work in book form. All of Davis’ writing fiercely condemns British control of Ireland and celebrates aspects of Ireland’s pre-colonial culture such as mythology, language, and artwork, which the British had long attempted to eradicate. Davis argues that reclaiming Ireland’s traditional culture will empower the Irish people to end their colonization.

Despite Davis’ importance to subsequent Irish nationalists (for instance, the leaders of the 1916 Easter Rising picked up his tactic of glorifying Ireland’s traditional culture), he is not at all well known in the U.S. My attraction to Davis’ writing stems from my interests in how national identities are created and sustained and in the history of Ireland’s colonization. To allow me to pursue my interest in Davis, Ted added two collections of Davis’ writing—The Poems of Thomas Davis (1846) and Literary and Historical Essays (1845)—to our corpus.

I was initially nervous about the idea of bringing digital tools to bear on Davis’ work. I worried that topic modeling would yield only obvious results, such as his frequent use of words like “Ireland,” “Irish,” and “national.” Indeed, both of the texts by Davis in our corpus rely heavily on Topic 69 (ireland irish county sir lord see king note catholic english scotland earl roman parish country said bishop england parliament castle year pope about kingdom died land century lands family per native priest crown fits afterwards royal reign dr church according ancient town property priests oath following queen held period north). This is the first topic listed for Literary and Historical Essays—completely unsurprising given that titles in this collection include “Ancient Ireland,” “Irish Antiquities,” and “Irish Scenery.” Topic 69’s relationship to The Poems of Thomas Davis, on the other hand, surprised me somewhat. I expected Topic 69 to be the most common topic in this collection, as well, because Davis’ anticolonial, nationalistic project informed all of his writing. Instead, Topic 69 is the third topic—still a prominent topic, to be sure, and still much more common in this text than in most other texts in our corpus, but not quite as dominating as I predicted.

So, which topics occur more often in Davis’ poems than Topic 69? Topics 91 and 137. In one sense, this result should not be shocking, since both of these topics are incredibly common in our corpus: Topic 91 is the eighth most common topic and Topic 137 the tenth most common. Both topics include plenty of characteristically poetic words. Topic 91 contains numerous references to facial features and emotions: thy love tis oh thee heart sweet song bright still yet eye fair thou may oft though joy while whose nor dear light smile brow vain hope hour maid thine hath how flowers home beauty gay notes page youth round loved can young tears day ye note grave breast away. The following visualization bears out the association between Topic 91 and poetry. The graph also demonstrates that this topic reached its peak usage in the first half of the nineteenth century, becoming significantly less common after about 1850. Davis, writing in the early 1840s, made use of this topic as it was declining in popularity.

The second most common topic in Davis’ poems, Topic 137, describes features of both nature and human beings (like sea night dark deep air light wind sky wild sun waves beneath round sound green earth voice shore clouds mountain pale land stood heard till bright moon eyes red death along waters wave heart cloud eye blue hear blood came rose storm around winds forth high fire dead white), indicating much use of the pathetic fallacy. As with Topic 91, the texts that most rely on Topic 137 are volumes of poetry. Another similarity between the two topics is that the visualizations show them both declining in the latter part of the nineteenth century (though Topic 137’s decline seems less dramatic than Topic 91’s). Again, Davis was using the words associated with Topic 137 as they were becoming less common.

Previously, I said that Davis’ reliance on Topics 91 and 137 is not surprising because of the commonness of these topics in nineteenth-century poetry generally. However, I believe that his use of these words is significant precisely because they are so common. As I mentioned earlier, Davis’ nationalistic rhetoric promotes the notion that Ireland’s pre-colonial culture is reclaimable; it also emphasizes the differences between British and traditional Irish culture. The prevalence of such common language in his poems suggests that colonization had effectively eliminated some of these cultural differences: topic modeling shows that Davis did not write characteristically Irish poetry (at least not at the level of word choice), as the authors most associated with Topics 91 and 137 are mostly English and American. Most likely, the similarities between Davis’ poetry and that of English and American writers stem from their use of a common language (English). Ironically, Davis dealt with the problem of the Irish using English in “Our National Language,” an essay published in 1843. In this essay, Davis writes that the Irish people cannot be truly decolonized until they cease using English, the language of their colonizers, and revive the long-outlawed Irish language.

Several of the other top topics for The Poems of Thomas Davis (Topics 143, 102, and 132, for example) are also very common in English poetry, further demonstrating the success of Britain’s cultural colonization of Ireland.

These are just a few of the interesting results I obtained by topic modeling Davis’ works. I don’t know how much significance these results hold for anyone besides me—and I’m not sure how interested any of the rest of you are in this area—but I have enjoyed looking at Davis’ writing in a new way.

–Rebecca McCloud

Reading Recommendation: The Elyse Edition

I was interested in the extent to which the rise of digital humanities might have a transformative power on the way everyone “does” humanities. I decided to go with this entry from a blog on The Chronicle of Higher Education website. The usual author is apparently an anonymous figure named Prof. Hacker, but this particular entry is by a woman named Adeline Koh. Koh attempts to outline the ways in which a DH scholar might approach current academic conventions surrounding tenure and promotion, that may also be useful for grad students.

Among these items is the statement: “blind peer review is dead,” and it offers an alternative in “post-publication” peer review. While it makes me feel a bit stodgy, I must admit I’m with the bald heads and white beards on this one, in that I find this concept horrifying. However, I plan to read the article more closely, and may be won over by our class meeting.

Here’s that link again:


Some results of topic modeling.

This is going to be a post with a lot of pictures and not a lot of interpretation. I’m actually using it in lieu of a class handout, because it’s easier than printing 40 pages in full color! The interpretation will be up to us in class.

These images are based on a collection of 1629 volumes from 1751 to 1902, including fiction, drama, poetry, biography, and some essays. The collection was topic-modeled using LDA. The plots you’re looking at here indicate the volumes that prominently featured a given topic, giving some indication of date and genre. BrowseLDA3.R can give you more detailed metadata on the volumes where a given topic appears prominently.

In each case it’s important to note that the gray curve represents an overall frequency that is based on the whole collection, not just the volumes plotted as individual points. (The script only plots the top 1/4 of the volumes, because plotting 1629 points is like plotting fog.) Also, the collection has more drama and poetry in the 18c than it does in the 19c; remedying this is a priority for us going forward.

Topic modeling often introduces you to things you “already know” sort of in the sense that a map of your home town shows you “things you already know.” You may have been on most of those streets. But a picture, from above, showing the spatial relationship of the parts … is not exactly the form in which you experienced them.

We know about the eighteenth-century discourse of sensibility, for instance. But in this modeling run, sensibility divides into two parts. Topic 109 contains a lot of epistolary fiction — and, less prominently, drama — and it expresses tenderness in these terms: dear am happiness myself moment shall most adieu amiable heaven heart happy alas cannot present can passion creature situation charming dearest friendship ever letter confess mine sentiments unhappy truly esteem honour tenderness both distress thousand affection conduct wishes wish possible.

By contrast, Topic 45 contains fiction and poetry, and peaks slightly later.

It expresses sensibility, if possible, even more tenderly: heart mind bosom spirit soul tender every nature generous affection virtue virtues parent tenderness youth whose delight friendship feel affections ardent lovely tears sensibility passion object pure gentle beloved gratitude passions sweet frame form human pride sentiment anguish sacred kind. This discourse may have a slightly less epistolary, or less first-person, emphasis.

I don’t want to overinterpret the next graph, because it’s quite possible that the generic division is an artefact of the collection rather than a feature of underlying print culture. But those black circles are books on peerage, and biographies of various noble characters, especially including Byron.

What I find surprising in the next one: the volumes where this topic is most prominent include novels by William Godwin and Brockden Brown — and also biographies, of William Godwin and Brockden Brown! Whatever affliction is happening here, even talking about Godwin and Brown seems to cause it. The affliction might be Jacobinism, I suppose, because a lot of the volumes further down the list are Jacobin novels.

Here are some of the terms prominent in the topic: myself situation mind present most degree scarcely however period conduct character travels might human intercourse thoughts circumstances conceived means purpose sentiments motives existence produced condition appeared temper scene perhaps person attention nothing own species regard time already confidence circumstance every. The stray novel in the 1840s there is Melville’s Typee.

Finally, here’s something for you later 19c types. I’ve been saying in class that there’s something weird about faces and body parts in later 19c fiction. Here it is again, mixed up with architecture and spatial orientation somehow.

Whatever this topic is, it always involves “looking” through “windows.” Door face back eyes looked room stood window down suddenly turned yes hand moment looking walked sat round voice light across cried white front herself girl look chair floor standing away behind table fire dark wall opened slowly towards between. We could say this is just “concrete,” but it’s concrete in some weirdly specific way. It seems to me that there’s something here — a periodizable discourse analogous to 18c “sensibility” — waiting to be defined by anyone who can figure out what the heck it is. The authors represented here include Thomas Hardy, Olive Schreiner, Mary Augusta (Mrs. Humphry) Ward, and A. Conan Doyle.

Finally, just to prove that we do have some 19c poetry, here’s Topic #2, which interestingly bridges late-Romantic and Victorian poetry, including Byron and Felicia Hemans, but also Elizabeth Barrett Browning, Harriet Beecher Stowe, and the occasional volume of fiction by Bulwer-Lytton or C. R. Maturin.

I actually am not entirely sure what to call this: dark light wild voice night through around dead spirit earth beneath like dim deep darkness stood death dream still sound sleep heard grave pale words sun silent fell waters bright gaze within shadow cloud near burning cold alone calm vision.

Two different ways of thinking about “religious” vocabulary.

I hope a picture is worth a thousand words, because I don’t have time for even a hundred words right now. But in re: Benjamin’s remarks about religion below, here are two different ways of looking at the Google ngrams dataset. On the one hand, drastic decline:

On the other hand, not doing so bad:

These images come out of work Loretta Auvil, Jordan Sellers, Boris Capitanu and myself presented at the Chicago Colloquium on Digital Humanities and Computer Science last fall. In a sense the contrast between those two pictures contains a compressed version of a lot of recent discourse rethinking the meaning of “secularization.” I’ll have to leave the meaning of the contrast elliptical for now, though.

Benjamin D. O’Dell: Thoughts on Frederick Gibbs and Daniel Cohen’s “A Conversation with Data,” Victorian Studies (Autumn 2011)

Literary scholars and historians have generally been reluctant to accept the value of text-mining as a tool of analysis.  Given that a word’s meaning is context dependent and often ambiguous, they argue that text-mining deprives historical documents of their context and transforms complex content into decontextualized data.  In contrast to these criticisms, advocates of the digital humanities have been quick to challenge the notion that digital humanities’ research methodology represents a dramatic—and, by extension, misguided—shift in the direction of conventional humanities scholarship.

As an illustration of this point, Frederick Gibbs and Daniel Cohen’s contribution to the Autumn 2011 edition of Victorian Studies offers a brief example of how one might combine methods of distant and close reading to either confirm or correct previous assumptions.  Drawing from the themes and values articulated in Walter E. Houghton’s influential 1957 text The Victorian Frame of Mind, 1830-1870, Gibbs and Cohen compare key themes from Houghton’s study such as “hope,” “faith,” and “heroism” with data from Google’s Ngram Viewer to determine whether or not the values articulated in Houghton’s text hold up to the test of quantitative analysis.

In order to address the issue of context, Gibbs and Cohen restricted their data to titles, based on the assumption that “word choice in a book’s title is far more meaningful than word choice in a common sentence” (71).  Their initial efforts subsequently produced a large collection of graphs “portraying the changing frequency of thematic words in titles, which were arranged in grids for an initial, human assessment” (72).  From there, the pair analyzed the data for trends.

While most of the graphs produced failed to provide an easily recognizable pattern, Gibbs and Cohen did notice one interesting trend:  a decline in religious titles incorporating words such as “God” and “Christian,” starting in the mid 1840s and proceeding rapidly between 1850-1880 (72).  At a disciplinary level, these findings complicate conventional historical accounts of the Victorian crisis of faith outlined in texts such as Houghton’s Victorian Frame of Mind in that they place a decline in Victorian faith several decades earlier than it has traditionally been assumed.   Thus, Gibbs and Cohen conclude that “here, publishing appears to be a leading, rather than a lagging indicator of Victorian culture” and suggest that future scholarship pay greater attention to “the overall landscape of publishing” as opposed to conventional canon of primary and secondary texts (73).

Overall, while this article is primarily concerned with cultivating a set of “best practices” for digital humanities scholarship, I found Gibbs and Houghton’s investigation of Houghton’s Victorian Frame of Mind to be one of the more compelling projects in the digital humanities that I have encountered thus far. As with other contributions to the Autumn 2011 edition of Victorian Studies—such as Maurice Lee, who verifies a New Historicist argument about Moby-Dick’s critique of American slavery and capitalism— Gibbs and Cohen seem particularly enthusiastic for the potential for digital humanities scholarship to act as an important form of “peer review” in determining what counts as valid interpretation in historicist criticism.

This seems to be a new—and necessary—direction for the digital humanities as it gains status within historical discipline, although I imagine that it will not resolve objections from scholars in subfields such as queer theory and the history of sexuality, who can offer legitimate complaints about the insufficiency of quantitative data to verify aspects of the past pertaining to complex and intellectually significant topics such as sexuality and desire.  As such, a logical “next step” for digital humanities research, then, would appear to be the need for the field to more clearly acknowledge the existence of fields of inquiry that cannot be verified by its methodology. Gibbs and Cohen provide a starting point for this practice in their study, in which they note that  “Flexibility is crucial, as there is not monolithic digital methodology that can be applied to research questions” (76).

Benjamin O’Dell


So I was slightly chagrined to discover that a bunch of material that I was interested in for the purpose of constructing my fields list, which I painstakingly located through library databases an archives, and was about to dig through a bunch of microfilm to find, was already available on the internet, in several forms (including on Wikipedia!). However, a lot of it is available in a form that’s not terribly useful to me–being able to track patterns, identify dominant key terms, etc. would be awesome.

However, a key question that continues to be raised in Digital Humanities (which Moretti poses at the beginning of “Maps”), is: when is all of the time and energy it takes to deploy DH methods worth the utility of the results you might get? Is it worth it for me to put a bunch of time and energy building my own text archive of obscure popular fiction just so I can MONK it out–for what will probably never amount to more than a bullet point, maybe a paragraph of a paper or dissertation? Is it worth it for Moretti to spend a bunch of time and energy tinkering with maps, in order to establish a dubious pattern in the living-spaces of French heroes and their objects of desire? What if the perfect circle business hadn’t emerged from his charting of the country walks of Our Village? (And even as pleased as he is with the map, he admits he’s not sure what to do with it yet.)

What is going to have to happen for these tools to become useful in the sense that they actually make our research fuller and easier, as opposed to more cumbersome and resulting in unfortunately thin textual analysis because we have to spend so much time analyzing our methods that we inevitably neglect the texts? (Or, am I missing the point entirely?)

Some thoughts on algorithmic criticism

Over the past couple of weeks, one of the central arguments we have encountered in class is the notion that, despite the buzz surrounding its potential, the core issues of the digital humanities are not radically different from those of the humanities in general.  To take a case in point, in his talk for the MLA, Ted noted that scholars have been using “digital tools” for a long time to conduct basic searches through archives and databases.  As a result, he suggests that the primary novelty associated with the growth of digital humanities as a recognized field has not been a shift in practice but rather a shift towards heightened reflexivity about the search process.

Clearly, in thinking about the application of digital tools to literary criticism, it makes sense to reflect on the way searches are conducted.  Yet, in practice, I found it interesting that while the Stanford Literary Lab pamphlet on “Quantitative Formalism” and Tanya Clement’s article on Gertrude Stein’s The Making of Americans were fairly open in discussing the process of data collection, neither seemed particularly transparent about acknowledging the continuities between their interpretative practices and the tradition of twentieth-century literary criticism.

To be specific, as I read through these pieces, I found myself hung up on what may appear to be an obvious question (at least at first glance):  Is the distant reading of algorithmic criticism little more than a form of “close reading” for the digital age?  Unlike commentators who have expressed anxiety that the kind of distant reading Franco Morretti advocates in Graphs, Maps, and Trees might get in the way of close reading, I have a different problem:  I fear that it might replicate the worst tendencies of New Criticism– namely, ahistoricism.  When we think about New Criticism, we probably think of W.K. Wimsatt and Monroe Beardsley’s emphasis on “the text itself” in essays like “The Intentional Fallacy” (1946) or concepts of “unity” found in Cleanth Brooks’ work on the heresy of paraphrase for The Well-Wrought Urn: Studies in the Structure of Poetry (1947).  Like New Criticism, algorithmic criticism locates meaning in the text itself; however, as we have seen in several examples, it is often far more interested in word choice as a matter of quantification.  But as most of us probably know, the problem with formalism is that it has the potential to obscure the subjectivity of interpretation, as well the historical context of literary production.  Consequently, my concern is that if we don’t think carefully about how we use these tools, we run the risk of repeating the missteps of previous critical movements.

I found myself coming back to this problem again and again in the work from the Stanford Literary Lab and Tanya Clement, albeit it to varying degrees.  Since the Stanford Literary Lab’s work on quantitative formalism is primarily focused on technology and the research process, it is difficult to critique, as its observations are largely cursory and anecdotal.  Thus, while the discovery of the fact that Dickens’ “language remains basically the same” as he moves from novel to novel doesn’t advance my knowledge of Dickens or nineteenth century literature, it does tell me a good bit about some of the problems one encounters when designing and applying tools (15).

Far more troubling than the meta-reflective nature of the Stanford pamphlet is Clement’s work.  In short, her reliance on “the data”—which stands in for the text itself—seems to assume the existence of a unity or pattern that may or may not exist.  Although I don’t claim to be a specialist on Gertrude Stein, the fact that other modernists such as Virginia Woolf and James Joyce often edited their work to obscure meaning leads me to question the utility of looking for a discernible structural arc in a highly experimental work.  Ultimately, I’m suspicious about whether Clement advances our understanding of a work like The Making of AmericansIn many ways, she seems to transform a highly affective, rhythmic, poetic work into little more than a bad grammar lesson.

For the sake clarity, in levying this critique, I’m not suggesting that digital tools don’t have an application.  But I do think that it is important for us to consider whether or not essays like Clement’s alienate readers in their desire to replicate scientific realms.  Whether or not its fair, literary theory and criticism has often been accused of incomprehensibility.  As interest in the digital humanities presents an opportunity for the humanities to become more relevant, it seems foolish to squander such cultural capital in a turn to the quantitative.

In my own practice, I have envisioned using digital tools in a supportive mode for arguments that are rooted in more established forms of theoretical discourse.  Am I an outlier in this regard, or do others share my interests and concerns?  I’d be curious to hear more about the varying assumptions and aspirations that others have bought to this course.

