Topic Modeling and The Ladder

Hi guys.

I’m quite late to this party, but since I didn’t participate in the round-table discussion/blog post week, I thought I should share what I’ve been working on with everybody now. Below is an excerpt from my paper. Thanks everybody, for being awesome; this was a great class. I would love to read some updates!

For this project, and for me as a scholar in general, it was important to figure out some way to balance the specific needs of my archive and still put the what was relevant about DH methods into practice. Following Tonya Clements’ example of a close-distant-reading, I attempted a simple “bag of words” type analysis on some works by one of my pet authors, Edna Ferber; because she was so widely read in the 30s and 40s, yet nobody cares about her now, it was fairly easy to find a good number of her texts online in .txt format via The results were profoundly uninteresting. Ferber is a straightforward writer, known for her interesting characters more than complex plots or innovative style. In fact, this experience gave me a new understanding of Tonya Clements’ suggestion in “‘A Thing Not Beginning and Not Ending’: Using Digital Tools to Distant-Read Gertrude Stein’s The Making of Americans” that digital methods proved that Stein was a “genius”—though Stein may not have necessarily been secretly writing in these patterns on purpose, psychically intuiting that someday, almost a century later, someone with a computer would find them, it appears to be true that a text has to be pretty complex in order for a close-distant-reading like that to produce interesting results. I would need something bigger, over a longer period of time, that was still at least somewhat unified by something, if not a specific author.

Enter The Ladder. The Ladder was a monthly periodical that ran from 1956-1972. It was written, printed and distributed by the lesbian organization the Daughters of Bilitis in San Francisco. It was a small magazine, in every sense of the word: small subscribership, only about twenty pages per issue, and a relatively brief 15-year run. But the journal was tremendously important to the disenfranchised community it served—namely, the ever-invisible lesbian population in 1950s America.  McCarthy-era anti-gay paranoia meant that publishing and distributing such a text was fairly risky—and indeed, the Daughters of Bilitis had an extensive FBI file—but it was vital for establishing a lesbian community and/or support network for those who feared for their jobs and friendships. The Daughters of Biltis’s expressed purpose in creating the magazine came down to two basic concepts: 1. educating people about what it means to be a gay woman, and 2. Archiving, reviewing, and soliciting lesbian-centered literature. If the literary field as a whole functions as a “collective system,” as Moretti insists, The Ladder was explicitly concerned with facilitating their own smaller version of that system for lesbian women in the 1950s United States. As a magazine, it’s barely a dent in the large corpus of twentieth century magazines. As a medium of communication between individual members of an otherwise-estranged community of readers and writers, it’s a perfect resource, and though it is not large in the distant-reading sense, it is large enough that a traditional close reading of all 180 or so issues is less effective than a comprehensive view of the publication using digital methods.

Given the wide-ranging, system-based goals of The Ladder, topic modeling seemed like it would produce the most interesting results. Obtaining the data would prove to be extremely feasible, because one of the best things about The Ladder is how carefully it has been preserved and archived. Though the regular subscriber list never topped more than a few hundred, issues were shared, read, re-read, and stored much more widely than these numbers suggest. Several archival projects have taken an interest in the Daughters of Bilitis and their landmark publication, and as a result it’s available in several libraries—including the Rare Books and Manuscripts library here at U of I—and it’s even been digitized by the LGBT Life database. I had my entire corpus available to me, already in a digital format.

Due to time constraints, I ended up stopping at 1963, with more than 1000 .txt files for topic modeling. Though the project seemed exceedingly promising, and though I was learning more and more about my corpus firsthand through OCR correction, I started my project far too late to make the topic modeling happen by this point. I strongly wish I had come up with this idea earlier in the semester, so I might have been able to do a more thorough job.

For  now, I did a quick “bag of words” scan on about half of the issues from 1957—the first full year of The Ladder’s publication, and the first year in which Phyllis Lyon cast aside her pseudonym Ann Ferguson in favor of her real name. The 20th through 80th most commonly used words for the January through July issues, excluding April, are below.

The term “homosexual” is not insignificant here—it was a fairly new term at the time, as the more commonly used word was Havelock Ellis’s “congenital invert” to describe gay citizens. Active verbs like “make” and “can,” in conjunction with the word “public” are likewise unsurprising, considering that The Ladder placed itself at the frontier of a social movement aimed at cultural acceptance of homosexuality as a minority identity. I’m struck by the lack of the word “lesbian” (or, Lesbian, as The Ladder typically capitalized the word), because my understanding of the periodical was that it was the only one addressed specifically to gay women, and that this is what differentiated it from associate homophile publications like The Mattachine Review and ONE. I am equally struck by the appearance of the word “love” on this list, though upon second glance it’s less puzzling when one sees the word “book” and “subject” on the list—much of lesbian literature which was being published and reviewed by The Ladder were centered on forbidden love stories. While this approach to text mining has obvious limitations, and is not itself enough to compose a compelling argument, I think enough interesting things emerge, even from a simple text analysis, that it bodes well for a future project involving topic modeling and The Ladder.

Posted in Uncategorized | Leave a comment

Variations on the theme Boys & Unicorns & Sparkles: What is a librarian to do?

I want to focus on where librarians have placed themselves in the field of Digital Humanities?

What kind of Help guides and Tutorials are available through libraries?

What kind of resources do libraries have online with regard to text files/ program tutorials?

This is an impossible task, it would require a master list of every library. And it would require every librarian publish their work regarding their involvement in DH. Every reference question, every referral, must be counted toward an adequate accounting of the librarians presence in the world of DH.

I began this process with my own Librarian. For about a half an hour we talked about all the institutions she had worked with, or new of, that had Digital Humanities programs. She then directed me to the ARL, American Research Library’s SPEC Kit 326, from November 2011, Regarding Digital Humanities. I went into the book stacks and pulled a soft cover catalog off the shelf to consider a libraries role in DIGITAL Humanities. The Spec Kit is a survey that was sent out by the ARL, to the ARL libraries, regarding Digital Humanities.  “Sixty-four of the 126 ARL members completed this survey for a response rate of 51%” (Bryson, 11).  The survey attempts to record trends, practices, and procedures of ARL Libraries with regard to Digital Humanities scholarship. I laughed out loud as I flipped to the back of the publication. The last half of this catalog was pages of screen shots of library web pages. To study library influence in the Digital Humanities, I was looking through printed pictures of web pages and typing in the url’s of the institutions one by one.

I have been combing through the web pages of every institution that responded to the ARL survey last year. I collected all the mission statements of these Libraries Digital Humanities centers. I decided to preform a vary flawed text analysis with a free online tool .

Wordle: DH: Missions

Wordle is simply counting the words in my text and then increasing the size of the word as the word gets mentioned more frequently.

What I fond from this rather silly exercise is that Librarians are using mostly vague positive language with regard to their Digital Humanities missions. (Not surprising, but why should it be so?)

A preliminary search of the ARL respondents webpages gave a huge variation of institutional commitment to digital scholarship. Some libraries have one page explanations of the computers that are available in their computer labs, listed under their DH web presence. Other institutions have blogs and resource guides about text mining practices. The most interesting part of this search was the realization that all the institutions my librarian had mentioned as DH sources were absent from the ARL Survey respondents. Libraries seem to be interested in the prospects of DH but at the same time disconnected from the very digital world of DH.

I have found some pretty amazing resource guides buried in the ARL respondents web pages. I plan on compiling these helpful resources in an annotated web bibliography, as a way to use the ARL DH survey as a resource list rather than a list of mission statements. I plan on comparing this list of resources to the ones suggested by my librarian as a way of preforming a very qualitative comparison of DH information aggregation.

More Soon:

Julia Pollack

Bryson, Tim. Digital Humanities. SPEC Kit 326. Washington, DC :: Association of Research Libraries,.

Posted in Uncategorized | Leave a comment

Topic Modeling Personal Stories from September 11, 2001

One of my research interests is the formation of literary responses to trauma, and I have been interested in finding ways digital humanities can help uncover trends and themes in post-traumatic literature. As part of my Master’s thesis I examined the emergence of 9/11 fiction as it evolved from short personal essays and journalistic pieces to short stories and novels. I started by examining personal essays published in collections like 110 Stories: New York Writes After September 11 and Afterwords: Stories and Reports from 9/11 and Beyond and examined how certain formal themes and constructions introduced in these early responses were reiterated in later literature about 9/11. Through close reading, I argued that these themes were central to the continuing post-traumatic response process.

For my Digital Tools project I am interested in continuing the work of examining personal responses to 9/11 to see how one begins process, understand, and form a linguistic response to a traumatic event. I am specifically interested to see if topic modeling may uncover new significant trends in the early responses to 9/11 that are also apparent in the later literature on the attacks, which in turn may shed light on how authors responded fictionally to the trauma. Hopefully, the topic modeling of these personal stories can help shed light on the way we narratively and imaginatively cope with a devastating national tragedy.

I am fortunate enough to have been generously granted access to the data collected by the Center for History’s and New Media and the American Social History Project/Center for Media and Learning’s September 11 Digital Archive, a project that has sought to collect and preserve the history of the attacks and their aftermath. The archive has collected personal emails and stories from different sources, some individually donated by the creators of the digital material, and some collected by, and acquired from, national institutions like the Library of Congress and the Smithsonian. I will be topic modeling the personal stories from various institutional sources (as they are more standardized and have fewer textual inconsistencies) in order to see what larger trends occur throughout time. So far, it looks like we will be able to search for topics and break down data by time of composition, gender, and (at times) location of authors to see if interesting trends emerge depending on temporal or spatial distance from the epicenter of the attacks. In understanding more thoroughly the ways we respond to events like 9/11, I hope that a project like this can pave the way for further discussion on how we process, imagine, narrate, discuss, and eventually come to terms with trauma.

–Jessica Young

Posted in Uncategorized | Leave a comment

Intentions, Conventions, and Topics

I. “Sonnet Boom: A Thumbnail History”

My project for this course is to use topic-modeling to explore a collection of texts that literary historians associate with the “Elizabethan sonnet boom.” The history, in brief, is as follows: the sonnet form was imported from Italy to England during the reign of Henry VIII. Sir Robert Wyatt was one of the first practitioners of the form; he composed original sonnets, but most of his poetic work consists of “translations” (really, they’re more like what we would today call adaptations) of Petrarch.

When Elizabeth took the throne in 1558, it became possible for poets to adapt the standard Petrarchan set-up (of a male poet-lover trying to woo a female love-object) for the purposes of flattering a female monarch and getting patronage. This and other historical circumstances led to a huge explosion in the production of sonnets during and just after Elizabeth’s reign. And while there were plenty of one-off sonnets produced in the period later collected into poetic miscellanies, a large percentage of the sonnet output took the Petrarchan form of the sonnet sequence.

The sonnet fad extended well into the reign of James I, Elizabeth’s successor, but it had largely died out by the 1620s-30s

II. “Intentions, Conventions, and Topics”
Sonnet sequences are hyper-artificial, and they consciously engage with a pre-existing set of conventions. Part of what sonnet writers hope to do is to make a verbal object that has been highly labor-intensive while seeming to be dashed-off in a moment of passion or inspiration – an ideal familiar to Renaissance scholars as ‘sprezzatura.’ Sonnet-writers aim at total rhetorical control: think of it as proto-Nabokovian.

They’re also highly conscious of the conventions of their genre. Petrarch laid the foundations, and sonnet sequences as a corpus are conspicuous for the way they really work closely on and elaborate a rather limited set of tropes, techniques, and themes.

In a way, these two facts—that sonnets are highly-intentional verbal objects, and that they consciously rely on a set of stock tropes—might be seen as a very good reason not to perform topic modeling on sonnet sequences as a corpus. If topic-modeling locates and defines “discourses,” and if the discourses subtending the sonnet tradition are really, really well known (not only because they have been studied endlessly, but because they have a common root in an identifiable set of master-texts), then it might be that the results of this project are, well, boring. That is, I have no doubt that I’ll turn up some topics or discourses that are recognizably Petrarchan, and some that are the result of the extreme artificiality of sonnet-language.

III. “Three Potential Payoffs”

So what, then, might be the payoff of such a project?

I hope to find, over and above some things that a recognizable, a few topics that are more surprising: non-Petrarchan in origin and not necessarily tied up with the intense rhetorical control endemic to the sonnet form. I think this would be an interesting result in two ways: first, it might tell us something we didn’t yet know about the corpus; second, it might help to clarify some things about topic modeling as a technique.

The first claim is pretty obvious, and it’s one of the big benefits of topic modeling: we become aware of topics that exceed authorial intent and that go beyond standard and well-worn tropes. The second claim is less obvious, and more speculative, and deals with the question: what is a topic, anyway? Hopefully, because I’m dealing with a corpus that’s maximally intentional and conventional, the lines between topics, stock-tropes, and intentional grammatical forms will be “harder,” and less fuzzy.

Finally, and most speculatively, I hope that since my corpus is actually rather small, that I might be able to do some work in thinking about the change of topics over time. When James comes to the throne, is there a shift in topics? That is, does the discourse subtending sonnets change with the gender of the monarch? How about well-known political events: the Spanish Armada (1588) say, or the late-Elizabethan succession crisis? Will these register in terms measurable by topic modeling?

Until the results come in, it’s hard to float guesses. Hopefully, though, this will help to explain the rationale of my project.

Posted in Uncategorized | Leave a comment

A conversation among journal editors on theory, history, and database

So, to start with, I have two projects on the stove at once, one I should have finished and sent out long ago, and the other, meant as a sequel, which is now riding its own melting. This project was at first conceived as a sequel to an earlier project I am still trying to finish and send out, but it is turning into a prequel.

First, the old project, with all but the writing finished before this course started: I am trying to find evidence of the effect of large databases of primary texts (Early English Books Online, American Periodical Series online, the Whitman Archive, Google Book Search, etc) on scholarly practice in the humanities. I’ve gathered four and a half threads of evidence:

* ideological statements made in public fora (professional journals, listservs, etc) made about these databases;

* an ambitious quantitative analysis of citations of primary texts in two prominent journals, American Literature and English Literary History, showing a dramatic increase in primary sources cited well after the rise of New Historicism and directly matching the rise of database;

* interviews with authors who published in those journals,

* some new history from interviews on why searchable text (not a simple enterprise, and one that was not a natural choice) was added to these collections when they were digitized,

* and my own theoretical argument for why search is a big deal, transformative rather than just a boost in efficiency.

I had some questions left over from this work. I wanted to find evidence of trends in scholarship that I’ve intuited but that I don’t have either the evidence or the stature to assert, specifically, two trends: a rise of a sort of historicism and an eclipsing of theory, or perhaps merely lapsing into a theoretical monoculture. I can put these claims in the mouths of my elders:

* A few years ago, Jane Gallop wrote about a job candidate who claimed it’s impossible to get published without archival work and complained that history has overtaken close reading.

* In a column before MLA ’12, Stanley Fish announced that “the theory days” are over.

I saw both of these trends as entangled with the third trend I could prove, the increasing adoption of database in literary studies, English in particular.

I conducted a series of interviews with prominent journal editors and asked them questions about trends they’ve seen in the submissions received and comments by reviewers. I asked the interviewers to reflect upon trends in the scholarship generally whether or not they’re related to databases. And I referred to Gallop and Fish as prompts.

And, this is actually working.

I’ve completed interviews with Frances Ferguson of ELH, Cathy Davidson and Priscilla Wald of American Literature, and David Raybin of the Chaucer Review, Nancy Armstrong of Novel, Marianne Hirsch of PMLA. Gordon Hutner is posting a question on my behalf to a listserv for the Council of Editors of Learned Journals.

The questions I asked were motivated by my particular take on database and its effects, but I think they also serve to elicit a good characterization of the current state of literary studies in English. What I have now is a conversation about the canon, what counts as a discovery in 21st-century English studies, the use of archive, and the role of history and theory–all with database serving as the connecting thread.

I am planning to publish this as just that: a conversation of prominent editors about trends in scholarship. To the extent that it has a thesis, it is that database is central to understanding the forces converging in this moment literary scholars find themselves in.

So, I personally think this is shaping up to be a pretty good project. I need to do some thinking about how to organize their responses.

/Alan Bilansky


Posted in Uncategorized | Leave a comment

Benjamin D. O’Dell: “The Captain is Bunk”

My final project has led me to follow in the footsteps of the Stanford Literary Lab’s first pamphlet and explore what use, if any, digital tools hold for the project of genre classification. To be specific, my objective is to test whether Charles Dickens’s mid-Victorian Bildungsroman David Copperfield (1850) can be appropriately understood “as an attempt by Dickens to recast domestic fiction as a masculine endeavor,” as Emily Rena-Dozier contends in her article for the Autumn 2010 edition of SEL (813).  Conventional wisdom holds that traditional notions of gender relations in England are best understood through the concept of the separation of spheres, whereby male authority is located in the public workplace and female authority is located in the privacy of the home.  In fact, so pervasive is this concept that it is not only the structuring premise for period studies such as Mary Poovey’s Uneven Developments: The Ideological Work of Gender in Mid-Victorian England but also more expansive works of literary scholarship like Michael McKeon’s The Secret History of Domesticity:  Public, Private, and the Division of Knowledge.  Previous commentators have often read nineteenth-century British domestic fiction as an extension of the separation of spheres by assuming that the genre embodies distinctly feminine traits.  In contrast to this view, Rena-Dozier uses information pertaining to David Copperfield’s composition, content, and initial reception to provide a compelling case for the need to break the association between femininity, the domestic novel, and the domestic sphere.  In so doing, she presents a new problem for scholars:  “how to understand domestic fiction and the domestic sphere without the habitual conflation of domesticity with femininity” (825).

In raising this question, Rena-Dozier gestures towards the daunting task of identifying formal features within the genre of domestic fiction that can be understood as either gender-neutral or masculine.  I believe that a close examination of domestic fiction using tools that have been developed within the digital humanities can provide empirical evidence towards quantifying Rena-Dozier’s claims about David Copperfield, as well as a starting point for generating hypotheses about how domestic fiction functions in relation to gender, more generally.  Be that as it may, one of the initial complications I have run into in collecting data for my project is determining what criteria to use as the basis for classifying texts.  For the sake of simplicity, I began by conducting a simple search for “domestic fiction” using the Internet Archive, the largest open-access digital library in the English language.  My search initially returned fifty novels that had been labeled with the term as one of the “keywords” in the metadata for each file.  After eliminating duplicate and foreign-language novels from my collection, I made a surprising discovery:  male authors were responsible for the majority of the texts that had been labeled “domestic fiction” in the Internet Archive’s database.  Recognizing obvious oversights in the collection, I decided to incorporate primary texts from the bibliography of Nancy Armstrong’s landmark text, Desire and Domestic Fiction: A Political History of the Novel, as well as my own knowledge of literary history to augment the representation of female authors.

In its present form, women account for the authorship in thirty of the fifty-seven examples of domestic fiction in my corpus.  Whether this collection can be called “representative” is less important to me than the fact that it is useful to the comparative nature of my project, which seeks to tease out associations between David Copperfield and gendered categories of fiction.  At the same time, I have noticed a few issues with the composition of the collection.  Although most of the titles on this list clearly have a place in this study, as in the case of any attempt to classify texts, there are strong arguments to be made for or against the inclusion and exclusion of particular works.  Most troubling to me is the inclusion of M.B. Manwell’s 1898 novel The Captain’s Bunk:  A Story for Boys.  This novel first stood out to me for its peculiar title.  A closer examination of the text and its metadata quickly suggested that it is very different from the rest of the texts included in the collection (23).  Whereas most of the novels in the corpus contained “domestic fiction” as the first—and occasionally sole—tag, the keywords for The Captain’s Bunk on the Internet Archive placed “domestic fiction” as the third tag and included the following tags in addition:  “Christian fiction,” “Didactic fiction,” Fishing villages—Juvenile fiction,” and “Problem families—Juvenile fiction.” As if that weren’t enough, the image of a menacing Captain reprimanding a boy on the cover of the 1898 edition confirmed the notion that the novel is worlds away from the works from Jane Austen, Anthony Trollope, and George Eliot included in the collection.


Given the loose interpretation of the term “domestic fiction” on the part of the file’s creator, I was initially wary of including the novel in my project because it seemed to signal an obvious error in classification.  Yet I ultimately decided to keep it based on the belief that it may reveal things about domestic fiction that I do not already know.  As a matter of methodology, the question of whether or not to include Manwell’s novel points seems to touch on a larger concern about the function of classification.  Contemporary skepticism for the notion that genres represent static categories has largely eroded its significance in scholarship.  But as the Stanford Literary Lab has demonstrated, digital tools have the potential to see genres that exist on levels beyond the purview of current literary frameworks.  By removing Manwell’s text, I would have been prematurely closing off the opportunity to identify potentially significant variation within the realm of domestic fiction.  In contrast, my actual list attempts to provide a glimpse at the many manifestations of domestic fiction written by male and female authors and may provide a list with more utility than established conceptions of the genre.

-Ben O’Dell


Rena-Dozier, Emily.  “Re-gendering the Domestic Novel in David Copperfield.”  SEL 50.4 (Autumn 2010):  811-829.  Print.

Aside | Posted on by | Leave a comment

Project Update: Before you can manipulate the text…

you have to have text to work with. In my case, I have text, but it’s handwritten on parchment. I spent a good amount of last semester transcribing pages of text from Univ of Illinois MS 80, a small manuscript (about 3.5″ x 5.5″) of 122 folios (244 pages) into a word document. While my enthusiasm to record as much of the information from the manuscript as possible in my transcription was admirable for my paleography project, the data I have will need some adjusting for my digital humanities project. I plan to use plagiarism software to see if I can better narrow down the original source material for the portions of MS 80’s text where the scribe simply states that some of the writings are from St. Bernard, from St. Anselm, and from St. Augustine. In order to run MS 80 (in Middle English) and the source materials (some Latin, some Middle English, some modern English, depending on the source and where I’m able to obtain a digital copy) through the plagiarism software to (hopefully!) detect overlapping text, I need unmarked, unformatted, digitized text.

As my digitized version  of MS 80 stands now, several problems are immediately apparent:

1. I need to remove the table (not a big fix)

2. The punctuation needs to be standardized. This is a little more complicated; the periods sometimes indicate a sense break (comma) and sometimes indicate the end of a sentence (semicolon or period). Sometimes, they simply indicate that the reader should pause briefly before continuing. I’m still not exactly sure whether the slash marks indicate the end of a paragraph, a sentence, or both. It seems to depend on which scribe is copying at the time, and also depends perhaps on the formatting of the source material.

3. All abbreviations need to be spelled out, with parentheses removed. However, if the source material that I’ll be checking this manuscript against is formatted with parentheses, I will either have to remove those as well, or leave them in my text and be prepared for screwier data.

4. If the source material is modernized, I will need to create a modernized version of my manuscript. This is problematic in itself, as my vocabulary selection may not always match that of other editors.

5. The spelling, too, is not standardized. If my sources are in Middle English (my preference, to help decrease the number of variables involved here), their spelling will also not be standardized, nor will it be consistent within itself.

So, my original concern that I wouldn’t be able to find adequate software is turning out to be less of a concern as the issue of data preparation becomes more complex and unwieldy. I’m learning quite a bit about textual editing theory in the process, which I’m finding oddly enjoyable. The more I work on this, the less likely it seems that I will actually come away with new information about MS 80. I may end up simplifying this project by searching for similarities between two *known* borrowings (for example, MS 80 includes a well-known & oft-copied version of a prayer to Jesus) and see how well the plagiarism software detects what scholars have already verified, and then determine whether or not this is a viable option for detecting textual echoes.

Posted in Uncategorized | 3 Comments