What is text encoding and why do we need it? This workshop will serve as the introduction to the Text Encoding Initiative (TEI), the de facto standard for creating, editing, analyzing and sharing electronic texts in the humanities. The TEI Guidelines define an extensible, XML-based vocabulary for explicitly describing the structural, visual and semantic features of any document so that it can be processed by computers and used in research, teaching and the preparation of digital editions. We will examine the concept of markup and XML encoding; discuss the virtues and complexities of the TEI scheme, its assumptions and its organization; provide a bird's eye view of TEI elements;  learn how to exploit the Guidelines on our own; and, finally, assess various scenarios in which TEI may (or may not) come in handy in your own work.

This introductory workshop is targeted at an audience with a background in the humanities. No previous technical knowledge or programming skills will be assumed. Participants will be expected to bring their own laptops. 

Toma TasovacToma Tasovac is the director of the Belgrade Center for Digital Humanities (, and chief programmer and editor of Transpoetika: A Digital Platform for Serbian Language and Literature ( He has degrees in Slavic Languages and Literatures from Harvard and Comparative Literature from Princeton.  Currently, he is pursuing a PhD in Digital Arts and Humanities at Trinity College Dublin.  His research interests include complex lexical architectures in eLexicography, retrodigitization of historic dictionaries, and integration of digital libraries and language resources.  Toma is equally active in the field of new media education, regularly teaching seminars and workshops in Germany, Eastern Europe, the Caucasus and Central Asia. He blogs at and tweets as @ttasovac.


Discovering Deep Semantic Structures in Large Corpora

Broadly speaking, Topic Modeling is a computational approach for semantic analysis of large text collections. The technique now favored in the Digital Humanities, Latent Dirichlet Allocation or LDA, attempts to group words into semantically meaningful clusters, or "topics", by inspecting the co-occurrence patterns of these words in large document collections. The major advantages of topic modeling techniques is that they are typically "unsupervised" — they require no time-consuming, interpretive interventions by the user — as well as the fact that they are to a large extent language independent.  Once these clusters have been constructed, it would be possible to plot, for instance, the evolution of the popularity of certain topics over time by calculating to which degree these topics are present in individual texts of the historical corpus under investigation. In this tutorial and the subsequent hacking day, this and many other interesting applications of topic modelling in the up and coming research field of "Distant Reading" in the Digital Humanities will be explored.

In the first part of this tutorial we will focus on the theoretical foundations of topic modeling (although we will skip most mathematical details). In the second, hands-on part of this tutorial we walk the participants through an interactive tutorial for the MALLET toolkit, a popular general purpose toolkit for Topic Modeling. Finally, we will survey a number of additional resources and software, such as the recently introduced word2vec kit. The participants will then have the opportunity to continue to work with MALLET on the the Ben Yehuda in the second day of the THATCamp.  The course instructors will also attend this second day to support the efforts of the participants.

This introductory workshop is targeted at an audience with a background in the Humanities and no statistical knowledge or programming skills are required. Participants are expected to bring their own laptops in order to make the exercices individually.

Mike Kestemont ( is currently is a postdoctoral research fellow for the Research Foundation of Flanders at the University of Antwerp (Belgium), where he works in the research group for literary history, as well as the CLiPS computational linguistics research lab. Mike's research focuses on Digital Humanities, in particular computational text analysis and stylometry for historic texts. His PhD tackled the problem of computational authorship attribution (in medieval Dutch literature), a problem which he continues to study in his current research (also for other medieval languages, like Latin). Mike is actively involved in many DH initiatives in the Low Countries and regularly publishes his research results in journals in the (Digital and Less Digital) Humanities. A recent, accessible documentary on his stylometric research can viewed online:

mattMatthew Munson holds a bachelor’s degree in Education from the University of Kansas and in Theology from Loyola College in Baltimore, Maryland, and a master’s degree in Religious Studies from the University of Virginia.  While a student at the University of Virginia, Matthew began working in the digital humanities center there, the Scholars’ Lab, and immediately became interested in the fascinating insights digital methods could give into ancient religious texts.  He received a Scholars’ Lab Digital Humanities Fellowship in 2009-2010 to explore the use of text-mining strategies to identify relationships between the Greek texts of St. Paul in the New Testament and the Hebrew texts of the Old Testament.  At the Göttingen Centre for Digital Humanities (GCDH), Matthew works in the European project DARIAH (Digital Research Infrastructure for the Arts and Humanities) and coordinates the DARIAH work package concerning VREs on the German and European level and is also coordinating the development of the DARIAH international digital humanities summer school, planned for August, 2013 and 2014 in Göttingen.  His current research interests lie in the area of semantic drift and methods of calculating the change in the meanings of words from the Old Testament to the New Testament.

