Daniel Griffin

Word of the Day: Attar

Mon, 22 Feb 2016 00:00:00 -0500

In working through my delayed NYT crossword puzzle today I came across an interesting new word, attar (always a good day when that happens).

A consultation of the OED tells us that an attar is “a very fragrant, volatile, essential oil obtained from the petals of the rose; fragrant essence (of roses).” The etymology is from Persian through Arabic, ʿiṭr “perfume essence” and ʿiṭr al-gul “essence of roses” being the base root.

Early references include ethnographies such as Thomas Pennant’s The view of Hindoostan (1798–1800). Pennant was a well accomplished Welsh naturalist and writer, publishing several books on flora and fauna. The quintessential tourist.

An interesting usage is given from Thomas Hardy in his Far from the Maddening Crowd: “That buzz of pleasure which is the attar of applause” (I. xxiii. 263). Perfect choice of words in my estimation.

Migrating with Jekyll

Mon, 22 Feb 2016 00:00:00 -0500

So, after many months of experimentation and false starts, I have finally migrated to a Jekyll static site blog, shuttering the WordPress version completely.

I found that WordPress was more than I needed and the care and uptake was not worth the cost of keeping the server up. I have not been contributing much nor ever really have much in the way of visitors, so no technical necessity to have a behemoth of a CMS to power the operation.

So what you may see now is the first deployed version of this site. If it looks like the static template for Jekyll, that is probably because it is. A newer practice for me in my technical projects is to launch first, enhance later—less paralysis by over design. If something gets really big I can maybe get some funding to refactor in a big way, but for my purposes right now this will do just fine.

The process of migrating to Jekyll was not too difficult, it just took me a bit of time to figure out the quirks:

Retaining the Link Structure: I wanted to retain my link structure from the old site, so that required a bit of tinkering with the _config.yml settings. The hard part was figuring out which setting I needed to set; once i figured out it was permalink (and that I needed to jekyll build after a settings change), it was pretty easy to set it to permalink: "/:year/:title/".
Figuring out the post frontmatter: The YAML frontmatter took me longer to figure out than I care to admit, mainly that listing out my categories and tags was not working as I thought it would, but rather that I needed to do an unordered list (- item) for it to work as I intended.
Making sure that I had the proper “server” configuration: Not too much more to say here, I just had to take a bit of time configuring my CNAME / A records and my DNS took longer than expected.
Converting posts to Markdown: This wasn’t too hard, just tedious. I took advantage of a markdown conversion service, heckyesmarkdown.com, with some of the labor. Incidentally, there were fewer of these conversion services available than I thought there would be. The code blocks are still broken though (ugh).
Images: I have to go back and manually add images because I couldn’t find a quick and efficient way to import them (ugh again).

I will work on upgrading the site over the next few months, but right now the focus is on writing more. I feel comfortable I can upgrade any functionality I may need (if I feel like I need comments, I might add a connection to Firebase), so let’s get cracking on the thinking work! A new job is on the horizon, I hope I can keep up the momentum.

Finding Weird Stuff in the Access Logs

Mon, 28 Jul 2014 00:00:00 -0400

One of the jobs I have self-appointed myself as a Monday task is looking over the access logs for our various sites that we host. Our vendor doesn’t provide the best tools for doing this, but it has proven a fairly interesting endeavor nonetheless. I am keeping track of some of these things in an informal log. Here are some weird things that I have noticed already.

A Funny Looking Mirror

Today I noticed that for one of our sites something like 75% of our referrals were coming from one site, accounting for about 11% of all traffic. When you visited the site, it was actually a complete mirror of the original–the only thing different was the domain name (ourwebsite.com vs theirwebsite.com).

The first think I thought was, “oh great, somebody has ripped our entire site and put it up here.” But browsing the site via proxy I realized that none of the restricted content was available–I would have thought that it would be if a hacker had put it up. So I did a WHOIS look up for the domain. I was expecting to get very little information, but I got something curious: the name and address of our web vendor. Curious. I did a traceroute on the domain and found that it actually pointed to our vendor’s servers, the exact same one as our page. I looked at the access logs again: I could see mention of it all the way back to 2005, shortly after the site went live. So it would appear that it is just a mirror that someone left up (probably as a test for another journal, it would appear from the name) and forgot about. The only remaining question is why it is popping up now with such high usage–probably a crawler.

A Cow in the Data

A couple of days ago, I ran a usage report on subscribers to one of our sites. Far and away the greatest number of hits were coming from one institution, which had more than five times the number of access events compared to the next user. This #2 institution was probably as high as it was because it was running a LOCKSS scan of the site, so this usage was off the charts compared to even that bot. The strangest thing was the name of the institution: a bookseller out of Europe. What were they doing with our stuff? Had they been hacked, and was someone using their access to hijack content?

So, I did some more digging. The name of the user, it turns out, was misleading: it was actually a large university out of Berlin. So, ok, what does someone want with our information in Berlin? I couldn’t find any thematic connection between our collection and the university with a search of their faculty and research, so I googled something like “web crawler Berlin”. I came across the website of a web-crawler that is building a linguistic corpus of English language usage on the web. I looked again at the access log reports to see what was there (in a different place from the usage statistics), and, sure enough, I found heavy usage from a machine registered to the research group in Berlin. Too bad for them the English that they found was probably not very current, since it dates back to the 19th century.

Mitigation Strategies for a XMLRPC.PHP attack

Thu, 17 Jul 2014 00:00:00 -0400

I haven’t been able to get easy access to a terminal in the last month or so, I have been pretty flustered when it comes to the performance of this site. If you haven’t noticed, it is going up and down—more down than up I suspect, unfortunately, because I don’t check on it every day to reset it if its down. But today I was able to get in and poke around a bit in the access logs to see what’s been going on. That’s when I noticed this strange behavior: excessive use of the xmlrpc.php file.

Looking at the Apache log files, I would see entries that looked something like this:

danielgriff.in:80 80.82.78.57 - - [14/Jul/2014:07:02:36 +0000] "POST /xmlrpc.php HTTP/1.0" 500 610 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)"

And there were many other similar entries. The IP address here points to a rather suspicious location—you can Google it and see how disreputable it actually is. The interesting thing here is what is being asked: "POST /xmlrpc.php HTTP/1.0" 500 610. In case you are wondering, this is a really odd file for anybody to be querying. You can read what xmlrpc.php is used for in it’s WordPress Codex entry; basically, it allows clients to make changes to their WordPress sites using a method other than the web interface—say, using your iPhone app or a desktop application.

Unfortunately, this file can be abused by nasty folks. There is a good summary of how attackers can exploit the pingback function in the xml-rpc library on this Acunetix page. To summarize, they can use it to (1) to guess hosts inside the internal network and (2) subsequently port scan those hosts, (3) carry out a DDOS attack, or (4) attack the login credentials of an internal server. If I had to guess, the 3rd option is likely the one that was affecting my site, but that’s a guess.

So, what can one do? I am going to give you a couple of options. I’ll get back to you on the effectiveness of the solutions at a later point in time.

1. Deny access to the file in the Apache .htaccess file (from Vilpponen)

Order allow,deny
Deny from all

2. Send requests to 0.0.0.0, using the Apache .htaccess file (goldenguineas on the WordPress Support Forums)

RewriteRule ^xmlrpc.php$ "http://0.0.0.0/" [R=301,L]

3. Block IP addresses, using the following linux command (Purab Kharat on the WordPress Support Forums)

iptables -A INPUT -s 198.154.62.21 -j DROP

4. Block the agent making the attack (Tigr on the WordPress Support Forums)

# Block attackers by agents

RewriteCond %{HTTP_USER_AGENT} ^.*WinHttp.WinHttpRequest.5.*$
RewriteRule .* http://%{REMOTE_ADDR}/ [R,L]

NOTE: Remember, blocking access to xmlrpc.php will block some WordPress features–the biggest one I can see is Jetpack.

Feel free to make further suggestions. Again, I’ll update on what effect this has for me.

UPDATE 7/21/2014: A whole weekend without any down time! This is probably solving all my issues. Score!

StorymapJS makes telling stories in space easy and fun

Fri, 11 Apr 2014 00:00:00 -0400

One of the problems with using Neatline to tell stories is that it is often more complex than it needs to be. Luckily, the Knight Lab at Northwestern has created StorymapJS to make that process easier. In just a couple of hours you can put together a map that looks professional and is portable to any webpage.

Below you will see a map I made using StorymapJS to show off some of the great coffee options that Durham has to offer. It was really easy. You log into the site with a Google account (the data is stored in your drive) and it introduces you to an interface for making your map. If you have your own map, and the file size is big enough, you can use their Gigapixel option (they have a great example tracking character movement in the HBO Game of Thrones series). You can explore the map sequentially or click items on the map to skip around.

The interface for creating storymaps is found on the StorymapJS website. The entire process is carried out in your browser. After deciding on a title and other initial options, you are brought to map screen, with slides to left, description and media boxes below, and publication options up top. It is a fairly intuitive interface that doesn’t take too long to explore and master.

The basic unit of a Storymap is a slide. You start out with an introductory slide and then add more for each location. You can search for a location or place a pin yourself, which is handy if you are dealing with places that may no longer exist. For each slide, you give a title and description, and can add media such as a picture or Youtube video easily to go along with it. When you are done, it spits out two links, one for a full screen view and another for embedding in HTML (using an iframe element).

I really liked how easy it was to use. There aren’t a whole lot of customizable options, but I do like the selection of map tiles they include. Some people might balk at the fact you need to add a Google Drive folder for it to work, but this is probably not a concern for most users. I would like some more customization, but right out of the box it seems to work great. So check out my Durham Coffee Tour below (fullscreen link) and try creating your own map!

Creating beautiful timelines with TimelineJS

Thu, 10 Apr 2014 00:00:00 -0400

For several years I have been using SIMILE timelines in a variety of applications because it is very straightforward and simple to code. However, one thing it is not is sexy: it takes a lot of work to make your timelines look pretty, and adding anything else besides text can be a chore. But I recently had the opportunity to work with a new tool, TimelineJS from the Knight Lab at Northwestern University, that is even easier to use and has a really sleek, intuitive, and (dare I say it) fun interface right out of the box. Here I am going to briefly describe TimelineJS and give you an example of how I am using it on my Digital Faust site.

The Knight Lab has put out a lot of great products recently, and you should definitely check out their tools and see if you can put them to use in your own projects. The mission of the lab is to create digital tools for journalists, but many of them have obvious relevance for many digital humanities projects. In the coming days I will be writing a few more articles on their tools, so look forward to those.

TimelineJS gives you the ability to easily create an attractive timeline with images, links, and videos. What makes it so easy is that all you have to do is put in all your information into a Google spreadsheet. Once you have finished filling in the fields (date, title, image link, credits, etc.), you publish the spreadsheet and insert a small bit of code (iframe) into your HTML. And that’s it! The tool also allows for HTML coding, so you can easily write links into the timeline. I also like that the date field is very forgiving and doesn’t require strict coding–very handy when you are dealing with whole years as your dates, which is usually the case with books.

Below is a simple implementation that is straight from the Digital Faust Project website. We found that Neatline wasn’t really cutting it for our purposes, as its geographic functions were really just unnecessary, and a simile timeline was getting too bogged down with all the items. What we wanted was a way to view our images and be able to navigate through them with a timeline. TimelineJS is serving this purpose marvelously.

There are just a few downsides. Customizing the interface can be a little difficult, but luckily it is very pretty right out of the box. As we might expect, it is also somewhat difficult to deal with BCE dates, but that is the case with almost all digital timeline tools. Also, it is pretty easy to figure out where your spreadsheet is, so you will have to make sure to lock it down with privacy settings. All in all, though, it’s a pretty cool tool that I can heartily recommend.

Personography, or, Thinking about People with TEI

Tue, 18 Mar 2014 00:00:00 -0400

In creating the Ancient Sports Text Repository and Atlas (ASTRA), I spent a lot of time deciding how I should encode and store my data. The obvious choice for the texts was the TEI (Textual Encoding Initiative) XML markup schema. This standard, which has been in use since 1994, is used by a large number of libraries, museums, and digital humanities projects to represent texts in a machine readable format. TEI is a remarkably versatile format, capable of expressing all sorts of information beyond the text itself. Since TEI is the standard for many of the major text projects in Classical Studies (eg., Perseus, Papyri.info, etc.), it was a natural choice for encoding the small texts that make up the bulk of the project.

However, because I want to encode things other than texts, I was a little hesitant, especially when thinking about how to encode people. Part of the project includes compiling as comprehensive as possible a collection of people from antiquity who were connected with athletics. This includes persons both literary and mythical—especially since it is often impossible to determine in which of these categories a particular individual belongs. It would perhaps be easier to encode this data in a traditional relational database. This data would include columns with an identifier, a name field, a plausibility field (mythical, established), a brief biography, and perhaps an array to refer to other texts in the database. The problem with using a database of this sort is that it doesn’t allow much flexibility. The data could be encoded in HTML to provide links to other content, but it would be preferable to have the data encoded in a standard XML format so the XSLT form can extract it automatically along with other content. So preferably we would want to encode our data in the same format as everything else.

Luckily, TEI does provide the tools for making this possible. TEI refers to the encoding of people as “personography”. I generally cringe at neologisms, but this term is at least descriptive. A personography is simply a structured way of representing biographical data about a person. Unfortunately, there are not a lot of resources for how to structure a personography with TEI; a further complication includes the fact that biography in general is not particularly well theorized. So a large part of the effort has come in deciding how exactly to structure such a biography using TEI. In the process, I believe I have come up with a method that is not only applicable to others who wish to use personography in other applications. In what follows, I am going to talk a little bit about how personography in TEI works, then see how it has been used in other contexts, point out some of the shortcomings and pitfalls, and finally then see how it can be used effectively, using the ASTRA project as an exemplum. I assume familiarity with XML, and some experience with TEI will be helpful.

TEI Personography

The TEI standard is incredibly complex: a printed version of the code would surely run over a thousand pages long. Also, because the schema is on the whole not strict in its nesting policies, it allows the author a large amount of leeway in how to structure a document. Here, I will describe the primary tools needed to construct a personography. As will become apparent quickly, there is much room for interpretation; I will give examples later on that demonstrate how the personography can be put in practice.

The primary tag for personography is the tag, which is a member of the “namesdates” module of the schema for dealing with names, dates, and places. In addition to the standard range of attributes used in TEI, the tag has three unique attributes: role, sex, and age. It is recommended that the age attribute be used to indicate an age group, such as “teenager” or “senior”, as opposed to a integer value. The sex attribute should be encoded using “M” to indicated male, “F” for female, “O” for other, “N” for none or not applicable, or “U” for unknown. One can also use the ISO 5218:2004 standard for the Representation of Human Sexes, in which “0” is used for unknown, “1” for male, “2” for female, and “9” for not applicable; however, this standard is now considered to be outdated, so the former coding should be preferred. The role attribute is a bit trickier, but only because it employs a local vocabulary, which is to say, a set of terms that the encoder determines beforehand. For example, if we were describing an edited volume, we might have roles for “author” and “editor” to describe the authors of individual chapters and the editor(s) of the entire volume; if describing a military text, we might use the role attribute for expressing an individual’s rank. Because of the flexibility of this attribute, it is worthwhile to have a descriptive vocabulary before encoding begins.

Such are the main attributes for the tag. You probably notice that there is no attribute for some of the more typical descriptions for people such as “name” or “birthdate”. That is because the main work of a personography occurs within the person tag. The tag can include many of the tag modules in the [TEI documentation][1]. The most important of these modules to personography is the “namesdates” module, to which the tag itself belongs. The tags in this module are as follows: , , , , , , , , , , , , , , , , , and .Each of these tags has their own set of attributes that are relevant to their usage, so when constructing a personography one will probably have to consult the documentation repeatedly to see what the range of options allow.

The TEI documentation provides the following example of personography, using the Roman poet Ovid as their object of encoding: ```

Ovid Publius Ovidius Naso 20 March 43 BC Sulmona Italy 17 or 18 AD Tomis (Constanta) Romania

As you can see, using primarily the tags in the "namesdates" module, one can give a pretty straightforward description of the poet within the  tag, In addition to the attributes discussed above, you will see that the person tag has the attribute xml:id, which is used to uniquely identify the reference in the project at large. The name is described in two ways between  tags: first, the typical English shorthand "Ovid", and second, the longer, full Latin name of "Publius Ovidius Maro". His birth is given using the appropriate tag, which contains an attribute for the date (@when) and the nested  element to describle where. After the  element, we see another element representing his death encoded in similar fashion. As you can see, the data expressed is minimal, but straightforward.

Such is how a Personography is constructed in the abstract. Information is given or subtracted based on the needs of the creators. Before going on to how it is put into practice, it is relevant to note how the person element is contained, as that largely dictates how the  element is used in any particular document. The  tag can be contained by three elements: , , and . The  tag is used primarily for nesting a series of  elements one after another. Participant Description () is an element that describes all the people involved in a particular document; perhaps the most salient example is to mention all the characters in a play, or perhaps all the contributors to a particular document. The  tag is for organizations: as organizations are composed of people, we would likely then expect a list of people who constitute the organization to appear within the tag.

### Examples of Personography

As I mentioned above, there are not many digital humanities projects that use personography extensively. The typical usage of the person tag seems to be as a short reference inline, which may or may not refer to a list of persons, usually found somewhere in the header element. It is not always apparent, either, when the authors have attempted to create a more structured personography beyond the standard practice of marking persons. It seems the only way to be sure that a project is compiling a personography is if they refer to the document as a personography, biography, biographical gazetteer, or some other such synonym. In what follows, I will discuss two examples of personography where the authors have self-consciously applied the title to their documents. By examining how others have constructed their documents, a set of best practices may become apparent.

The first example is the [The William F. Cody Archive][2], run by the Buffalo Bill Center of the West and the University of Nebraska, which catalogues the life, times, and documents of "Buffalo Bill" Cody. The primary goal of the archive is to give access to the multitude of newspaper clippings, correspondence, and video which give the viewer insight into this famous Wild West Hero. All documents are encoded in TEI, to varying degrees of specificity.

The authors of the site have compiled all the persons mentioned in the texts and placed them into a single XML document, which they refer to as a personography. Unlike the other documents in the archive, no link is provided for the xml version of the text, so one must manually enter the [XML personography's address][3] to access the file. After providing typical TEI header information—file description, title statement, authorship, publication information—the file begins with  and  tags before starting a list of persons with . There are three lists which the 56 separate entries are distributed amongst:empire, personal, and business, referring to the different spheres which one could place a person in relation to Buffalo Bill.

The entries themselves follow a formula: in fact, the authors have commented out the format which they wish to follow at the bottom of the text. Here is an example, the first entry of a person:

Bailey, James Anthony, 1847-1906 Bailey , James Anthony McGinnes 1847 1906 male James A. Bailey (1847-1906) was born James Anthony McGinnes in 1847. As a teenager McGinnes became an assistant to Frederic Augusta Bailey, a nephew of circus pioneer Hachaliah Bailey. McGinnes eventually changed his name to James Anthony Bailey. [...]

The only attribute in the name tag here is the XML identifier, which allows the application to point resources pertaining to the person to this entry (and vice versa). There are several different options presented for the name element. The first indicates what name should be displayed as the title of the entry when displayed in a list, which is designated by use of the type attribute "display". The second is his full given name encoded with  and , where the middle and first forename are distinguished with type attributes. Interestingly, because this particular person changed his name at one point in his life, the authors have included an  element to indicate this change from his birth name. The name fileds are followed by birth, death, and sex elements, and then a series of closed tags indicate choices which the authors did not provide information for (the majority of the entries leave the same fields blank). The final element is a  tag. In this element, the authors have provided a brief (one paragraph) biographical sketch of the person. The narrative is similar to what one would find in an encyclopedia entry, with an emphasis on the person's relationship to Buffalo Bill. Following the note, the person element is closed, and another similar entry begins.

What is most striking to me about the choices the authors made in constructing this personography was their strict use of formatting and heavy reliance on the note tag. First, it is clear from the onset that the authors settled on a list of elements they wished to include in each entry and stuck to it, to the point of leaving in empty elements. This practice takes a lot of the cognitive burden off the encoder, as he simply has to enter in the data, with only a little extra effort when idiosyncrasies arise, as the case with a person who changed their name. However, the authors relied heavily on a prose exposition for each entry, to the detriment of other encodable data: for example, there are no place elements anywhere, though these are certainly known for the birth and death elements at least. The note itself could also have been encoded; this means that if there is some reference in the note to another encoded entity, there will be no way (at least easily) to point the reader to that entity within the site.

A second example I will discuss is the [Map of Early Modern London][4] (MoEML) project developed by the University of Victoria in Canada. MoEML provides visitors with an interactive tour of London in and around 1561 CE through use of the Agas map, an early plan of the city. In addition to providing location data, MoEML  has compiled an Encyclopedia to go along with the data, broken up into three sections: a placeography, a personography, and an orgography (having to do with organizations). There are actually [three different flavors of personography][5] on the site. The first two are Historical and Literary personographies, which I will briefly mention, and then a handful of full-length biographies.

The official personography of the site is the Historical and Literary personographies. As expected, the Historical deals with real people, and the Literary with fictional characters. [The entries][6] are ordered in a list, of which the following is an example:

Bacon, Sir Nicholas Sir Nicholas Bacon

Lawyer and administrator.

ODNB

Here, the  tag contains an XML id and sex attributes. The name is fully expressed in the  element, with the special tags  for express the accepted name and  to indicate a title. The birth and death tags just give the date, although the attributes are somewhat complicated. The authors of this site have included the @datingMethod attribute to help indicate that this particular entry is to be handled using the Julian calendar. Because the time frame which the map covers spans across the Julian and Gregorian calendars, so the dates differ depending on whether the source uses one calendar or the other. The authors thus establish which calendar their source was using and encode the date to reflect it. Coding is performed on the back end to make sure that the calendars coincide with one another for timeline purposes. Following the date elements, the authors have included a  element to give further details about the individual. Rather than create their own bibliography, as in the Cody Archive, MoEML gives just a headline description of the person and links to external content—in this case, an Oxford Dictionary of National Biography entry.

This entry is on the whole more compact than that of the Cody Archive, but MoEML also has a series of experimental personographies that have expansive entries on important personages. These personographies are full-fledged biographies, several paragraphs long and with plenty of scholarly references (too long to provide a concise example). This collection is an offshoot from the main content of the site, written by graduate students who participated in the site's development. They are also encoded using TEI, but take a different tact than the two examples given above. Instead of encoding the information using the  element, they give the biography using  and  tags. In other words, the document is envisioned as a scholarly work in its own right and less as a collection of biographical statistics. There are also separate TEI headers for each entry, complete with file and revision descriptions; bibliographical citations are referenced in a separate site-wide XML bibliography. The text has two major sections: a  section that contains a  element with the name of the person, and the main  section with the content (under a
 element) separated into paragraphs with a basic
 tag. In each paragraph appears the text of the biography. Title, people, and bibliography citations are tagged with references to the appropriate resources.

An interesting omission to the texts is that basic biographical information–such as date of birth, place of birth, death, nationality, etc.–are not encoded using XML tags. Some of this information is given the body of the text through narrative, but it does not receive special encoding. The downside of this approach is that it does not provide a machine parsable way to extract this information, and should the reader wish to, say, pull out all the entries for people alive in a particular decade, they would have to read each entry themselves and compile a list themselves, rather than having a search application find the list automatically.

With these three examples, we see three different strategies for approaching the coding of persons using TEI XML. The first two use the  tag, are part of documents containing long lists of people, and contain only a small amount of information about the person. The MoEML personography has extremely short entries, outsourcing a large part of information as links to resources hosted on external sites. The Cody Archive's personography has more information, included using a note tag, and other information as pertinent to their site's content. The final example from MoEML, on the other hand, are much longer and encoded using  and  tags. They provide detailed information about the person, but much of the typical biograhpical information is not encoded for convenient machine reading. All strategies have their plusses and minuses, but are designed as appropriate for their diverse purposes.

### Personography for the ASTRA Project

The examples given above suggest several ways to think about personography, but the greatest lesson to be learned is that a personography must fit appropriately the scope of the project. The ASTRA project separates its inforrmation into several "buckets": texts, people, events, sports, artifacts, and scholarship. For the project to function effectively, the items in the different buckets need to have a way to communicate with one another. That means each text will ideally provide links that will link the data in the site, both internally and externally.

By comparing the different examples of personography against our needs, a clearer picture of how our personography should be structured emerges. First, the MoEML and Cody Archive personographies are too short and not specialized enough for our purposes. Neither would be appropriate for the linking of events and sports, though the MoEML personography does hint at an appropriate way of linking to scholarship and texts. Another consideration: given that there will likely be a good number of entries in the database, it seems conceptually more appropriate to give each entry its own XML document. Computationally, this should not be problematic, and should provide some ease of access when editing particular entries. The longer biographies at MoEML have more of the type of information we would like to include. The tagging of description content also seems appropriate. However, the lack of encoding of biographical data would have to be included and the descriptions are rather too long–this can be included with strategic referencing of information.

With these considerations in mind, it becomes apparent that we need to find some middle ground between these styles. After some tinkering, here is an example of a TEI personography for the ASTRA project for the first recorded victor of the Olympic Games, Coroebus of Elis:

[...] Coroebus Koroibos Κόροιβος Elis Tomb located at the Eastern edge of Elis and was used in defining territorial boundaries. Reputedly the winner of the Stade in the first Olympic Games. Olympia Athenaeus, 9.382B Pausanias, 5.8.5-6 Pausanias, 8.26.3-4 Strabo, 8.30.3 1 ``` As you can see, the document begins with the element as the parent element with the appropriate xml:id. Because of the many xml ids, these are kept in a separate spreadsheet to ensure there is no overlap. This is followed by a TEI header, which provides file and revision information (omitted here for brevity). The main document information is contained in a tag. The person element is broken into four sections: name, dating, sports content, and bibliography. The tag itself provides several peices of information. First, it specifies that this person is a historical person, as opposed to a mythical person or a scholar. The role is defined as victor, insofar as he was a victor in an Olympic games. This seems an appropriate category, as it is often an important distinction in scholarship. The sex attribute is straightforward. The name section also provides three different ways of representing a person's name: the anglicized version, the bare transliteration or "continental" version, and the Greek original text. This is followed by dating and location data. The place and date are sourced, which is to say, they refer through XML refs to texts on the site which provide the information given. The information also provides text descriptions where interesting or necessary. The floruit is included as the date he was active. Note that the references for place reference Pleiades locations, a convenient reference from a trusted source. The sports content is the meat of the article. It begins with a note, that gives a brief overview of who the person was and what his importance to the project is (there isn't a lot of information Coroebus, so this section isn't very long). The next element is an tag. This is an imaginative use of the tag, slightly different from its intended use, but corresponds to a particular event in the Olympic Games. It gives all the information about the staging of the event, including the where and when. The type of event is an attribute (the stadion is a sprint), and his place as victor and the source of the information are also given in that form. For other entries, where the athlete has multiple victories, there would be multiple event elements in this section. The final section gives bibliographic information. The first entries are primary sources, which would be contained as part of the text repository and linked via their xml refs. This is followed by a link to standard reference work, Moretti's list of Olympic victors. As much as possible, elements will be linked to other documents on the site. XSLT spreadsheets will be used to pull information and display them in a standardized format. The information will also lik to external sites where appropriate. The scope is large, but it is hoped this project will not only add to scholarship on ancient sports, but also provide a new perspective on how to structure personographies and biographies in general. ### Resources for Personography [1]: http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-person.html "TEI element person" [2]: http://codyarchive.org "The William F. Cody Archive" [3]: codyarchive.org/life/wfc.person.xml "Cody Archive personography" [4]: http://mapoflondon.uvic.ca "MoEML" [5]: http://mapoflondon.uvic.ca/mdtEncyclopediaPersonography.htm?listType=subcategory "MoEML Personographies" [6]: http://mapoflondon.uvic.ca/historical_personography.xml "MoEML Personography"

DH Awards 2013 Posted

Wed, 19 Feb 2014 00:00:00 -0500

Today the DH Awards group announced their winners for best DH projects of 2013. The nominating body is an international group of scholars and the voting is open to the community. Its an interesting collection of sites–browsing the non-selected entrants will bring much reward. You can find the winners posted here. There were six categories this year, which I list here with the winning projects:

I don’t have a whole lot of comments on the selections; they seem pretty good to me, though the selection of the journal site Eä seems a bit off to me, as, (1), you can read it in English, and (2), its a journal more than a DH project. I liked the 2nd runner-up in that category better, a Greek-Dutch wordbook (Woordenboek Grieks/Nederlands), for both obvious and personal reasons. Papyri.info was also in that category.

I was also happy to see that UNC Digital Innovation Lab’s DH Press was the second runner-up in the “Best DH tool or suite of tools” category; go Triangle! The Ancient Lives site is also an interesting project to check out. The selection of the Our Marathon project is a timely one–I’m going to mention it in my Omeka presentation today. Be sure to check out the other entrants as well and feel free to take a look at the statistics.

Add Swap Space to EC2 to Increase Performance and Mitigate Failure

Mon, 17 Feb 2014 00:00:00 -0500

I didn’t realize this until it was too late: AWS EC2 instances do not come with a swap partition. So what does this mean? Well, it explains (1), why my WordPress installation was randomly crashing when I saved a post, and (2), why my instance of Omeka crashed during a tutorial I was giving on the platform–which was really not a lot of fun, let me tell you. The process I describe below explains how the problem was found and how to mitigate this sort of failure by installing a swap file (I’ll be using Ubuntu 12 LTS, but it should work on any Linux machine).

So, I had mentioned in an earlier post that I was having problems occasionally with WordPress giving me a database error, and that I could solve this pretty easily by just restarting the mysql service (sudo service mysql restart). But this problem happened so rarely that I didn’t really expend much effort to figure out what was going on. But when my Omeka install crashed while about 20 people were trying to log in, and it locked up so no one could even view the front end, I knew something was up. After sputtering through the rest of the presentation using the Digital Faust site as emergency backup, I went back to my computer to try to figure out what locked up the site. On a whim, I tried restarting the mysql server, and that fixed everything up. So, it was just a matter of looking into the mysql error logs to find the following Fatal Error:

InnoDB: Fatal error: cannot allocate memory for the buffer pool

So, aha, that is probably what was going on. Some quick googling led me to an answer at Stack Overflow, and then I looked at one or two other sites to make sure that I had a good grasp of the problem: lack of swap space.

Intro to Swap

Basically, Unix systems use the term “swap” to signify the moving of memory from RAM to the hard drive, and back and forth (hence the term; virtual memory is another similar term). The swap space is where on the hard drive this swapping back and forth actually takes place. Usually, when one installs a Linux instance, the option to have a hd partition dedicated to swap is a default option. However, when using a micro EC2 instance in AWS, there is no swap space by default. I didn’t know this, and I had little reason to guess there wasn’t any, given the statistics that pop up when you login to the instance:

System load: 0.08              Processes: 74
Usage of /: 18.4% of 7.87GB    Users logged in: 0
Memory usage: 77%              IP address for eth0: xx.xx.xxx.xxx
Swap usage: 0%

So, looking at those statistics, one might be inclined to think as I did, “OK, my swap usage is 0%, which makes sense because I have just booted up and there has been no reason to use it yet, but because I have a percentage there, it must exist.” Well, it turns out that just isn’t true for the micro instance. If I had used the command swapon -s to check the swap statistics, I would have seen that there is nothing going on; free -k is another option, which put out the following:

             total    used    free   shared   buffers    cached
Mem:        604340  588772   15568        0       720     26196
-/+ buffers/cache:  561856   42484
Swap:            0       0       0

That’s a whole lot of zero!

Creating a Swap File

Because its not a good idea to repartition a drive when its already in use, there is another option: designate a file to be used for swap instead of a whole partition. The first step is to create a swap file, which can be done with the following command:

sudo dd if=/dev/zero of=/swapfile bs=1M count=1024

The dd command takes an input file (if=_file_) and copies it to an output file (of=_file_). It is generally used for copying whole partitions, as it goes block by block, which means its a bit slower than some other methods, but it useful for this sort of task. For this command, I am using a blank device (/dev/zero gives nothing but a string of zeros) and creating the swapfile in return. A good rule of thumb for swap is twice the size of physical memory–a micro instance at this writing has 0.615GB. I am setting the size at at 1GiB: it is a little bit less than twice, but I don’t want to fill up the internal storage and the 1GiB should be more than enough for our purposes. This command takes a few seconds to run, and puts out the following if all goes well:

1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 36.9033 s, 29.1 MB/s

Enabling Swap

Now that we have created the swap file, we just need to activate it. We start by making a swap space with the following command:

sudo mkswap /swapfile

What the mkswap command did was set our swapfile as the swapspace for the system. I got the following output:

Setting up swapspace version 1, size = 1048572 KiB
no label, UUID=c6810168-a5aa-4a74-bcd7-1502942e2667

Now we enable the swapspace with the swapon command:

sudo swapon /swapfile

There was no output for that command, but by typing swapon -s I can see that I now have a swapspace:

Filename    Type    Size       Used  Priority
/swapfile   file    1048572    0     -1

One last thing to do: I have to enable the swapfile in file systems table (fstab) so that on a reboot the swapfile is enabled. That file is at /etc/fstab; you can edit it with the following command:

sudo nano /etc/fstab

Nano is a straightforward editor, so that’s why I use it here. There is probably only one line in this file; on a new line, paste in the following:

/swapfile    none    swap sw 0 0

Save it and exit (ctrl-O and then ctrl-X). The swap is now enabled and from now on.

Some further options

There are two more steps you should probably take: adjust the “swappiness” and lock down the swap file. I love the term swappiness: I thought it was made up the first time I saw it, but it is in fact an accepted term. Swappiness is a ratio of how often the system will write to the swapfile: if set to zero, the system will only swap to avoid running out of memory (the error above); if set to 100, the system will attempt to swap all the time. The default is set at 60. Since we want to utilize the swap only when necessary, we can set the swappiness to zero with the following commands:

echo 0 | sudo tee /proc/sys/vm/swappiness
echo vm.swappiness = 0 | sudo tee -a /etc/sysctl.conf

The next step is locking down the swapfile so other users won’t play with it. If you are the only user you likely won’t have a problem, but it is a good best practices sort of thing to do. You can do that with the following commands:

sudo chown root:root /swapfile
sudo chmod 0600 /swapfile

And that’s it! You now have your own swap space set up, optimized, and locked down.

Increase your file size limit in Omeka

Mon, 10 Feb 2014 00:00:00 -0500

I’ve seen a few queries in the Omeka documentation asking how to increase the file size limit for uploading files. I found that I needed to do this exact task this week, but I had to poke around a bit in the file system to get the right file. So, since the forum has information that is a bit older, I thought I write up briefly the steps you need to perform to increase that file size limit. The system I am using has Omeka 2.1.4 installed on an Ubuntu 12 server: the file system will likely be different if your OS is different.

Omeka in and of itself does not preclude large files: the main issue comes from default settings for PHP. Therefore, we need to alter some of the configuration files to up that limit. If you followed the instructions on the Omeka site for installation and thus are using PHP5 & Apache2, the file you need can be found on your server here:

sudo nano /etc/php5/apache2/php.ini

Notice the sudo; you will have to have administrator access to change this file, which is probably the case if you are reading this. The php.ini file is a long one, and the two things we need are about halfway down. The first option to change is under the Data Handling section (line 740 in my documentation):

; Maximum size of POST data that PHP will accept.
; http://php.net/post-max-size
post_max_size = 2M

The default setting here is likely 2M–change it to the size you want. This number is in bytes, so you can use K (kilo-), M (mega-) and G (giga-), or none at all for a byte-sized file limit. You don’t want it too large though, definitely not over the 32-bit integer limit; I set mine to 20M. POST, one of the four HTML protocols, handles requests to write on the server; GET, on the other hand, is a request to read from the server. So, change this to whatever you want to set your limit. The next option we need to change is a little further down. The option to change is under the File Uploads section (line 891 for me):

; Maximum allowed size for uploaded files.
; http://php.net/upload-max-filesize
upload_max_filesize = 2M

This is pretty self explanatory: change the 2M to your chosen file size limit. Now we just need to save the file and exit (ctrl-O, yes, ctrl-X is the sequence in nano). You have now adjusted what you need in PHP, but lets go a little further and tell our users what their limit is. We can do this by altering our Omeka .htaccess file. We can change this with the following command:

sudo nano /var/www/.htaccess

This will bring us into Omeka’s Apache configuration file. Scroll down to the bottom and add these two lines at the end of the file:

php_value upload_max_filesize 20M
php_value post_max_size 20M

Basically, what you have just done is instruct Omeka and Apache that your file size limit is now different. I put here 20M as my upload file sizes; if you decided on a different number, just change these numbers to whatever size you chose. Once you save the file, all that is left is to restart Apache, which you can do with this command:

sudo service apache2 restart

And you are all set! Now, when you go to the file upload tab, you will see that your file upload max size is reflected above the file chooser. Enjoy!