Users of this site may want to bookmark the ARTFL research blog. Recent research and development on ARTFL projects, including Perseus under PhiloLogic, are reported in the blog. A recent addition, reported on in the blog, are simple frequency reports on the Greek and Latin texts.
New in April 2010: More Greek Texts have been disambiguated and all texts have been reparsed. Medical texts have been added. LSJ has again seen substantial editing. Collocation overviews by lemma are now available. Morphological searching has been made easier with the 'form:' query. The KWIC views have been improved for Greek. They do take a moment, and will also take a moment to catch up when you resize your screen, but we think the wait is worth it! (Thanks Kristin!) We are grateful for all problem reports and user suggestions; keep them (and your donations:-)) coming. To keep abreast of developments, do follow our blog (above).
The texts we make available on this site are practically all used by permission from the Perseus Project at Tufts University, the foremost
Digital Library for the classical world, if not for the
Humanities in general. In its collection of Greek and Roman materials,
readers will find many of the canonical texts read today. The Greek
collection approaches 8 million words and the Latin collection currently has 5.5 million. In addition, many English language dictionaries, other reference works, translations, and commentaries are included, so that anyone with an internet connection has access to the equivalent of a respectable College Classics library. The Greek and Latin texts are richly encoded for content rather than form (e.g., not page breaks, initials, and indents, but speaker information, metrical information, and milestones). The Perseus site is further enriched by intricate linking mechanisms among texts (resulting in more than 30 million links). For licensing information, details on editors and translators, etc., click on the XML Header links that show up in the bibliographical details of the texts.
You will here find just about the same texts as at the Tufts site, but the mechanism for
browsing and searching them is a different one. It is PhiloLogic, a
system that was especially developed for large textual databases by the ARTFL project at the University of Chicago. While the original Perseus site is an excellent tool for linear reading, by putting all kinds of resources on the same page while a user reads a passage, we were interested in leveraging the rich encoding for searching the texts, and for other tasks that are less about reading and more about research: corpus linguistics, above all. We are grateful that the Perseus Project makes its texts available to third parties, and continue to live in hope that other not-for-profit institutions devoted to text curation will enhance their search and analysis offerings, or follow the example of Perseus, and decide to make their data available for advanced analysis with other systems than their own. Please get in touch, or download your own copy of PhiloLogic, which is open-source.
It is important to understand that a PhiloLogic search form is not like a Google search box. The main search box is for words that occur in the text, so that by typing 'Gallia est' you will find the opening sentence of the Gallic Wars, but entering 'Julius Caesar' will in the first instance lead you to texts of Catullus and Cicero. Starting from our homepage, if you wish to read a work by a certain author, click on the initial letter of the author's name to see a list of authors and works; type in the abbreviation of the work (e.g., Caes. Gal.) in the citation search box, or click on the link to the full search form, where you can use the Author and Title fields.
PhiloLogic is designed to leverage the rich structural encoding that Perseus texts offer, and therefore to know the difference between types of content: words in the texts, versus the so-called metadata: authors, titles, and much more. It is also designed to allow for precise answers to specific questions, rather than ballpark estimates of the 'are you feeling lucky' type. If you search for the word 'amicitia' in texts, or for the name 'Pseudolus', we don't want you to find instances from titles, or speaker indications -- unless you specify that that's the kind of information: titles that include amicitia, words spoken by Pseudolus, that you want. We believe that both approaches have their advantages but that more precise searching is something that classicists tend to want. In sum, before entering anything in a search field, ask yourself what kind of search this is: a word search or a search for metadata. If your search is for metadata, find the fitting field elsewhere on the search form. Tip: By clicking on the buttons next to the search fields, you will always get a listing of your options.
One type of reaction we heard a lot about the original Perseus under PhiloLogic site was that the search forms were rather intimidating to the novice. While keeping the search forms in place for all the power users who have by now become accustomed to them, we decided to offer a radically simplified front page for all our resources, with a pared-down set of search options. On this new home page, you can now navigate to texts directly by entering a citation, search for a word or phrase, or browse works alphabetically by author. We wanted to make sure that finding texts is as intuitive to a classicist reader as possible, and so you can usually look up a text based on its Oxford Classical Dictionary citation. In addition, the homepage gives direct access to word parses, dictionary entries, and grammar sections.
Texts and their translations live in the same databases. In the new release, we have decided to no longer display these in a single browser window. Many users found this confusing. You can now go from translation to original, or read them side by side, by clicking on links ('English', 'Greek', 'Latin'). If there are multiple translations, you will see 'English' and 'English2'. For a demonstration of a typical visit, check the steps in the earlier part of this recent presentation.
Commentaries and Monographs live in two separate databases. On the home page, you can now enter author or title so that it is easy to find out whether a commentary is available for a particular ancient text. Monographs include various grammars. We have made a quick lookup box for grammar sections, in accordance with how these works usually get cited in commentaries and in classrooms.
Dictionaries are now accessible via the parse window in the Greek and Latin databases. In addition, entries in Liddell & Scott and Lewis & Short can be looked up from the homepage. Full text remains searchable from the search forms for the individual dictionaries.
We know about users with good experiences on Linux, Ubuntu, Windows XP, Mac OS as operating systems; we know that Opera, Firefox, and Safari have been successfully used as browsers. Unfortunately Internet Explorer is not compatible with our click-to-parse mechanism. In all other browsers we have tested, a click on a Greek or Latin word should result in a new window with parse information and links to dictionaries. Subsequent clicks will result in this same parse window being 'refreshed'; if you don't see anything, it may be that this window is hidden behind your other browser window(s). If Greek fails to show up as Greek, make sure that your browser can deal with UTF-8 encoding, and download some Unicode font that has Greek in it. There are plenty of free Greek fonts. Cutting and pasting into word processors should be easy. In most cases, you should be able to type in words you search for without diacritics (this also means: no breathings and no iota subcripts), or in transliteration (see 'Info & Help' for guidance); just be sure to also select the corresponding radio button ('no diacritics', 'transliteration') when you do this.
Unicode detail that is probably too much information: we try to be consistent in using pre-combined Unicode and avoiding the now-deprecated characters that use 'oxia' rather than the canonical 'tonos' combinations). If you use a Greek input method that produces the 'oxia' variant, consider entering your search without diacritics when there are acute accents in play or installing an input method that adheres to canonical practice. The Mac OS X system has built-in polytonic Greek input that also complies with these standards.
In the Spring of 2008 we received an ATI grant to develop morphological analysis for the Greek corpus, and to make it searchable. You can learn more about this project by reading abstracts of our presentations on this topic or taking a look at this big poster on how it was all put together. In a more recent presentation, we present a walk-through of a set of searches. For more details on part-of-speech codes, consult the 'Info & Help' sections on the search forms. It is important to point out that the texts were not parsed by hand, so that there will be many erroneous parses. We hope you will help us correct those!
In a typical parse window, you'll see one parse highlighted in light blue. It indicates that our automatic part-of-speech tagger has selected this parse as the most likely one in the context. You will see a number (say, 0.45678) associated with the parse. This expresses the probability the system associates with that particular parse. When you enter a word by hand in a parse box, the system will simply give all possible parses for that form, with no probability score at all (the number displayed will be 0). If you click on a different parse that parse will turn yellow, and a vote will be registered in our databases of parse votes. It will be quarantined there until approved. Parts of the texts have actually been hand-tagged. If you encounter a hand-tagged form, it will be green in color. Even there, data entry problems may come up, so please be critical and report (= click on the correct parse, or submit a problem report form via the link in the parse window if the correct parse is not listed) any errors you find.
If you wish to search for occurrences of a lemma or part-of-speech code, you use the same search field as for normal words (or 'strings'), but you prefix them with 'lemma:' or 'pos:'. For example, 'lemma:nostos' or 'lemma:sum'.
New: by using 'form:' you can ignore the more complex instructions for part-of-speech codes that follow. Simply write out what you think will sufficiently describe the form you are looking for, in any order, but use hyphens between terms. For instance, 'form:optative-act-singular' for an active optative in the singular, where 'form:sg-opt-act' would do the same thing.
The part-of-speech codes are less simple to summarize. The Info & Help section has a quick introduction. It is important to know that while a full analysis constitutes ten slots, many of these will be empty (-), and even more will not be of interest to you at a given time. All of these you can leave unspecified with *, but your formulation must be specific enough that an 'a' does define accusative and not aorist. For this it is helpful to know the ordering of the different slots. They are:
1) major part of speech: Verb, Noun, Adjective, Pronoun, particle (g), aDverb, nuMeral, pReposition, Conjunction, Interjection;
2) minor part of speech: a: Article or determinative (Latin is, idem, ipse), Personal, Demonstrative, x: indefinite, Interrogative, Relative, poSsessive, k: reflexive, reCiprocal, propEr;
3) person: 1, 2, 3;
4) number: singular, plural, dual;
5) tense: Present, Imperfect, Aorist, peRfect, pLuperfect, Future, fuTure perfect;
6) mood: Indicative, Subjunctive, Optative, iMperative, iNfinitive, Participle, Gerundive, gerunD, sUpine;
7) voice: Active, Middle, Passive, middlE-passive;
8) gender: Masculine, Feminine, Neuter, Common;
9) case: Nominative, Genitive, Dative, Accusative, aBlative, Vocative;
10) degree: Comparative, Superlative.
Regular expressions will work to a certain extent. For instance, one could merely specify 'pos:*a-' to capture accusatives. (All slots from 1 through 8 are here left unspecified. We know this because the search field always requires a complete word, and we have ended our word with '-' and not with a wild card). This initial formulation, however, would miss accusatives that are also comparatives or superlatives. In order to include them, try 'pos:*a[-cs]' instead. [xyz] means 'pick any one of the items xyz between the brackets'. Conversely, if one is looking for personal pronouns, it may make sense to use pos:pp* with no further specification about slots 3-8.
Part-of-speech and lemma searches can be combined, by means of a semi-colon, or used separately, with a space, if one is specifying different words: The search 'lemma:dokew;pos:v-3s.* pos:.*d-' searches for forms of δοκέω in the 3rd singular (semicolon), and separately, something in the dative.
This is probably as good a moment as any to point out that our parser and our search engine do not know Greek or even Latin syntax! You will have to decide for yourself, in searches of this sort, whether the datives you find are in fact datives that are governed by the verb.
Is all of this rather overwhelming? We do realize that the formulas look rather forbidding! If we can find the time and the funding, we will work on more natural language querying (could I please have some perfect active optatives?) to take the place of 'pos:v*roa*'.
We think that this corpus holds great promise both for research and for teaching. Philologists need to do corpus study beyond the single word; more particularly, classical linguists should work on making more evidence-based and quantitative claims than are found in much of the current literature. Teachers who wish to select what vocabulary or constructions to emphasize should have a notion of frequency of use, and rather than making up examples, they could run a quick search for actual examples of constructions. To give a simple example, three definite articles in sequence is not unusual. Now you can find actual examples in Lysias, a suitable author for introductory and intermediate classes, to demonstrate this. On a practical note for teachers, if you send your class a link of this sort, the phenomenon you wished to highlight is highlighted on the page. If you wish to draw your students' attention to a particular part of a page - search for it, and send them the copied URL of the search result. They will see the same highlighting.
As you can probably imagine, there are many many wheels within wheels to make this site do what it does, and sometimes things get lost in the shuffle. If you see something awry, please let us know. Here's how you can help us improve this site: If you encounter a problem, please use the "Report a Problem" link that you will find on the Results pages.
In addition, we hope you will select the correct parses when you use the parse window. You will see your selection turn yellow; it will also be stored in the database. In fact, it will be quarantined, with all other user votes, until approval, from which point all users will see the corrected parse and new runs of our part-of-speech tagger will be more accurate thanks to the increased amount of so-called training data. Your corrections, therefore, will have both a local impact in their context, and a global impact on the accuracy of the database as a whole.
The parse window has a separate problem report form (in case none of the parses is satisfactory, or the short definition falls, well, short).
This project would not have been possible without open-source software and data shared under creative-commons licences. If you are a faculty member, staff, student, or administrator at an institution of higher learning, get informed about Open Access, Open Content and the Creative Commons. Support the principles they represent, and work for change where you can in your own institution and professional organizations. Regardless of affiliation, classical enthusiasts can support organizations that work with these principles. You can support open-access and creative-commons oriented projects that you like. For classicists, some sites to visit as good clearing houses for this kind of information are Chuck Jones's Ancient World Online, Neel Smith's Vitruvian Design blog, and stoa.org.
Of course, there is always more to wish for. We'd like to go back and clean up a few things we have not cleaned up yet (we will, eventually). We'd like to get lemmas and morphological characteristics more fully integrated into the system (collocations by lemma, more frequency data for morphology). We'd like to make searching for morphological attributes easier. We'd like to take text mining experiments and sequence alignment on the texts out of our lab and onto this public website. We'd like.. more texts. Etcetera. We'll see. Much of the programming on this release has been done by a single Classics BA pursuing a Master's in Computer Science (a good amount of additional unfunded work by determined classicists helps, as well as open-source software and assistance by its developers). We wish to register our gratitude to the Provost's office of the University of Chicago for its ATI grant for 2008-09. And of course, κῦδος to Richard Whaling for pulling it off!
A final line-up, then, of people to thank for their help in the past year. All the programming for this release was done by Richard Whaling. We, Richard and Helma, wish to thank our disambiguators: Kristin Dean, Charlotte Krontiris, and Ursula Poole; Walt Shandruk, for munging through a pile of Latin data on short notice; the Perseus Project, for sharing data and expertise; Martin Mueller, for consultation and making available his Homeric data; and Hugh Cayless, for making our life easier with his Transcoder. We thank the entire staff at ARTFL for welcoming classicists in their midst and generously sharing expertise, caffeine, and mirth.
Chicago, July 2009