![]() ![]() The results of a recent post on word frequencies in Latin suggested that Zipf’s law would hold up for this language and I wanted to test it to be sure. In a future post, I will build a KWIC method from scratch that offers more flexibility, especially with respect to context scope and location identification.Īs usually formulated, Zipf’s law states that when given a natural-language corpus, the relationship between the frequency of words and their frequency rank is inversely proportional. At the same time, it is another step towards combining existing resources and tools (here, NLTK Text and a CLTK corpus) to explore Latin literature from different angles. Admittedly, it is pretty basic-it does not even return an identification or location code to help the user move easily to the wider context and the only way we know that the fifth match is in Chapter 95 is because the chapter number happens to be included in the context. There’s not much customization available for the method, so this is pretty much what it does. The KWICs generated by NLTK Text are case insensitive (see amicus and Amicus above) and sorted sequentially by location in the text. Secerni autem blandus amicus a vero et internosci tam potest adh M in amicitiam transferetur, verus amicus numquam reperietur est enim is qu Quamquam Ennius recte : Amicus certus in re incerta cernitur, tam Optare, ut quam saepissime peccet amicus, quo plures det sibi tamquam ansasĮscendant. Quonam enim modo quisquam amicus esse poterit ei, cui se putabit in Here, for example, is the NLTK concordance for ‘amicus’: In : amicitia_ncordance('amicus') Now that we have an NLTK text, there are several methods available to us, including “concordance,” which generates a KWIC for us based on keywords that we provide. In : amicitia_text = nltk.Text(amicitia_words) We can then convert this list of words to an NLTK Text: In : import nltk In : amicitia_words = latinlibrary.words('cicero/amic.txt') ![]() Using the Classical Language Toolkit and the Natural Language Toolkit’s Text module, we can easily create KWICs for texts in the Latin Library.įirst, we can import a text from the Latin Library-here, Cicero’s De amicitia-as a list of words: In : from import latinlibrary Here are the first entries for the preposition e in Packard’s concordance: For examples, David Packard’s 1968 A Concordance to Livy uses an alphabetical KWIC format. This allows a user to scan quickly a large number of uses in a given text. The idea is to produce a list of all occurrences of a word, aligned so that the word is printed as a column in the center of the text with the corresponding context printed to the immediate left and right. The “key word-in-context” (KWIC) index was an innovation of early information retrieval, the basic concepts of which were developed in the late 1950s by H.P. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |