Abstract
The chapter presents a model for discovering rhymes in a corpus of poetic texts. The algorithm employs an adaptation of the usual collocation extraction technique in order to identify some common rhyme pairs in a corpus. The output is then used as a training set for simple machine learning. The method has been tested on corpora of poetry in three different languages (Czech, English, and French) with F-scores ranging from 0.9 to 0.95.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
- 2.
“Far is that lost dream now, a shadow no more found,/Like visions of white towns, deep in the waters drowned,/The last indignant thoughts of the defeated dead,/Their unremembered names, the clamour of old fights,/The worn-out northern lights after their gleam is fled,/The untuned harp, whose strings distil no more delights.” (translation: Edith Pargeter)
- 3.
Here, we use the experimental values: m = 4, s = 5.
- 4.
In regular collocation extraction, there are two most frequently used measures: T-score and MI-score. The first one derives from a statistical hypothesis testing (Student’s t-test) and aims thus to calculate the confidence with which we can assert that the difference from the expected frequency is not random; it gives no information on the strength of such an association. MI-score on the other hand directly measures the strength of the association but gives no information on what the probability is that it was caused by chance. The practical consequences are T-scores being sensitive to the co-occurrence of high-frequency grammatical words (the more the evidence, the more confidence), while MI-scores seem to overestimate the co-occurrences of words with low frequencies. As we are interested in distinguishing the significant co-occurrences from random ones and not in their ranking from strongest to weakest, T-score seems to be the optimal choice here.
- 5.
Here, we use α = 3.078.
- 6.
See issue #323 at MaryTTS (2017). It has been fixed later on.
- 7.
- 8.
As Reddy and Knight (2011b) point out, their algorithm has a very high demand on internal memory. To train the algorithm on an entire corpus as large as the Czech corpus would require a machine with several terabytes of RAM. Keeping the data on a hard drive instead of RAM would, on the other hand, lead to several months of computational time per evaluation of each subcorpus.
- 9.
The most appropriate comparison would be that of where the expectation maximization algorithm works with all the possible schemes of stanzas of a given length. Such an approach—as already mentioned—is far beyond the capabilities of contemporary machines in general. We were thus only able to process two small subcorpora this way with short stanzas only: cs-1740 and cs-1750, getting the F-scores 0.61 and 0.75, respectively.
- 10.
Extremely low recall for French is due to the abovementioned inaccurate setting of relevant substrings (section “Learning”). Notice also that the precision for Czech constantly decreases starting with authors born in the beginning of the 19th century. This may be attributed to the fact that after the national-revival period rhyme pairs where vowel lengths match go out of fashion (cf. Jakobson 1923/1995, pp. 204–211).
References
ARTFL: American and French research on the Treasury of the French Language. (2009). Centre National de la Recherche Scientifique/University of Chicago. http://artfl-project.uchicago.edu/content/artfl-frantext. Accessed 1 Mar 2017.
Crystal, D. (2007). Original pronunciation transcriptions of Shakespeare’s Sonnets. http://www.davidcrystal.com/books-and-articles/shakespeare. Accessed 1 Mar 2017.
Gardner, M. (1978). The bells: Versatile numbers that can count partitions of a set, primes and even rhymes. Scientific American, 238, 24–30.
Jakobson, R. (1923/1995). Základy českého verše [Foundations of Czech Verse]. In M. Červenka (Ed.), Poetická funkce [Poetic Function] (pp. 157–248). Jinočany, Czech Republic: H&H.
MaryTTS: An open source, multilingual text-to-speech synthesis system. (2017). GitHub. http://github.com/marytts/marytts. Accessed 1 Mar 2017.
Plecháč, P. (2016). Czech verse processing system KVĚTA: Phonetic and metrical components. Glottotheory, 7, 159–174. https://doi.org/10.1515/glot-2016-0013
Plecháč, P., & Kolár, R. (2015). The corpus of Czech verse. Studia Metrica et Poetica, 2, 107–118. https://doi.org/10.12697/smp.2015.2.1.05
Reddy, S., & Knight, K.. (2011b). Unsupervised discovery of rhyme schemes. The code. GitHub. https://github.com/sravanareddy/rhymediscovery. Accessed 1 Mar 2017.
Reddy, S., & Knight, K. (2011a). Unsupervised discovery of rhyme schemes. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (pp. 77–82). Portland, OR: ACL.
Sonderegger, M. (2011). Applications of graph theory to an English rhyming corpus. Computer Speech and Language, 25, 655–678.
Acknowledgments
Funding: This work was supported by the Czech Science Foundation, project GA17-01723S (“Stylometric Analysis of Poetic Texts”) and the research institution 68378068.
Data and source code: available at http://github.com/versotym/rhymeTagger/.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Plecháč, P. (2018). A Collocation-Driven Method of Discovering Rhymes (in Czech, English, and French Poetry). In: Fidler, M., Cvrček, V. (eds) Taming the Corpus. Quantitative Methods in the Humanities and Social Sciences. Springer, Cham. https://doi.org/10.1007/978-3-319-98017-1_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-98017-1_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-98016-4
Online ISBN: 978-3-319-98017-1
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)