A Collocation-Driven Method of Discovering Rhymes (in Czech, English, and French Poetry)

Plecháč, Petr

doi:10.1007/978-3-319-98017-1_5

A Collocation-Driven Method of Discovering Rhymes (in Czech, English, and French Poetry)

Petr Plecháč⁸

Chapter
First Online: 10 November 2018

537 Accesses
2 Citations

Part of the book series: Quantitative Methods in the Humanities and Social Sciences ((QMHSS))

Abstract

The chapter presents a model for discovering rhymes in a corpus of poetic texts. The algorithm employs an adaptation of the usual collocation extraction technique in order to identify some common rhyme pairs in a corpus. The output is then used as a training set for simple machine learning. The method has been tested on corpora of poetry in three different languages (Czech, English, and French) with F-scores ranging from 0.9 to 0.95.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Hardcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
According to the authors, the English corpus comes originally from the study of Sonderegger (2011) and the French comes from the ARTFL (2009) project.
2.
“Far is that lost dream now, a shadow no more found,/Like visions of white towns, deep in the waters drowned,/The last indignant thoughts of the defeated dead,/Their unremembered names, the clamour of old fights,/The worn-out northern lights after their gleam is fled,/The untuned harp, whose strings distil no more delights.” (translation: Edith Pargeter)
3.
Here, we use the experimental values: m = 4, s = 5.
4.
In regular collocation extraction, there are two most frequently used measures: T-score and MI-score. The first one derives from a statistical hypothesis testing (Student’s t-test) and aims thus to calculate the confidence with which we can assert that the difference from the expected frequency is not random; it gives no information on the strength of such an association. MI-score on the other hand directly measures the strength of the association but gives no information on what the probability is that it was caused by chance. The practical consequences are T-scores being sensitive to the co-occurrence of high-frequency grammatical words (the more the evidence, the more confidence), while MI-scores seem to overestimate the co-occurrences of words with low frequencies. As we are interested in distinguishing the significant co-occurrences from random ones and not in their ranking from strongest to weakest, T-score seems to be the optimal choice here.
5.
Here, we use α = 3.078.
6.
See issue #323 at MaryTTS (2017). It has been fixed later on.
7.
We have used the stanza-independent EM model with θ initialized by orthographic similarity (Reddy & Knight, 2011a, p. 79), the original F-score of which is reported in Table 5.2, column “ortho. init.” at ibid.: 81.
8.
As Reddy and Knight (2011b) point out, their algorithm has a very high demand on internal memory. To train the algorithm on an entire corpus as large as the Czech corpus would require a machine with several terabytes of RAM. Keeping the data on a hard drive instead of RAM would, on the other hand, lead to several months of computational time per evaluation of each subcorpus.
9.
The most appropriate comparison would be that of where the expectation maximization algorithm works with all the possible schemes of stanzas of a given length. Such an approach—as already mentioned—is far beyond the capabilities of contemporary machines in general. We were thus only able to process two small subcorpora this way with short stanzas only: cs-1740 and cs-1750, getting the F-scores 0.61 and 0.75, respectively.
10.
Extremely low recall for French is due to the abovementioned inaccurate setting of relevant substrings (section “Learning”). Notice also that the precision for Czech constantly decreases starting with authors born in the beginning of the 19th century. This may be attributed to the fact that after the national-revival period rhyme pairs where vowel lengths match go out of fashion (cf. Jakobson 1923/1995, pp. 204–211).

References

ARTFL: American and French research on the Treasury of the French Language. (2009). Centre National de la Recherche Scientifique/University of Chicago. http://artfl-project.uchicago.edu/content/artfl-frantext. Accessed 1 Mar 2017.
Crystal, D. (2007). Original pronunciation transcriptions of Shakespeare’s Sonnets. http://www.davidcrystal.com/books-and-articles/shakespeare. Accessed 1 Mar 2017.
Gardner, M. (1978). The bells: Versatile numbers that can count partitions of a set, primes and even rhymes. Scientific American, 238, 24–30.
Article Google Scholar
Jakobson, R. (1923/1995). Základy českého verše [Foundations of Czech Verse]. In M. Červenka (Ed.), Poetická funkce [Poetic Function] (pp. 157–248). Jinočany, Czech Republic: H&H.
Google Scholar
MaryTTS: An open source, multilingual text-to-speech synthesis system. (2017). GitHub. http://github.com/marytts/marytts. Accessed 1 Mar 2017.
Plecháč, P. (2016). Czech verse processing system KVĚTA: Phonetic and metrical components. Glottotheory, 7, 159–174. https://doi.org/10.1515/glot-2016-0013
Article Google Scholar
Plecháč, P., & Kolár, R. (2015). The corpus of Czech verse. Studia Metrica et Poetica, 2, 107–118. https://doi.org/10.12697/smp.2015.2.1.05
Article Google Scholar
Reddy, S., & Knight, K.. (2011b). Unsupervised discovery of rhyme schemes. The code. GitHub. https://github.com/sravanareddy/rhymediscovery. Accessed 1 Mar 2017.
Reddy, S., & Knight, K. (2011a). Unsupervised discovery of rhyme schemes. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (pp. 77–82). Portland, OR: ACL.
Google Scholar
Sonderegger, M. (2011). Applications of graph theory to an English rhyming corpus. Computer Speech and Language, 25, 655–678.
Article Google Scholar

Download references

Acknowledgments

Funding: This work was supported by the Czech Science Foundation, project GA17-01723S (“Stylometric Analysis of Poetic Texts”) and the research institution 68378068.

Data and source code: available at http://github.com/versotym/rhymeTagger/.

Author information

Authors and Affiliations

Ustav pro ceskou literaturu, Praha, Czech Republic
Petr Plecháč

Authors

Petr Plecháč
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Petr Plecháč .

Editor information

Editors and Affiliations

Department of Slavic Studies, Brown University, Providence, RI, USA
Masako Fidler
Institute of the Czech National Corpus, Charles University, Prague 1, Czech Republic
Václav Cvrček

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Plecháč, P. (2018). A Collocation-Driven Method of Discovering Rhymes (in Czech, English, and French Poetry). In: Fidler, M., Cvrček, V. (eds) Taming the Corpus. Quantitative Methods in the Humanities and Social Sciences. Springer, Cham. https://doi.org/10.1007/978-3-319-98017-1_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-98017-1_5
Published: 10 November 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-98016-4
Online ISBN: 978-3-319-98017-1
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics