Skip to main content

A Collocation-Driven Method of Discovering Rhymes (in Czech, English, and French Poetry)

  • Chapter
  • First Online:

Abstract

The chapter presents a model for discovering rhymes in a corpus of poetic texts. The algorithm employs an adaptation of the usual collocation extraction technique in order to identify some common rhyme pairs in a corpus. The output is then used as a training set for simple machine learning. The method has been tested on corpora of poetry in three different languages (Czech, English, and French) with F-scores ranging from 0.9 to 0.95.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD   129.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    According to the authors, the English corpus comes originally from the study of Sonderegger (2011) and the French comes from the ARTFL (2009) project.

  2. 2.

    “Far is that lost dream now, a shadow no more found,/Like visions of white towns, deep in the waters drowned,/The last indignant thoughts of the defeated dead,/Their unremembered names, the clamour of old fights,/The worn-out northern lights after their gleam is fled,/The untuned harp, whose strings distil no more delights.” (translation: Edith Pargeter)

  3. 3.

    Here, we use the experimental values: m = 4, s = 5.

  4. 4.

    In regular collocation extraction, there are two most frequently used measures: T-score and MI-score. The first one derives from a statistical hypothesis testing (Student’s t-test) and aims thus to calculate the confidence with which we can assert that the difference from the expected frequency is not random; it gives no information on the strength of such an association. MI-score on the other hand directly measures the strength of the association but gives no information on what the probability is that it was caused by chance. The practical consequences are T-scores being sensitive to the co-occurrence of high-frequency grammatical words (the more the evidence, the more confidence), while MI-scores seem to overestimate the co-occurrences of words with low frequencies. As we are interested in distinguishing the significant co-occurrences from random ones and not in their ranking from strongest to weakest, T-score seems to be the optimal choice here.

  5. 5.

    Here, we use α = 3.078.

  6. 6.

    See issue #323 at MaryTTS (2017). It has been fixed later on.

  7. 7.

    We have used the stanza-independent EM model with θ initialized by orthographic similarity (Reddy & Knight, 2011a, p. 79), the original F-score of which is reported in Table 5.2, column “ortho. init.” at ibid.: 81.

  8. 8.

    As Reddy and Knight (2011b) point out, their algorithm has a very high demand on internal memory. To train the algorithm on an entire corpus as large as the Czech corpus would require a machine with several terabytes of RAM. Keeping the data on a hard drive instead of RAM would, on the other hand, lead to several months of computational time per evaluation of each subcorpus.

  9. 9.

    The most appropriate comparison would be that of where the expectation maximization algorithm works with all the possible schemes of stanzas of a given length. Such an approach—as already mentioned—is far beyond the capabilities of contemporary machines in general. We were thus only able to process two small subcorpora this way with short stanzas only: cs-1740 and cs-1750, getting the F-scores 0.61 and 0.75, respectively.

  10. 10.

    Extremely low recall for French is due to the abovementioned inaccurate setting of relevant substrings (section “Learning”). Notice also that the precision for Czech constantly decreases starting with authors born in the beginning of the 19th century. This may be attributed to the fact that after the national-revival period rhyme pairs where vowel lengths match go out of fashion (cf. Jakobson 1923/1995, pp. 204–211).

References

Download references

Acknowledgments

Funding: This work was supported by the Czech Science Foundation, project GA17-01723S (“Stylometric Analysis of Poetic Texts”) and the research institution 68378068.

Data and source code: available at http://github.com/versotym/rhymeTagger/.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Petr Plecháč .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Plecháč, P. (2018). A Collocation-Driven Method of Discovering Rhymes (in Czech, English, and French Poetry). In: Fidler, M., Cvrček, V. (eds) Taming the Corpus. Quantitative Methods in the Humanities and Social Sciences. Springer, Cham. https://doi.org/10.1007/978-3-319-98017-1_5

Download citation

Publish with us

Policies and ethics