Sunset in the Adirondack mountains.

Teaching a Computer to Talk* *(Sort of)

Generating Random Text with Markov Chains

Last updated on Feb 21, 2022 7 min read

“My friends, when we began this journey, I am so proud. So proud. So proud. So proud. So proud. So proud. So proud. So proud. So proud of” – Doug Schmord

Then we saw all the things in the hat. Then we saw mother’s new gown! Her gown with the fan, and the cake! - Dr. Schmeuss

Introduction

What’s the easiest way to teach a computer to talk?

This post shows a simple (interactive!) way to generate English text, providing an infinitely renewable source of Dr. Seuss stories and Doug Ford campaign speeches. The algorithm is available as an R package on GitHub, and there’s an interactive Shiny web app so you can try it yourself.

Motivation and research question

When it comes to computer-generated text, massive deep-learning systems like GPT-3 generate a lot of splashy headlines. But these systems are also opaque and have ridiculous environmental impacts–although it can be hard to measure them precisely. More to the point, they often don’t work in ways that can be hilarious, or, uh, not hilarious.

Reflecting on how poorly these large expensive models often work, I wondered how hard it would be to rig up a low-fi text generator using basic math. Or specifically:

Can we generate semi-plausible English text modeled after an input text using simple and transparent methods?

The rest of this post will try to convince you that the answer is “yes, sort of,” by walking through the R package markovtext: A deeply unserious package for generating random text that mimics a given input text using Markov chains. Obviously this won’t have the generality or flexibility of GPT-3 and its friends, but I believe there’s value in seeing how far you can get with simple surveyable methods.

Skip the math, show me the interactive stuff!

Click here to play with the algorithm using an interactive Shiny web app.

It has a few default settings (Dr. Seuss, Doug Ford, Nietzsche, etc.) and you can also provide your own input text and adjust some of the parameters. You might try feeding it some classic literature, or perhaps something more contemporary.

Okay, now let’s talk math

What if we wanted to write English text probabilistically? We could start by choosing some words–say, 1000 of them–assigning each one a number, and then rolling a 1000-sided die and writing down the corresponding numbers. This would generate strings of English words, but you’re probably already thinking that it wouldn’t really be English text. We’d get all kinds of nonsense.

In coherent English words tend to follow specific other words. So we can’t just look at how often each specific word is used; we need to consider how often each specific word is used after one or more other words are used. We need a random system with some kind of a “memory.”

This is where we use Markov chains: very roughly, a Markov chain is a discrete stochastic process where the probability of future outcomes depends on the system’s present state.¹ While a coin flip famously “has no memory,”, a Markov chain does–just a little bit, at least.

This would let us ask the following question: given the word(s) we’ve just seen, which words are likely to come next? Both the words themselves and their likelihoods will depend on the input text we feed our algorithm.

So we put this all together into a simple model: We’ll generate text one word at a time using simple probabilities based on observed frequencies in an input text. We’ll break an input text down into a string of words (treating some punctuation marks as special words), count how often each word comes after each other word (or two words), and then use those counts to generate new text based on the words we’ve generated so far.

Generating some random text with R

You can install the development version of the R package markovtext from GitHub with:

# install.packages("devtools")
devtools::install_github("chris31415926535/markovtext")

The package has only two functions:

markovtext::get_word_freqs() takes an input text and generates a word-frequency table.
markovtext::generate_text() uses a word-frequency table to generate random text.

It also includes a few sample word-frequency tables to get you started. I’ll walk through some extremely legitimate use-cases here.

For the Doug Ford superfans

Perhaps you are a Doug Ford superfan, and your only wish in life is for a never-ending source of wisdom from Ontario’s 26th Premier. Today is your lucky day, for nirvana is only one function call away:

dougford_text <- markovtext::generate_text(markovtext::wordfreqs_dougford_3grams, word_length = 100)

knitr::kable(dougford_text, col.names = "")

I want to thank each of you. Hazel mccallion, I thank you. And we will reduce your gas prices and keep more money in your pocket. A plan to fix our economy and create more good, paying jobs. A legacy of service to the people. We will deliver on our plan for the people of this province around, so our children and their children will always put you first and we will make ontario once again the engine of canada. My friends what a response. This victory belongs to you.

I fed the algorithm Doug Ford’s victory speech, and asked it to calculate probabilities based on the past two words. So for example, “open for” will always be followed by “business,” but “thank you” could be followed by “for” (12.5% chance), “from” (12.5%), “so” (12.5%), a comma (12.5%), or a period (50%).

The results are–to me, at least, and by the standards of Doug Ford speeches–surprisingly coherent.

For three-year-olds who just won’t go to sleep

Or perhaps you’re a beleaguered parent facing a three-year-old with an insatiable demand for Dr. Seuss bedtime stories. Again, I’ve got you covered:

seuss_text <- markovtext::generate_text(markovtext::wordfreqs_catinthehat_3grams, word_length = 100)

knitr::kable(seuss_text, col.names = "")

I do not wish to go! They should not fly kites, said the cat. In this box are two things, said the fish in the pot. But I like this? Oh, what would she say? Oh, so tame! They should not be about. He picked up the hook. Now look at me! Look at this! Look at this! Look at me now! It is wet and the dish, and the dish, and the sun did not know what to say. You did

If you squint a bit, the output has the general form and sometimes even the cadence of a Dr. Seuss poem. Of course there’s no hope of a narrative, but that was never our intent.

Generating text based on your own inputs

You can also supply an input text for the package to mimic.

Here we’ll generate text based on this famous aphorism from the philosopher James Robert Brown’s 1973 studio album:

I can do wheelin’, I can do dealin’, But I don’t do no damn squealin’. I can dig rappin’, I’m ready! I can dig scrappin’. But I can’t dig that backstabbin’.

Create a word-frequency table by feeding that text to get_word_freqs(), and generate text with a simple call to generate_text().

Here’s a sample:

library(markovtext)

text <- "I can do wheelin', I can do dealin',
         But I don't do no damn squealin'.
         I can dig rappin', I'm ready! I can dig scrappin'.
         But I can't dig that backstabbin'."

wordfreqs <- markovtext::get_word_freqs(text, n_grams = 3)

new_text <- markovtext::generate_text(wordfreqs, word_length = 50)

knitr::kable(new_text, col.names = "")

I can do dealin, but I don’t do no damn squealin. I can dig scrappin. But I don’t do no damn squealin. I can do dealin, but I can’t dig that backstabbin. But I can’t dig that backstabbin. But I don’t do no damn

The output is, again, a (mostly) grammatically correct string of text that replicates the structures found in the input text. Leading and trailing punctuation is trimmed from the input words, so we lose the contractions in the output.

Next steps?

I’m morbidly curious to see what would happen if you extend the “memory” back an arbitrary number of steps. My guess is that the output would become more grammatically correct, but that you’d also need to feed it much more data. In a short text, for example, any arbitrary string of 3 or 4 words is only likely to happen once, so your “random output” would just reproduce the input perfectly.

But now we’re getting beyond entertainment into more serious research, which is definitely outside the scope of this extremely unserious blog post.

Conclusion

You can generate semi-plausible English text using a surprisingly simple probabilistic model. The code to do so is available in a custom R package available on GitHub, and I’ve wrapped it into a nice interactive web app for users to explore themselves.

Did you try it out? Did it generate anything good?

This description is highly oversimplified!↩︎

Christopher Belanger, PhD MBA

Data Scientist
Researcher
Policy Expert

My research interests include data science, marketing, and public policy, bridging the quantitative-qualitative divide.