Solving Wordle using Regex (and a little bit of Python)¶

For debugging the algorithm, I'm using the wonderful "Wordle Archive" game. If you haven't checked it out yet, you can find it here: https://metzger.media/games/wordle-archive

Give Me the Words¶

The rules of the game is to guess the daily 5-letter English word. To start playing around with grep, we'll need to first obtain a dataset of, well, 5-letter English words.

The first result of my Google query gets me to this list: https://github.com/first20hours/google-10000-english, which gets me the most frequent 10,000 English words with the added bonus of filtering out the swear words (I'm assuming it's a family-friendly game so I welcome the narrowed search-space)

In [1]:
url = 'https://raw.githubusercontent.com/first20hours/google-10000-english/master/google-10000-english-usa-no-swears-medium.txt'

!mkdir -p data
!wget -q {url} -P data

Let the Game Begin¶

For those of you who managed to escape the mania, here are the rules from cnet:

Wordle gives players six chances to guess a randomly selected five-letter word. If you have the right letter in the right spot, it shows up green. A correct letter in the wrong spot shows up yellow. A letter that isn't in the word in any spot shows up gray.

Let's run some guesses on the word of 1/23/2022 just to get something to start with.

image.png

Subsequent guesses here are not valid candidates (for example mouse is not a valid guess as we already know from the gray E in snake the E is not in the word), but that should be enough to build some intuition on how to solve it.

Our first guess SNAKE returned all gray, meaning none of S, N, A, K and E are in the word. In regex this is translated into negating a character class. Since there are 5 letters, we'd need to repeat this 5 times.

In [2]:
pattern='[^snake][^snake][^snake][^snake][^snake]'

!grep '^{pattern}$' 'data/google-10000-english-usa-no-swears-medium.txt' | head
which
would
world
group
could
right
forum
photo
topic
court
In [3]:
!grep '^{pattern}$' 'data/google-10000-english-usa-no-swears-medium.txt' | wc -l
108

Moving on to the second line we get a few more gray letters and a yellow M, meaning M is in the word but not in the right place. Let's update the possible letters for each spot:

In [4]:
pattern='[^msnakeous][^snakeous][^snakeous][^snakeous][^snakeous]'

!grep '^{pattern}$' 'data/google-10000-english-usa-no-swears-medium.txt' | head
which
right
light
child
third
civil
limit
birth
fight
width
In [5]:
!grep '^{pattern}$' 'data/google-10000-english-usa-no-swears-medium.txt' | wc -l
27

Nice. Without doing much we could narrow down to 27 words. Moving on to feedback we got from our guess on the third attempt, we get the following hints:

  1. I is in the word but not in the fourth spot
  2. R is in the word and in the second spot
  3. T is not in the word

Let's update our negated character classes and fix R in the second spot:

In [6]:
pattern='[^msnakeoust]r[^snakeoust][^snakeousti][^snakeoust]'

!grep '^{pattern}$' 'data/google-10000-english-usa-no-swears-medium.txt'
grill
drill

The word 'crimp' (the correct answer) is not there, so we'd need a bigger corpus

cat count_1w.txt | grep -Po "^[a-z]{5,5}(?=\t)" | head -100000 > 100k

Now let's try with incorporating the very revealing information of our lucky fourth guess:

In [7]:
pattern='crim[^snakeoust]'

!grep '^{pattern}$' data/100k
crimp

A More Interesting Example¶

Now let's try today's word (1/25/2022), this time using our simple algorithm to assist us "guessing" it:

image.png

In [8]:
pattern='[^moe][^moe][^moeu][^moes][^moe]'
!grep '^{pattern}$' data/100k | wc -l
2945
In [9]:
!grep '^{pattern}$' data/100k | head
which
click
links
right
using
black
think
shall
still
visit

These are a lot to sift through, but if we look carefully we notice that most of the suggested words do not contain U nor S, although we know for sure they exist in the word. It seems like we're not using all the information we got.

Let's look at U. It has to be at least once in any position that is not the 3rd. That's all we know for now. So the options for that would be:

U _ _ _ _
_ U _ _ _
_ _ _ U _
_ _ _ _ U

The same goes to S:

S _ _ _ _
_ S _ _ _
_ _ S _ _
_ _ _ _ S

To account for all the possible options, we'd need to combine these two hints. One possible combination could be: U S _ _ _. Even though S could have potentially taken the first spot, we can only choose one letter for that spot, so it's not entirely a cartesian product, but a subset of it.

Before we move on to computing the combinations, let's see how we could feed grep with multiple alternative patterns:

In [10]:
!grep '(which|could)' data/100k

Argg that didn't work. This seemingly basic regex syntax feature is not supported by default. Let's RTFM and run man grep. It reads:

grep understands three different versions of regular expression syntax: “basic” (BRE), “extended” (ERE) and “perl” (PCRE).

Reading a little more, we find out that the following features are not guaranteed to be available in the basic (default) form:

  • Character Classes and Bracket Expressions
  • Anchoring
  • The Backslash Character and Special Expressions
  • Repetition
  • Concatenation
  • Alternation
  • Precedence
  • Back-references and Subexpressions
  • Basic vs Extended Regular Expressions

Alternation is what we try to do, so let's try again with the -E flag:

In [11]:
!grep -E '(which|could)' data/100k
which
could

Much better. Now let's see if this could work with more alternative patterns instead of full words:

In [12]:
!grep -P '(su...|us...)' data/100k | head
using
users
super
suite
usage
sugar
susan
usual
sunny
sudan

and a little more like what we need (albeit less readable):

In [13]:
!grep -P '(su[^moe][^moe][^moe]|us[^moe][^moe][^moe])' data/100k | head
using
sugar
susan
usual
sunny
sudan
suits
sucks
supra
sushi

Great! now, that we're assured that regex (and grep) have what we need, we can get back to the task of finding the (somewhat constraint) combinations of U and S:

In [14]:
U = {0, 1, 3, 4}
S = {0, 1, 2, 4}

from itertools import product
combinations = [(u, s) for u, s in product(U, S) if u != s]
combinations
Out[14]:
[(0, 1),
 (0, 2),
 (0, 4),
 (1, 0),
 (1, 2),
 (1, 4),
 (3, 0),
 (3, 1),
 (3, 2),
 (3, 4),
 (4, 0),
 (4, 1),
 (4, 2)]

Let's insert U an S in these positions. Our more general hint is that none of the positions is any of M, O and E so we can start with that, and then override these defaults with U and S in the positions indicated by the combinations:

In [15]:
patterns = []
for u, s in combinations:
    pattern = ['[^moe]'] * 5
    pattern[u] = 'u'
    pattern[s] = 's'
    patterns.append(''.join(pattern))
patterns
Out[15]:
['us[^moe][^moe][^moe]',
 'u[^moe]s[^moe][^moe]',
 'u[^moe][^moe][^moe]s',
 'su[^moe][^moe][^moe]',
 '[^moe]us[^moe][^moe]',
 '[^moe]u[^moe][^moe]s',
 's[^moe][^moe]u[^moe]',
 '[^moe]s[^moe]u[^moe]',
 '[^moe][^moe]su[^moe]',
 '[^moe][^moe][^moe]us',
 's[^moe][^moe][^moe]u',
 '[^moe]s[^moe][^moe]u',
 '[^moe][^moe]s[^moe]u']

and now let's join them using the regex 'or' operator:

In [16]:
'|'.join(patterns)
Out[16]:
'us[^moe][^moe][^moe]|u[^moe]s[^moe][^moe]|u[^moe][^moe][^moe]s|su[^moe][^moe][^moe]|[^moe]us[^moe][^moe]|[^moe]u[^moe][^moe]s|s[^moe][^moe]u[^moe]|[^moe]s[^moe]u[^moe]|[^moe][^moe]su[^moe]|[^moe][^moe][^moe]us|s[^moe][^moe][^moe]u|[^moe]s[^moe][^moe]u|[^moe][^moe]s[^moe]u'

Lookin good. Let's test run it:

In [17]:
!grep -P '({"|".join(patterns)})' data/100k | head
using
pussy
units
funds
virus
sugar
susan
turns
usual
busty
In [18]:
!grep -P '({"|".join(patterns)})' data/100k | wc -l
235

Not bad. We're left with only 235 options after the first guess! Since this list is ordered by frequency, together with the assumption that obscure words are less likely to appear in the game (people will get frustrated if they're tasked with guessing a word they don't know), it's a pretty good bet to take the first word and put it in the game. Let's see what we get:

image.png

Let's summarize what new information we got now:

  1. I and N join the lettera-non-grata club.
  2. G joins U and S to the displaced nomand group
  3. U and S have even less possible locations

Let's update our knowledge of the world and repeat the steps above.

In [19]:
positions = {
    'u': {0, 1, 3, 4} - {0},
    's': {0, 1, 2, 4} - {1},
    'g': {0, 1, 2, 3}
}

combinations = [g for g in product(*positions.values()) if len(set(g)) == len(g)]

We had to change our filter to account for the case of an arbitrary group elements' count. Comparing the number of unique elements to the number of total elements would do the trick.

We can also upgrade our exclude pattern, by adding the yellow letters to their corresponding positions:

In [20]:
exclude = [['m', 'o', 'e', 'i', 'n'] for _ in range(5)]
for letter, potential_positions in positions.items():
    for non_position in {0, 1, 2, 3, 4} - potential_positions:
        exclude[non_position].append(letter)

exclude = [f"[^{''.join(p)}]" for p in exclude]
exclude
Out[20]:
['[^moeinu]', '[^moeins]', '[^moeinu]', '[^moeins]', '[^moeing]']
In [21]:
patterns = []
for u, s, g in combinations:
    pattern = exclude.copy()
    pattern[u] = 'u'
    pattern[s] = 's'
    pattern[g] = 'g'
    patterns.append(''.join(pattern))
patterns
Out[21]:
['sug[^moeins][^moeing]',
 'su[^moeinu]g[^moeing]',
 'gus[^moeins][^moeing]',
 '[^moeinu]usg[^moeing]',
 'gu[^moeinu][^moeins]s',
 '[^moeinu]ug[^moeins]s',
 '[^moeinu]u[^moeinu]gs',
 'sg[^moeinu]u[^moeing]',
 's[^moeins]gu[^moeing]',
 'g[^moeins]su[^moeing]',
 '[^moeinu]gsu[^moeing]',
 'g[^moeins][^moeinu]us',
 '[^moeinu]g[^moeinu]us',
 '[^moeinu][^moeins]gus',
 'sg[^moeinu][^moeins]u',
 's[^moeins]g[^moeins]u',
 's[^moeins][^moeinu]gu',
 'g[^moeins]s[^moeins]u',
 '[^moeinu]gs[^moeins]u',
 '[^moeinu][^moeins]sgu']
In [22]:
!grep -P '({"|".join(patterns)})' data/100k | nl
     1	sugar
     2	argus
     3	gurus
     4	gusts
     5	juggs
     6	gulls
     7	gurps
     8	gyrus
     9	gusta
    10	dysgu
    11	gusty
    12	suggs
    13	vagus

Ooh, only 13 words, most of which I've never heard of, so I'm pretty certain it's sugar. Let's try:

image.png

Sweet 🍯🐁 (<--- mouse using sugar)

Epilogue¶

Now we have all the necessary building blocks to make it fully automatic, but for that we'd need to define a format to communicate the feedback from the game and to create an interactive tool. To quote my favorite professor who knew just the right moment to stop:

this is left as an exercise to the reader....