For debugging the algorithm, I'm using the wonderful "Wordle Archive" game. If you haven't checked it out yet, you can find it here: https://metzger.media/games/wordle-archive
The rules of the game is to guess the daily 5-letter English word. To start playing around with grep
, we'll need to first obtain a dataset of, well, 5-letter English words.
The first result of my Google query gets me to this list: https://github.com/first20hours/google-10000-english, which gets me the most frequent 10,000 English words with the added bonus of filtering out the swear words (I'm assuming it's a family-friendly game so I welcome the narrowed search-space)
url = 'https://raw.githubusercontent.com/first20hours/google-10000-english/master/google-10000-english-usa-no-swears-medium.txt'
!mkdir -p data
!wget -q {url} -P data
For those of you who managed to escape the mania, here are the rules from cnet:
Wordle gives players six chances to guess a randomly selected five-letter word. If you have the right letter in the right spot, it shows up green. A correct letter in the wrong spot shows up yellow. A letter that isn't in the word in any spot shows up gray.
Let's run some guesses on the word of 1/23/2022 just to get something to start with.
Subsequent guesses here are not valid candidates (for example mouse is not a valid guess as we already know from the gray E in snake the E is not in the word), but that should be enough to build some intuition on how to solve it.
Our first guess SNAKE returned all gray, meaning none of S, N, A, K and E are in the word. In regex this is translated into negating a character class. Since there are 5 letters, we'd need to repeat this 5 times.
pattern='[^snake][^snake][^snake][^snake][^snake]'
!grep '^{pattern}$' 'data/google-10000-english-usa-no-swears-medium.txt' | head
which would world group could right forum photo topic court
!grep '^{pattern}$' 'data/google-10000-english-usa-no-swears-medium.txt' | wc -l
108
Moving on to the second line we get a few more gray letters and a yellow M, meaning M is in the word but not in the right place. Let's update the possible letters for each spot:
pattern='[^msnakeous][^snakeous][^snakeous][^snakeous][^snakeous]'
!grep '^{pattern}$' 'data/google-10000-english-usa-no-swears-medium.txt' | head
which right light child third civil limit birth fight width
!grep '^{pattern}$' 'data/google-10000-english-usa-no-swears-medium.txt' | wc -l
27
Nice. Without doing much we could narrow down to 27 words. Moving on to feedback we got from our guess on the third attempt, we get the following hints:
Let's update our negated character classes and fix R in the second spot:
pattern='[^msnakeoust]r[^snakeoust][^snakeousti][^snakeoust]'
!grep '^{pattern}$' 'data/google-10000-english-usa-no-swears-medium.txt'
grill drill
The word 'crimp' (the correct answer) is not there, so we'd need a bigger corpus
cat count_1w.txt | grep -Po "^[a-z]{5,5}(?=\t)" | head -100000 > 100k
Now let's try with incorporating the very revealing information of our lucky fourth guess:
pattern='crim[^snakeoust]'
!grep '^{pattern}$' data/100k
crimp
Now let's try today's word (1/25/2022), this time using our simple algorithm to assist us "guessing" it:
pattern='[^moe][^moe][^moeu][^moes][^moe]'
!grep '^{pattern}$' data/100k | wc -l
2945
!grep '^{pattern}$' data/100k | head
which click links right using black think shall still visit
These are a lot to sift through, but if we look carefully we notice that most of the suggested words do not contain U nor S, although we know for sure they exist in the word. It seems like we're not using all the information we got.
Let's look at U. It has to be at least once in any position that is not the 3rd. That's all we know for now. So the options for that would be:
U _ _ _ _
_ U _ _ _
_ _ _ U _
_ _ _ _ U
The same goes to S:
S _ _ _ _
_ S _ _ _
_ _ S _ _
_ _ _ _ S
To account for all the possible options, we'd need to combine these two hints.
One possible combination could be: U S _ _ _
. Even though S could have potentially taken the first spot, we can only choose one letter for that spot, so it's not entirely a cartesian product, but a subset of it.
Before we move on to computing the combinations, let's see how we could feed grep
with multiple alternative patterns:
!grep '(which|could)' data/100k
Argg that didn't work. This seemingly basic regex syntax feature is not supported by default. Let's RTFM and run man grep
. It reads:
grep understands three different versions of regular expression syntax: “basic” (BRE), “extended” (ERE) and “perl” (PCRE).
Reading a little more, we find out that the following features are not guaranteed to be available in the basic (default) form:
Alternation is what we try to do, so let's try again with the -E
flag:
!grep -E '(which|could)' data/100k
which could
Much better. Now let's see if this could work with more alternative patterns instead of full words:
!grep -P '(su...|us...)' data/100k | head
using users super suite usage sugar susan usual sunny sudan
and a little more like what we need (albeit less readable):
!grep -P '(su[^moe][^moe][^moe]|us[^moe][^moe][^moe])' data/100k | head
using sugar susan usual sunny sudan suits sucks supra sushi
Great! now, that we're assured that regex
(and grep
) have what we need, we can get back to the task of finding the (somewhat constraint) combinations of U and S:
U = {0, 1, 3, 4}
S = {0, 1, 2, 4}
from itertools import product
combinations = [(u, s) for u, s in product(U, S) if u != s]
combinations
[(0, 1), (0, 2), (0, 4), (1, 0), (1, 2), (1, 4), (3, 0), (3, 1), (3, 2), (3, 4), (4, 0), (4, 1), (4, 2)]
Let's insert U an S in these positions. Our more general hint is that none of the positions is any of M, O and E so we can start with that, and then override these defaults with U and S in the positions indicated by the combinations:
patterns = []
for u, s in combinations:
pattern = ['[^moe]'] * 5
pattern[u] = 'u'
pattern[s] = 's'
patterns.append(''.join(pattern))
patterns
['us[^moe][^moe][^moe]', 'u[^moe]s[^moe][^moe]', 'u[^moe][^moe][^moe]s', 'su[^moe][^moe][^moe]', '[^moe]us[^moe][^moe]', '[^moe]u[^moe][^moe]s', 's[^moe][^moe]u[^moe]', '[^moe]s[^moe]u[^moe]', '[^moe][^moe]su[^moe]', '[^moe][^moe][^moe]us', 's[^moe][^moe][^moe]u', '[^moe]s[^moe][^moe]u', '[^moe][^moe]s[^moe]u']
and now let's join them using the regex 'or' operator:
'|'.join(patterns)
'us[^moe][^moe][^moe]|u[^moe]s[^moe][^moe]|u[^moe][^moe][^moe]s|su[^moe][^moe][^moe]|[^moe]us[^moe][^moe]|[^moe]u[^moe][^moe]s|s[^moe][^moe]u[^moe]|[^moe]s[^moe]u[^moe]|[^moe][^moe]su[^moe]|[^moe][^moe][^moe]us|s[^moe][^moe][^moe]u|[^moe]s[^moe][^moe]u|[^moe][^moe]s[^moe]u'
Lookin good. Let's test run it:
!grep -P '({"|".join(patterns)})' data/100k | head
using pussy units funds virus sugar susan turns usual busty
!grep -P '({"|".join(patterns)})' data/100k | wc -l
235
Not bad. We're left with only 235 options after the first guess! Since this list is ordered by frequency, together with the assumption that obscure words are less likely to appear in the game (people will get frustrated if they're tasked with guessing a word they don't know), it's a pretty good bet to take the first word and put it in the game. Let's see what we get:
Let's summarize what new information we got now:
Let's update our knowledge of the world and repeat the steps above.
positions = {
'u': {0, 1, 3, 4} - {0},
's': {0, 1, 2, 4} - {1},
'g': {0, 1, 2, 3}
}
combinations = [g for g in product(*positions.values()) if len(set(g)) == len(g)]
We had to change our filter to account for the case of an arbitrary group elements' count. Comparing the number of unique elements to the number of total elements would do the trick.
We can also upgrade our exclude pattern, by adding the yellow letters to their corresponding positions:
exclude = [['m', 'o', 'e', 'i', 'n'] for _ in range(5)]
for letter, potential_positions in positions.items():
for non_position in {0, 1, 2, 3, 4} - potential_positions:
exclude[non_position].append(letter)
exclude = [f"[^{''.join(p)}]" for p in exclude]
exclude
['[^moeinu]', '[^moeins]', '[^moeinu]', '[^moeins]', '[^moeing]']
patterns = []
for u, s, g in combinations:
pattern = exclude.copy()
pattern[u] = 'u'
pattern[s] = 's'
pattern[g] = 'g'
patterns.append(''.join(pattern))
patterns
['sug[^moeins][^moeing]', 'su[^moeinu]g[^moeing]', 'gus[^moeins][^moeing]', '[^moeinu]usg[^moeing]', 'gu[^moeinu][^moeins]s', '[^moeinu]ug[^moeins]s', '[^moeinu]u[^moeinu]gs', 'sg[^moeinu]u[^moeing]', 's[^moeins]gu[^moeing]', 'g[^moeins]su[^moeing]', '[^moeinu]gsu[^moeing]', 'g[^moeins][^moeinu]us', '[^moeinu]g[^moeinu]us', '[^moeinu][^moeins]gus', 'sg[^moeinu][^moeins]u', 's[^moeins]g[^moeins]u', 's[^moeins][^moeinu]gu', 'g[^moeins]s[^moeins]u', '[^moeinu]gs[^moeins]u', '[^moeinu][^moeins]sgu']
!grep -P '({"|".join(patterns)})' data/100k | nl
1 sugar 2 argus 3 gurus 4 gusts 5 juggs 6 gulls 7 gurps 8 gyrus 9 gusta 10 dysgu 11 gusty 12 suggs 13 vagus
Ooh, only 13 words, most of which I've never heard of, so I'm pretty certain it's sugar. Let's try:
Sweet 🍯🐁 (<--- mouse using sugar)
Now we have all the necessary building blocks to make it fully automatic, but for that we'd need to define a format to communicate the feedback from the game and to create an interactive tool. To quote my favorite professor who knew just the right moment to stop:
this is left as an exercise to the reader....