Word Ladder Solver

The actual solver: Word Ladder Solver

Word ladders are a kind of puzzle.  You change one word into another by altering one letter at a time; all the intermediate words between the start word and target word must be valid dictionary words.  For example, you might change CAT into DOG using the steps: cat→cot→dot→dog.

The solver linked above and described here is a little different to some others:

  • You don’t have to download or install anything on your computer or device.
  • It finds not only the shortest solution but also a solution using common words.
  • It displays dictionary definitions for the words in the solutions it finds.

The solver is written in PHP – not the fastest language ever – and several people could be using it via the web at the same time, so it needs to have a pretty efficient algorithm.  The trick is to use a dictionary where all the links from each word have been worked out in advance.  For example, consider the word, SOLVER: it’s stored in the dictionary along with links to the eight other words it can be changed into in one step:

SOLVER: salver, silver, soever, solder, soller, solved, solves, wolver.

The dictionary

We can see that the dictionary contains some pretty obscure words, and we’d prefer to avoid using those words in our solutions unless they’re absolutely necessary; so each word has a ‘rareness’ score stored along with it.  ‘Silver’, ‘solved’ and ‘solves’ have a rareness of 1, ‘solder’ has a rareness of 4, ‘salver’ 200, ‘soever’ 20,000, ‘wolver’ 200,000 and ‘soller’ 2,000,001.  When searching for a common-word solution, the solver looks for a ladder of words with the lowest sum of rareness, so it would rather use four common words like ‘silver’ than the slightly less common ‘solder’ or ten words as rare as ‘soever’ instead of one as rare as ‘wolver’ and so on.

The other useful information to store in the dictionary for each word is a group number.  When the linked words in the dictionary were being found, the computer worked out how the words fell into different groups.  It’s obvious that a six-letter word can never be transformed into a seven-letter one using this simple kind of word ladder; less obvious is the fact that there are several groups of words for each word length.  All the words within one group can be changed (by a ladder of words) into any other word in that same group – but not to a word in a different group.  For example, ELBOWED can be transformed into UMPIRES or LOYALTY into MIRACLE, but none of those words can be changed into the main group of seven-letter words that includes ACCUSED and ZIPPING.

In case you’re interested, the CSW dictionary, which my program uses, has:

  • ‘Elbowed’ and ‘umpires’ in group 4786 along with sixty other words.
  • ‘Loyalty’ and ‘miracle’ in group 4787 along with sixty three others.
  • The big group of connected seven-letter words is group 4788 and contains 12,512 words.

There’s a summary table of how the words are distributed in groups in this text file.

The word in the dictionary with the most links is SAY, which can be transformed into forty two different words in a single step.  For longer words, the champions are:

length Word Definition Number of linked words
4 TATS Someone who tats is making lace by hand 40
5 CARE 38
6 BARKED 26
7 BARKING 25
8 SLATTERS Someone who slatters is being careless or negligent 15
9 BATTERING 13
10 SLATTERING See Slatters above 9
11 MUSTINESSES Varieties of stale odors 10

The dictionary contains words up to fifteen letters long, but the longer words become increasingly uninteresting from a word ladder point of view.

The default dictionary used is the Collins Scrabble Words (CSW or SOWPODS) which is the dictionary used in tournament Scrabble in most countries except the USA, Thailand and Canada.  The USA tournament Scrabble dictionary is the Official Tournament and Club Word List (OWL) but this turns out to be merely a subset of the CSW, so the program is able to cater for that dictionary too by knowing which words are present in OWL and only using those words when the OWL dictionary is selected.

The program

The program checks that both the start word and the target word exist in its dictionary, and that both the words are in the same group; if not it displays appropriate error messages.  If both words are in the same group the program then loads the necessary data for all the words in that group from its MySQL database:  it doesn’t have to load the actual words or their definitions at this stage – it only needs:

  • an index number (id) for each of the words in the group.
  • for each word id,  the ‘rareness’ score.
  • for each word id, the list of other word’s id’s that it can be transformed into.

The first pass finds one of the shortest solutions. There may be many equally short solutions, but the search tree is built in a random order so that different solutions may be found for the same start and target words.  All the nodes at a particular depth in the search tree are examined before moving on to  the next depth:  if the shortest solution possible is, say, six steps then the program is bound to find and display a six step solution,  If there are several different ways of solving the ladder in six steps, the program will use a random choice to find and display one of those solutions each time it is run.

The program builds the first depth of the search tree by visiting each node linked from the start node; it then visits all nodes linked from the first depth except those that have already been visited: these new nodes are ‘depth two’.  It then proceeds to visit all nodes (not already visited) that are linked from the depth two nodes… and so on until one of the nodes visited is the target word and a solution can be displayed.  As each node is visited a pointer is set in the node that points to its parent node.  The pointer is used both to allow the solution to be displayed and to keep track of which nodes have been visited before:  initially all the node’s pointers are initialized to -1, so if the program looks at a node and sees a pointer >= 0 it knows that node has been visited already.

The program uses a three element array to store each node (parent, rareness, links): the links are stored as a string with comma separated ids of linked nodes.  For example ABDUCT (id 19633) links to two other words ABDUCE (id 19632) and ADDUCT (id 19772) so the links element in the ABDUCT array would be “19632,19772”.  The nodes are stored in a node array, n[]  where the keys are the ids of the nodes and the values are instances of the three-element array described above (PHP uses associative arrays which can be confusing if you’re not used to them, but work well for this kind of task), so for ABDUCT (which has a rareness of 4), the information is stored in memory as:

n[19633] => (parent, 4, “19632,19772”)

Note that the n array is sparse: only the information belonging to the words in the same group as the start and target word are loaded from the database to memory.  With PHP’s associative arrays there might only be, say, a hundred elements in the n array, even though the keys are integers ranging up to 270132 (the highest id in the dictionary).

Another array, toVisit[], holds the keys of the n array to be visited.  This is initialised with the key (id) of the start word.  We’re not interested in the keys of the toVisit array, only the values (which store keys to the n array).  As the search proceeds, the ids of nodes to visit are added to the toVisit array, and as each node is visited, it is removed from the array.  Here is the code for the first pass solver:

$toVisit = array($startWordId); 
$soln = false;
while (!$soln && count($toVisit)) {
    foreach ($toVisit as $k => $i) {
        $links = explode(',', $n[$i][2]);
        shuffle($links);
        foreach ($links as $link) {
            $j = intval($link);
            if (isset($n[$j]) && $n[$j][0] < 0) { // node exists and hasn't been seen
                $n[$j][0] = $i; // assign current node as parent of linked node
                $toVisit[] = $j; // add the linked node to the list of nodes to visit
            }
            if ($j == $targetWordId) { // we have a solution
                $soln = true;
                break;
            }				
        }
        unset($toVisit[$k]);		
    }	
}

Having displayed the solution the program then does a second pass, this time taking into account the rareness of the words so that it finds a solution with the lowest possible sum of the rareness values of the words used.  Again there may be several alternative solutions with same total rareness, so the program builds the search tree in a random order.  This time, each depth of the tree holds all the words that can be reached from the start word with the same total rareness cost.

Before the second pass, the parent element of all the n nodes is reset to -1, effectively cleaning the previous search result from the n array.  For the second pass, toVisit is an array of arrays: the keys in toVisit are the cost of reaching a particular set of words; the value associated with each key is an array containing the ids of that set of words.  Initially the only node to visit is the start node, with a cost of zero so that is stored as:

toVisit[0] => array(19633)

There will never be any other nodes with a cost of 0, but we might eventually find, say, five nodes with a cost of, say, 87 which would be stored as, for example:

toVisit[87] => (21030, 20043, 29754, 19444, 19443)

Here’s the code for the second pass of the solver. As you can see, it’s just a slight variation on the code used for the first pass:

$toVisit = array(0 => array($startWordId));
$soln = false;
while (!$soln && count($toVisit)) {
    ksort($toVisit, SORT_NUMERIC); // sort array into key (costs) order
    $cost = key($toVisit); // cost of the lowest nodes to visit
    foreach ($toVisit[$cost] as $i) {
        $links = explode(',', $n[$i][2]);
        shuffle($links);
        foreach ($links as $link) {
            $j = intval($link);
            if (isset($n[$j]) && $n[$j][0] < 0) { // linked node hasn't been seen
                $newCost = $cost + $n[$j][1];
                $n[$j][0] = $i; // assign current node as parent of linked node
                if (isset($toVisit[$newCost]))  // already nodes with this cost
                    $toVisit[$newCost][] = $j; // add the new node to the array
                else
                    $toVisit[$newCost] = array($j); // new array for the new cost
            }
            if ($j == $w2targetWordId) { // we have a solution
                $soln = true;
                break;
            }				
        }
    }
    unset ($toVisit[$cost]); // all nodes of this cost have now been visited	
}

 

Comments

11 responses to “Word Ladder Solver”

  1. Edgar Coudal avatar
    Edgar Coudal

    Most informative comments . . . thanks
    Edgar Coudal, Sarasota, FL

  2. edward avatar
    edward

    what algorithm did u use

    1. ceptimus avatar
      ceptimus

      It’s really just a breadth first search with a randomizing element. It always searches breadth first, but within each ‘depth’ of the search it visits the nodes in a random order. This means that if there is more than one (say) five-link solution the algorithm returns any one of those solutions – and is likely to find different solutions when restarted with the same starting words.

  3. Roderick avatar

    I love this! I’m adding this tool to my ‘Puzzle Makers Toolkit’ at http://www.enigami.fun Thanks! -Roderick Kimball

  4. Forest Kunecke avatar

    I’m developing a mobile word ladder game and found your solver, which is way more efficient than my current algorithm. The grouping method in your database is very clever. Right now I’m generating ladders from a parsed out OED, and I have been struggling to tackle the rarity problem. Some ladders require very uncommon words, and I haven’t found a great datasource for rarity. Could you shed some light on how you constructed your database? Alternatively I’d be interested in simply licensing it, if that’s an option.

    1. ceptimus avatar

      Thanks for the kind comments. My starting point for constructing the ‘rarity’ part of the database was SCOWL (Spell Checker Oriented Words List) http://wordlist.aspell.net/ Looking at my backups I wrote my code in 2013 (maybe earlier, but those are the timestamps on my backup files). At that time, SCOWL revision 7.1 was the most recent version, but I see it’s been updated several times since then, so it’s probably a good idea to start over with the current up-to-date version. I wrote a console program in C# (C Sharp) that parsed OWL2 (the Oxford Word List) and CSW12 (scrabble word list) dictionaries into a large temporary text file, and then compared the entries against the SCOWL lists to assign rarity scores. Then it used a brute-force, fairly slow, algorithm (several hours of running time on my old Core Duo PC) to find the groups and populate the MySQL database. I forget the details, but I found this in the source code:

      1. Build stage1.txt from OWL2.txt
      2. Build stage2.txt by merging CSW12.txt into dictionary table
      3. Resolve cross-linked definitions
      4. Create copies of scowl-7.1 word lists with ‘S and duplicates removed
      5. Compare scowl lists with stage2 into stage3, assigning USA,BRIT,rareness
      6. Build links and groups
      7. Transfer results into MySQL dictionary

      That was the program’s menu – so I could run just one of those operations, examine the intermediate files produced, fix any bugs in my code and then re-run the bug-fixed stage, before moving to the next stage.

      I can’t run it right now because I’m using Linux, and the code was written for Microsoft Visual Studio under Windows. I don’t currently have a C# compiler set up.

  5. Faya avatar
    Faya

    I have tried this with some words like “greed” and “money” and then it gives a an output but it also includes some words that aren’t actually words like
    “Toney” and more. Is that how it intended to work? I thought it would include an error if the word can’t be formed from real words.

    1. ceptimus avatar

      They are all real words, though they may be rare words.
      Toney is a valid Scrabble word
      That’s why it also generates a (usually longer) solution that tries to use only common words.

  6. Manish avatar
    Manish

    This is a great approach. I had studied Donald Knuth’s work on this and your handling of unconnected words is a great one.

    I had written a similar program but instead of taking your rareness approach, I experimented with smart breadth first search techniques which gave some good results. I too used PHP and the algorithm largely followed your logic. Just that I did not use any database whatsover and used ‘awk’ to find the family to which a word belonged and then loaded the necessary files.

    Have wrapped it into an API and it can be tested/used from https://rapidapi.com/contactousapp-oO8YC-PlBj/api/word-ladder-builder

  7. Icculus avatar
    Icculus

    I found a better solution to HIGH -> JUMP.
    Your algorithm:
    HIGH – HISH – WISH – WISP – WIMP – JIMP – JUMP
    My solution:
    HIGH – HUGH – HUGE – HUME – HUMP – JUMP

    Just wanted to let ya know.

    1. ceptimus avatar

      Hume isn’t in either of the dictionaries that my Word Ladder Solver uses (the Scrabble dictionaries). HUME is not a valid Scrabble word

Leave a Reply to Edgar Coudal Cancel reply

Your email address will not be published. Required fields are marked *