3

Say I have a list of 100,000 words. I want to find out if a given string matches any words in that list, and I want to do it in the fastest way possible. Also I want to know if any other words, that are formed by starting with the first character, in that string appear in the list.

For example:

Say you have the string "icedtgg"

"i" "ic" "ice" "iced" "icedt" "icedtg" "icedtgg"

I am trying to come up with an optimal compare algorithm that tells me if the following words are in my list.

What I have so far is my list of 100,000 words are stored in a

Dicitonary<char, List<string>> WordList;

where char is the first character of the word, and the List<string> is all of the words that start with that character.

So WordList['a'] has a list of all words that start with 'a' ("ape", "apple", "art" etc.) and 'b' has a list of all words that start with b etc.

Since I know that all of my words start with 'i', I can first narrow my solution down from 100,000 words to just the words that start with 'i'.

List<string> CurrentWordList = WordList['i'];

Now I check

if( CurrentWordList[0].Length == 1 )

Then I know my first string is a match "i" because "i" will be the first word im the list. These lists are sorted alphabetically beforehand, so as not to slow down the matching.

Any ideas?

*No this is not a HW assigment, I am a profesionall Software Architect trying to find an optimal match algorithm for fun/hobby/game development.

20
  • List of string objects and then use LINQ and parallel LINQ functions Commented Jun 1, 2015 at 16:24
  • 1
    If your dictionary is in the format {key: {word_list} with the value for each key being a set of words you would have O(1) lookup time for each lookup providing there aren't too many collisions. Commented Jun 1, 2015 at 16:25
  • @Pseudonym could you be more specific, as I am familiar with LINQ, but I am trying to find the optimal solution, also I am doing this with Xamarin so I want the sln to be easily translated by the Xamarin compiler into Java (Android) and ObjectiveC (iPhone). Commented Jun 1, 2015 at 16:32
  • 4
    You may want to look into using specialized data structures and algorithms for this task: either prefix tries (Patricia Tries specifically) or the Aho–Corasick string matching algorithm refer to e.g. this question stackoverflow.com/questions/29778776/… Commented Jun 1, 2015 at 16:33
  • 1
    @MattyMerrix glad I was mildly helpful! Commented Jun 2, 2015 at 16:08

5 Answers 5

6

I decided to add this answer not because it is the optimal solution to your problem, but to illustrate two possible solutions that are relatively simple and that are somewhat in line with the approach you seem to be following yourself.

The (non-optimized) sample below provides an extremely simple prefix trie implementation, that uses a node per consumed character.

public class SimplePrefixTrie
{
    private readonly Node _root = new Node(); // root represents empty string.

    private class Node
    {
        public Dictionary<char, Node> Children;
        public bool IsTerminal; // whether a full word ends here.

        public Node Find(string word, int index)
        {
            var child = default(Node);
            if (index < word.Length && Children != null)
                Children.TryGetValue(word[index], out child);
            return child;
        }

        public Node Add(string word, int toConsume)
        {
            var child = default(Node);
            if (toConsume == word.Length)
                this.IsTerminal = true;
            else if (Children == null || !Children.TryGetValue(word[toConsume], out child))
            {
                if (Children == null)
                    Children = new Dictionary<char, Node>();
                Children[word[toConsume]] = child = new Node();
            }
            return child;
        }
    }

    public void AddWord(string word)
    {
        var ndx = 0;
        var cur = _root;
        while (cur != null)
            cur = cur.Add(word, ndx++);
    }

    public IEnumerable<string> FindWordsMatchingPrefixesOf(string searchWord)
    {
        var ndx = 0;
        var cur = _root;
        while (cur != null)
        {
            if (cur.IsTerminal)
                yield return searchWord.Substring(0, ndx);
            cur = cur.Find(searchWord, ndx++);
        }
    }
}

A simple implementation of a compressed prefix trie is also added below. It follows an almost identical approach to the sample above, but stores shared prefix parts, instead of single characters. It splits nodes when an existing stored prefix becomes shared and needs to be split into two parts.

public class SimpleCompressedPrefixTrie
{
    private readonly Node _root = new Node();

    private class Node
    {
        private Dictionary<char, Node> _children;
        public string PrefixValue = string.Empty;
        public bool IsTerminal;

        public Node Add(string word, ref int startIndex)
        {
            var n = FindSharedPrefix(word, startIndex);
            startIndex += n;
            if (n == PrefixValue.Length) // full prefix match
            {
                if (startIndex == word.Length) // full match
                    IsTerminal = true;
                else
                    return AddToChild(word, ref startIndex);
            }
            else // partial match, need to split this node's prefix.
                SplittingAdd(word, n, ref startIndex);
            return null;
        }

        public Node Find(string word, ref int startIndex, out int matchLen)
        {
            var n = FindSharedPrefix(word, startIndex);
            startIndex += n;
            matchLen = -1;
            if (n == PrefixValue.Length)
            {
                if (IsTerminal)
                    matchLen = startIndex;
                var child = default(Node);
                if (_children != null && startIndex < word.Length && _children.TryGetValue(word[startIndex], out child))
                {
                    startIndex++; // consumed map key character.
                    return child;
                }
            }
            return null;
        }

        private Node AddToChild(string word, ref int startIndex)
        {
            var key = word[startIndex++]; // consume the mapping character
            var nextNode = default(Node);
            if (_children == null)
                _children = new Dictionary<char, Node>();
            else if (_children.TryGetValue(key, out nextNode))
                return nextNode;
            var remainder = word.Substring(startIndex);
            _children[key] = new Node() { PrefixValue = remainder, IsTerminal = true };
            return null; // consumed.
        }

        private void SplittingAdd(string word, int n, ref int startIndex)
        {
            var curChildren = _children;
            _children = new Dictionary<char, Node>();
            _children[PrefixValue[n]] = new Node()
            {
                PrefixValue = this.PrefixValue.Substring(n + 1),
                IsTerminal = this.IsTerminal,
                _children = curChildren
            };
            PrefixValue = PrefixValue.Substring(0, n);
            IsTerminal = startIndex == word.Length;
            if (!IsTerminal)
            {
                var prefix = word.Length > startIndex + 1 ? word.Substring(startIndex + 1) : string.Empty;
                _children[word[startIndex]] = new Node() { PrefixValue = prefix, IsTerminal = true };
                startIndex++;
            }
        }

        private int FindSharedPrefix(string word, int startIndex)
        {
            var n = Math.Min(PrefixValue.Length, word.Length - startIndex);
            var len = 0;
            while (len < n && PrefixValue[len] == word[len + startIndex])
                len++;
            return len;
        }
    }

    public void AddWord(string word)
    {
        var ndx = 0;
        var cur = _root;
        while (cur != null)
            cur = cur.Add(word, ref ndx);
    }

    public IEnumerable<string> FindWordsMatchingPrefixesOf(string searchWord)
    {
        var startNdx = 0;
        var cur = _root;
        while (cur != null)
        {
            var matchLen = 0;
            cur = cur.Find(searchWord, ref startNdx, out matchLen);
            if (matchLen > 0)
                yield return searchWord.Substring(0, matchLen);
        };
    }
}

Usage examples:

var trie = new SimplePrefixTrie(); // or new SimpleCompressedPrefixTrie();
trie.AddWord("hello");
trie.AddWord("iced");
trie.AddWord("i");
trie.AddWord("ice");
trie.AddWord("icecone");
trie.AddWord("dtgg");
trie.AddWord("hicet");
foreach (var w in trie.FindWordsMatchingPrefixesOf("icedtgg"))
    Console.WriteLine(w);

With output:

i
ice
iced

UPDATE: Selecting the right data structure matters

I think an update could provide some value to illustrate how selecting a data structure that fits the problem well is important and what kinds of trade-offs are involved. Therefore I created a small benchmark application that tests the strategies in the answers provided to this question so far, versus a baseline reference implementation.

  • Naive: Is the simplest possible naive solution.
  • JimMischel: Is based on the approach from this answer.
  • MattyMerrix: Is based on your own answer here.
  • JimMattyDSL: Combines the 'JimMischel' and 'MattyMerrix' approaches and uses a more optimal binary string search in the sorted list.
  • SimpleTrie and CompessedTrie are based on the two implementations described in this answer.

The full benchmark code can be found in this gist. The results of running it with dictionaries of 10,000, 100,000, and 1,000,000 (randomly generated character sequence) words and searching for all prefix matches of 5,000 terms are:

Matching 5000 words to dictionary of 10000 terms of max length 25

       Method              Memory (MB)         Build Time (s)        Lookup Time (s)
        Naive          0.64-0.64, 0.64     0.001-0.002, 0.001     6.136-6.312, 6.210
   JimMischel          0.84-0.84, 0.84     0.013-0.018, 0.016     0.083-0.113, 0.102
  JimMattyDSL          0.80-0.81, 0.80     0.013-0.018, 0.016     0.008-0.011, 0.010
   SimpleTrie       24.55-24.56, 24.56     0.042-0.056, 0.051     0.002-0.002, 0.002
CompessedTrie          1.84-1.84, 1.84     0.003-0.003, 0.003     0.002-0.002, 0.002
  MattyMerrix          0.83-0.83, 0.83     0.017-0.017, 0.017     0.034-0.034, 0.034

Matching 5000 words to dictionary of 100000 terms of max length 25

       Method              Memory (MB)         Build Time (s)        Lookup Time (s)
        Naive          6.01-6.01, 6.01     0.024-0.026, 0.025  65.651-65.758, 65.715
   JimMischel          6.32-6.32, 6.32     0.232-0.236, 0.233     1.208-1.254, 1.235
  JimMattyDSL          5.95-5.96, 5.96     0.264-0.269, 0.266     0.050-0.052, 0.051
   SimpleTrie    226.49-226.49, 226.49     0.932-0.962, 0.951     0.004-0.004, 0.004
CompessedTrie       16.10-16.10, 16.10     0.101-0.126, 0.111     0.003-0.003, 0.003
  MattyMerrix          6.15-6.15, 6.15     0.254-0.269, 0.259     0.414-0.418, 0.416

Matching 5000 words to dictionary of 1000000 terms of max length 25

       Method              Memory (MB)         Build Time (s)        Lookup Time (s)
   JimMischel       57.69-57.69, 57.69     3.027-3.086, 3.052  16.341-16.415, 16.373
  JimMattyDSL       60.88-60.88, 60.88     3.396-3.484, 3.453     0.399-0.400, 0.399
   SimpleTrie 2124.57-2124.57, 2124.57  11.622-11.989, 11.860     0.006-0.006, 0.006
CompessedTrie    166.59-166.59, 166.59     2.813-2.832, 2.823     0.005-0.005, 0.005
  MattyMerrix       62.71-62.73, 62.72     3.230-3.270, 3.251     6.996-7.015, 7.008

As you can see, memory required for the (non-space optimized) tries is substantially higher. It increases by the size of the dictionary, O(N) for all of the tested implementations.

As expected, lookup time for the tries is more or less constant: O(k), dependent on the length of the search terms only. For the other implementations, time will increase based on the size of the dictionary to be searched.

Note that far more optimal implementations for this problem can be constructed, that will be close to O(k) for search time and allow a more compact storage and reduced memory footprint. If you map to a reduced alphabet (e.g. 'A'-'Z' only), this is also something that can be taken advantage of.

Sign up to request clarification or add additional context in comments.

9 Comments

Alex Thank You for the answer, this is a different approach then mine, I see you are using Nodes and what looks to be a type of Linked List implementation. I added a new, much shorter answer below, utilizing Linq, I would be curious as to which Answer would be quicker in a speed test. Take a look and let me know what you think, I can provide a word set, if you want to run speed tests!!! :)
@MattyMerrix I updated the answer with a benchmark reference, its results and some explanations.
Nice summary. Build time is expensive for mine because of that sort. Lookup time gets expensive because of all the Substring calls and the resulting garbage collection, and because I'm doing full string comparisons rather than just checking the last character. Can't do anything about the sort, but I could speed the lookup considerably.
Wow! Thanks Alex, this is really cool. As we can see the trie methods can be unbelievably fast, but can use a lot of memory for some of the larger data sets. Both Jim and I beat the simplest Naive approach by far. And ours are pretty similar for memory and startup time. I know LINQ is pretty awesome, and has come cool features like late time binding and others.It would be interesting to really put LINQ to the test and test against a fully optimized char comparison algorithm, although Im sure someone has already done that. This was cool. For now, Im satisfied with our results. Cheers!
@JimMischel, I just checked and using your approach combined with Matty's (a dictionary keyed by first character, with a sorted list of postfixes as the value) and a string compare that allows comparing substrings; string.Compare(strA, indexA, strB, indexB, length) to prevent extra allocations is much faster (0.075 seconds for the 100,000 words). I imagine a trie with "burst nodes" as leaf nodes that contain a sorted postfix list would do quite well in terms of both speed and memory usage.
|
2

So you just want to find the words in the dictionary that are prefixes of the input string? You can do this much more efficiently than any of the methods proposed. It's really just a modified merge.

If your word list consists of a dictionary keyed by first letter, with each entry containing a sorted list of words that begin with that letter, then this will do it. Worst case is O(n + m), where n is the number of words that start with the letter, and m is the length of the input string.

var inputString = "icegdt";
// get list of words that start with the first character
var wordsList = MyDictionary[input_string[0]];

// find all words that are prefixes of the input string
var iInput = 0;
var iWords = 0;
var prefix = inputString.Substring(0, iInput+1);
while (iInput < inputString.Length && iWords < wordsList.Count)
{
    if (wordsList[iWords] == prefix)
    {
        // wordsList[iWords] is found!
        ++iWords;
    }
    else if (wordsList[iWords] > prefix)
    {
        // The current word is alphabetically after the prefix.
        // So we need the next character.
        ++iInput;
        if (iInput < inputString.Length)
        {
            prefix = inputString.Substring(0, iInput+1);
        }
    }
    else
    {
        // The prefix is alphabetically after the current word.
        // Advance the current word.
        ++iWord;
    }
}

If this is all you want to do (find dictionary words that are prefixes of the input string), then there's no particular reason for your dictionary indexed by first character. Given a sorted list of words, you could do a binary search on the first letter to find the starting point. That would take slightly more time than the dictionary lookup, but the time difference would be very small compared to the time spent searching the word list for matches. In addition, the sorted word list would take less memory than the dictionary approach.

If you want to do case-insensitive comparisons, change the comparison code to:

    var result = String.Compare(wordsList[iWords], prefix, true);
    if (result == 0)
    {
        // wordsList[iWords] is found!
        ++iWords;
    }
    else if (result > 0)
    {

That also reduces the number of string comparisons per iteration to exactly one per iteration.

2 Comments

Jim at first glance this seems pretty good! You make a lot of good points about just checking if a word is == to the current word or comes after it since the list is sorted alphabetically. I think I was a little afraid to use the > operator for comparing strings because I wasnt sure how many steps this was compiled to in Assembly Language. I want to compare the two methods with speed tests when I get a free second, possibly tomorrow. Thanks for your input! You are brilliant.
Calling an optimized string comparison function isn't going to take any longer than writing a bunch of code that compares the strings character-by-character, which is essentially what your LINQ code does.
0
while (x < str.Length-1)
{
    if (ChrW(10) == GetChar(str, x) && ChrW(13) == GetChar(str, x+1))
     {
       // x+2 - This new line
     }
   x++;
}

Comments

0

Here is my first go at it, wanted to get this out there in case I cant finish it today.

 public class CompareHelper
 {
    //Should always be sorted in alphabetical order.
    public static Dictionary<char, List<string>> MyDictionary;
    public static List<string> CurrentWordList;
    public static List<string> MatchedWordList;

    //The word we are trying to find matches for.
    public static char InitChar;
    public static StringBuilder ThisWord;

    /// <summary>
    /// Initialize the Compare.  Set the first character.  See if there are any 1 letter words
    /// for that character.
    /// </summary>
    /// <param name="firstChar">The first character in the word string.</param>
    /// <returns>True if a word was found.</returns>
    public static bool InitCompare(char firstChar)
    {
        InitChar = firstChar;
        //Get all words that start with the firstChar.
        CurrentWordList = MyDictionary[InitChar];
        ThisWord = new StringBuilder();
        ThisWord.Append(firstChar);

        if (CurrentWordList[0].Length == 1)
        {
            //Match.
            return true;
        }
        //No matches.
        return false;
    }

    /// <summary>
    /// Append this letter to our ThisWord.  See if there are any matching words.
    /// </summary>
    /// <param name="nextChar">The next character in the word string.</param>
    /// <returns>True if a word was found.</returns>
    public static bool NextCompare(char nextChar)
    {
        ThisWord.Append(nextChar);
        int currentIndex = ThisWord.Length - 1;
        if (FindRemainingWords(nextChar, currentIndex))
        {
            if (CurrentWordList[0].Length == currentIndex)
            {
                //Match.
                return true;
            }
        }
        //No matches.
        return false;
    }

    /// <summary>
    /// Trim down our CurrentWordList until it only contains words
    /// that at currIndex start with the currChar.
    /// </summary>
    /// <param name="currChar">The next letter in our ThisWord.</param>
    /// <param name="currIndex">The index of the letter.</param>
    /// <returns>True if there are words remaining in CurrentWordList.</returns>
    private static bool FindRemainingWords(char currChar, int currIndex)
    {
        //Null check.
        if (CurrentWordList == null || CurrentWordList.Count < 1)
        {
            return false;
        }

        bool doneSearching = false;
        while(!doneSearching)
        {
            int middleIndex = CurrentWordList.Count / 2;

            //TODO: test for CurrentWordList.count 2 or 1 ...

            //TODO: test for wordToCheck.length < curr index

            char middleLetter = CurrentWordList[middleIndex][currIndex];


            LetterPositionEnum returnEnum = GetLetterPosition(currChar, middleLetter);
            switch(returnEnum)
            {
                case LetterPositionEnum.Before:
                    CurrentWordList = CurrentWordList.GetRange(middleIndex, (CurrentWordList.Count - middleIndex));
                    break;
                case LetterPositionEnum.PREV:
                    CurrentWordList = CurrentWordList.GetRange(middleIndex, (CurrentWordList.Count - middleIndex));

                    break;
                case LetterPositionEnum.MATCH:
                    CurrentWordList = CurrentWordList.GetRange(middleIndex, (CurrentWordList.Count - middleIndex));

                    break;
                case LetterPositionEnum.NEXT:
                    CurrentWordList = CurrentWordList.GetRange(0, middleIndex);

                    break;
                case LetterPositionEnum.After:
                    CurrentWordList = CurrentWordList.GetRange(0, middleIndex);

                    break;
                default:
                    break;
            }
        }

        TrimWords(currChar, currIndex);

        //Null check.
        if (CurrentWordList == null || CurrentWordList.Count < 1)
        {
            return false;
        }

        //There are still words left in CurrentWordList.
        return true;
    }

    //Trim all words in CurrentWordList 
    //that are LetterPositionEnum.PREV and LetterPositionEnum.NEXT
    private static void TrimWords(char currChar, int currIndex)
    {
        int startIndex = 0;
        int endIndex = CurrentWordList.Count;
        bool startIndexFound = false;

        //Loop through all of the words.
        for ( int i = startIndex; i < endIndex; i++)
        {
            //If we havent found the start index then the first match of currChar
            //will be the start index.
             if( !startIndexFound &&  currChar == CurrentWordList[i][currIndex] )
            {
                startIndex = i;
                startIndexFound = true;
            }

             //If we have found the start index then the next letter that isnt 
             //currChar will be the end index.
             if( startIndexFound && currChar != CurrentWordList[i][currIndex])
            {
                endIndex = i;
                break;
            }
        }

        //Trim the words that dont start with currChar.
        CurrentWordList = CurrentWordList.GetRange(startIndex, endIndex);
    }


    //In order to find all words that begin with a given character, we should search
    //for the last word that begins with the previous character (PREV) and the 
    //first word that begins with the next character (NEXT).
    //Anything else Before or After that is trash and we will throw out.
    public enum LetterPositionEnum
    {
        Before,
        PREV,
        MATCH,
        NEXT,
        After
    };

    //We want to ignore all letters that come before this one.
    public static LetterPositionEnum GetLetterPosition(char currChar, char compareLetter)
    {
        switch (currChar)
        {
            case 'A':
                switch (compareLetter)
                {
                    case 'A': return LetterPositionEnum.MATCH;
                    case 'B': return LetterPositionEnum.NEXT;
                    case 'C': return LetterPositionEnum.After;
                    case 'D': return LetterPositionEnum.After;
                    case 'E': return LetterPositionEnum.After;
                    case 'F': return LetterPositionEnum.After;
                    case 'G': return LetterPositionEnum.After;
                    case 'H': return LetterPositionEnum.After;
                    case 'I': return LetterPositionEnum.After;
                    case 'J': return LetterPositionEnum.After;
                    case 'K': return LetterPositionEnum.After;
                    case 'L': return LetterPositionEnum.After;
                    case 'M': return LetterPositionEnum.After;
                    case 'N': return LetterPositionEnum.After;
                    case 'O': return LetterPositionEnum.After;
                    case 'P': return LetterPositionEnum.After;
                    case 'Q': return LetterPositionEnum.After;
                    case 'R': return LetterPositionEnum.After;
                    case 'S': return LetterPositionEnum.After;
                    case 'T': return LetterPositionEnum.After;
                    case 'U': return LetterPositionEnum.After;
                    case 'V': return LetterPositionEnum.After;
                    case 'W': return LetterPositionEnum.After;
                    case 'X': return LetterPositionEnum.After;
                    case 'Y': return LetterPositionEnum.After;
                    case 'Z': return LetterPositionEnum.After;
                    default: return LetterPositionEnum.After;
                }
            case 'B':
                switch (compareLetter)
                {
                    case 'A': return LetterPositionEnum.PREV;
                    case 'B': return LetterPositionEnum.MATCH;
                    case 'C': return LetterPositionEnum.NEXT;
                    case 'D': return LetterPositionEnum.After;
                    case 'E': return LetterPositionEnum.After;
                    case 'F': return LetterPositionEnum.After;
                    case 'G': return LetterPositionEnum.After;
                    case 'H': return LetterPositionEnum.After;
                    case 'I': return LetterPositionEnum.After;
                    case 'J': return LetterPositionEnum.After;
                    case 'K': return LetterPositionEnum.After;
                    case 'L': return LetterPositionEnum.After;
                    case 'M': return LetterPositionEnum.After;
                    case 'N': return LetterPositionEnum.After;
                    case 'O': return LetterPositionEnum.After;
                    case 'P': return LetterPositionEnum.After;
                    case 'Q': return LetterPositionEnum.After;
                    case 'R': return LetterPositionEnum.After;
                    case 'S': return LetterPositionEnum.After;
                    case 'T': return LetterPositionEnum.After;
                    case 'U': return LetterPositionEnum.After;
                    case 'V': return LetterPositionEnum.After;
                    case 'W': return LetterPositionEnum.After;
                    case 'X': return LetterPositionEnum.After;
                    case 'Y': return LetterPositionEnum.After;
                    case 'Z': return LetterPositionEnum.After;
                    default: return LetterPositionEnum.After;
                }
            case 'C':
                switch (compareLetter)
                {
                    case 'A': return LetterPositionEnum.Before;
                    case 'B': return LetterPositionEnum.PREV;
                    case 'C': return LetterPositionEnum.MATCH;
                    case 'D': return LetterPositionEnum.NEXT;
                    case 'E': return LetterPositionEnum.After;
                    case 'F': return LetterPositionEnum.After;
                    case 'G': return LetterPositionEnum.After;
                    case 'H': return LetterPositionEnum.After;
                    case 'I': return LetterPositionEnum.After;
                    case 'J': return LetterPositionEnum.After;
                    case 'K': return LetterPositionEnum.After;
                    case 'L': return LetterPositionEnum.After;
                    case 'M': return LetterPositionEnum.After;
                    case 'N': return LetterPositionEnum.After;
                    case 'O': return LetterPositionEnum.After;
                    case 'P': return LetterPositionEnum.After;
                    case 'Q': return LetterPositionEnum.After;
                    case 'R': return LetterPositionEnum.After;
                    case 'S': return LetterPositionEnum.After;
                    case 'T': return LetterPositionEnum.After;
                    case 'U': return LetterPositionEnum.After;
                    case 'V': return LetterPositionEnum.After;
                    case 'W': return LetterPositionEnum.After;
                    case 'X': return LetterPositionEnum.After;
                    case 'Y': return LetterPositionEnum.After;
                    case 'Z': return LetterPositionEnum.After;
                    default: return LetterPositionEnum.After;
                }
//etc.  Stack Overflow limits characters to 30,000 contact me for full switch case.

   default: return LetterPositionEnum.After;
        }
    }
}

2 Comments

Why would you want to invent something that will not work as well as an existing solution? You've already been given multiple approaches that are known to work. It's unclear to me if your code will work. In addition, your huge nested switch statement can be be rolled up into some simple code: int diff = currChar - compareLetter; followed by if (diff == 1) return LetterPositionEnum.PREV; if (diff == -1) return LetterPositionEnum.NEXT; return LetterPositionEnum.After;
Matty, I did not fully inspect the code in your answer, but as @JimMischel already mentioned, some of the standard algorithms and data structures may be better suited to this task. To provide you with some more concrete information, I just added an answer that illustrates solving your question with two different prefix trie implementations that somewhat resemble the approach you seem to be following, and thus may be easier for you to explore and get started with.
0

Ok here is the final solution I came up with, I am not sure if it is Optimal Optimal, but seems to be pretty darn fast and I like the logic and love the brevity of code.

Basically on App start up you pass in a List of words of any length to InitWords. This will sort the words and place them into a Dicitonary that has 26 keys, one for each Letter in the alphabet.

Then during play, you will iterate through the character set, always starting with the first letter and then the first and second letter and so on. The whole time you are trimming down the number of words in your CurrentWordList.

So if you have the string 'icedgt'. You would call InitCompare with 'i', this would grab the KeyValuePair with Key 'I' from MyDictionary, then you will see if the first word is of length 1 since they are already in alphabetic order, the word 'I' would be the first word. Then on your next iteration you pass in 'c' to NextCompare, this again reduces the List size by using Linq to only return words that have a second char of 'c'. Then next you would do another NextCompare and pass in 'e', again reducing the number of words in CurrentWordList using Linq.

So after the first iteration your CurrentWordList has every word that starts with 'i', on the NextCompare you will have every word that starts with 'ic' and on the NextCompare you will have a subset of that where every word starts with 'ice' and so on.

I am not sure if Linq would have beat my manual gigantic Switch Case in terms of speed, but it is simple and elegant. And for that I am happy.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace Xuzzle.Code
{
public class CompareHelper
{
    //Should always be sorted in alphabetical order.
    public static Dictionary<char, List<string>> MyDictionary;
    public static List<string> CurrentWordList;

    //The word we are trying to find matches for.
    public static char InitChar;
    public static StringBuilder ThisWord;

    /// <summary>
    /// Init MyDictionary with the list of words passed in.  Make a new
    /// key value pair with each Letter.
    /// </summary>
    /// <param name="listOfWords"></param>
    public static void InitWords(List<string> listOfWords)
    {
        MyDictionary = new Dictionary<char, List<string>>();
        foreach (char currChar in LetterHelper.Alphabet)
        {
            var wordsParsed = listOfWords.Where(currWord => char.ToUpper(currWord[0]) == currChar).ToArray();
            Array.Sort(wordsParsed);
            MyDictionary.Add(currChar, wordsParsed.ToList());
        }
    }

    /// <summary>
    /// Initialize the Compare.  Set the first character.  See if there are any 1 letter words
    /// for that character.
    /// </summary>
    /// <param name="firstChar">The first character in the word string.</param>
    /// <returns>True if a word was found.</returns>
    public static bool InitCompare(char firstChar)
    {
        InitChar = firstChar;
        //Get all words that start with the firstChar.
        CurrentWordList = MyDictionary[InitChar];
        ThisWord = new StringBuilder();
        ThisWord.Append(firstChar);

        if (CurrentWordList[0].Length == 1)
        {
            //Match.
            return true;
        }
        //No matches.
        return false;
    }

    /// <summary>
    /// Append this letter to our ThisWord.  See if there are any matching words.
    /// </summary>
    /// <param name="nextChar">The next character in the word string.</param>
    /// <returns>True if a word was found.</returns>
    public static bool NextCompare(char nextChar)
    {
        ThisWord.Append(nextChar);
        int currentIndex = ThisWord.Length - 1;
        if (CurrentWordList != null && CurrentWordList.Count > 0)
        {
            CurrentWordList = CurrentWordList.Where(word => (word.Length > currentIndex && word[currentIndex] == nextChar)).ToList();
            if (CurrentWordList != null && CurrentWordList.Count > 0)
            {
                if (CurrentWordList[0].Length == ThisWord.Length)
                {
                    //Match.
                    return true;
                }
            }
        }
        //No matches.
        return false;
    }
}
}

7 Comments

But what happens if the string you're given is, for example, "dice"? Would your algorithm find the words "dice", "i", and "ice"? I guess the question is whether you're looking for all words in the given string, or are you just looking for words that are at the front of the given string?
As written, it appears that, given the string "icedgt", your algorithm will iterate over the "i" list once for each character in the given string (6 times in this case) to find the possible matching words. That's the selection of your CurrentWordList. This makes the algorithm O(n * m), where n is the length of the given string, and m is the number of words that start with "i". The Aho-Corasick algorithm, on the other hand, is O(n + z), where n is the length of the given string, and z is the number of matches found.
@JimMischel dice would not be i or ice. You must start with the first character.
@JimMischel I am not sure your second comment is correct. Each time you iterate over the list the list shrinks in size. For example, I have a list of 10,000 words of those 1,000 start with the letter i. So for the first word'i' I am not using hardly any cpu cycles since I am just grabbing the 'i' key from my dicitonary. Next I grab out of those 1,000 words that begin with i the words that begin with ic, so now my list of words to check is down to 100.
@JimMischel next i check how many words begin with ice, knowing that if a word does not begin with ic then it cannot begin with ice, so i only have to check the remaining 100 words for words beginning with ice. so out of the 10,000 words, 1,000 begin with i, of those 100 begin with ic, and i then find out that only 10 begin with ice. My next iteration is incredibly quick because i am only checking 10 words. Out of these 1 word iced begins with iced. Now all of the next iterations will be almost instantaneous since I have no words left to check.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.