4

I have a list of about 50 keywords and about 50000 strings. I check every string if it contains at least one of the keywords. I'm not interested in the matched keyword or the number of matched keywords. I only want a "true" or "false" back, as fast as possible.

So, I bet there's an algorithm out there that outperforms my current LINQ version by far:

class MyEnumerableExtension
{
    public static bool ContainsAny(this string searchString, IEnumerable<string> keywords)
    {
        return keywords.Any(keyword => searchString.Contains(keyword))
    }
}

bool foundAny = "abcdef".ContainsAny(new string[] { "ac", "bd", "cd" } );

4 Answers 4

1

isn't this in essence the same as your other question of today Efficient algorithm for finding all keywords in a text except modified to return once a match has been found?

Sign up to request clarification or add additional context in comments.

3 Comments

No it isn't. I have two separate concerns. One is finding all string containing any keyword in a list of given keywords. The other is tokenizing these found strings by using another keyword list. These lists are different and have different purposes.
OK, but the solution is the same algorthm in both places (altered in this case to return once one match is found)?
Whoops, I should have read until the end. I think you're right, I could modify the algorithm to return after a single keyword is found. Since I would have to build the keyword tree only once this should be a very fast solution.
0

There are multiple algorithms to search a text for a set of substrings.

Comments

0

Yo can implement Knuth-Morris-Pratt algorithm.

1 Comment

That searches for one word. There are better algorithms for searching for any of a set. See Wikipedia en.wikipedia.org/wiki/…
0

A quick analysis shows that you are iteratively searching for the keywords. If you could search in one pass for all fo the keywords, you should have an overall improvement in your algorithm. A Regex expression can do that and couple it with the "Compiled" option and you should begin to see a performance increase (because it will single pass the string for all keywords). But, it would only benefit you if you have several keywords. Here's a quick idea to help you along, but note, I have not actually tested the performance against your algorithm.

        string[] keywords = { "ac", "bd", "cd" };
        string[] tosearch = { "abcdef" };
        string pattern = String.Join("|", keywords);
        Regex regex = new Regex(pattern, RegexOptions.Compiled);
        foundAny = regex.IsMatch(String.Join("|", tosearch));

Also note, this works as long as your keywords do not contain any Regex special characters (and your search strings do not contain the pipe symbol. However, the special characters could be overcome with escape sequences, and the search strings do not have to be joined as I've done.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.