1

I'm trying to check approximately similarity of strings.

Here is a criteria that I use for that.

1) The order of the words is important 2) The words can have 80% of similarity.

Example:

$string1 = "How much will it cost to me" //string in vocabulary (all "right" words is here)
$string2 = "How much does costs it "   //"costs" instead "cost" -is a deliberate mistake (user input);

Algoritm: 1) Check the similarity of words and create clean string with "right" words (according to the order it appear in vocabulary). OUTPUT: "how much it cost" 2) create clean string with "right" words in order it appear in user input. OUTPUT: "how much cost it" 3)Compare two outputs - if not the same - return no, else if same return yes.

Any suggestions?

I started to write a code, but i'm not familiar with a tools in PHP, so i don't know how to do it rationally and efficiently.

It looks more like javascript/php

$string1="how much will it cost for me" ;
$string2= "how much does costs it";

function compareStrings($string1, $string2) {

    if (strlen($s1)==0 || strlen($s2)==0) {
        return 0;
    }

    while (strpos($s1, "  ")!==false) {
        $s1 = str_replace("  ", " ", $s1);
    }
    while (strpos($s2, "  ")!==false) {
        $s2 = str_replace("  ", " ", $s2);
    }

    $ar1 = explode(" ",$s1);
    $ar2 = explode(" ",$s2);
    $array1 = array_flip($ar1);
    $array2 = array_flip($ar2);
    $l1 = count($ar1);
    $l2 = count($ar2);

 $meaning="";
    $rightorder=""

    for ($i=0;$i<=$l1;$i++) {


        for ($j=0;$j<=$l2;$j++) {

         $k=   similar_text($array1[i], $array2[j], $perc).PHP_EOL;
if ($perc>=85) {
    $meaning=$meaning." ".$array1[j]; //generating a string of the first output
    $rightorder[i]= array1[i]; //generating the array with second output

}

        }


    }

}

The idea thet the $meaning will get "how much it cost" and $rightorder will get

$rightorder[0]='how'
$rightorder[1]='much'
$rightorder[2]=''
$rightorder[3]='cost'
$rightorder[4]='it'

after then i will somehow onvert it back to string "how much cost it"

and compare those two.

if ("how much cost it"=="how much it cost") return true; else return false.
3
  • check out the levenshtein() and similar_text() function offered by PHP, they might fit the bill. Commented May 14, 2013 at 13:20
  • 1
    Also soundex Commented May 14, 2013 at 13:23
  • 1
    You should read the question. I know this function, but the question is more complicated Commented May 14, 2013 at 13:28

1 Answer 1

1

Your problem belongs to the science of NLP (Natural Language Processing).

Each issue mentioned in the question has a filed of study of its own:

  • Splitting a string into words is tokenization. It seems trivial in English, however it is not in other languages, like German. There is also a problem of how to parse punctuation marks.

  • Creating "right words" is called stemming. There is a number of tools to do that. If your words are in English you may try Porter Stemming Algorithm. Other languages may have their own stemming techniques, usually a dictionary algorithm exists.

  • Calculating the similarity of string based on the individual word occurrences is called "Cosine Similarity". There is a number of other techniques. There is alse a problem od synonymy and polysemy

I hope it help as your problem is a mixture of the above-mentioned problems.

Sign up to request clarification or add additional context in comments.

1 Comment

Yes i know what is NLP, but i don't want to dive into it. Here is my simplified solution for that (good for latinic languages)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.