Home » Php » php – Removing similar elements from array

php – Removing similar elements from array

Posted by: admin July 12, 2020 Leave a comment

Questions:
Array
(
    [0] => The N2225 and N2226 SAS/SATA HBAs are low-cost, high-performance host bus adapters for high-performance connectivity between System x® servers and tape drives and RAID storage systems. The N2225 provides two x4 external mini-SAS HD connectors with eight lanes of 12 Gbps SAS. The N2226 provides four x4 external mini-SAS HD connectors with 16 lanes of 12 Gbps SAS.
    [1] => The N2225 and N2226 SAS/SATA HBAs are low-cost, high-performance host bus adapters for high-performance connectivity between System x® servers and tapes drives and RAID storage systems. The N2225 provides two x4 external mini-SAS HD connectors with eight lanes of 12 Gbps SAS. The N2226 provides four x4 external mini-SAS HD connectors with 16 lanes of 12 Gbps SAS.
    [2] => The N2225 and N2226 SAS/SATA HBAs support SAS data transfer rates of 3, 6, and 12 Gbps per lane and SATA transfer rates of 3 and 6 Gbps per lane, and they enable maximum connectivity and performance in a low-profile (N2225) or full-height (N2226) form factor.
    [3] => Rigorous testing of the N2225 and N2226 SAS/SATA HBAs by Lenovo through the ServerProven® program ensures a high degree of confidence in storage subsystem compatibility and reliability. Providing an additional peace of mind, these controllers are covered under Lenovo warranty.
    [4] => The following tables list the compatibility information for the N2225 and N2226 SAS/SATA HBAs and System x®, iDataPlex®, and NeXtScale™ servers.
    [5] => For more information about the System x servers, including older servers that support the N2225 and N2226 adapters, see the following ServerProven® website:
    [6] => The following table lists the external storage systems that are currently offered by Lenovo that can be used with the N2225 and N2226 SAS/SATA HBAs in storage solutions.
    [7] => The following table lists the external tape backup units that are currently offered by Lenovo that can be used with the N2225 and N2226 SAS/SATA HBAs in tape backup solutions.
    [8] => For more information about the specific versions and service levels that are supported and any other prerequisites, see the ServerProven website:
    [9] => The N2225 and N2226 SAS/SATA HBAs carry a one-year limited warranty. When installed in a supported System x server, the adapters assume your system’s base warranty and any Lenovo warranty upgrade.
)

Well not exactly identical which can be removed with array_unique, but elements that are rendered obsolete by another element which contains exactly the same data and more, or sometimes just a few words are different.

How do I filter these?

How to&Answers:

First of all, the problem is not that simple and not well enough formulated: you don’t want to remove identical elements, you want to remove similar elements, so your 1st problem becomes determining which elements are similar.

Given that similarities can happen at any point in the string, it’s not enough to require them to start with the same set of characters. For example, take these two sentences (adapted from your question):

Rigorous testing of the N2225 and N2226 SAS/SATA HBAs by Lenovo through the ServerProven® program ensures a high degree of confidence in storage subsystem compatibility and reliability. Providing an additional peace of mind, these controllers are covered under Lenovo warranty.
The rigorous testing of the N2225 and N2226 SAS/SATA HBAs by Lenovo through the ServerProven® program ensures a high degree of confidence in storage subsystem compatibility and reliability. Providing an additional peace of mind, these controllers are covered under Lenovo warranty.

They are very similar without starting with the same string. One way of determining a similarity measure is the Smith–Waterman_algorithm, there’s a PHP implementation available here.

— Later edit —

Here’s the implementation using PHP’s built in similar_text()

/**
 * @param mixed $array          input array
 * @param int $minSimilarity    minimum similarity for an item to be removed (percentage)
 * @return array
 */
function applyFilter ($array, $minSimilarity = 90) {
    $result = [];

    foreach ($array as $outerValue) {
        $append = true;
        foreach ($result as $key => $innerValue) {
            $similarity = null;
            similar_text($innerValue, $outerValue, $similarity);
            if ($similarity >= $minSimilarity) {
                if (strlen($outerValue) > strlen($innerValue)) {
                    // always keep the longer one
                    $result[$key] = $outerValue;
                }
                $append = false;
                break;
            }
        }

        if ($append) {
            $result[] = $outerValue;
        }
    }

    return $result;
}

$test = [
    'The N2225 and N2226 SAS/SATA HBAs are low-cost, high-performance host bus adapters for high-performance connectivity between System x® servers and tape drives and RAID storage systems. The N2225 provides two x4 external mini-SAS HD connectors with eight lanes of 12 Gbps SAS. The N2226 provides four x4 external mini-SAS HD connectors with 16 lanes of 12 Gbps SAS.',
    'The N2225 and N2226 SAS/SATA HBAs are low-cost, high-performance host bus adapters for high-performance connectivity between System x® servers and tapes drives and RAID storage systems. The N2225 provides two x4 external mini-SAS HD connectors with eight lanes of 12 Gbps SAS. The N2226 provides four x4 external mini-SAS HD connectors with 16 lanes of 12 Gbps SAS.',
    'The N2225 and N2226 SAS/SATA HBAs support SAS data transfer rates of 3, 6, and 12 Gbps per lane and SATA transfer rates of 3 and 6 Gbps per lane, and they enable maximum connectivity and performance in a low-profile (N2225) or full-height (N2226) form factor.',
    'Rigorous testing of the N2225 and N2226 SAS/SATA HBAs by Lenovo through the ServerProven® program ensures a high degree of confidence in storage subsystem compatibility and reliability. Providing an additional peace of mind, these controllers are covered under Lenovo warranty.',
    'The following tables list the compatibility information for the N2225 and N2226 SAS/SATA HBAs and System x®, iDataPlex®, and NeXtScale™ servers.',
    'For more information about the System x servers, including older servers that support the N2225 and N2226 adapters, see the following ServerProven® website:',
    'The following table lists the external storage systems that are currently offered by Lenovo that can be used with the N2225 and N2226 SAS/SATA HBAs in storage solutions.',
    'The following table lists the external tape backup units that are currently offered by Lenovo that can be used with the N2225 and N2226 SAS/SATA HBAs in tape backup solutions.',
    'For more information about the specific versions and service levels that are supported and any other prerequisites, see the ServerProven website:',
    'The N2225 and N2226 SAS/SATA HBAs carry a one-year limited warranty. When installed in a supported System x server, the adapters assume your system’s base warranty and any Lenovo warranty upgrade.',
];

var_dump(applyFilter($test));

— EOF later edit —

Here is the full working code with the Smith–Waterman_algorithm:

class SmithWatermanGotoh
{
    private $gapValue;
    private $substitution;

    /**
     * Constructs a new Smith Waterman metric.
     *
     * @param gapValue
     *            a non-positive gap penalty
     * @param substitution
     *            a substitution function
     */
    public function __construct($gapValue=-0.5,
                $substitution=null)
    {
        if($gapValue > 0.0) throw new Exception("gapValue must be <= 0");
        //if(empty($substitution)) throw new Exception("substitution is required");
        if (empty($substitution)) $this->substitution = new SmithWatermanMatchMismatch(1.0, -2.0);
        else $this->substitution = $substitution;
        $this->gapValue = $gapValue;
    }

    public function compare($a, $b)
    {
        if (empty($a) && empty($b)) {
            return 1.0;
        }

        if (empty($a) || empty($b)) {
            return 0.0;
        }

        $maxDistance = min(mb_strlen($a), mb_strlen($b))
                * max($this->substitution->max(), $this->gapValue);
        return $this->smithWatermanGotoh($a, $b) / $maxDistance;
    }

    private function smithWatermanGotoh($s, $t)
    {
        $v0 = [];
        $v1 = [];
        $t_len = mb_strlen($t);
        $max = $v0[0] = max(0, $this->gapValue, $this->substitution->compare($s, 0, $t, 0));

        for ($j = 1; $j < $t_len; $j++) {
            $v0[$j] = max(0, $v0[$j - 1] + $this->gapValue,
                    $this->substitution->compare($s, 0, $t, $j));

            $max = max($max, $v0[$j]);
        }

        // Find max
        for ($i = 1; $i < mb_strlen($s); $i++) {
            $v1[0] = max(0, $v0[0] + $this->gapValue, $this->substitution->compare($s, $i, $t, 0));

            $max = max($max, $v1[0]);

            for ($j = 1; $j < $t_len; $j++) {
                $v1[$j] = max(0, $v0[$j] + $this->gapValue, $v1[$j - 1] + $this->gapValue,
                        $v0[$j - 1] + $this->substitution->compare($s, $i, $t, $j));

                $max = max($max, $v1[$j]);
            }

            for ($j = 0; $j < $t_len; $j++) {
                $v0[$j] = $v1[$j];
            }
        }

        return $max;
    }
}

class SmithWatermanMatchMismatch
{
    private $matchValue;
    private $mismatchValue;

    /**
     * Constructs a new match-mismatch substitution function. When two
     * characters are equal a score of <code>matchValue</code> is assigned. In
     * case of a mismatch a score of <code>mismatchValue</code>. The
     * <code>matchValue</code> must be strictly greater then
     * <code>mismatchValue</code>
     *
     * @param matchValue
     *            value when characters are equal
     * @param mismatchValue
     *            value when characters are not equal
     */
    public function __construct($matchValue, $mismatchValue) {
        if($matchValue <= $mismatchValue) throw new Exception("matchValue must be > matchValue");

        $this->matchValue = $matchValue;
        $this->mismatchValue = $mismatchValue;
    }

    public function compare($a, $aIndex, $b, $bIndex) {
        return ($a[$aIndex] === $b[$bIndex] ? $this->matchValue
                : $this->mismatchValue);
    }

    public function max() {
        return $this->matchValue;
    }

    public function min() {
        return $this->mismatchValue;
    }
}

/**
 * @param mixed $array          input array
 * @param int $minSimilarity    minimum similarity for an item to be removed (percentage)
 * @return array
 */
function applyFilter ($array, $minSimilarity = 90) {
    $swg = new SmithWatermanGotoh();

    $result = [];

    foreach ($array as $outerValue) {
        $append = true;
        foreach ($result as $key => $innerValue) {
            $similarity = $swg->compare($innerValue, $outerValue) * 100;
            if ($similarity >= $minSimilarity) {
                if (strlen($outerValue) > strlen($innerValue)) {
                    // always keep the longer one
                    $result[$key] = $outerValue;
                }
                $append = false;
                break;
            }
        }

        if ($append) {
            $result[] = $outerValue;
        }
    }

    return $result;
}


$test = [
    'The N2225 and N2226 SAS/SATA HBAs are low-cost, high-performance host bus adapters for high-performance connectivity between System x® servers and tape drives and RAID storage systems. The N2225 provides two x4 external mini-SAS HD connectors with eight lanes of 12 Gbps SAS. The N2226 provides four x4 external mini-SAS HD connectors with 16 lanes of 12 Gbps SAS.',
    'The N2225 and N2226 SAS/SATA HBAs are low-cost, high-performance host bus adapters for high-performance connectivity between System x® servers and tapes drives and RAID storage systems. The N2225 provides two x4 external mini-SAS HD connectors with eight lanes of 12 Gbps SAS. The N2226 provides four x4 external mini-SAS HD connectors with 16 lanes of 12 Gbps SAS.',
    'The N2225 and N2226 SAS/SATA HBAs support SAS data transfer rates of 3, 6, and 12 Gbps per lane and SATA transfer rates of 3 and 6 Gbps per lane, and they enable maximum connectivity and performance in a low-profile (N2225) or full-height (N2226) form factor.',
    'Rigorous testing of the N2225 and N2226 SAS/SATA HBAs by Lenovo through the ServerProven® program ensures a high degree of confidence in storage subsystem compatibility and reliability. Providing an additional peace of mind, these controllers are covered under Lenovo warranty.',
    'The following tables list the compatibility information for the N2225 and N2226 SAS/SATA HBAs and System x®, iDataPlex®, and NeXtScale™ servers.',
    'For more information about the System x servers, including older servers that support the N2225 and N2226 adapters, see the following ServerProven® website:',
    'The following table lists the external storage systems that are currently offered by Lenovo that can be used with the N2225 and N2226 SAS/SATA HBAs in storage solutions.',
    'The following table lists the external tape backup units that are currently offered by Lenovo that can be used with the N2225 and N2226 SAS/SATA HBAs in tape backup solutions.',
    'For more information about the specific versions and service levels that are supported and any other prerequisites, see the ServerProven website:',
    'The N2225 and N2226 SAS/SATA HBAs carry a one-year limited warranty. When installed in a supported System x server, the adapters assume your system’s base warranty and any Lenovo warranty upgrade.',
];

var_dump(applyFilter($test));

Now you just need to tweak the $minSimilarity variable according to your needs. For example, in your case, if you keep the default 90%, will remove the 1st element (similar with the 2nd to a 99.86% degree). However, setting a lower value (80%) will also remove 8th element as well.

Hope it helps!

Answer:

Assuming that the value always appears at the very beginning, you could do something like this:

$arr = ["Some Text.", "Some Text. And more details."];

foreach($arr as $key => $value) {

    // Look for the value in every element
    foreach($arr as $key2 => $value2) {

        // Remove element if its value appears at the beginning of another element
        if ($key !== $key2 && strpos($value2, $value) === 0) {
            unset($arr[$key]);
            continue 2;
        }
    }
}

// Re-index array 
$arr = array_values($arr);

This works as well if the element order is the other way around.

Answer:

You can still use array_filter and use a custom callback, use substr_count to find if the value is more than once in the array

$input = array("a","b","c","d","ax","cz");

$str = implode("|",array_unique($input));

$output = array_filter($input, function($var) use ($str){
                        return substr_count($str, $var) == 1;
                    });

print_r($output);

Answer:

sometimes just a few words are different.

As you stated that, few words can be different into another text. But in programming you need exact condition to filter.

You can put matched percentage to filter out

Here is a basic example from where you can get idea.

<?php
    $data = ["this is test","this is another test","one test","two test","this is two test"];
    $percentageMatched = 100;//Here you can put your percentage matched to delete
    for($i=0;$i<count($data)-1;$i++){
      $value = explode(" ",$data[$i]);
      /* check each word in another text */
      for($k=$i+1;$k<count($data);$k++){
        $nextArray = explode(" ",$data[$k]);
        $foundCount = 0;
        for($j=0;$j<count($value);$j++){  
          if(in_array($value[$j],$nextArray)){
            $foundCount++;    
          }
        }
        $fromLine = $i;
        $toLine = $k;
        $percentage = $foundCount/count($value)*100;
        echo "EN $fromLine matched $percentage % with EN $toLine  \n";  
        if($percentage >= $percentageMatched){  
          $data[$i] = "";
          break;
          //array_values($data);
        }  
      }

      echo ".............\n";
    }
    print_r(array_filter($data));
?>

live demo : https://eval.in/706478

If input data is :

Array
(
    [0] => this is test
    [1] => this is another test
    [2] => one test
    [3] => two test
    [4] => this is two test
)

It gives output: with 100% matched percentage here index 0 and 3 matched 100% and filtered out

EN 0 matched 100 % with EN 1  
.............
EN 1 matched 25 % with EN 2  
EN 1 matched 25 % with EN 3  
EN 1 matched 75 % with EN 4  
.............
EN 2 matched 50 % with EN 3  
EN 2 matched 50 % with EN 4  
.............
EN 3 matched 100 % with EN 4  
.............
Array
(
    [1] => this is another test
    [2] => one test
    [4] => this is two test
)

Answer:

Using array_filter is a good option

$temp = "";

function prefixmatch($x){
  global $temp;
  $temp = $x;
  // do an optimist linear search to determine if there's a prefix match
  $bool = true;
  for($i=0; $i < min([strlen($x), strlen($temp)]); $i++){
    $bool = $bool & ($x[i] === $temp[i]);
  }
  // negate the result just because of array_filter
  return(!$bool);
}

print_r(array_filter($array1, "prefixmatch"));

Answer:

I think stemming and Lemmatization can be helpful in this scenario. If we take the case of first two elements in the array, the only difference is singular ‘tape’ and plural ‘tapes’.
Array
(
[0] => The N2225 and N2226 SAS/SATA HBAs are low-cost, high-performance host bus adapters for high-performance connectivity between System x® servers and tape drives and RAID storage systems. The N2225 provides two x4 external mini-SAS HD connectors with eight lanes of 12 Gbps SAS. The N2226 provides four x4 external mini-SAS HD connectors with 16 lanes of 12 Gbps SAS.
[1] => The N2225 and N2226 SAS/SATA HBAs are low-cost, high-performance host bus adapters for high-performance connectivity between System x® servers and tapes drives and RAID storage systems. The N2225 provides two x4 external mini-SAS HD connectors with eight lanes of 12 Gbps SAS. The N2226 provides four x4 external mini-SAS HD connectors with 16 lanes of 12 Gbps SAS.

If you tokenize you string and pass it through a stemmer like Php Stemmer, both the words ‘tape’ and ‘tapes’ will be reduced to their stem i.e. ‘tape’. Post stemming, you can compare your array elements. I am sure it will remove many redundant elements.

You can also go one step further and perform Lemmatisation on the strings. For example, in English, the verb ‘to walk’ may appear as ‘walk’, ‘walked’, ‘walks’, ‘walking’. The base form, ‘walk’, that one might look up in a dictionary, is called the lemma for the word(From wiki).

I personally used Stanford NLP java. There is a Php implementation as well PHP-Stanford-NLP

Answer:

The solution will depend on your definition of “similarity” and the data set. It can be really different from one context to the other.

One solution that may answer your need is the cosine similarity. Here is a sample of code: Cosine similarity vs Hamming distance

Answer:

In PHP, you can use the array_unique method to remove duplicates from an array.

Example from php.net:

<?php
   $input = array("a" => "green", "red", "b" => "green", "blue", "red");
   $result = array_unique($input);
   print_r($result);
?>

The output is:

Array
( 
   [a] => green
   [0] => red
   [1] => blue
)

Hope it was what you were looking for