Home » excel » excel – Finding Patterns: Identify the string portion that is common to a group of cells

excel – Finding Patterns: Identify the string portion that is common to a group of cells

Posted by: admin May 14, 2020 Leave a comment

Questions:

I don’t know how to search for this or how to explain without an example.

I’m looking for an excel function that compares cell strings and identifies the portion they have in common.

Conditions

  • Compares 2 or more cells.
  • The common string is identified as long as 2 cells share it. *(ie: if comparing more than 2, it’s enough to have 2 cells with that string. Not all the compared cells need to have it.) *
  • The string has at least 3 or more chars to avoid single characters and pairs being flagged.

Example

----------------------------------------------------------------------
|  Pattern  | Page URL 1           | Page URL 2     | Page URL 3     |
----------------------------------------------------------------------
|    test   | example.net/test/    | www.test.com   | www.notest.com |  
----------------------------------------------------------------------
|   q=age   | another.com?q=age    | test.com/q=age | test.com/q=lol |
----------------------------------------------------------------------

Probably obvious by now, but what I’m trying to achieve/analyze is if there are string patterns that are common to large sets of URLs.

(forgive my poor attempt trying to draw a table)

How to&Answers:

This doesn’t fully answer the question but I think it will give you what you need to get it. Give it a try. Place the following code in a new moudule:

Public Sub FindStrings()
    Dim rng1 As Excel.Range
    Dim rng2 As Excel.Range

    Set rng1 = ActiveSheet.Range("A1")
    Set rng2 = ActiveSheet.Range("A2")

    Dim i As Integer
    Dim j As Integer
    Dim searchVal As String
    For i = 3 To Len(rng2)
        For j = 1 To Len(rng1)
            searchVal = Mid(rng1, j, i)
            If Len(searchVal) < i Then Exit For
            If InStr(1, rng2, searchVal) Then Debug.Print searchVal
        Next j
    Next i
End Sub

In cell A1 put example.net/test
In cell A2 put www.test.com

Result

tes
est
test

UPDATE

I updated the code to search for a minimum of 4 characters instead of 3 (as you mentioned above). Furthermore, I guessed you wouldn’t want strings such as www. and .com returned, nor strings with the / or . character. So the code pulls those out as well. Also, it compares every column combination.

Option Explicit
Public Sub CompareStrings()
    Dim Arr As Variant
    Dim i As Integer
    Dim j As Integer
    Dim StartRange As Excel.Range
    Dim SearchRange As Excel.Range
    Dim Counter As Integer
    Dim ComparableRange As Variant
    Dim Comparable As Integer
    Dim Compared As Integer
    Dim SearchVal As String

    Set StartRange = ActiveSheet.Range("A1")

    Counter = 0
    For Each ComparableRange In ActiveSheet.Range("A1:A2")
    Set SearchRange = Range(StartRange.Offset(Counter), Cells(StartRange.Offset(Counter).Row, Columns.Count).End(xlToLeft))
    Arr = Application.Transpose(Application.Transpose(SearchRange.Value))
    Debug.Print "Row " & SearchRange.Row & ":"
        For j = LBound(Arr) To UBound(Arr)
            For i = j + 1 To UBound(Arr)
                For Comparable = 4 To Len(Arr(j))
                    For Compared = 1 To Len(Arr(i))
                        SearchVal = Mid(Arr(j), Compared, Comparable)
                        If InStr(1, SearchVal, ".") = 0 Then
                            If InStr(1, SearchVal, "/") = 0 Then
                                If Len(SearchVal) < Comparable Then Exit For
                                If InStr(1, Arr(i), SearchVal) > 0 Then Debug.Print vbTab & SearchVal
                            End If
                        End If
                    Next Compared
                Next Comparable
            Next i
        Next j
        Counter = Counter + 1
    Next ComparableRange    
End Sub

When comparing test.com/q=age with another.com?q=age You will still get results such as:

q=ag
=age 
q=age 

…though I suspect you only want the third one. The longer the matching strings are the more results you will get. The last results are the ones you will probably want.

Answer:

Copy the following code into a module. Read the comments at the top of CommonString for usage.

Option Explicit

Public Function CommonString(rng As Range, iMinLen As Integer, Optional strDelimiter As String = ",") As String
    'Finds the maximum number of cells (iMax) in "rng" that have a common substring of length at least "iMinLen".
    'The function returns a string with the format "iMax: substring1,substring2,substring3..."
    ' where substring1, substring2, etc. are unique substrings found in exactly iMax cells.
    'The output does not include any substrings of the unique substrings.
    'The delimter between substrings can be specified by the optional parameter "strDelimiter".
    'If no common substrings of length at least "iMinLen" are found, "CommonString" will return an empty string.
    Dim blnRemove() As Boolean
    Dim dicSubStrings As Object 'records the number of times substrings are found in pairwise string comparisons
    Dim iCandidates As Integer
    Dim iCol As Integer
    Dim iCurrCommon As Integer
    Dim iCurrLen As Integer
    Dim iMax As Integer
    Dim iMaxCommon As Integer
    Dim iNumStrings As Integer
    Dim iOutCount As Integer
    Dim iRow As Integer
    Dim iString1 As Integer
    Dim iString2 As Integer
    Dim iSubStr1 As Integer
    Dim iSubStr2 As Integer
    Dim lngSumLen As Long
    Dim str1D() As String
    Dim strCandidates() As String
    Dim strOut() As String
    Dim strSim() As String
    Dim strSub As String
    Dim vKey As Variant
    Dim vStringsIn() As Variant

    Set dicSubStrings = CreateObject("Scripting.Dictionary")
    vStringsIn = rng.Value
    iNumStrings = Application.CountA(rng)
    ReDim str1D(1 To iNumStrings)
    ' pull the strings into a 1-D array
    For iRow = 1 To UBound(vStringsIn, 1)
        For iCol = 1 To UBound(vStringsIn, 2)
            iCurrLen = Len(vStringsIn(iRow, iCol))

            If iCurrLen > 0 Then
                iString1 = iString1 + 1
                str1D(iString1) = vStringsIn(iRow, iCol)
                lngSumLen = lngSumLen + iCurrLen
            End If
        Next iCol
    Next iRow
    'initialize the array that will hold the substrings to output
    ReDim strOut(1 To lngSumLen - iNumStrings * (iMinLen - 1))
    'find common substrings from all pairwise combination of strings
    For iString1 = 1 To iNumStrings - 1
        For iString2 = iString1 + 1 To iNumStrings
            strSim = Sim2Strings(str1D(iString1), str1D(iString2), iMinLen)
            'loop through all common substrings
            For iSubStr1 = 1 To UBound(strSim)
                If dicSubStrings.Exists(strSim(iSubStr1)) Then
                    iCurrCommon = dicSubStrings(strSim(iSubStr1)) + 1
                    dicSubStrings(strSim(iSubStr1)) = iCurrCommon
                    If iCurrCommon > iMaxCommon Then iMaxCommon = iCurrCommon
                Else    'add common substrings to the "dicSubStrings" dictionary
                    dicSubStrings.Add strSim(iSubStr1), 1
                    If iMaxCommon = 0 Then iMaxCommon = 1
                End If
            Next iSubStr1
        Next iString2
    Next iString1

    If dicSubStrings.Count = 0 Then Exit Function
    ReDim strCandidates(1 To dicSubStrings.Count)
    'add the candidate substrings to the "strCandidates" array
    'candidate substrings are those found in exactly "iMaxCommon" pairwise comparisons
    For Each vKey In dicSubStrings.keys
        If dicSubStrings(vKey) = iMaxCommon Then
            iCandidates = iCandidates + 1
            strCandidates(iCandidates) = CStr(vKey)
        End If
    Next vKey

    ReDim blnRemove(1 To iCandidates)
    iOutCount = iCandidates
    'keep only the candidate substrings that are not a substring within another candidate substring
    For iSubStr1 = 1 To iCandidates - 1
        If Not blnRemove(iSubStr1) Then
            For iSubStr2 = 1 To iCandidates - 1
                If Not blnRemove(iSubStr2) Then
                    If Len(strCandidates(iSubStr1)) <> Len(strCandidates(iSubStr2)) Then
                        If Len(strCandidates(iSubStr1)) > Len(strCandidates(iSubStr2)) Then
                            If InStr(strCandidates(iSubStr1), strCandidates(iSubStr2)) > 0 Then
                                blnRemove(iSubStr2) = True
                                iOutCount = iOutCount - 1
                            End If
                        Else
                            If InStr(strCandidates(iSubStr2), strCandidates(iSubStr1)) > 0 Then
                                blnRemove(iSubStr1) = True
                                iOutCount = iOutCount - 1
                            End If
                        End If
                    End If
                End If
            Next iSubStr2
        End If
    Next iSubStr1

    ReDim strOut(1 To iOutCount)
    iOutCount = 0
    'add the successful candidates to "strOut"
    For iSubStr1 = 1 To iCandidates
        If Not blnRemove(iSubStr1) Then
            iOutCount = iOutCount + 1
            strOut(iOutCount) = strCandidates(iSubStr1)
        End If
    Next iSubStr1
    'convert "iMaxCommon" (pairwise counts) to number of cells (iMax) by solving the formula:
    '(iMax ^ 2 - iMax) / 2 = iMaxCommon
    iMax = ((8 * iMaxCommon + 1) ^ 0.5 + 1) / 2
    CommonString = iMax & ": " & Join(strOut, strDelimiter)
End Function

Private Function Sim2Strings(str1 As String, str2 As String, iMinLen As Integer) As String()
    'Returns a list of unique substrings common to both "str1" and "str2" that
    ' have a length of at least "iMinLen".
    Dim dicInList As Object
    Dim iCharFrom As Integer
    Dim iLen1 As Integer
    Dim iSearchLen As Integer
    Dim iSubStr As Integer
    Dim strCurr As String
    Dim strList() As String
    Dim vKey As Variant

    iLen1 = Len(str1)
    Set dicInList = CreateObject("Scripting.Dictionary")
    'add common substrings to the "dicInList" dictionary
    For iCharFrom = 1 To iLen1 - iMinLen + 1
        For iSearchLen = iMinLen To iLen1 - iCharFrom + 1
            strCurr = Mid(str1, iCharFrom, iSearchLen)

            If InStr(str2, strCurr) = 0 Then
                Exit For
            Else
                If Not dicInList.Exists(strCurr) Then
                    dicInList.Add strCurr, 0
                End If
            End If
        Next iSearchLen
    Next iCharFrom

    If dicInList.Count = 0 Then
        ReDim strList(0)
    Else
        ReDim Preserve strList(1 To dicInList.Count)
        'output the keys in the "dicInList" dictionary to the "strList" array
        For Each vKey In dicInList.keys
            iSubStr = iSubStr + 1
            strList(iSubStr) = vKey
        Next vKey
    End If

    Sim2Strings = strList
End Function