Home » excel » excel – VBScript Regex Fill Submatches even when not Required for the Match

excel – VBScript Regex Fill Submatches even when not Required for the Match

Posted by: admin April 23, 2020 Leave a comment


I’m trying to replicate Google calendar’s method of creating an appointment from a narrative. I want to enter 5pm Happy Hour for 1 hour and parse it into, ultimately, an Outlook AppointmentItem.

My problem, I think, is I have a large chunk of optional text at the end. And because it’s optional, the regex passes but the submatch doesn’t get populated because it isn’t required for the match. I want it to populate because I want to use the submatches as my parsing engine.

I have a bunch of test cases in column A (working in Excel, then will move to Outlook), and my code lists out the submatches to the right. This is a representative sample of potential input

1. 5pmCST Happy Hour for 1 hour
2. 5pm CST Happy Hour for 1 hour
3. 5pm Happy Hour for 1 hour
4. 5 pm Happy Hour for 1 hour
5. 5 pm CST Happy Hour for 1 hour
6. 5 Happy Hour for 1 hour
7. 5 Happy Hour
8. 5pmCST Happy Hour
9. 5pm CST Happy Hour
10. 5pm Happy Hour
11. 5:00CST Happy Hour for 1 hour
12. 5:00 CST Happy Hour for 1 hour

Here’s the code that runs the tests

Sub testest()

    Dim RegEx As VBScript_RegExp_55.RegExp
    Dim Matches As VBScript_RegExp_55.MatchCollection
    Dim Match As VBScript_RegExp_55.Match
    Dim rCell As Range
    Dim SubMatch As Variant
    Dim lCnt As Long
    Dim aPattern(1 To 8) As String

    Set RegEx = New VBScript_RegExp_55.RegExp
    aPattern(1) = "(1?[0-9](:[0-5][0-9])?)" 'time
    aPattern(2) = "( ?)" 'optional space
    aPattern(3) = "([ap]m)?" 'optional ampm
    aPattern(4) = "( ?)" 'optional space
    aPattern(5) = "([ECMP][DS]T)?" 'optional time zone
    aPattern(6) = "( ?)" 'optional space
    aPattern(7) = "(.+?)" 'event description
    aPattern(8) = "(( for )([1-2]?[0-9](.[0-9]?[0-9])?)( hours?))?" 'optional duration

    RegEx.Pattern = Join(aPattern, vbNullString)
    Debug.Print RegEx.Pattern

    Sheet1.Range("C1").Resize(1000, 100).ClearContents

    For Each rCell In Sheet1.Range("A1").CurrentRegion.Columns(1).Cells
        lCnt = 0
        rCell.Offset(0, 2).Value = RegEx.test(rCell.Text)
        If RegEx.test(rCell.Text) Then
            Set Matches = RegEx.Execute(rCell.Text)

            For Each Match In Matches
                For Each SubMatch In Match.SubMatches
                    lCnt = lCnt + 1
                    rCell.Offset(0, 2 + lCnt).Value = SubMatch
                Next SubMatch
            Next Match
        End If
    Next rCell

End Sub

The pattern is

(1?[0-9](:[0-5][0-9])?)( ?)([ap]m)?( ?)([ECMP][DS]T)?( ?)(.+?)(( for )([1-2]?[0-9](.[0-9]?[0-9])?)( hours?))?

The submatches for #1 are

1        2          3        4      5       6       7
5                   pm              CST             H

It stops matching at the “H” in Happy Hour because everything starting with the ” for ” is optional. If I remove the optional part, my pattern becomes

(1?[0-9](:[0-5][0-9])?)( ?)([ap]m)?( ?)([ECMP][DS]T)?( ?)(.+?)( for )([1-2]?[0-9](.[0-9]?[0-9])?)( hours?)

But #7-#10 don’t pass because they don’t have a duration. The submmatches for #1 give me what I want though

1     2     3     4     5     6     7             8     9     10     11
5           pm          CST         Happy Hour     for  1            hour

I want every possible submatch to fill even if VBScript doesn’t need it to to make the regex pass. I fear this is just how it works and that I’m trying to get regex to do my parsing work for me. I considered running it through increasingly more restrictive patterns until it doesn’t pass, then using the last passing pattern, but that seems kludgy.

Is it possible to get regex to fill those submatches?

How to&Answers:

I have assumed each line is all the contents in a single cell. So I am able to use anchors.
I also don’t think you need as many capturing groups as you have. I set up the regex with:

Group 1        Time
Group 2        am/pm
Group 3        Time Zone
Group 4        Description
Group 5        Hours (and fractions of hours)

With your data in A2:An, the following routine parses the data into the adjacent columns. It doesn’t matter if a Submatch is “not filled”. You could also fill elements in an array, or whatever else you want to do. If you want more submatches, you can always either add capturing groups for the optional spaces, or change the relevant non-capturing groups to capturing groups.

Also, since the “for” is optional, I chose to use a lookahead to determine the end of “description”. Description will end with either a \s+for\s+ sequence; or with the “end of line”. Since I have assumed there is only one entry, and one line, per cell, the multiline and global properties are irrelevant.

One has to include spaces before and after “for” so as to avoid problems if that sequence is included in Description.

Option Explicit
'set Reference to Microsoft VBScript Regular Expressions 5.5
Sub ParseAppt()
    Dim R As Range, C As Range
    Dim RE As RegExp, MC As MatchCollection
    Dim I As Long
Set R = Range("a2", Cells(Rows.Count, "A").End(xlUp))
Set RE = New RegExp
With RE
    .Pattern = "((?:1[0-2]|0?[1-9])(?::[0-5]\d)?)\s*([ap]m)?\s*([ECMT][DS]T)?\s*(.*?(?=\s+for\s+|$))(?:\s+for\s+(\d+(?:\.\d+)?)\s*hour)?"
    .IgnoreCase = True
    For Each C In R
        If .Test(C.Text) = True Then
            Set MC = .Execute(C.Text)
            For I = 0 To 4
                C.Offset(0, I + 1) = MC(0).SubMatches(I)
            Next I
        End If
    Next C
End With
End Sub