I’m trying to replicate Google calendar’s method of creating an appointment from a narrative. I want to enter
5pm Happy Hour for 1 hour and parse it into, ultimately, an Outlook AppointmentItem.
My problem, I think, is I have a large chunk of optional text at the end. And because it’s optional, the regex passes but the submatch doesn’t get populated because it isn’t required for the match. I want it to populate because I want to use the submatches as my parsing engine.
I have a bunch of test cases in column A (working in Excel, then will move to Outlook), and my code lists out the submatches to the right. This is a representative sample of potential input
1. 5pmCST Happy Hour for 1 hour 2. 5pm CST Happy Hour for 1 hour 3. 5pm Happy Hour for 1 hour 4. 5 pm Happy Hour for 1 hour 5. 5 pm CST Happy Hour for 1 hour 6. 5 Happy Hour for 1 hour 7. 5 Happy Hour 8. 5pmCST Happy Hour 9. 5pm CST Happy Hour 10. 5pm Happy Hour 11. 5:00CST Happy Hour for 1 hour 12. 5:00 CST Happy Hour for 1 hour
Here’s the code that runs the tests
Sub testest() Dim RegEx As VBScript_RegExp_55.RegExp Dim Matches As VBScript_RegExp_55.MatchCollection Dim Match As VBScript_RegExp_55.Match Dim rCell As Range Dim SubMatch As Variant Dim lCnt As Long Dim aPattern(1 To 8) As String Set RegEx = New VBScript_RegExp_55.RegExp aPattern(1) = "(1?[0-9](:[0-5][0-9])?)" 'time aPattern(2) = "( ?)" 'optional space aPattern(3) = "([ap]m)?" 'optional ampm aPattern(4) = "( ?)" 'optional space aPattern(5) = "([ECMP][DS]T)?" 'optional time zone aPattern(6) = "( ?)" 'optional space aPattern(7) = "(.+?)" 'event description aPattern(8) = "(( for )([1-2]?[0-9](.[0-9]?[0-9])?)( hours?))?" 'optional duration RegEx.Pattern = Join(aPattern, vbNullString) Debug.Print RegEx.Pattern Sheet1.Range("C1").Resize(1000, 100).ClearContents For Each rCell In Sheet1.Range("A1").CurrentRegion.Columns(1).Cells lCnt = 0 rCell.Offset(0, 2).Value = RegEx.test(rCell.Text) If RegEx.test(rCell.Text) Then Set Matches = RegEx.Execute(rCell.Text) For Each Match In Matches For Each SubMatch In Match.SubMatches lCnt = lCnt + 1 rCell.Offset(0, 2 + lCnt).Value = SubMatch Next SubMatch Next Match End If Next rCell End Sub
The pattern is
(1?[0-9](:[0-5][0-9])?)( ?)([ap]m)?( ?)([ECMP][DS]T)?( ?)(.+?)(( for )([1-2]?[0-9](.[0-9]?[0-9])?)( hours?))?
The submatches for #1 are
1 2 3 4 5 6 7 5 pm CST H
It stops matching at the “H” in Happy Hour because everything starting with the ” for ” is optional. If I remove the optional part, my pattern becomes
(1?[0-9](:[0-5][0-9])?)( ?)([ap]m)?( ?)([ECMP][DS]T)?( ?)(.+?)( for )([1-2]?[0-9](.[0-9]?[0-9])?)( hours?)
But #7-#10 don’t pass because they don’t have a duration. The submmatches for #1 give me what I want though
1 2 3 4 5 6 7 8 9 10 11 5 pm CST Happy Hour for 1 hour
I want every possible submatch to fill even if VBScript doesn’t need it to to make the regex pass. I fear this is just how it works and that I’m trying to get regex to do my parsing work for me. I considered running it through increasingly more restrictive patterns until it doesn’t pass, then using the last passing pattern, but that seems kludgy.
Is it possible to get regex to fill those submatches?
I have assumed each line is all the contents in a single cell. So I am able to use anchors.
I also don’t think you need as many capturing groups as you have. I set up the regex with:
Group 1 Time Group 2 am/pm Group 3 Time Zone Group 4 Description Group 5 Hours (and fractions of hours)
With your data in A2:An, the following routine parses the data into the adjacent columns. It doesn’t matter if a Submatch is “not filled”. You could also fill elements in an array, or whatever else you want to do. If you want more submatches, you can always either add capturing groups for the optional spaces, or change the relevant non-capturing groups to capturing groups.
Also, since the “for” is optional, I chose to use a lookahead to determine the end of “description”. Description will end with either a \s+for\s+ sequence; or with the “end of line”. Since I have assumed there is only one entry, and one line, per cell, the multiline and global properties are irrelevant.
One has to include spaces before and after “for” so as to avoid problems if that sequence is included in Description.
Option Explicit 'set Reference to Microsoft VBScript Regular Expressions 5.5 Sub ParseAppt() Dim R As Range, C As Range Dim RE As RegExp, MC As MatchCollection Dim I As Long Set R = Range("a2", Cells(Rows.Count, "A").End(xlUp)) Set RE = New RegExp With RE .Pattern = "((?:1[0-2]|0?[1-9])(?::[0-5]\d)?)\s*([ap]m)?\s*([ECMT][DS]T)?\s*(.*?(?=\s+for\s+|$))(?:\s+for\s+(\d+(?:\.\d+)?)\s*hour)?" .IgnoreCase = True For Each C In R If .Test(C.Text) = True Then Set MC = .Execute(C.Text) For I = 0 To 4 C.Offset(0, I + 1) = MC(0).SubMatches(I) Next I End If Next C End With End Sub