Home » excel » Regex group string where the delimiter can be used twice (.net regex)

Regex group string where the delimiter can be used twice (.net regex)

Posted by: admin April 23, 2020 Leave a comment

Questions:

I am writing a parser for excel that can update values in the document. I am currently parsing the header/footer part of the spreadsheet document format. The format of a header/footer in excel is stored as plain text, delimited by:

  • &L
  • &C
  • &R

So your header / footer might look like this in the xml:

&LTodaysDate&CDocumentTitle&RAuthors Name

If you have only a left and right header your xml string would look like:

&LTodaysDate&RAuthors Name

I have tried to create a pattern that can detect each of thee groups and parse out the component (ie &L,&C,&R) along with any text that occurs after that tag.

The regex string is this: (&.{1})([A-Za-z\d_ ]*) (Link to example)

However I have a fringe case problem that will mean I can’t properly parse excel header that contain ampersands.

In an excel header for your document to have an ampersand in the title (this is in plain text) you must type &&. So the xml of a header with an ampersands might look like:

&RPork && Beans (which would display “Pork & Beans” in the spreadsheet).

My regex is not able to cope with the premature ampersand. In the first group ((&.{1})) I ask for anything that has an ampersand and the character that follows it (ie an L/C/R). How can I tell this group to not include when there are 2 ampersands. My regex skills are pretty novice, I can describe what I want at a higher level:

I want to split the string wherever I see &L/&C/&R and capture all text after this, up to another &L/&C/&R delimiter (excluding new line spaces etc). I can best describe this in C# linq below.

(&.{1}.Where(c => c != '&'))([A-Za-z\d_ ]*)

For the string “&RPork && Beans”

my regex captures 2 matches each with 2 groups:

match 1
group 1: “&R”
group 2: “Pork “

match 2
group 1: “&&”
group 2: ” Beans”

and I’d want it to match once:
group 1: “&R”
group 2: “Pork && Beans”

Thanks for the help

How to&Answers:

You may use

var result = Regex.Split(s, "(&[LRC])").Where(x => !string.IsNullOrWhiteSpace(x));

See the regex demo. The (&[LRC]) will match & and a L, R or C letter after it, this value will be extracted into the resulting array due to the capturing parentheses.

enter image description here

Another usage example:

var s = "&RPork && Beans&CDocument Title";
var result = Regex.Split(s, "(&[LRC])")
        .Where(x => !string.IsNullOrWhiteSpace(x))
        .ToList();
var data = result.Where((c,i) => i % 2 == 0).Zip(result.Where((c,i) => i % 2 != 0),
        (delimiter, value) => new KeyValuePair<string, string>(delimiter, value));
foreach (var kvp in data)
    Console.WriteLine("Delimiter: {0}\nValue: {1}", kvp.Key, kvp.Value);

Output:

Delimiter: &R
Value: Pork && Beans
Delimiter: &C
Value: Document Title

Answer:

upon my understanding to the case i have written regex that matches what you need (Link to example
)

and this is the expression :

(&(?=[RCL])[RCL]{1})([A-Za-z\d_ ](&(?![RCL]))[A-Za-z\d_ ])