Home » Php » php – Regex – get elements to render if statement

php – Regex – get elements to render if statement

Posted by: admin July 12, 2020 Leave a comment

Questions:

I’m designing a script and trying to get to the if construct without eval in php.

Still incomplete but blasting through, it’s to do a templating engine, the “if” part of the engine. no Assignment operators allowed, but I need to test values without allowing php code injections, precisely not using eval It’ll need to do individual operations between variables preventing injection attacks.

Regex must capture

[if:(a+b-c/d*e)|(x-y)&!(z%3=0)]
    output
[elseif:('b'+'atman'='batman')]
    output2
[elseif:('b'+'atman'='batman')]
    output3
[elseif:('b'+'atman'='batman')]
    output4
[else]
    output5
[endif]

[if:(a+b-c/d*e)|(x-y)&!(z%3=0)]
    output6
[else]
    output7
[endif]

The following works to get the if, elseif, else and endif blocks along with the condition statements:

$regex = '^\h*\[if:(.*)\]\R(?<if>(?:(?!\[elseif)[\s\S])+)\R^\h*\[elseif:(.*)\]\R(?<elseif>(?:(?!\[else)[\s\S])+)\R^\h*\[else.*\]\R(?<else>(?:(?!\[endif)[\s\S])+)\R^\[endif\]~xm';

Please help in having optional elseif and else.

then with the condition statement, I can get the operations with:

$regex = '~([^\^=<>+\-%/!&|()*]+)([\^+\-%/!|&*])([^\^=<>+\-%/!&|()*]*)~';

however, it’ll only pair them, missing each 3rd operator…

Thanks for your help.

How to&Answers:

(edit Added a simple if/elseif body parsing regex at the bottom)

Using PCRE, I think this regex recursion should handle nested
if/elseif/else/endif constructs.

In it’s current form it is a loose parse in that it doesn’t define
very well the form of the [if/elseif: body ].
For instance, is [if: the beginning delimiter construct and ] the end?
And should an error occur, etc.. It could be done this way if needing a strict parse.
Right now it basically is using [if: body ] as a the beginning delimiter
and [endif] as the end delimiter to find nesting constructs.

Also, it loosely defines body as [^\]]* which, in a serious parsing
situation, would have to be fleshed out to account for quotes and stuff.
Like I said, breaking it apart like that is doable, but is much more
involved. I’ve done this on a language level, and it’s not trivial.

There is a host language usage pseudocode sample on the bottom.
The language recursion demonstrates how to extract nested content
correctly.

The regex matches the current outter shell of a core. Where the core
is the inner nested content.

Each call to ParseCore() is initiated inside ParseCore() itself
(except for the initial call from main().

Since scoping seems unspecified, I’ve made assumptions that can be seen
littered in comments.

There is a placeholder for the if/elseif body that is captured that
can then be parsed for the (operations) portion which is really part 2
of this exercise I haven’t gotten around to doing yet.
Note – I will try to do this, but I don’t have the time today.

Let me know if you have any questions..

(?s)(?:(?<Content>(?&_content))|\[elseif:(?<ElseIf_Body>(?&_ifbody)?)\]|(?<Else>(?&_else))|(?<Begin>\[if:(?<If_Body>(?&_ifbody)?)\])(?<Core>(?&_core)|)(?<End>\[endif\])|(?<Error>(?&_keyword)))(?(DEFINE)(?<_ifbody>(?>[^\]])+)(?<_core>(?>(?<_content>(?>(?!(?&_keyword)).)+)|(?(<_else>)(?!))(?<_else>(?>\[else\]))|(?(<_else>)(?!))(?>\[elseif:(?&_ifbody)?\])|(?>\[if:(?&_ifbody)?\])(?:(?=.)(?&_core)|)\[endif\])+)(?<_keyword>(?>\[(?:(?:if|elseif):(?&_ifbody)?|endif|else)\])))

Formatted and tested:

 (?s)                               # Dot-all modifier

 # =====================
 # Outter Scope
 # ---------------

 (?:
      (?<Content>                        # (1), Non-keyword CONTENT
           (?&_content) 
      )
   |                                   # OR,
      # --------------
      \[ elseif:                         # ELSE IF
      (?<ElseIf_Body>                    # (2), else if body
           (?&_ifbody)? 
      )
      \]
   |                                   # OR
      # --------------
      (?<Else>                           # (3), ELSE
           (?&_else) 
      )
   |                                   # OR
      # --------------
      (?<Begin>                          # (4), IF
           \[ if: 
           (?<If_Body>                        # (5), if body
                (?&_ifbody)? 
           )
           \]
      )
      (?<Core>                           # (6), The CORE
           (?&_core) 
        |  
      )
      (?<End>                            # (7)
           \[ endif \]                        # END IF
      )
   |                                   # OR
      # --------------
      (?<Error>                          # (8), Unbalanced If, ElseIf, Else, or End
           (?&_keyword) 
      )
 )

 # =====================
 #  Subroutines
 # ---------------

 (?(DEFINE)

      # __ If Body ----------------------
      (?<_ifbody>                        # (9)
           (?> [^\]] )+
      )

      # __ Core -------------------------
      (?<_core>                          # (10)
           (?>
                #
                # __ Content ( non-keywords )
                (?<_content>                       # (11)
                     (?>
                          (?! (?&_keyword) )
                          . 
                     )+
                )
             |  
                #
                # __ Else
                # Guard:  Only 1 'else'
                # allowed in this core !!

                (?(<_else>)
                     (?!)
                )
                (?<_else>                          # (12)
                     (?> \[ else \] )
                )
             |  
                #
                # __ ElseIf
                # Guard:  Not Else before ElseIf
                # allowed in this core !!

                (?(<_else>)
                     (?!)
                )
                (?>
                     \[ elseif:
                     (?&_ifbody)? 
                     \]
                )
             |  
                #
                # IF  (block start)
                (?>
                     \[ if: 
                     (?&_ifbody)? 
                     \]
                )
                # Recurse core
                (?:
                     (?= . )
                     (?&_core) 
                  |  
                )
                # END IF  (block end)
                \[ endif \] 
           )+
      )

      # __ Keyword ----------------------
      (?<_keyword>                       # (13)
           (?>
                \[ 
                (?:
                     (?: if | elseif )
                     : (?&_ifbody)? 
                  |  endif
                  |  else
                )
                \]
           )
      )
 )

Host language pseudo-code

 bool bStopOnError = false;
 regex RxCore("....."); // Above regex ..

 bool ParseCore( string sCore, int nLevel )
 {
     // Locals
     bool bFoundError = false;
     bool bBeforeElse = true;
     match _matcher;

     while ( search ( core, RxCore, _matcher ) )
     {
       // Content
         if ( _matcher["Content"].matched == true )
           // Print non-keyword content
           print ( _matcher["Content"].str() );

           // OR, Analyze content.
           // If this 'content' has error's and wish to return.
           // if ( bStopOnError )
           //   bFoundError = true;

         else

       // ElseIf
         if ( _matcher["ElseIf_Body"].matched == true )
         {
             // Check if we are not in a recursion
             if ( nLevel <= 0 )
             {
                // Report error, this 'elseif' is outside an 'if/endif' block
                // ( note - will only occur when nLevel == 0 )
                print ("\n>> Error, 'elseif' not in block, body = " + _matcher["ElseIf_Body"].str() + "\n";

                // If this 'else' error will stop the process.
                if ( bStopOnError == true )
                   bFoundError = true;
             }
             else
             {
                 // Here, we are inside a core recursion.
                 // That means we have not hit an 'else' yet
                 // because all elseif's precede it.
                 // Print 'elseif'.
                 print ( "ElseIf: " );

                 // TBD - Body regex below
                 // Analyze the 'elseif' body.
                 // This is where it's body is parsed.
                 // Use body parsing (operations) regex on it.
                 string sElIfBody = _matcher["ElseIf_Body"].str() );

                // If this 'elseif' body error will stop the process.
                if ( bStopOnError == true )
                   bFoundError = true;
             }
         }


       // Else
         if ( _matcher["Else"].matched == true )
         {
             // Check if we are not in a recursion
             if ( nLevel <= 0 )
             {
                // Report error, this 'else' is outside an 'if/endif' block
                // ( note - will only occur when nLevel == 0 )
                print ("\n>> Error, 'else' not in block\n";

                // If this 'else' error will stop the process.
                if ( bStopOnError == true )
                   bFoundError = true;
             }
             else
             {
                 // Here, we are inside a core recursion.
                 // That means there can only be 1 'else' within
                 // the relative scope of a single core.
                 // Print 'else'.
                 print ( _matcher["Else"].str() );

                 // Set the state of 'else'.
                 bBeforeElse == false;
             }
         }

         else

       // Error ( will only occur when nLevel == 0 )
         if ( _matcher["Error"].matched == true )
         {
             // Report error
             print ("\n>> Error, unbalanced " + _matcher["Error"].str() + "\n";
             // // If this unbalanced 'if/endif' error will stop the process.
             if ( bStopOnError == true )
                 bFoundError = true;
         }

         else

       // If/EndIf block
         if ( _matcher["Begin"].matched == true )
         {
             // Print 'If'
             print ( "If:" );

             // Analyze 'if body' for error and wish to return.

             // TBD - Body regex below.
             // Analyze the 'if' body.
             // This is where it's body is parsed.
             // Use body parsing (operations) regex on it.
             string sIfBody = _matcher["If_Body"].str() );

             // If this 'if' body error will stop the process.
              if ( bStopOnError == true )
                  bFoundError = true;
              else
              {

                  //////////////////////////////
                  // Recurse a new 'core'
                  bool bResult = ParseCore( _matcher["Core"].str(), nLevel+1 );
                  //////////////////////////////

                  // Check recursion result. See if we should unwind.
                  if ( bResult == false && bStopOnError == true )
                      bFoundError = true;
                  else
                      // Print 'end'
                      print ( "EndIf" );
              }
         }

         else
         {
            // Reserved placeholder, won't get here at this time.
         }

       // Error-Return Check
         if ( bFoundError == true && bStopOnError == true )
              return false;
     }

     // Finished this core!! Return true.
     return true;
 }

 ///////////////////////////////
 // Main

 string strInitial = "...";

 bool bResult = ParseCore( strInitial, 0 );
 if ( bResult == false )
    print ( "Parse terminated abnormally, check messages!\n" );

Output sample of the outter core matches
Note there will be many more matches when the inner core’s are matched.

 **  Grp 0               -  ( pos 0 , len 211 ) 
[if:(a+b-c/d*e)|(x-y)&!(z%3=0)]
    output
[elseif:('b'+'atman'='batman')]
    output2
[elseif:('b'+'atman'='batman')]
    output3
[elseif:('b'+'atman'='batman')]
    output4
[else]
    output5
[endif]  
 **  Grp 1 [Content]     -  NULL 
 **  Grp 2 [ElseIf_Body] -  NULL 
 **  Grp 3 [Else]        -  NULL 
 **  Grp 4 [Begin]       -  ( pos 0 , len 31 ) 
[if:(a+b-c/d*e)|(x-y)&!(z%3=0)]  
 **  Grp 5 [If_Body]     -  ( pos 4 , len 26 ) 
(a+b-c/d*e)|(x-y)&!(z%3=0)  
 **  Grp 6 [Core]        -  ( pos 31 , len 173 ) 

    output
[elseif:('b'+'atman'='batman')]
    output2
[elseif:('b'+'atman'='batman')]
    output3
[elseif:('b'+'atman'='batman')]
    output4
[else]
    output5

 **  Grp 7 [End]         -  ( pos 204 , len 7 ) 
[endif]  
 **  Grp 8 [Error]       -  NULL 
 **  Grp 9 [_ifbody]     -  NULL 
 **  Grp 10 [_core]       -  NULL 
 **  Grp 11 [_content]    -  NULL 
 **  Grp 12 [_else]       -  NULL 
 **  Grp 13 [_keyword]    -  NULL 

-----------------------------

 **  Grp 0               -  ( pos 211 , len 4 ) 



 **  Grp 1 [Content]     -  ( pos 211 , len 4 ) 



 **  Grp 2 [ElseIf_Body] -  NULL 
 **  Grp 3 [Else]        -  NULL 
 **  Grp 4 [Begin]       -  NULL 
 **  Grp 5 [If_Body]     -  NULL 
 **  Grp 6 [Core]        -  NULL 
 **  Grp 7 [End]         -  NULL 
 **  Grp 8 [Error]       -  NULL 
 **  Grp 9 [_ifbody]     -  NULL 
 **  Grp 10 [_core]       -  NULL 
 **  Grp 11 [_content]    -  NULL 
 **  Grp 12 [_else]       -  NULL 
 **  Grp 13 [_keyword]    -  NULL 

-----------------------------

 **  Grp 0               -  ( pos 215 , len 74 ) 
[if:(a+b-c/d*e)|(x-y)&!(z%3=0)]
    output6
[else]
    output7
[endif]  
 **  Grp 1 [Content]     -  NULL 
 **  Grp 2 [ElseIf_Body] -  NULL 
 **  Grp 3 [Else]        -  NULL 
 **  Grp 4 [Begin]       -  ( pos 215 , len 31 ) 
[if:(a+b-c/d*e)|(x-y)&!(z%3=0)]  
 **  Grp 5 [If_Body]     -  ( pos 219 , len 26 ) 
(a+b-c/d*e)|(x-y)&!(z%3=0)  
 **  Grp 6 [Core]        -  ( pos 246 , len 36 ) 

    output6
[else]
    output7

 **  Grp 7 [End]         -  ( pos 282 , len 7 ) 
[endif]  
 **  Grp 8 [Error]       -  NULL 
 **  Grp 9 [_ifbody]     -  NULL 
 **  Grp 10 [_core]       -  NULL 
 **  Grp 11 [_content]    -  NULL 
 **  Grp 12 [_else]       -  NULL 
 **  Grp 13 [_keyword]    -  NULL 

This is the If/ElseIf Body regex

Raw

(?|((?:\s*[^\^=<>+\-%/!&|()\[\]*\s]\s*)+)([\^+\-%/*=]+)(?=\s*[^\^=<>+\-%/!&|()\[\]*\s])|\G(?!^)(?<=[\^+\-%/*=])((?:\s*[^\^=<>+\-%/!&|()\[\]*\s]\s*)+)())

Stringed

'~(?|((?:\s*[^\^=<>+\-%/!&|()\[\]*\s]\s*)+)([\^+\-%/*=]+)(?=\s*[^\^=<>+\-%/!&|()\[\]*\s])|\G(?!^)(?<=[\^+\-%/*=])((?:\s*[^\^=<>+\-%/!&|()\[\]*\s]\s*)+)())~'

Expanded

 (?|                                           # Branch Reset
      (                                             # (1 start), Operand
           (?: \s* [^\^=<>+\-%/!&|()\[\]*\s] \s* )+
      )                                             # (1 end)
      ( [\^+\-%/*=]+ )                              # (2), Forward Operator
      (?= \s* [^\^=<>+\-%/!&|()\[\]*\s] )
   |  
      \G 
      (?! ^ )
      (?<= [\^+\-%/*=] )
      (                                             # (1 start), Last Operand
           (?: \s* [^\^=<>+\-%/!&|()\[\]*\s] \s* )+
      )                                             # (1 end)
      ( )                                           # (2), Last-Empty Forward Operator
 )

Here is how this operates:
Assumes very simple constructs.
This will just parse the math operand/operator stuff.
It won’t parse any enclosing parenthesis blocks, nor any logic or math
operators in between.

If needed, parse any parenthesis blocks ahead of time, i.e. \( [^)* \) or
similar. Or split on the logic operators like |.

The body regex uses a branch reset to get the operand/operator sequence.
It always matches two things.
Group 1 contains the operand, group 2 the operator.

If group 2 is empty, group 1 is the last operand in the sequence.

Valid operators are ^ + - % / * =.
The equals = is included because it separates cluster of operations
and can just be noted as a separation.

The conclusion about this body regex is that it is very simple and
only suited for simple usage. Anything more of a complexity is involved
this won’t be the way to go.

Input/Output Sample 1:

(a+b-c/d*e)

 **  Grp 1 -  ( pos 1 , len 1 ) 
a  
 **  Grp 2 -  ( pos 2 , len 1 ) 
+  
------------
 **  Grp 1 -  ( pos 3 , len 1 ) 
b  
 **  Grp 2 -  ( pos 4 , len 1 ) 
-  
------------
 **  Grp 1 -  ( pos 5 , len 1 ) 
c  
 **  Grp 2 -  ( pos 6 , len 1 ) 
/  
------------
 **  Grp 1 -  ( pos 7 , len 1 ) 
d  
 **  Grp 2 -  ( pos 8 , len 1 ) 
*  
------------
 **  Grp 1 -  ( pos 9 , len 1 ) 
e  
 **  Grp 2 -  ( pos 10 , len 0 )  EMPTY 

Input/Output Sample 2:

('b'+'atman'='batman')

 **  Grp 1 -  ( pos 1 , len 3 ) 
'b'  
 **  Grp 2 -  ( pos 4 , len 1 ) 
+  
------------
 **  Grp 1 -  ( pos 5 , len 7 ) 
'atman'  
 **  Grp 2 -  ( pos 12 , len 1 ) 
=  
------------
**  Grp 1 -  ( pos 13 , len 8 ) 
'batman'  
 **  Grp 2 -  ( pos 21 , len 0 )  EMPTY 

Answer:

You have different possibilities here.

The regex version

^\h*\[if.*\]\R                        # if in the first line
(?<if>(?:(?!\[elseif)[\s\S])+)\R      # output
^\h*\[elseif.*\]\R                    # elseif
(?<elseif>(?:(?!\[else)[\s\S])+)\R    # output
^\h*\[else.*\]\R                      # elseif
(?<else>(?:(?!\[endif)[\s\S])+)\R     # output
^\[endif\]

Afterwards, you have three named captured groups (if, elseif and else).
See a demo for this one on regex101.com.

In PHP, this would be:

<?php
$code = <<<EOF
[if:(a+b-c/d*e)|(x-y)&!(z%3=0)]
output
[elseif:('b'+'atman'='batman')]
output2
out as well
[else]
output3
some other output here
[endif]
EOF;

$regex = '~
            ^\h*\[if.*\]\R                        # if in the first line
            (?<if>(?:(?!\[elseif)[\s\S])+)\R      # output
            ^\h*\[elseif.*\]\R                    # elseif
            (?<elseif>(?:(?!\[else)[\s\S])+)\R    # output
            ^\h*\[else.*\]\R                      # elseif
            (?<else>(?:(?!\[endif)[\s\S])+)\R     # output
            ^\[endif\]
          ~xm';

preg_match_all($regex, $code, $parts);
print_r($parts);
?>

Programming logic

Perhaps it would be better to skim the lines and look for [if...], capture anything up to [elseif...] in a string and glue them together afterwards.

<?php

$code = <<<EOF
[if:(a+b-c/d*e)|(x-y)&!(z%3=0)]
output
[elseif:('b'+'atman'='batman')]
output2
out as well
[else]
output3
some other output here
[endif]
EOF;

// functions, shamelessly copied from http://stackoverflow.com/questions/834303/startswith-and-endswith-functions-in-php
function startsWith($haystack, $needle) {
    // search backwards starting from haystack length characters from the end
    return $needle === "" || strrpos($haystack, $needle, -strlen($haystack)) !== false;
}

function endsWith($haystack, $needle) {
    // search forward starting from end minus needle length characters
    return $needle === "" || (($temp = strlen($haystack) - strlen($needle)) >= 0 && strpos($haystack, $needle, $temp) !== false);
}

$code = explode("\n", $code);
$buffer = array("if" => null, "elseif" => null, "else" => null);

$pointer = false;
for ($i=0;$i<count($code);$i++) {
	$save = true;
	if (startsWith($code[$i], "[if")) {$pointer = "if"; $save = false;}
	elseif (startsWith($code[$i], "[elseif")) {$pointer = "elseif"; $save = false; }
	elseif (startsWith($code[$i], "[else")) {$pointer = "else"; $save = false; }
	elseif (startsWith($code[$i], "[endif")) {$pointer = false; $save = false; }

	if ($pointer && $save) $buffer[$pointer] .= $code[$i] . "\n";

}
print_r($buffer);

?>