Home » Html » Search HTML for 2 phrases (ignoring all tags) and strip everything else

Search HTML for 2 phrases (ignoring all tags) and strip everything else

Posted by: admin November 29, 2017 Leave a comment

Questions:

I have html code stored in a string, example:

$html = '
        <html>
        <body>
        <p>Hello <em>進撃の巨人</em>!</p>
        random code
        random code
        <p>Lorem <span>ipsum<span>.</p>
        </body>
        </html>
        ';

Then I have two sentences stored in variables:

$begin = 'Hello 進撃の巨人!';
$end = 'Lorem ipsum.';

I want to search $html for these two sentences, and strip everything before and after them. So $html will become:

$html = 'Hello <em>進撃の巨人</em>!</p>
        random code
        random code
        <p>Lorem <span>ipsum<span>.';

How can I achieve this? Note that the $begin and $end variables do not have html tags but the sentences in $html very likely do have tags as shown above.

Maybe a regex approach?

What I’ve tried so far

  • A strpos() approach. The problem is that $html contains tags in the sentences, making the $begin and $end sentences not match. I can strip_tags($html) before running strpos(), but then I will obviously end up with $html without the tags.

  • Search part of variable, like Hello, but that’s never safe and will give many matches.

Answers:

Here is a short, yet – I believe – working solution based on a lazy dot matching regex (that can be improved by creating a longer, unrolled regex, but should be enough unless you have really large chunks of text).

$html = "<html>\n<body>\n<p><p>H<div>ello</div><script></script> <em>進&nbsp;&nbsp;&nbsp;撃の巨人</em>!</p>\nrandom code\nrandom code\n<p>Lorem <span>ipsum<span>.</p>\n</body>\n </html>";
$begin = 'Hello     進撃の巨人!';
$end = 'Lorem ipsum.';
$begin = preg_replace_callback('~\s++(?!\z)|(\s++\z)~u', function ($m) { return !empty($m[1]) ? '' : ' '; }, $begin);
$end = preg_replace_callback('~\s++(?!\z)|(\s++\z)~u', function ($m) { return !empty($m[1]) ? '' : ' '; }, $end);
$begin_arr = preg_split('~(?=\X)~u', $begin, -1, PREG_SPLIT_NO_EMPTY);
$end_arr = preg_split('~(?=\X)~u', $end, -1, PREG_SPLIT_NO_EMPTY);
$reg = "(?s)(?:<[^<>]+>)?(?:&#?\w+;)*\s*" .  implode("", array_map(function($x, $k) use ($begin_arr) { return ($k < count($begin_arr) - 1 ? preg_quote($x, "~") . "(?:\s*(?:<[^<>]+>|&#?\w+;))*" : preg_quote($x, "~"));}, $begin_arr, array_keys($begin_arr)))
        . "(.*?)" . 
        implode("", array_map(function($x, $k) use ($end_arr) { return ($k < count($end_arr) - 1 ? preg_quote($x, "~") . "(?:\s*(?:<[^<>]+>|&#?\w+;))*" : preg_quote($x, "~"));}, $end_arr, array_keys($end_arr))); 
echo $reg .PHP_EOL;
preg_match('~' . $reg . '~u', $html, $m);
print_r($m[0]);

See the IDEONE demo

Algorithm:

  • Create a dynamic regex pattern by splitting the delimiter strings into single graphemes (since these can be Unicode characters, I suggest using preg_split('~(?<!^)(?=\X)~u', $end)) and imploding back by adding an optional tag matching pattern (?:<[^<>]+>)?.
  • Then, (?s) enables a DOTALL mode when . matches any character including a newline, and .*? will match 0+ characters from the leading to trailing delimiter.

Regex details:

  • '~(?<!^)(?=\X)~u matches every location other than at the start of the string before each grapheme
  • (sample final regex) (?s)(?:<[^<>]+>)?(?:&#?\w+;)*\s*H(?:\s*(?:<[^<>]+>|&#?\w+;))*e(?:\s*(?:<[^<>]+>|&#?\w+;))*l(?:\s*(?:<[^<>]+>|&#?\w+;))*l(?:\s*(?:<[^<>]+>|&#?\w+;))*o(?:\s*(?:<[^<>]+>|&#?\w+;))* (?:\s*(?:<[^<>]+>|&#?\w+;))*進(?:\s*(?:<[^<>]+>|&#?\w+;))*撃(?:\s*(?:<[^<>]+>|&#?\w+;))*の(?:\s*(?:<[^<>]+>|&#?\w+;))*巨(?:\s*(?:<[^<>]+>|&#?\w+;))*人(?:\s*(?:<[^<>]+>|&#?\w+;))*\!(?:\s*(?:<[^<>]+>|&#?\w+;))* + (.*?) + L(?:\s*(?:<[^<>]+>|&#?\w+;))*o(?:\s*(?:<[^<>]+>|&#?\w+;))*r(?:\s*(?:<[^<>]+>|&#?\w+;))*e(?:\s*(?:<[^<>]+>|&#?\w+;))*m(?:\s*(?:<[^<>]+>|&#?\w+;))* (?:\s*(?:<[^<>]+>|&#?\w+;))*i(?:\s*(?:<[^<>]+>|&#?\w+;))*p(?:\s*(?:<[^<>]+>|&#?\w+;))*s(?:\s*(?:<[^<>]+>|&#?\w+;))*u(?:\s*(?:<[^<>]+>|&#?\w+;))*m(?:\s*(?:<[^<>]+>|&#?\w+;))*\. – the leading and trailing delimiters with optional subpatterns for tag matching and a (.*?) (capturing might not be necessary) inside.
  • ~u modifier is necessary since Unicode strings are to be processed.
  • UPDATE: To account for 1+ spaces, any whitespace in the begin and end patterns can be replaced with \s+ subpattern to match any kind of 1+ whitespace characters in the input string.
  • UPDATE 2: The auxiliary $begin = preg_replace('~\s+~u', ' ', $begin); and $end = preg_replace('~\s+~u', ' ', $end); are necessary to account for 1+ whitespace in the input string.
  • To account for HTML entities, add another subpattern to the optional parts: &#?\\w+;, it will also match &nbsp; and { like entities. It is also prepended with \s* to match optional whitespace, and quantified with * (can be zero or more).
Questions:
Answers:

I really wanted to write a regex solution. But I am preceeded with some nice and complex solutions. So, here is a non-regex solution.

Short explanation: The major problem is keeping HTML tags. We could easily search text, if HTML tags were stripped. So: strip these! We can easily search in the stripped content, and produce a substring we want to cut. Then, try to cut this substring from the HTML while keeping the tags.

Advantages:

  • Searching is easy and independent from HTML, you can search with regex too if you need
  • Requirements are scalable: you can easily add full multibyte support, support for entities and white-space collapse, and so on
  • Relatively fast (it is possible, that a direct regex can be faster)
  • Does not touch original HTML, and adaptable to other markup languages

A static utility class for this scenario:

class HtmlExtractUtil
{

    const FAKE_MARKUP = '<>';
    const MARKUP_PATTERN = '#<[^>]+>#u';

    static public function extractBetween($html, $startTextToFind, $endTextToFind)
    {
        $strippedHtml = preg_replace(self::MARKUP_PATTERN, '', $html);
        $startPos = strpos($strippedHtml, $startTextToFind);
        $lastPos = strrpos($strippedHtml, $endTextToFind);

        if ($startPos === false || $lastPos === false) {
            return "";
        }

        $endPos = $lastPos + strlen($endTextToFind);
        if ($endPos <= $startPos) {
            return "";
        }

        return self::extractSubstring($html, $startPos, $endPos);
    }

    static public function extractSubstring($html, $startPos, $endPos)
    {
        preg_match_all(self::MARKUP_PATTERN, $html, $matches, PREG_OFFSET_CAPTURE);
        $start = -1;
        $end = -1;
        $previousEnd = 0;
        $stripPos = 0;
        $matchArray = $matches[0];
        $matchArray[] = [self::FAKE_MARKUP, strlen($html)];
        foreach ($matchArray as $match) {
            $diff = $previousEnd - $stripPos;
            $textLength = $match[1] - $previousEnd;
            if ($start == (-1)) {
                if ($startPos >= $stripPos && $startPos < $stripPos + $textLength) {
                    $start = $startPos + $diff;
                }
            }
            if ($end == (-1)) {
                if ($endPos > $stripPos && $endPos <= $stripPos + $textLength) {
                    $end = $endPos + $diff;
                    break;
                }
            }
            $tagLength = strlen($match[0]);
            $previousEnd = $match[1] + $tagLength;
            $stripPos += $textLength;
        }

        if ($start == (-1)) {
            return "";
        } elseif ($end == (-1)) {
            return substr($html, $start);
        } else {
            return substr($html, $start, $end - $start);
        }
    }

}

Usage:

$html = '
<html>
<body>
<p>Any string before</p>
<p>Hello <em>進撃の巨人</em>!</p>
random code
random code
<p>Lorem <span>ipsum<span>.</p>
<p>Any string after</p>
</body>
</html>
';
$startTextToFind = 'Hello 進撃の巨人!';
$endTextToFind = 'Lorem ipsum.';

$extractedText = HtmlExtractUtil::extractBetween($html, $startTextToFind, $endTextToFind);

header("Content-type: text/plain; charset=utf-8");
echo $extractedText . "\n";

Questions:
Answers:

Regular expressions have their limitations when it comes to parsing HTML. Like many have done before me, I will refer to this famous answer.

Potential Problems when relying on Regular Expressions

For instance, imagine this tag appears in the HTML before the part that must be extracted:

<p attr="Hello 進撃の巨人!">This comes before the match</p>

Many regexp solutions will stumble over this, and return a string that starts in the middle of this opening p tag.

Or consider a comment inside the HTML section that has to be matched:

<!-- Next paragraph will display "Lorem ipsum." -->

Or, some loose less-than and greater-than signs appear (let’s say in a comment, or attribute value):

<!-- Next paragraph will display >-> << Lorem ipsum. >> -->
<p data-attr="->->->" class="myclass">

What will those regexes do with that?

These are just examples… there are countless other situations that pose problems to regular expression based solutions.

There are more reliable ways to parse HTML.

Load the HTML into a DOM

I will suggest here a solution based on the DOMDocument interface, using this algorithm:

  1. Get the text content of the HTML document and identify the two offsets where both sub strings (begin/end) are located.

  2. Then go through the DOM text nodes keeping track of the offsets where these nodes fit in. In the nodes where either of the two bounding offsets are crossed, a predefined delimiter (|) is inserted. That delimiter should not be present in the HTML string. Therefore it is doubled (||, ||||, …) until that condition is met;

  3. Finally split the HTML representation by this delimiter and extract the middle part as the result.

Here is the code:

function extractBetween($html, $begin, $end) {
    $dom = new DOMDocument();
    // Load HTML in DOM, making sure it supports UTF-8; double HTML tags are no problem
    $dom->loadHTML('<html><head>
            <meta http-equiv="content-type" content="text/html; charset=utf-8">
        </head></html>' . $html);
    // Get complete text content
    $text = $dom->textContent;
    // Get positions of the beginning/ending text; exit if not found.
    if (($from = strpos($text, $begin)) === false) return false;
    if (($to = strpos($text, $end, $from + strlen($begin))) === false) return false;
    $to += strlen($end);
    // Define a non-occurring delimiter by repeating `|` enough times:
    for ($delim = '|'; strpos($html, $delim) !== false; $delim .= $delim);
    // Use XPath to traverse the DOM
    $xpath = new DOMXPath($dom);
    // Go through the text nodes keeping track of total text length.
    // When exceeding one of the two offsets, inject a delimiter at that position.
    $pos = 0;
    foreach($xpath->evaluate("//text()") as $node) {
        // Add length of node's text content to total length
        $newpos = $pos + strlen($node->nodeValue);
        while ($newpos > $from || ($from === $to && $newpos === $from)) {
            // The beginning/ending text starts/ends somewhere in this text node.
            // Inject the delimiter at that position:
            $node->nodeValue = substr_replace($node->nodeValue, $delim, $from - $pos, 0);
            // If a delimiter was inserted at both beginning and ending texts,
            // then get the HTML and return the part between the delimiters
            if ($from === $to) return explode($delim, $dom->saveHTML())[1];
            // Delimiter was inserted at beginning text. Now search for ending text
            $from = $to;
        }
        $pos = $newpos;
    }
}

You would call it like this:

// Sample input data
$html = '
        <html>
        <body>
        <p>This comes before the match</p>
        <p>Hey! Hello <em>進撃の巨人</em>!</p>
        random code
        random code
        <p>Lorem <span>ipsum<span>. la la la</p>
        <p>This comes after the match</p>
        </body>
        </html>
        ';

$begin = 'Hello 進撃の巨人!';
$end = 'Lorem ipsum.';

// Call
$html = extractBetween($html, $begin, $end);

// Output result
echo $html;

Output:

Hello <em>進撃の巨人</em>!</p>
        random code
        random code
        <p>Lorem <span>ipsum<span>.

You’ll find this code is also easier to maintain than regex alternatives.

See it run on eval.in.

Questions:
Answers:

This might by far not be the optimal solution, but I love cracking my head about such “riddles”, so here’s my approach.

<?php
$subject = ' <html> 
<body> 
<p>He<i>l</i>lo <em>Lydia</em>!</p> 
random code 
random code 
<p>Lorem <span>ipsum</span>.</p> 
</body> 
</html>';

$begin = 'Hello Lydia!';
$end = 'Lorem ipsum.';

$begin_chars = str_split($begin);
$end_chars = str_split($end);

$begin_re = '';
$end_re = '';

foreach ($begin_chars as $c) {
    if ($c == ' ') {
        $begin_re .= '(\s|(<[a-z/]+>))+';
    }
    else {
        $begin_re .= $c . '(<[a-z/]+>)?';
    }
}
foreach ($end_chars as $c) {
    if ($c == ' ') {
        $end_re .= '(\s|(<[a-z/]+>))+';
    }
    else {
        $end_re .= $c . '(<[a-z/]+>)?';
    }
}

$re = '~(.*)((' . $begin_re . ')(.*)(' . $end_re . '))(.*)~ms';

$result = preg_match( $re, $subject , $matches );
$start_tag = preg_match( '~(<[a-z/]+>)$~', $matches[1] , $stmatches );

echo $stmatches[1] . $matches[2];

This outputs:

<p>He<i>l</i>lo <em>Lydia</em>!</p> 
random code 
random code 
<p>Lorem <span>ipsum</span>.</p>

This is matching this case, but I think it would require some more logic to escape regex special chars like periods.

In general, what this snippet does:

  • Splitting the strings into array, each array value representing a single character. This needs to be done because Hello needs to match Hel<i>l</i>o as well.
  • To do that, for the regex part an additional (<[a-z/]+>)? is inserted after each character with a special case for the space character.
Questions:
Answers:

You could try this RegEx:

(.*?)  # Data before sentences (to be removed)
(      # Capture Both sentences and text in between
  H.*?e.*?l.*?l.*?o.*?\s    # Hello[space]
  (<.*?>)*                  # Optional Opening Tag(s)
  進.*?撃.*?の.*?巨.*?人.*?   # 進撃の巨人
  (<\/.*?>)*                # Optional Closing Tag(s)
  (.*?)                     # Optional Data in between sentences
  (<.*?>)*                  # Optional Opening Tag(s)
  L.*?o.*?r.*?e.*?m.*?\s    # Lorem[space]
  (<.*?>)*                  # Optional Opening Tag(s)
  i.*?p.*?s.*?u.*?m.*?      # ipsum
)
(.*)   # Data after sentences (to be removed)

Substituting with the 2nd Capture Group

Live Demo on Regex101

The Regex can be shortened to:

(.*?)(H.*?e.*?l.*?l.*?o.*?\s(<.*?>)*進.*?撃.*?の.*?巨.*?人.*?(<\/.*?>)*(.*?)(<.*?>)*L.*?o.*?r.*?e.*?m.*?\s(<.*?>)*i.*?p.*?s.*?u.*?m.*?)(.*)

Questions:
Answers:

Just for fun

<?php
$begin = 'Hello Moto!';
$end = 'Lorem ipsum.';
//https://regex101.com/r/mC8aO6/1
$re = "/[\w\W]/"; 
$str = $begin.$end; 
$subst = "$0.*?"; 

$result = preg_replace($re, $subst, $str);
//Hello Moto! 
//to
//H.*?e.*?l.*?l.*?o.*? .*?M.*?o.*?t.*?o.*?!.*?

//https://regex101.com/r/fS6zG2/1
$re = "/(\!|\.\.)/"; 
$str = $result; 
$subst = "\\$1";

$result = preg_replace($re, $subst, $str);

$re = "/.*(<p.*?$result.*?p>).*/s"; 
$str = "        <html>\n        <body>\n        <p>He<i>l</i>lo <em>Moto</em>!\n        random code\n        random code\n        <p>Lorem <span>ipsum<span>.<p>\n        </body>\n        </html>\n        "; 
$subst = "$1"; 

$result = preg_replace($re, $subst, $str);
echo $result."\n";
?>

Input

$begin = 'Hello Moto!';
$end = 'Lorem ipsum.';

    <html>
    <body>
    <p>He<i>l</i>lo <em>Moto</em>!
    random code
    random code
    <p>Lorem <span>ipsum<span>.<p>
    </body>
    </html>

Output

<p>He<i>l</i>lo <em>Moto</em>!
        random code
        random code
        <p>Lorem <span>ipsum<span>.<p>

Questions:
Answers:

There are several different approaches to do a content search on HTML source. They all have advantages and disadvantages. If the structure in unknown code is an issue, the safest way would be to use an XML parser, however, those are complex and therefore rather slow.

Regular expressions are designed for text processing. Although regexp is not the quickest thing due to overhead, preg_functions are a reasonable compromise to keep code small and concise while not paying to much performance impact if and only if you prevent patterns becoming too complex.

Analysis of HTML structures is doable by recursive regular expressions. Since the slow down the processing and are hard to debug I prefer to code the base logic in PHP and utilize preg_functions to do smaller quick tasks.

Here is an solution in OOP, a tiny class intended to process many searches on the same HTML source. It is already an approach to handle extended similar problems like adding preceding and succeeding content until next tag boundary. It does not claim to be a perfect solution yet, but it is easily extendable.

The logic is:
Pay some runtime for initialization to store tag positions relative to plain text, strip tags and store the strings between <...> and sums of length as well.
Then on each content search match the needles with plain content. Locate the start/end position in the HTML source by binary search.

Binary search works like that: A sorted list is required. You store the index of first and last element+1. Calculate the average by an addition and integer division by 2. Division and floor is performantly done by a right bitshift. If the found value is to low, set the less index var to the current index, else the greater one. Stop on index difference 1. If you search an exact value, break early on element found.
0,(14+1) => 7 ; 7,15 => 11 ; 7,11 => 9 ; 7,9 => 8 ; 8-7 = diff.1
Instead of 15 iterations only 4 are done. The greater the start value is, the more time is exponentially saved.

PHP class:

<?php
class HtmlTextSearch
{
  protected 
    $html            = '',
    $heystack        = '',
    $tags            = [],
    $current_tag_idx = null
  ;

  const
    RESULT_NO_MODIFICATION      = 0,
    RESULT_PREPEND_TAG          = 1,
    RESULT_PREPEND_TAG_CONTENT  = 2,
    RESULT_APPEND_TAG           = 4,
    RESULT_APPEND_TAG_CONTENT   = 8,
    MATCH_CASE_INSENSITIVE      =16,
    MATCH_BLANK_AS_WHITESPACE   =32,
    MATCH_BLANK_MULTIPLE        =64
  ;

  public function __construct($html)
  {
    $this->set_html($html);
  }

  public function set_html($html)
  {
    $this->html = $html;
    $regexp = '~<.*?>~su';
    preg_match_all($regexp, $html, $this->tags, PREG_PATTERN_ORDER | PREG_OFFSET_CAPTURE);
    $this->tags = $this->tags[0];
    # we use exact the same algorithm to strip html
    $this->heystack = preg_replace($regexp, '', $html);

    # convert positions to plain content
    $sum_length = 0;
    foreach($this->tags as &$tag)
    { $tag['pos_in_content'] = $tag[1] - $sum_length;
      $tag['sum_length'    ] = $sum_length += strlen($tag[0]);
    }

    # zero length dummy tags to mark start/end position of strings not beginning/ending with a tag
    array_unshift($this->tags , [0 => '', 1 => 0, 'pos_in_content' => 0, 'sum_length' => 0 ]); 
    array_push   ($this->tags , [0 => '', 1 => strlen($html)-1]); 
  }

  public function translate_pos_plain2html($content_position)
  {
    # binary search
    $idx = [true => 0, false => count($this->tags)-1];
    while(1 < $idx[false] - $idx[true])
    { $i = ($idx[true] + $idx[false]) >>1;                               // integer half of both array indexes
      $idx[$this->tags[$i]['pos_in_content'] <= $content_position] = $i; // hold one index less and the other greater
    }

    $this->current_tag_idx = $idx[true];
    return $this->tags[$this->current_tag_idx]['sum_length'] + $content_position;
  }

  public function &find_content($needle_start, $needle_end = '', $result_modifiers = self::RESULT_NO_MODIFICATION)
  {
    $needle_start = preg_quote($needle_start, '~');
    $needle_end   = '' == $needle_end ? '' : preg_quote($needle_end  , '~');
    if((self::MATCH_BLANK_MULTIPLE | self::MATCH_BLANK_AS_WHITESPACE) & $result_modifiers)
    { 
      $replacement  = self::MATCH_BLANK_AS_WHITESPACE & $result_modifiers ? '\s' : ' ';
      if(self::MATCH_BLANK_MULTIPLE & $result_modifiers)
      { $replacement .= '+';
        $multiplier = '+';
      }
      else
        $multiplier = '';
      $repl_pattern = "~ $multiplier~";
      $needle_start = preg_replace($repl_pattern, $replacement, $needle_start);
      $needle_end   = preg_replace($repl_pattern, $replacement, $needle_end);
    }

    $icase = self::MATCH_CASE_INSENSITIVE & $result_modifiers ? 'i' : '';
    $search_pattern = "~{$needle_start}.*?{$needle_end}~su$icase";
    preg_match_all($search_pattern, $this->heystack, $matches, PREG_PATTERN_ORDER | PREG_OFFSET_CAPTURE);

    foreach($matches[0] as &$match)
    { $pre = $post = '';

      $pos_start = $this->translate_pos_plain2html($match[1]);
      if(self::RESULT_PREPEND_TAG_CONTENT & $result_modifiers)
        $pos_start = $this->tags[$this->current_tag_idx][1]
          +( self::RESULT_PREPEND_TAG & $result_modifiers ? 0 : strlen ($this->tags[$this->current_tag_idx][0]) );
      elseif(self::RESULT_PREPEND_TAG     & $result_modifiers)
        $pre = $this->tags[$this->current_tag_idx][0];

      $pos_end   = $this->translate_pos_plain2html($match[1] + strlen($match[0]));
      if(self::RESULT_APPEND_TAG_CONTENT & $result_modifiers)
      { $next_tag = $this->tags[$this->current_tag_idx+1];
        $pos_end = $next_tag[1]
          +( self::RESULT_APPEND_TAG  & $result_modifiers ? strlen ($next_tag[0]) : 0);
      }
      elseif(self::RESULT_APPEND_TAG     & $result_modifiers)
        $post = $this->tags[$this->current_tag_idx+1][0];

      $match = $pre . substr($this->html, $pos_start, $pos_end - $pos_start) . $post;
    };
    return $matches[0];
  }
}

Some test case:

$html_source = get($_POST['html'], <<< ___
<html>
  <body>
    <p>He said: "Hello <em>進撃の巨人</em>!"</p>
    random code
    random code
    <p>Lorem <span>ipsum</span>. foo bar</p>
  </body>
</html>
___
);


  function get(&$ref, $default=null) { return isset($ref) ? $ref : $default; }

  function attr_checked($name, $method = "post")
  { $req = ['post' => '_POST', 'get' => '_GET'];
    return isset($GLOBALS[$req[$method]][$name]) ? ' checked="checked"' : '';
  }

  $begin = get($_POST['begin'], '"Hello 進撃の巨人!"');
  $end   = get($_POST['end'  ], 'Lorem ipsum.'   );
?>

<form action="" method="post">
  <textarea name="html" cols="80" rows="10"><?php
echo $html_source;
?></textarea>

  <br><input type="text"  name="begin" value="<?php echo $begin;?>">
  <br><input type="text"  name="end"   value="<?php echo $end  ;?>">

  <br><input type="checkbox" name="tag-pre" id="tag-pre"<?php echo attr_checked('tag-pre');?>>
      <label for="tag-pre">prefix tag</label>
      <br><input type="checkbox" name="txt-pre" id="txt-pre"<?php echo attr_checked('txt-pre');?>>
      <label for="txt-pre">prefix content</label>
  <br><input type="checkbox" name="txt-suf" id="txt-suf"<?php echo attr_checked('txt-suf');?>>
      <label for="txt-suf">suffix content</label>
  <br><input type="checkbox" name="tag-suf" id="tag-suf"<?php echo attr_checked('tag-suf');?>>
      <label for="tag-suf">suffix tag</label>
  <br>
  <br><input type="checkbox" name="wspace" id="wspace"<?php echo attr_checked('wspace');?>>
      <label for="wspace">blanc (#32) matches any whitespace character</label>
  <br><input type="checkbox" name="multiple" id="wspace"<?php echo attr_checked('multiple');?>>
      <label for="multiple">one or more blancs match any number of blancs/whitespaces</label>
  <br><input type="checkbox" name="icase"    id="icase"<?php echo attr_checked('icase');?>>
      <label for="icase">case insensitive</label>

  <br><button type="submit">submit</button>
</form>

<?php
  $html = new HtmlTextSearch($html_source);

  $opts=
  [ 'tag-pre' => HtmlTextSearch::RESULT_PREPEND_TAG,
    'txt-pre' => HtmlTextSearch::RESULT_PREPEND_TAG_CONTENT,
    'txt-suf' => HtmlTextSearch::RESULT_APPEND_TAG_CONTENT,
    'tag-suf' => HtmlTextSearch::RESULT_APPEND_TAG,
    'wspace'  => HtmlTextSearch::MATCH_BLANK_AS_WHITESPACE,
    'multiple'=> HtmlTextSearch::MATCH_BLANK_MULTIPLE,
    'icase'   => HtmlTextSearch::MATCH_CASE_INSENSITIVE
  ];
  $options = 0;
  foreach($opts as $k => $v)
    if(isset($_POST[$k]))
      $options |= $v;
  $results = $html->find_content($begin, $end, $options);
  var_dump($results);
?>

Questions:
Answers:

How about this?

$escape=array('\'=>1,'^'=>1,'?'=>1,'+'=>1,'*'=>1,'{'=>1,'}'=>1,'('=>1,')'=>1,'['=>1,']'=>1,'|'=>1,'.'=>1,'$'=>1,'+'=>1,'/'=>1);
$pattern='/';
for($i=0;isset($begin[$i]);$i++){
if(ord($c=$begin[$i])<0x80||ord($c)>0xbf){
    if(isset($escape[$c]))
        $pattern.="([ \t\r\n\v\f]*<\/?[a-zA-Z]+>[ \t\r\n\v\f]*)*\$c";
    else
        $pattern.="([ \t\r\n\v\f]*<\/?[a-zA-Z]+>[ \t\r\n\v\f]*)*$c";
    }
    else
        $pattern.=$c;
}
$pattern.="(.|\n|\r)*";
for($i=0;isset($end[$i]);$i++){
if(ord($c=$end[$i])<0x80||ord($c)>0xbf){
    if(isset($escape[$c]))
        $pattern.="([ \t\r\n\v\f]*<\/?[a-zA-Z]+>[ \t\r\n\v\f]*)*\$c";
    else
        $pattern.="([ \t\r\n\v\f]*<\/?[a-zA-Z]+>[ \t\r\n\v\f]*)*$c";
    }
    else
        $pattern.=$c;
}
$pattern[17]='?';
$pattern.='(<\/?[a-zA-Z]+>)?/';
preg_match($pattern,$html,$a);
$match=$a[0];

Questions:
Answers:

PHP solution:

PHPFiddle Demo

$html = '
        <html>
        <body>
        <p>Hello <em>進撃の巨人</em>!</p>
        random code
        random code
        <p>Lorem <span>ipsum<span>.</p>
        </body>
        </html>
        ';
$begin = 'Hello 進撃の巨人!';
$end = 'Lorem ipsum.';

$matchHtmlTag = '(?:<.*?>)?';
$matchAllNonGreedy = '(?:.|\r?\n)*?';
$matchUnescapedCharNotAtEnd = '([^\\](?!$)|\.(?!$))';
$matchBeginWithTags = preg_replace(
    $matchUnescapedCharNotAtEnd, '$0' . $matchHtmlTag, preg_quote($begin));
$matchEndWithTags = preg_replace(
    $matchUnescapedCharNotAtEnd, '$0' . $matchHtmlTag, preg_quote($end));
$pattern = '/' . $matchBeginWithTags . $matchAllNonGreedy . $matchEndWithTags . '/';

preg_match($pattern, $html, $matches);
$html = $matches[0];

Generated regex ($pattern):

Regex101 Demo

H(?:<.*?>)?e(?:<.*?>)?l(?:<.*?>)?l(?:<.*?>)?o(?:<.*?>)? (?:<.*?>)?進(?:<.*?>)?撃(?:<.*?>)?の(?:<.*?>)?巨(?:<.*?>)?人(?:<.*?>)?!(?:.|\r?\n)*?L(?:<.*?>)?o(?:<.*?>)?r(?:<.*?>)?e(?:<.*?>)?m(?:<.*?>)? (?:<.*?>)?i(?:<.*?>)?p(?:<.*?>)?s(?:<.*?>)?u(?:<.*?>)?m(?:<.*?>)?\.

Questions:
Answers:

Assuming that random code in your example is inside <p></p> i propose using domdocument and xpath and not regular expression in what you try to do.

$html = '
        <html>
        <body>
        <div>nada blahhh <p>test paragraph</p> <em>blahh</em></div>
        <p>test</p>
        <span>this is test</span>
        <p>Hello <em>進撃の巨人</em>!</p>
        <p>random code</p>
        <p>random code</p>
        <p>Lorem <span>ipsum<span>.</p>
        <div>nada blahhh <p>test paragraph</p> <em>blahh</em></div>
        <p>test</p>
        <span>this is test</span>
        </body>
        </html>
        ';
$begin = 'Hello 進撃の巨人!';
$begin = iconv ( 'iso-8859-1','utf-8' , $begin ); // had to use iconv it won't be needed in your case
$end = 'Lorem ipsum.';       
$doc = new DOMDocument();
$doc->loadHTML($html);

$xpath = new DOMXpath($doc);
// example 3: same as above with wildcard
$elements = $xpath->query("*/p");

if (!is_null($elements)) {
    $flag = 'no_output';
  foreach ($elements as $element) {
      if($flag=='prepare_for_output'){$flag='output';}
      if($element->nodeValue==$begin){
      $flag='prepare_for_output';
      }
      if($element->nodeValue==$end){
      $flag='no_output';
      }
      if($flag=='output') {
      echo $element->nodeValue."\n";
      }
  }
}

http://sandbox.onlinephpfunctions.com/code/fa1095d98c6ef5c600f7b06366b4e0c4798a112f

Questions:
Answers:

you can use this concept , code are given below

        <html lang="en-US">
        <head>

        <title>HTML Unicode UTF-8</title>

        <meta charset="utf-8">
        </head>

        <body>
        <?php
        $html = '
            <html>
            <body>
            <p>Hello <em>進撃の巨人</em>!</p>
            random code
            random code
            <p>Lorem <span>ipsum<span>.</p>

            </body>
            </html>
            ';

        $begin = 'Hello 進撃の巨人!';
        $end = 'Lorem ipsum.';

        $stripped =strip_tags($html);

        if (strpos($stripped, $end) !== false) {

            $final =str_replace($begin,"",$stripped);

           echo str_replace($end,"",$final);
        }
        ?>
        </body>  
        </html>

Questions:
Answers:

Don’t break your mind trying to use regexp.

Use the DOM library of PHP: http://php.net/manual/en/book.dom.php

<?php

    header('Content-Type: text/html; charset=UTF-8');

    $html = '
            <html>
            <body>
            <p>Hello <em>進撃の巨人</em>!</p>
            random code
            random code
            <p>Lorem <span>ipsum<span>.</p>
            </body>
            </html>
            ';

    $doc = new DOMDocument();
    $doc->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));

    $body_elements = $doc->getElementsByTagName("body"); 

    $code = '';

    foreach ($body_elements as $element) { 

        $children  = $element->childNodes;

        foreach ($children as $child) 
        { 
            $code.= $element->ownerDocument->saveHTML($child);
        }

    }

    echo $code;
?>

If you run that code in a php sample file, you should check the source of the webpage using “View Source” in your browser to see the html tags. The <p> or <em> should be there 😉

Leave a Reply

Your email address will not be published. Required fields are marked *