Home » Php » xml parsing – Decode multiple xml tags inside using PHP

xml parsing – Decode multiple xml tags inside using PHP

Posted by: admin July 12, 2020 Leave a comment

Questions:

I’m looking for a ‘smart way’ of decoding multiple XML tags inside a string, i have the following function:

function b($params) {
    $xmldata = '<?xml version="1.0" encoding="UTF-8" ?><root>' . html_entity_decode($params['data']) . '</root>';
    $lang = ucfirst(strtolower($params['lang']));
    if (simplexml_load_string($xmldata) === FALSE) {
        return $params['data'];
    } else {
        $langxmlobj = new SimpleXMLElement($xmldata);

        if ($langxmlobj -> $lang) {
            return $langxmlobj -> $lang;
        } else {
            return $params['data'];
        }
    }
}

And trying out

$params['data'] = '<French>Service DNS</French><English>DNS Service</English> - <French>DNS Gratuit</French><English>Free DNS</English>';
$params['lang'] = 'French';
$a = b($params);
print_r($a);

But outputs:

Service DNS

And I want it to basically output every tags, so result should be :

Service DNS - DNS Gratuit

Pulling my hairs out. Any quick help or directions would be appreciated.


Edit: Refine needs.

Seems that I wasn’t clear enough; so let me show another example

If i have the following string as input :

The <French>Chat</French><English>Cat</English> is very happy to stay on stackoverflow 
because it makes him <French>Heureux</French><English>Happy</English> to know that it 
is the best <French>Endroit</French><English>Place</English> to find good people with
good <French>Réponses</French><English>Answers</English>.

So if i’d run function with ‘French’ it will return :

The Chat is very happy to stay on stackoverflow 
because it makes him Heureux to know that it 
is the best Endroit to find good people with
good Réponses.

And with ‘English’ :

The Cat is very happy to stay on stackoverflow 
because it makes him Happy to know that it 
is the best Place to find good people with
good Answers.

Hope it’s more clear now.

How to&Answers:

Basically, I will parse out the lang section firstly, like:

<French>Chat</French><English>Cat</English>

with this:

"@(<($defLangs)>.*?</\2>)[email protected]"

Then parse the right lang str out with callback.

If you got php 5.3+, then:

function transLang($str, $lang, $defLangs = 'French|English')
{
    return preg_replace_callback ( "@(<($defLangs)>.*?</\2>)[email protected]", 

            function ($matches) use($lang)
            {
                preg_match ( "/<$lang>(.*?)<\/$lang>/i", $matches [0], $longSec );

                return $longSec [1];
            }, $str );
}

echo transLang ( $str, 'French' ), "\n", transLang ( $str, 'English' );

If not, a little complicated:

class LangHelper
{

    private $lang;

    function __construct($lang)
    {
        $this->lang = $lang;
    }

    public function callback($matches)
    {
        $lang = $this->lang;

        preg_match ( "/<$lang>(.*?)<\/$lang>/i", $matches [0], $subMatches );

        return $subMatches [1];
    }

}

function transLang($str, $lang, $defLangs = 'French|English')
{
    $langHelper = new LangHelper ( $lang );

    return preg_replace_callback ( "@(<($defLangs)>.*?</\2>)[email protected]", 
            array (
                    $langHelper,
                    'callback' 
            ), $str );
}

echo transLang ( $str, 'French' ), "\n", transLang ( $str, 'English' );

Answer:

If I understand you correctly you would like to remove all “language” tags, but keep the contents of the provided language.

The DOM is a tree of nodes. Tags are element nodes, the text is stored in text nodes. Xpath allows to select nodes using expressions. So take all the child nodes of the language elements you want to keep and copy them just before the language node. Then remove all language nodes. This will work even if the language elements contain other element nodes, like an <em>.

function replaceLanguageTags($fragment, $language) {
  $dom = new DOMDocument();
  $dom->loadXml(
    '<?xml version="1.0" encoding="UTF-8" ?><content>'.$fragment.'</content>'
  );
  // get an xpath object
  $xpath = new DOMXpath($dom);

  // fetch all nodes with the language you like to keep
  $nodes = $xpath->evaluate('//'.$language);
  foreach ($nodes as $node) {
    // copy all the child nodes of just before the found node
    foreach ($node->childNodes as $childNode) {
      $node->parentNode->insertBefore($childNode->cloneNode(TRUE), $node);
    }
    // remove the found node
    $node->parentNode->removeChild($node);
  }

  // select all language nodes
  $tags = array('English', 'French');
  $nodes = $xpath->evaluate('//'.implode('|//', $tags));
  foreach ($nodes as $node) {
    // remove them
    $node->parentNode->removeChild($node);
  }

  $result = '';
  // we do not need the root node, so save all its children
  foreach ($dom->documentElement->childNodes as $node) {
    $result .= $dom->saveXml($node);
  }
  return $result;
}

$xml = <<<'XML'
The <French>Chat</French><English>Cat</English> is very happy to stay on stackoverflow
because it makes him <French>Heureux</French><English>Happy</English> to know that it
is the best <French>Endroit</French><English>Place</English> to find good people with
good <French>Réponses</French><English>Answers</English>.
XML;

var_dump(replaceLanguageTags($xml, 'English'));
var_dump(replaceLanguageTags($xml, 'French'));

Output:

string(146) "The Cat is very happy to stay on stackoverflow
because it makes him Happy to know that it
is the best Place to find good people with
good Answers."
string(153) "The Chat is very happy to stay on stackoverflow
because it makes him Heureux to know that it
is the best Endroit to find good people with
good Réponses."

Answer:

What version of PHP are you on? I don’t know what else could be different, but I copied & pasted your code and got the following output:

SimpleXMLElement Object
(
    [0] => Service DNS
    [1] => DNS Gratuit
)

Just to be sure, this is the code I copied from above:

<?php

function b($params) {
    $xmldata = '<?xml version="1.0" encoding="UTF-8" ?><root>' . html_entity_decode($params['data']) . '</root>';
    $lang = ucfirst(strtolower($params['lang']));
    if (simplexml_load_string($xmldata) === FALSE) {
        return $params['data'];
    } else {
        $langxmlobj = new SimpleXMLElement($xmldata);

        if ($langxmlobj -> $lang) {
            return $langxmlobj -> $lang;
        } else {
            return $params['data'];
        }
    }
}

$params['data'] = '<French>Service DNS</French><English>DNS Service</English> - <French>DNS Gratuit</French><English>Free DNS</English>';
$params['lang'] = 'French';
$a = b($params);
print_r($a);

Answer:

Here’s my suggestion. It should be fast and it is simple. You just need to strip the tags of the desired language and then remove any other tags along with their content.

The downside is that if you wish to use any other tags than the language one, you have to make sure that the opening one is different from the closing (e.g. <p >Lorem</p> instead of <p>Lorem</p>). On the other hand this allows you to add as many languages as you want, without keeping a list of them. You need to know only the default one (or just throw and catch exception) when the asked language is missing.

function only_lang($lang, $text) {
    static $infinite_loop;

    $result = str_replace("<$lang>", '', $text, $num_matches_open);
    $result = str_replace("</$lang>", '', $result, $num_matches_close);

    // Check if the text is malformed. Good place to throw an error
    if($num_matches_open != $num_matches_close) {
        //throw new Exception('Opening and closing tags does not match', 1);

        return $text;
    }

    // Check if this language is present at all.
    // Otherwise fallback to default language or throw an error
    if( ! $num_matches_open) {
        //throw new Exception('No such language', 2);

        // Prevent infinite loop if even the default language is missing
        if($infinite_loop) return $text;
        $infinite_loop = __FUNCTION__;
        return $infinite_loop('English', $text);
    }

    // Strip any other language and return the result
    return preg_replace('!<([^>]+)>.*</\1>!', '', $result);
}

Answer:

I got a simple one using regex. Useful, if the input only contains <lang>...</lang> tags.

function to_lang($lang="", $str="") {
  return strip_tags(preg_replace('~<(\w+(?<!'.$lang.'))>.*</>~Us',"",$str));
}

echo to_lang("English","The happy <French>Chat</French><English>Cat</English>");

Removes each <tag>...</tag>, that is not the specified one in $lang. If there could be spaces/specials inside the <tag-name> e.g. <French-1> replace \w with [^/>].


Search pattern explained a bit

1.) <(\w+(?<!'.$lang.'))

< followed by one or more Word characters,
not matching $lang (using a negative lookbehind)
and capturing the <tag_name>

2.) .* followed by anything (ungreedy: modifier U, dot matches newlines: modifier s)

3.) </\1> until the captured tag is closed