Home » Php » PHP SimpleXML doesn't preserve line breaks in XML attributes

PHP SimpleXML doesn't preserve line breaks in XML attributes

Posted by: admin July 12, 2020 Leave a comment

Questions:

I have to parse externally provided XML that has attributes with line breaks in them. Using SimpleXML, the line breaks seem to be lost. According to another stackoverflow question, line breaks should be valid (even though far less than ideal!) for XML.

Why are they lost? [edit] And how can I preserve them? [/edit]

Here is a demo file script (note that when the line breaks are not in an attribute they are preserved).

PHP File with embedded XML

$xml = <<<XML
<?xml version="1.0" encoding="utf-8"?>
<Rows>
    <data Title='Data Title' Remarks='First line of the row.
Followed by the second line.
Even a third!' />
    <data Title='Full Title' Remarks='None really'>First line of the row.
Followed by the second line.
Even a third!</data>
</Rows>
XML;

$xml = new SimpleXMLElement( $xml );
print '<pre>'; print_r($xml); print '</pre>';

Output from print_r

SimpleXMLElement Object
(
    [data] => Array
        (
            [0] => SimpleXMLElement Object
                (
                    [@attributes] => Array
                        (
                            [Title] => Data Title
                            [Remarks] => First line of the row. Followed by the second line. Even a third!
                        )

                )

            [1] => First line of the row.
Followed by the second line.
Even a third!
        )

)
How to&Answers:

The entity for a new line is . I played with your code until I found something that did the trick. It’s not very elegant, I warn you:

//First remove any indentations:
$xml = str_replace("     ","", $xml);
$xml = str_replace("\t","", $xml);

//Next replace unify all new-lines into unix LF:
$xml = str_replace("\r","\n", $xml);
$xml = str_replace("\n\n","\n", $xml);

//Next replace all new lines with the unicode:
$xml = str_replace("\n","
", $xml);

Finally, replace any new line entities between >< with a new line:
$xml = str_replace(">
<",">\n<", $xml);

The assumption, based on your example, is that any new lines that occur inside a node or attribute will have more text on the next line, not a < to open a new element.

This of course would fail if your next line had some text that was wrapped in a line-level element.

Answer:

Using SimpleXML, the line breaks seem to be lost.

Yes, that is expected… in fact it is required of any conformant XML parser that newlines in attribute values represent simple spaces. See attribute value normalisation in the XML spec.

If there was supposed to be a real newline character in the attribute value, the XML should have included a character reference instead of a raw newline.

Answer:

Assuming $xmlData is your XML string before it is sent to the parser, this should replace all newlines in attributes with the correct entity. I had the issue with XML coming from SQL Server.

$parts = explode("<", $xmlData); //split over <
array_shift($parts); //remove the blank array element
$newParts = array(); //create array for storing new parts
foreach($parts as $p)
{
    list($attr,$other) = explode(">", $p, 2); //get attribute data into $attr
    $attr = str_replace("\r\n", "
", $attr); //do the replacement
    $newParts[] = $attr.">".$other; // put parts back together
}
$xmlData = "<".implode("<", $newParts); // put parts back together prefixing with <

Probably can be done more simply with a regex, but that’s not a strong point for me.

Answer:

Here is code to replace the new lines with the appropriate character reference in that particular XML fragment. Run this code prior to parsing.

$replaceFunction = function ($matches) {
    return str_replace("\n", "
", $matches[0]);
};
$xml = preg_replace_callback(
    "/<data Title='[^']+' Remarks='[^']+'/i",
    $replaceFunction, $xml);

Answer:

This is what worked for me:

First, get the xml as a string:

    $xml = file_get_contents($urlXml);

Then do the replacement:

    $xml = str_replace(".\xe2\x80\xa9<as:eol/>",".\n\n<as:eol/>",$xml);

The “.” and “< as:eol/ >” were there because I needed to add breaks in that case. The new lines “\n” can be replaced with whatever you like.

After replacing, just load the xml-string as a SimpleXMLElement object:

    $xmlo = new SimpleXMLElement( $xml );

Et Voilà

Answer:

Well, this question is old but like me, someone might come to this page eventually.
I had slightly different approach and I think the most elegant out of these mentioned.

Inside the xml, you put some unique word which you will use for new line.

Change xml to

<data Title='Data Title' Remarks='First line of the row. \n
Followed by the second line. \n
Even a third!' />

And then when you get path to desired node in SimpleXML in string output write something like this:

$findme  = '\n';
$pos = strpos($output, $findme);
if($pos!=0)
{
$output = str_replace("\n","<br/>",$output);

It doesn’t have to be ‘\n, it can be any unique char.