Home » Php » php – Get HTML with current styles (maybe inlined) of a page that finished rendering and finished running scripts

php – Get HTML with current styles (maybe inlined) of a page that finished rendering and finished running scripts

Posted by: admin July 12, 2020 Leave a comment

Questions:

I need to get the HTML with current styles (maybe inlined) of a page that finished rendering and finished running scripts, using a server side application which will be given just an URL (no extra information such as cookies, no POSTs, no impeding forms, etc.).

A bridge/proxy to a temporarily running browser or a stand alone utility using a browser library is an accepted solution (however, the chosen browser or browser library must be available on all major platforms, and must be able to run without an OS GUI beeing present or installed).

An optional requirement is to remove all scripts afterwards (there are already stand alone solutions for this, adding it here because maybe the given answer will be able to remove scripts while rendering or something like that).

How do I get a snapshot in HTML+CSS in a single .html file of the curent HTML document with the current styles (maybe inlined) and current images (using data URI)?

If it can be done using pure PHP it would be a plus (although I doubt it, I haven’t found anything interesting).

Edit: I know how to load HTTP resources and get the HTML for an URL, that’s not what I’m looking for 😉

Edit 2
Example input HTML:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
    <head>
        <title></title>

        <meta http-equiv="Content-Type" content="text/html;charset=utf-8">

        <link rel="stylesheet" type="text/css" href="/css/example.css">
        <script type="text/javascript" src="/javascript/example.js"></script>

        <script type="text/javascript">
            window.addEventListener("load",
                function(event){
                    document.title="New title";

                    document.getElementById("pic_0").style.border="0px";
                }
            );
        </script>
        <style type="text/css">
            p{
                color: blue;
            }
        </style>
    </head>
    <body>
        <p>Hello world!</p>
        <p>
            <img 
                alt="" 
                style="border: 1px" 
                id="pic_0" 
                src="http://linuxgazette.net/144/misc/john/helloworld.png"
            >
        </p>
    </body>
</html>

Example output:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
    <head>
        <title>New title</title>

        <meta http-equiv="Content-Type" content="text/html;charset=utf-8">

        <style type="text/css">
            b{font-weight: bold}
        </style>

        <style type="text/css">
            p{
                color: blue;
            }
        </style>
    </head>
    <body>
        <p>Hello world!</p>
        <p>
            <img 
                alt="" 
                style="border: 0px" 
                id="pic_0" 
                src=""
            >
        </p>
    </body>
</html>

Notice how the <title> tag changed, how border: 1px became border: 0px, how the image URL was transformed into a data URI.

For example, some of these transformations (inline CSS and <title> tag) can be observed when inspecting the document using the Google Chrome inspector.

Edit 3: Replacing external resources with on-page ones (styles and images) and removing javascript is an easy part. The hard part is computing the CSS style after running javascript.

Edit 4 Maybe this could be done using injected javascript (still need browser control though)?

How to&Answers:

PhantomJS is a headless (GUI-less) WebKit with JavaScript API.
It runs on all major platforms, as I requested in my question.

It can run Javascript scripts to control the GUI-less web browser. It has a powerful API, and lots and lots of examples.

In my spare time over the last 2-3 days I wrote the solution to my question, and it covers all requirements beautifully. I haven’t found a webpage for which it wouldn’t work.

.

Usage, command line:

phantomjs save_as_html.js http://stackoverflow.com/q/12215844/584490 saved.html

.

Javascript is allowed to run for n seconds after everything else loads, it should work even for web pages generated entirely by javascript.

.

Notes:

  • Where possible, XHR loading of resources is prefered over HTML5’s canvas rendering because of reduced file size and preventing quality loss (reusing original files is better than anything).

  • <link> and <img> tags are kept in place, and data: URIs are used inside the href and src attributes respectively, instead of URLs. The same is true for background-image, which is read using getComputedStyle() on all DOM nodes.

  • <script> tags and event handler attributes are removed.

  • <link> tags with rel="alternative" are removed also (maybe they shouldn’t be, and instead be fixed into an absolute URL, if relative).

  • <iframe> is currently not handled, and its src attribute is beeing set to about:blank.

.

Beware all cross site scripting security restrictions are lifted, so that all resources can be loaded. Make sure you don’t try to save malicious webpages while using some secret credentials of your Facebook account :).

.

save_as_html.js contents: