Home » Php » php – How to speed up a complex image processing?

php – How to speed up a complex image processing?

Posted by: admin July 12, 2020 Leave a comment


Every user will be able to upload 100 TIFF (black and white) images.

The process requires:

  1. Convert tif to jpg.

  2. Resize image to xx.

  3. Crop image to 200px.

  4. Add a text watermark.

Here is my PHP code:

$image_name_only = strtolower($image_info["filename"]);

$exec = '"C:\Program Files\ImageMagick-6.9.0-Q16\convert.exe" '.$destination_folder.$image_name. ' '.$name.' 2>&1';
exec($exec, $exec_output, $exec_retval);                

$exec = '"C:\Program Files\ImageMagick-6.9.0-Q16\convert.exe" '.$name. ' -resize 1024x  '.$name;
exec($exec, $exec_output, $exec_retval);

$exec = '"C:\Program Files\ImageMagick-6.9.0-Q16\convert.exe" '.$name. ' -thumbnail 200x200!  '.$thumb;
exec($exec, $exec_output, $exec_retval);

$exec = '"C:\Program Files\ImageMagick-6.9.0-Q16\convert.exe" '.$name. "  -background White  label:ش.پ12355  -append  ".$name;
exec($exec, $exec_output, $exec_retval);

This code works. But the average processing time for every image is 1 second.
So for 100 images it will probably take around 100 seconds.

How can I speed up this whole process (convert, resize, crop, watermark)?


I have a Server G8:Ram:32G,CPU:Intel Xeon E5-2650(4 Process)

version:ImageMagick 6.9.0-3 Q16 x64


convert logo: -resize 500% -bench 10 1.png

 Performance[1]: 10i 0.770ips 1.000e 28.735u 0:12.992
 Performance[2]: 10i 0.893ips 0.537e 26.848u 0:11.198
 Performance[3]: 10i 0.851ips 0.525e 27.285u 0:11.756
 Performance[4]: 10i 0.914ips 0.543e 26.489u 0:10.941
 Performance[5]: 10i 0.967ips 0.557e 25.803u 0:10.341
 Performance[6]: 10i 0.797ips 0.509e 27.737u 0:12.554
 Performance[7]: 10i 0.963ips 0.556e 25.912u 0:10.389
 Performance[8]: 10i 0.863ips 0.529e 26.707u 0:11.586

Resource limits:

Width: 100MP;Height: 100MP;Area: 17.16GP;Memory: 7.9908GiB;Map: 15.982GiB;Disk: unlimited;File: 1536;Thread: 8;Throttle: 0;Time: unlimited

How to&Answers:

0. Two approaches

Basically, this challenge can be tackled in two different ways, or a combination of the two:

  1. Construct your commands as clever as possible.
  2. Trade speed-up gains for quality losses.

The next few sections discuss the both approaches.

1. Check which ImageMagick you’ve got: ‘Q8’, ‘Q16’, ‘Q32’ or ‘Q64’?

First, check for your exact ImageMagick version and run:

convert -version

In case your ImageMagick has a Q16 (or even Q32 or Q64, which is possible, but overkill!) in its version string:
This means, all of ImageMagick’s internal functions treat all images as having 16 bit (or 32 or 64 bit) channel depths.
This gives you a better quality in image processing.
But it also requires double memory as compared to Q8.
So at the same time it means a performance degradation.

Hence: you could test what performance benefits you’ll achieve by switching to a Q8-build.
(The Q is symbol for the ‘quantum depth’ supported by a ImageMagick build.)

You’ll pay your possible Q8-performance gains with quality loss, though.
Just check what speed up you achieve with Q8 over Q16, and what quality losses you suffer.
Then decide whether you can live with the drawbacks or not…

In any case Q16 will use twice as much RAM per image to process, and Q32 will again use twice the amount of Q16.
This is independent from the actual bits-per-pixels seen in the input files.
16-bit image files, when saved, will also consume more disk space than 8-bit ones.

With Q16 or Q32 requiring more memory, you always have to ensure that you have enough of it.
Because exceeding your physical memory would be very bad news.
If a larger Q makes a process swap to disk, performance will plummet.
A 1074 x 768 pixel image (width x height) will require the following amounts of virtual memory, depending on the quantum depth:

Quantum                   Virtual Memory
  Depth    (consumed by 1 image 1024x768)
-------    ------------------------------  
      8         3.840 kiB  (=~  3,75 MiB)
     16         7.680 kiB  (=~  7,50 MiB)
     32        15.360 kiB  (=~ 14,00 MiB)

Also keep in mind, that some ‘optimized’ processing pipelines (see below) will need to keep several copies of an image in virtual memory!
Once virtual memory cannot be satisfied by available RAM, the system will start to swap and claim “memory” from the disk.
In that case, all clever command pipeline optimization is of course gone, and starts to knock over to the very reverse.

ImageMagick’s birthday was in the aera when CPUs could handle only 1 bit at a time.
That was decades ago.
Since then CPU architecture has changed a lot.
16-bit operations used to take twice as long as 8-bit operations, or even longer.
Then 16-bit processors arrived.
16-bit ops became standard.
CPUs were optimised for 16-bit:
Suddenly some 8-bit operations could take even longer than 16-bit equivalents.

Nowadays, 64bit CPUs are common.
So the Q8 vs. Q16 vs. Q32 argument in real terms may even be void.
Who knows?
I’m not aware of any serious benchmarking about this.
It would be interesting if someone (with really deep knowhow about CPUs and about benchmarking real world programs) would run with such a project one day.

Yes, I see you are using Q16 on Windows.
But I still wanted to mention it, for completeness’ sake…
In the future there will be other users reading this question and the answers given.

Very likely, since your input TIFFs are black+white only, the image quality output of a Q8 build will be good enough for your workflow.
(I just don’t know if it would also be significantly faster:
this largely also depends on the hardware resources you are running this on…)

In addition, if your installation sports support HDRI (high dynamic resolution images), this may also cause some speed penalty.
Who knows?
So building IM with configure options --disable-hdri --quantum-depth 8 may or may not lead to speed improvements.
Nobody has ever tested this in a serious way…
The only thing we know about this:
these options will decrease image quality.
However most people will not even notice this, unless they take really close looks and make direct image-by-image comparisons…


2. Check your ImageMagick’s capabilities

Next, check if your ImageMagick installation comes with OpenCL and/or OpenMP support:

convert -list configure | grep FEATURES

If it does (like mine), you should see something like this:

FEATURES      DPC HDRI OpenCL OpenMP Modules

OpenCL (for C omputing L anguage) utilizes ImageMagick’s parallel computing features (if compiled-in).
This will make use of your computer’s GPU additionally to the CPU for image processing operations.

OpenMP (for M ulti-P rocessing) does something similar:
it allows ImageMagick to execute in parallel on all the cores of your system.
So if you have a quad-core system, and resize an image, the resizing happens on 4 cores (or even 8 if you have hyperthreading).

The command

convert -version

prints some basic info about supported features.
If OpenCL/OpenMP are available, you will see one of them (or both) in the output.

If none of the two show up:
look into getting the most recent version of ImageMagick that has OpenCL and/or OpenMP support compiled in.

If you build the package yourself from the sources, make sure OpenCL/OpenMP are used.
Do this by including the appropriate parameters into your ‘configure’ step:

./configure  [...other options-]  --enable-openmp  --enable-opencl

ImageMagick’s documentation about OpenMP and OpenCL is here:

  • Parallel Execution With OpenMP.
    Read it carefully.
    Because OpenMP is not a silver bullet, and it does not work under all circumstances…
  • Parallel Execution With OpenCL.
    The same as above applies here.
    Additionally, not all ImageMagick operations are OpenCL-enabled.
    The link here has a list of those which are.
    -resize is one of them.

Hints and instructions to build ImageMagick from sources and configure the build, explaining various options, are here:

This page also includes a short discussion of the --with-quantum-depth configure option.

3. Benchmark your ImageMagick

You can now also use the builtin -bench option to make ImageMagick run a benchmark for your command.
For example:

convert logo: -resize 500% -bench 10 logo.png

  Performance[4]: 10i 1.489ips 1.000e 6.420u 0:06.510

Above command with -resize 500% tells ImageMagick to run the convert command and scale the built-in IM logo: image by 500% in each direction.
The -bench 10 part tells it to run that same command 10 times in a loop and then print the performance results:

  • Since I have OpenMP enabled, I have 4 threads (Performance[4]:).
  • It reports that it ran 10 iterations (10i).
  • The speed was nearly 1.5 iterations per second (1.489ips).
  • Total user-alotted time was 6.420 seconds.

If your result includes Performance[1]:, and only one line, then your system does not have OpenMP enabled.
(You may be able to switch it on, if your build does support it: run convert -limit thread 2.)

4. Tweak your ImageMagick’s resource limits

Find out how your system’s ImageMagick is set up regarding resource limits.
Use this command:

identify -list resource
  File       Area     Memory     Map       Disk    Thread         Time
   384    8.590GB       4GiB    8GiB  unlimited         4    unlimited

Above shows my current system’s settings (not the defaults — I did tweak them in the past).
The numbers are the maximum amount of each resource ImageMagick will use.
You can use each of the keywords in the column headers to pimp your system.
For this, use convert -limit <resource> <number> to set it to a new limit.

Maybe your result looks more like this:

identify -list resource
  File       Area     Memory     Map       Disk    Thread         Time
   192    4.295GB       2GiB    4GiB  unlimited         1    unlimited
  • The files defines the max concurrently opened files which ImageMagick can use.
  • The memory, map, area and disk resource limits are defined in Bytes.
    For setting them to different values you can use SI prefixes, .e.g 500MB).

When you do have OpenMP for ImageMagick on your system, you can run.

convert -limit thread 2

This enable 2 parallel threads as a first step.
Then re-run the benchmark and see if it really makes a difference, and if so how much.
After that you could set the limit to 4 or even 8 and repeat the excercise….

5. Use Magick Pixel Cache (MPC) and/or Magick Persistent Registry (MPR)

Finally, you can experiment with a special internal format of ImageMagick’s pixel cache.
This format is called MPC (Magick Pixel Cache).
It only exists in memory.

When MPC is created, the processed input image is kept in RAM as an uncompressed raster format.
So basically, MPC is the native in-memory uncompressed file format of ImageMagick.
It is simply a direct memory dump to disk.
A read is a fast memory map from disk to memory as needed (similar to memory page swapping).
But no image decoding is needed.

(More technical details: MPC as a format is not portable.
It also isn’t suitable as a long-term archive format.
Its only suitability is as an intermediate format for high-performance image processing.
It requires two files to support one image.)

If you still want to save this format to disk, be aware of this:

  • Image attributes are written to a file with the extension .mpc.
  • Image pixels are written to a file with the extension .cache.

Its main advantage is experienced when…

  1. …processing very large images, or when
  2. …applying several operations on one and the same image in “opertion pipelines”.

MPC was designed especially for workflow patterns which match the criteria “read many times, write once”.

Some people say that for such operations the performance improves here, but I have no personal experience with it.

Convert your base picture to MPC first:

convert input.jpeg input.mpc

and only then run:

convert input.mpc [...your long-long-long list of crops and operations...]

Then see if this saves you significantly on time.

Most likely you can use this MPC format even “inline” (using the special mpc: notation, see below).

The MPR format (memory persistent register) does something similar.
It reads the image into a named memory register.
Your process pipeline can also read the image again from that register, should it need to access it multiple times.
The image persists in the register the current command pipeline exits.

But I’ve never applied this technique to a real world problem, so I can’t say how it works out in real life.

6. Construct a suitable IM processing pipeline to do all tasks in one go

As you describe your process, it is composed of 4 distinguished steps:

  1. Convert a TIFF to a JPEG.
  2. Resize the JPEG image to xx (?? what value ??)
  3. Crop the JPEG to 200px.
  4. Add a text watermark.

Please tell if I understand correctly your intentions from reading your code snippets:

  • You have 1 input file, a TIFF.
  • You want 2 final output files:
    1. 1 thumbnail JPEG, sized 200×200 pixels;
    2. 1 labelled JPEG, with a width of 1024 pixels (height keeping aspect ratio of input TIFF);
    3. 1 (unlabelled) JPEG is only an intermediate file which you do not really want to keep.

Basically, each step uses its own command — 4 different commands in total.
This can be sped up considerably by using a single command pipeline which performs all the steps on its own.

Moreover, you seem to not really need to keep the unlabelled JPEG as an end result — yet your one command to generate it as an intermediate temporary file saves it to disk. We can try to skip this step altogether then, and try to achieve the final result without this extra write to disk.

There are different approaches possible to this change.
I’ll show you (and other readers) only one for now — and only for the CLI, not for PHP.
I’m not a PHP guy — it’s your own job to ‘translate’ my CLI method into appropriate PHP calls.

(But by all means: please test with my commands first, really using the CLI, to see if the effort is worth while translating the approach to PHP!)

But please make first sure that you really understand the architecture and structure of more complex ImageMagick’s command lines!
For this goal, please refer to this other answer of mine:

Your 4 steps translate into the following individual ImageMagick commands:

convert image.tiff image.jpg

convert image.jpg -resize 1024x image-1024.jpg

convert image-1024.jpg -thumbnail 200x200 image-thumb.jpg

convert -background white image-1024.jpg label:12345 -append image-labelled.jpg

Now to transform this workflow into one single pipeline command…
The following command does this.
It should execute faster (regardless of what your results are when following my above steps 0.–4.):

convert image.tiff                                                             \
 -respect-parentheses                                                          \
 +write mpr:XY                                                                 \
  \( mpr:XY                                       +write image-1024.jpg \)     \
  \( mpr:XY -thumbnail 200x200                    +write image-thumb.jpg \)    \
  \( mpr:XY -background white label:12345 -append +write image-labelled.jpg \) \


  • -respect-parentheses :
    required to really make independent from each other the sub-commands executed inside the \( .... \) parentheses.
  • +write mpr:XY :
    used to write the input file to an MPR memory register.
    XY is just a label (you can use anything), needed to later re-call the same image.
  • +write image-1024.jpg :
    writes result of subcommand executed inside the first parentheses pair to disk.
  • +write image-thumb.jpg :
    writes result of subcommand executed inside the second parentheses pair to disk.
  • +write image-labelled.jpg :
    writes result of subcommand executed inside the third parentheses pair to disk.
  • null: :
    terminates the command pipeline.
    Required because we otherwise would end with the last subcommand’s closing parenthesis.

7. Benchmarking 4 individual commands vs. the single pipeline

In order to get a rough feeling about my suggestion, I did run the commands below.

The first one runs the sequence of the 4 individual commands 100 times (and saves all resulting images under different file names).

time for i in $(seq -w 1 100); do
   convert image.tiff                                                          \
   convert image-indiv-run-${i}.jpg -sample 1024x                              \
   convert image-1024-indiv-run-${i}.jpg -thumbnail 200x200                    \
   convert -background white image-1024-indiv-run-${i}.jpg label:12345 -append \
   echo "DONE: run indiv $i ..."

My result for 4 individual commands (repeated 100 times!) is this:

real  0m49.165s
user  0m39.004s
sys   0m6.661s

The second command times the single pipeline:

time for i in $(seq -w 1 100); do
    convert image.tiff                                        \
     -respect-parentheses                                     \
     +write mpr:XY                                            \
      \( mpr:XY -resize 1024x                                 \
                +write image-1024-pipel-run-${i}.jpg     \)   \
      \( mpr:XY -thumbnail 200x200                            \
                +write image-thumb-pipel-run-${i}.jpg    \)   \
      \( mpr:XY -resize 1024x                                 \
                -background white label:12345 -append         \
                +write image-labelled-pipel-run-${i}.jpg \)   \
   echo "DONE: run pipeline $i ..."

The result for single pipeline (repeated 100 times!) is this:

real   0m29.128s
user   0m28.450s
sys    0m2.897s

As you can see, the single pipeline is about 40% faster than the 4 individual commands!

Now you can also invest in multi-CPU, much RAM, fast SSD hardware to speed things up even more 🙂

But first translate this CLI approach into PHP code…

There are a few more things to be said about this topic.
But my time runs out for now.
I’ll probably return to this answer in a few days and update it some more…

Update: I had to update this answer with new numbers for the benchmarking:
initially I had forgotten to include the -resize 1024x operation (stupid me!) into the pipelined version.
Having included it, the performance gain is still there, but not as big any more.

8. Use -clone 0 to copy image within memory

Here is another alternative to try instead of the mpr: approach with a named memory register as suggested above.

It uses (again within ‘side processing inside parentheses’) the -clone 0 operation.
The way this works is this:

  1. convert reads the input TIFF from disk once and loads it into memory.
  2. Each -clone 0 operator makes a copy of the first loaded image (because it has index 0 in the current image stack).
  3. Each “within-parenthesis” sub-pipeline of the total command pipeline performs some operation on the clone.
  4. Each +write operation saves the respective result to disk.

So here is the command to benchmark this:

time for i in $(seq -w 1 100); do
    convert image.tiff                                         \
     -respect-parentheses                                      \
      \( -clone 0 -thumbnail 200x200                           \
                  +write image-thumb-pipel-run-${i}.jpg    \)  \
      \( -clone 0 -resize 1024x                                \
                  -background white label:12345 -append        \
                  +write image-labelled-pipel-run-${i}.jpg \)  \
   echo "DONE: run pipeline $i ..."

My result:

real   0m19.432s
user   0m18.214s
sys    0m1.897s

To my surprise, this is faster than the version which used mpr: !

9. Use -scale or -sample instead of -resize

This alternative will most likely speed up your resizing sub-operation.
But it will likely lead to a somewhat worse image quality (you’ll have to verify, if this difference is noticeable).

For some background info about the difference between -resize, -sample and -scale see the following answer:

I tried it too:

time for i in $(seq -w 1 100); do
    convert image.tiff                                         \
     -respect-parentheses                                      \
      \( -clone 0 -thumbnail 200x200                           \
                  +write image-thumb-pipel-run-${i}.jpg    \)  \
      \( -clone 0 -scale 1024x                                 \
                  -background white label:12345 -append        \
                  +write image-labelled-pipel-run-${i}.jpg \)  \
   echo "DONE: run pipeline $i ..."

My result:

real   0m16.551s
user   0m16.124s
sys    0m1.567s

This is the fastest result so far (I combined it with the +clone variant).

Of course, this modification can also be applied to your initial method running 4 different commands.

10. Emulate the Q8 build by adding -depth 8 to the commands.

I did not actually run and measure this, but the complete command would be.

time for i in $(seq -w 1 100); do
    convert image.tiff                                            \
     -respect-parentheses                                         \
      \( -clone 0 -thumbnail 200x200 -depth 8                     \
                  +write d08-image-thumb-pipel-run-${i}.jpg    \) \
      \( -clone 0 -scale 1024x       -depth 8                     \
                  -background white label:12345 -append           \
                  +write d08-image-labelled-pipel-run-${i}.jpg \) \
   echo "DONE: run pipeline $i ..."

This modification is also applicable to your initial “I run 4 different commands”-method.

11. Combine it with GNU parallel, as suggested by Mark Setchell

This of course is only applicable and reasonable for you, if your overall work process allows for such parallelization.

For my little benchmark testing it is applicable.
For your web service, it may be that you know of only one job at a time…

time for i in $(seq -w 1 100); do                                 \
    cat <<EOF
    convert image.tiff                                            \
      \( -clone 0 -scale  1024x         -depth 8                  \
                  -background white label:12345 -append           \
                  +write d08-image-labelled-pipel-run-${i}.jpg \) \
      \( -clone 0 -thumbnail 200x200  -depth 8                    \
                  +write d08-image-thumb-pipel-run-${i}.jpg   \)  \
    echo "DONE: run pipeline $i ..."
done | parallel --will-cite


real  0m6.806s
user  0m37.582s
sys   0m6.642s

The apparent contradiction between user and real time can be explained:
the user time represents the sum of all time ticks which where clocked on 8 different CPU cores.

From the point of view of the user looking onto his watch, it was much faster: less than 10 seconds.

12. Summary

Pick your own preferences — combine different methods:

  1. Some speedup can be gained (with identical image quality as currently) by constructing a more clever command pipeline.
    Avoid running various commands (where each convert leads to a new process, and has to read its input from disk).
    Pack all image manipulations into one single process.
    Make use of the “parenthesized side processing”.
    Make use of -clone or mbr: or mbc: or even combine each of these.

  2. Some speedups can be additionally be gained by trading image quality with performance:
    Some of your choices are:

    1. -depth 8 (has to be declared on the OP’s system) vs. -depth 16 (the default on the OP’s system)
    2. -resize 1024 vs. -sample 1024x vs. -scale 1024x
  3. Make use of GNU parallel if your workflow permits this.


As always, @KurtPfeifle has provided an excellently reasoned and explained answer, and everything he says is solid advice which you would do well to listen to and follow carefully.

There is a bit more that can be done though but it is more than I can add as a comment, so I am putting it as another answer, though it is only an enhancement on Kurt’s…

I do not know what size of imput image Kurt used, so I made one of 3000×2000 and compared my run times with his to see if they were comparable since we have different hardware. The individual commands ran in 42 seconds on my machine and the pipelined ones ran in 36 seconds so I guess my image size and hardware are broadly similar.

I then used GNU Parallel to run the jobs in parallel – I think you will get a lot of benefit from that on a Xeon. Here is what I did…

time for i in $(seq -w 1 100); do
    cat <<EOF
    convert image.tiff                                        \
     -respect-parentheses                                     \
     +write mpr:XY                                            \
      \( mpr:XY -resize 1024x                                 \
                +write image-1024-pipel-run-${i}.jpg     \)   \
      \( mpr:XY -thumbnail 200x200                            \
                +write image-thumb-pipel-run-${i}.jpg    \)   \
      \( mpr:XY -background white label:12345 -append         \
                +write image-labelled-pipel-run-${i}.jpg \)   \
   echo "DONE: run pipeline $i ..."
done | parallel

As you can see, all I did was echo the commands that need running onto stdout and piped them into GNU Parallel. Run that way, it takes just 10 seconds on my machine.

I also had a try at imitating the functionality using ffmpeg, and came up with this, which seems pretty similar on my test images – your mileage may vary.

for i in $(seq -w 1 100); do
    echo ffmpeg -y -loglevel panic -i image.tif ff-$i.jpg 
    echo ffmpeg -y -loglevel panic -i image.tif -vf scale=1024:682 ff-$i-1024.jpg
    echo ffmpeg -y -loglevel panic -i image.tif -vf scale=200:200 ff-$i-200.jpg
done | parallel

That runs in 7 seconds on my iMac with a 3000×2000 image.tif input file.

I failed miserably to get libturbo-jpeg installed with ImageMagick under homebrew.


I keep hearing from some people that GraphicsMagick (a fork from quite a few years back, that branched off from ImageMagick) is significantly faster than ImageMagick.

So I took this opportunity to give it a spin. Hence my second answer.

I did run the following loop of 4 separate gm commands. This makes the results comparable to the 4 separate convert commands documented in my other answer. It happened on the same machine:

time for i in $(seq -w 1 100); do 
 gm convert         image.tiff                         gm-${i}-image.jpg
 gm convert gm-${i}-image.jpg      -resize 1024x       gm-${i}-image-1024.jpg
 gm convert gm-${i}-image-1024.jpg -thumbnail 200x200  gm-${i}-image-thumb.jpg
 gm convert -background white    \
            gm-${i}-image-1024.jpg label:12345 -append gm-${i}-image-labelled.jpg
 echo "GraphicsMagick run no. $i ..."

Resulting times:

real   1m4.225s
user   0m51.577s
sys    0m8.247s

This means: for this particular job, and on this machine, my Q8 GraphicsMagick (version is 1.3.20 2014-08-16 Q8) is slower with 64 seconds needed than my Q16 ImageMagick (version is 6.9.0-0 Q16 x86_64 2014-12-06), which needed 50 seconds for 100 runs each.

Of course this short test and its results are by no means to be taken as a bullet-proof statement.

You may ask: What else were this machine and its OS doing while conducting each test? Which other apps were loaded into memory at the same time? etc.pp., and right you are. — But you are now free to run your own tests. One thing you can do to provide almost identical conditions for both tests: run them at the same time in 2 different terminal windows!)


I couldn’t resist trying this benchmark with vips. I used these two scripts:


for file in $*; do
        convert $file \
                -respect-parentheses \
                \( -clone 0 -resize 200x200 \
                        +write $file-thumb.jpg \)  \
                \( -clone 0 -resize 1024x \
                        -background white label:12345 -append \
                        +write $file-labelled.jpg \) \

and vips using the Python interface:


import sys
from gi.repository import Vips

for filename in sys.argv[1:]:
    im = Vips.Image.new_from_file(filename, access = Vips.Access.SEQUENTIAL)

    im = im.resize(1024.0 / im.width)
    mem = Vips.Image.new_memory()

    thumb = mem.resize(200.0 / mem.width)
    thumb.write_to_file(filename + "-thumb.jpg")

    txt = Vips.Image.text("12345", dpi = 72)
    footer = txt.embed(10, 10, mem.width, txt.height + 20)
    mem = mem.join(footer, "vertical")

    mem.write_to_file(filename + "-labelled.jpg")

Then on 100 3000 x 2000 RGB tiff images with IM 6.8.9-9 and vips 8.0, both with libjpeg-turbo, I see:

$ time ../im-bench.sh * 
real    0m32.033s
user    1m40.416s
sys 0m3.316s
$ time ../vips-bench.py *
real    0m22.559s
user    1m8.128s
sys 0m1.304s