I am building a website in CakePHP that processes files uploaded though an XML-RPC API and though a web frontend. Files need to be scanned by ClamAV, thumbnails need to be generated, et cetera. All resource intensive work that takes some time for which the user should not have to wait. So, I am looking into asynchronous processing with PHP in general and CakePHP in particular.
I came across the MultiTask plugin for CakePHP that looks promising. I also came across various message queue implementations such as dropr and beanstalkd. Of course, I will also need some kind of background process, probably implemented using a Cake Shell of some kind. I saw MultiTask using PHP_Fork to implement a multithreaded PHP daemon.
I need some advice on how to fit all these pieces together in the best way.
- Is it a good idea to have a long-running daemon written in PHP? What should I watch out for?
- What are the advantage of external message queue implementations? The MultiTask plugin does not use an external message queue. It rolls it’s own using a MySQL table to store tasks.
- What message queue should I use? dropr? beanstalkd? Something else?
- How should I implement the backend processor? Is a forking PHP daemon a good idea or just asking for trouble?
My current plan is either to use the MultiTask plugin or to edit it to use beanstald instead of it’s own MySQL table implementation. Jobs in the queue can simply consist of a task name and an array of parameters. The PHP daemon would watch for incoming jobs and pass them out to one of it’s child threads. The would simply execute the CakePHP Task with the given parameters.
Any opinion, advice, comments, gotchas or flames on this?
I’ve had excellent results with BeanstalkD and a back-end written in PHP to retrieve jobs and then act on them. I wrapped the actual job-running in a bash-script to keep running if even if it exited (unless I do a ‘
exit(UNIQNUM);‘, when the script checks it and will actually exit). In that way, the restarted PHP script clears down any memory that may have been used, and can start afresh every 25/50/100 jobs it runs.
A couple of the advantages of using it is that you can set priorities and delays into a BeanstalkD job – “run this at a lower priority, but don’t start for 10 seconds”. I’ve also queued a number of jobs up at the some time (run this now, in 5 seconds and again after 30 secs).
With the appropriate network configuration (and running it on an accessible IP address to the rest of your network), you can also run a beanstalkd deamon on one server, and have it polled from a number of other machines, so if there are a large number of tasks being generated, the work can be split off between servers. If a particular set of tasks needs to be run on a particular machine, I’ve created a ‘tube’ which is that machine’s hostname, which should be unique within our cluster, if not globally (useful for file uploads). I found it worked perfectly for image resizing, often returning the finished smaller images to the file system before the webpage itself that would refer to it would refer to the URL it would be arriving at.
I’m actually about to start writing a series of articles on this very subject for my blog (including some techniques for code that I’ve already pushed several million live requests through) – My URL is linked from my user profile here, on Stackoverflow.
(I’ve written a series of articles on the subject of Beanstalkd and queuing of jobs)
If you use a message queue like beanstalkd, you can start as many processes as you’d like (even on the same server). Each worker process will take one job from the queue and process it. You can add more workers and more servers if you need more capacity.
The nice thing about using a single threaded worker is that you don’t have to deal with synchronization inside a process. The jobqueue will make sure no job will be handled twice.
Might also be worth checking out Amazon SQS to be used in conjunction with EC2?
What about Gearman? Good support and integration in php and features like parallel task, scaling, monitoring and so on…