I have a command that scraped roughly around 300K webpages, and it takes forever to run since it’s a lot of websites and the website is throttled from where I’m running the server. So since the process of the web scraper is
POST Website > Scrape > Collect into Array > Write to DB
all the other steps than POST becomes delayed since it takes forever to even do the first step. So I’m looking to run multiple workers at once; The options I’m looking at is
AsyncOperation and Queue Workers from Laravel, but I’m not exactly sure how I would implement either of those.
You are likely wanting to use the queue/worker system, which is explained in detail here:
One of the possible setups includes Supervisor (Linux process monitor) which makes sure that the
php artisan queue:work command keeps running in the background, and gets restarted if an error occurs.
Within the Supervisor configuration you can then define that you want 4 instances of this running using
numprocs=4 in the
Basic queue explanation
So basically this is all depending on a queue, which for Laravel could be
Redis (I can recommend this),
Beanstalkd or a regular
database table called “jobs” (the last one might not be the best solution for production environments) or any other implementation you choose.
Say if you are running 4 workers, one of the running
queue:work processes will pick up and reserve a job as soon as one becomes available in your queue. Multiple jobs in the queue may thus be reserved by different workers.
Note that multiple processes run in parallel, which means that if you push 3 jobs to the queue, you cant assume that they will be handled in the order 1-2-3. They are commenced in this order, but they might not finish in this order. So you have to keep that in mind while doing any read or write operations like database queries. Depending on your needs, you can set the number of processes to
1 to ensure correct order of execution, but this may limit your throughput considerably.