We’re using a curl HEAD request in a PHP application to verify the validity of generic links. We check the status code just to make sure that the link the user has entered is valid. Links to all websites have succeeded, except LinkedIn.
While it seems to work locally (Mac), when we attempt the request from any of our Ubuntu servers, LinkedIn returns a 999 status code. Not an API request, just a simple curl like we do for every other link. We’ve tried on a few different machines and tried altering the user agent, but no dice. How do I modify our curl so that working links return a 200?
A sample HEAD request:
curl -I --url https://www.linkedin.com/company/linkedin
Sample Response on Ubuntu machine:
HTTP/1.1 999 Request denied Date: Tue, 18 Nov 2014 23:20:48 GMT Server: ATS X-Li-Pop: prod-lva1 Content-Length: 956 Content-Type: text/html
To respond to @alexandru-guzinschi a little better. We’ve tried masking the User Agents. To sum up our trials:
- Mac machine + Mac UA => works
- Mac machine + Windows UA => works
- Ubuntu remote machine + (no UA change) => fails
- Ubuntu remote machine + Mac UA => fails
- Ubuntu remote machine + Windows UA => fails
- Ubuntu local virtual machine (on Mac) + (no UA change) => fails
- Ubuntu local virtual machine (on Mac) + Windows UA => works
- Ubuntu local virtual machine (on Mac) + Mac UA => works
So now I’m thinking they block any curl requests that dont provide an alternate UA and also block hosting providers?
Is there any other way I can check if a link to linkedin is valid or if it will lead to their 404 page, from an Ubuntu machine using PHP?
It looks like they filter requests based on the user-agent:
$ curl -I --url https://www.linkedin.com/company/linkedin | grep HTTP HTTP/1.1 999 Request denied $ curl -A "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:18.104.22.168) Gecko/20100401 Firefox/3.6.3" -I --url https://www.linkedin.com/company/linkedin | grep HTTP HTTP/1.1 200 OK
I found the workaround,
important to set accept-encoding header:
curl --url "https://www.linkedin.com/in/izman" \ --header "user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36" \ --header "accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8" \ --header "accept-encoding:gzip, deflate, sdch, br" \ | gunzip
Seems like LinkedIn filter both user agent AND ip address. I tried this both at home and from an Digital Ocean node:
curl -A "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:22.214.171.124) Gecko/20100401 Firefox/3.6.3" -I --url https://www.linkedin.com/company/linkedin
From home I got a 200 OK, from DO I got 999 Denied…
Or you could setup a proxy on your home network, for example use a Raspberry PI to proxy your requests. Here is a guide on that.
Proxy would work, but I think there’s another way around it. I see that from AWS and other clouds that it’s blocked by IP. I can issue the request from my machine and it works just fine.
I did notice that in the response from the cloud service that it returns some JS that the browser has to execute to take you to a login page. Once there, you can login and access the page. The login page is only for those accessing via a blocked IP.
If you use a headless client that executes JS, or maybe go straight to the subsequent link and provide the credentials of a linkedin user, you may be able to bypass it.
LinkedIn does not allow direct access. They have blacklisted Heroku/AWS IP address and the only way to access the data is to use their APIs. it can be accessed from the local machine or headless browser if you want to scrap LinkedIn or you can use proxy to scrap LinkedIn because LinkedIn has blocked many servers IPs