I am trying to scrape an ‘.xlsx’ file from the Tax Foundation website. Sadly I keep receiving an error message that reads:
Excel cannot open the file '2017-FF-For-Website-7-10-2017.xlsx because the file format or file extension is not valid. verify that the file has not been corrupted and that the file extension matches the format of the file. I did some research and it says the way to fix this is to change the file extension to ‘.xls’ instead of ‘.xlsx’. Can anyone help?
from bs4 import BeautifulSoup import urllib.request import os url = urllib.request.urlopen("https://taxfoundation.org/facts-figures-2017/") soup = BeautifulSoup(url, from_encoding=url.info().get_param('charset')) FHFA = os.chdir('C:/US_Census/Directory') seen = set() for link in soup.find_all('a', href=True): href = link.get('href') if not any(href.endswith(x) for x in ['.xlsx']): continue file = href.split('/')[-1] filename = file.rsplit('.', 1) if filename not in seen: # only retrieve file if it has not been seen before seen.add(filename) # add the file to the set url = urllib.request.urlretrieve('https://taxfoundation.org/' + href, file) print(filename) print(' ') print("All files successfully downloaded.")
P.S. I know you can download the file, but I am web scraping it to automate a specific process.
Your problem was with your
url = urllib.request.urlretrieve('https://taxfoundation.org/' + href, file) line. If you go to the website and hover over the Excel download button, you’ll see there is a much longer link,
https://files.taxfoundation.org/20170710170238/2017-FF-For-Website-7-10-2017.xlsx (notice the
2017....238?). So you were never correctly downloading the Excel file. Here’s the correct line to do so:
url = urllib.request.urlretrieve(href, file)
Everything else was working correctly.