Home » excel » python – How to change file extension?

python – How to change file extension?

Posted by: admin April 23, 2020 Leave a comment

Questions:

I am trying to scrape an ‘.xlsx’ file from the Tax Foundation website. Sadly I keep receiving an error message that reads: Excel cannot open the file '2017-FF-For-Website-7-10-2017.xlsx because the file format or file extension is not valid. verify that the file has not been corrupted and that the file extension matches the format of the file. I did some research and it says the way to fix this is to change the file extension to ‘.xls’ instead of ‘.xlsx’. Can anyone help?

from bs4 import BeautifulSoup
import urllib.request
import os

url = urllib.request.urlopen("https://taxfoundation.org/facts-figures-2017/")

soup = BeautifulSoup(url, from_encoding=url.info().get_param('charset'))

FHFA = os.chdir('C:/US_Census/Directory')

seen = set()
for link in soup.find_all('a', href=True):
    href = link.get('href')
    if not any(href.endswith(x) for x in ['.xlsx']):
        continue

    file = href.split('/')[-1]
    filename = file.rsplit('.', 1)[0]
    if filename not in seen:  # only retrieve file if it has not been seen before
        seen.add(filename)  # add the file to the set
        url = urllib.request.urlretrieve('https://taxfoundation.org/' + href, file)
    print(filename)

print(' ')
print("All files successfully downloaded.")

P.S. I know you can download the file, but I am web scraping it to automate a specific process.

How to&Answers:

Your problem was with your url = urllib.request.urlretrieve('https://taxfoundation.org/' + href, file) line. If you go to the website and hover over the Excel download button, you’ll see there is a much longer link, https://files.taxfoundation.org/20170710170238/2017-FF-For-Website-7-10-2017.xlsx (notice the 2017....238?). So you were never correctly downloading the Excel file. Here’s the correct line to do so:

url = urllib.request.urlretrieve(href, file)

Everything else was working correctly.