Home » excel » python – I'm trying to scrape college football team rosters into an excel file and need help organizing the data

python – I'm trying to scrape college football team rosters into an excel file and need help organizing the data

Posted by: admin May 14, 2020 Leave a comment

Questions:

I’m trying to build a program using Python to scrape NCAA football rosters into an Excel file, but I can’t figure out how to organize the data in the way I want.

Currently I’m able to scrape all the text from all the players that I want, names, height & weight, hometown, etc. but it all comes out in one big clump. I’d like the names to be in one column, heights & weights in another, so on and so forth. I just can’t find any information on how to do this when it is not in table.


from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.select import Select
from tkinter import *

window = Tk()
window.title("Roster Scraper v1.0")
window.configure(background="light grey")
window.geometry('300x250')

TeamRoster = Label(window, text="Roster URL: ", font=("Arial"), fg="gray17")
TeamRoster.grid(column=0, row=0, sticky='e')
TeamRoster.configure(background="light grey")
URLEntry = Entry(window, width=20)
URLEntry.configure(background="light grey")
URLEntry.grid(column=1, row=0)

def ScrapeScript():

    DesiredRoster = (URLEntry.get())

    driver = webdriver.Firefox()

    driver.get(DesiredRoster)

    PlayerCard = driver.find_element_by_class_name('sidearm-roster-players').text
    print(PlayerCard)


SearchButton = Button(window, text="Scrape", command=ScrapeScript)
SearchButton.grid(column=1, row=3)
SearchButton.configure(background = "light grey")

window.mainloop()

The website I’m trying to scrape from is from Alabama’s team website: https://rolltide.com/roster.aspx?roster=226&path=football

A lot of college teams use this exact style of website so it’d be really helpful to not have to input all this data manually. Any help would be greatly appreciated.

How to&Answers:

You should create more complex rules to scrape only parts of data in rows.

First you could use find_elements_by_class_name (with s in word elements) to get all elements with class sidearm-roster-players-name and separatelly with class sidearm-roster-player-position, sidearm-roster-player-class-hometown, etc.

all_names = driver.find_elements_by_class_name('sidearm-roster-player-name')
all_pozitions = driver.find_elements_by_class_name('sidearm-roster-player-position')
all_hometowns = driver.find_elements_by_class_name('sidearm-roster-player-class-hometown')

and then you can use zip() to create pairs (name, size, hometown, etc.)

for name, position, hometown in zip(all_names, all_positions, all_hometowns):
    print(name.text, "|", position.text, "|", hometown.text)

from selenium import webdriver

url = 'https://rolltide.com/roster.aspx?roster=226&path=football'

driver = webdriver.Firefox()
driver.get(url)

all_names = driver.find_elements_by_class_name('sidearm-roster-player-name')
all_positions = driver.find_elements_by_class_name('sidearm-roster-player-position')
all_hometowns = driver.find_elements_by_class_name('sidearm-roster-player-class-hometown')

for name, position, hometown in zip(all_names, all_positions, all_hometowns):
    print(name.text, "|", position.text, "|", hometown.text)

For more detailed scraping you can use more complex rules and you can use xpath (find_elements_by_xpath).

You can even first scrape all rows and later use for-loop to scrape elements in every row separatelly.


from selenium import webdriver
import csv

url = 'https://rolltide.com/roster.aspx?roster=226&path=football'

driver = webdriver.Firefox()
driver.get(url)

all_rows = driver.find_elements_by_xpath('//ul[@class="sidearm-roster-players"]//li')

fh = open('output.csv', 'w')
csvwriter = csv.writer(fh)
#write headers
csvwriter.writerow(["Number", "Name", "Position", "Height", "Weight", "Hometown", "Highschool", "Academic year"])

for row in all_rows: #[:10]:
    number = row.find_element_by_xpath('.//div[@class="sidearm-roster-player-name"]//span').text
    print('number:', number)

    name = row.find_element_by_xpath('.//div[@class="sidearm-roster-player-name"]//p').text
    #print('name:', name)

    position = row.find_element_by_xpath('.//div[@class="sidearm-roster-player-position"]/span').text
    #print('position:', position)

    height = row.find_element_by_class_name('sidearm-roster-player-height').text
    #print('height:', height)

    weight = row.find_element_by_class_name('sidearm-roster-player-weight').text
    #print('weight:', weight)

    # it seems some classes have two elements in row - first probably always is empty but I join all elements 

    hometown = row.find_elements_by_class_name('sidearm-roster-player-hometown')
    hometown = ''.join(x.text for x in hometown)
    #print('hometown:', hometown)

    highschool = row.find_elements_by_class_name('sidearm-roster-player-highschool')
    highschool = ''.join(x.text for x in highschool)
    #print('highschool:', highschool)

    academic_year = row.find_elements_by_class_name('sidearm-roster-player-academic-year')
    academic_year = ''.join(x.text for x in academic_year)
    #print('academic_year:', academic_year)

    #print('---')
    csvwriter.writerow([number, name, position, height, weight, hometown, highschool, academic_year])

fh.close()