Home » Python » python – Identifying the most common words from a specific columns but only top 10 songs (another column)-Exceptionshub

python – Identifying the most common words from a specific columns but only top 10 songs (another column)-Exceptionshub

Posted by: admin February 24, 2020 Leave a comment

Questions:

I am having some trouble with this code. I am supposed to retrieve the 20 most common words from the top 10 songs of each year (1965-2015) there is a rank so I feel like I can identify the top 10 with rank <= 10. But I am just so lost on how to even begin. This is what I have so far. I have not included the top 10 ranked songs yet. Also, the 20 most common words are coming from the lyrics column (which is 4)

import collections
import csv
import re

words = re.findall(r'\w+', open('billboard_songs.csv').read().lower())
reader = csv.reader(words, delimiter=',')
csvrow = [row[4] for row in reader]
most_common = collections.Counter(words[4]).most_common(20)
print(most_common)

the first lines from my file are as follow:

"Rank","Song","Artist","Year","Lyrics","Source"   
1,"wooly bully","sam the sham and the pharaohs",1965,"sam the sham miscellaneous wooly bully wooly bully sam the sham the pharaohs .....,3   

when it gets to 100 (rank) it starts again at 1 for the following year and etc.

How to&Answers:

You can use csv.DictReader to parse the file and get a usable Python list of dictionaries out of it. Then, you can use for-comprehensions and itertools.groupby() to extract the song information you need. Finally, you can use collections.Counter to find the most common words in the songs.

#!/usr/bin/env python

import collections
import csv
import itertools


def analyze_songs(songs):
    # Grouping songs by year (groupby can only be used with a sorted list)
    sorted_songs = sorted(songs, key=lambda s: s["year"])
    for year, songs_iter in itertools.groupby(sorted_songs, key=lambda s: s["year"]):
        # Extract lyrics of top 10 songs
        top_10_songs_lyrics = [
            song["lyrics"] for song in songs_iter if song["rank"] <= 10
        ]

        # Join all lyrics together from all songs, and then split them into
        # a big list of words.
        top_10_songs_words = (" ".join(top_10_songs_lyrics)).split()

        # Using Counter to find the top 20 words
        most_common_words = collections.Counter(top_10_songs_words).most_common(20)

        print(f"Year {year}, most common words: {most_common_words}")


with open("billboard_songs.csv") as songs_file:
    reader = csv.DictReader(songs_file)

    # Transform the entries to a simpler format with appropriate types
    songs = [
        {
            "rank": int(row["Rank"]),
            "song": row["Song"],
            "artist": row["Artist"],
            "year": int(row["Year"]),
            "lyrics": row["Lyrics"],
            "source": row["Source"],
        }
        for row in reader
    ]

analyze_songs(songs)

In this answer, I assumed the following format for billboard_songs.csv:

"Rank","Song","Artist","Year","Lyrics","Source"
1,"wooly bully","sam the sham and the pharaohs",1965,"sam the sham miscellaneous wooly bully wooly bully sam the sham the pharaohs","Source Evian"

I’m assuming the dataset is from 1965 to 2015 as explained in the question. If not, the list of songs should first be filtered accordingly.