Home » Python » python – How do I transform a TF-IDF matrix into an overall dictionary of the top 10 words-Exceptionshub

python – How do I transform a TF-IDF matrix into an overall dictionary of the top 10 words-Exceptionshub

Posted by: admin February 24, 2020 Leave a comment

Questions:

I am trying to get the overall tf-idf score of words over a few texts. I am following the manual method of calculating tf-idf seen here: https://towardsdatascience.com/natural-language-processing-feature-engineering-using-tf-idf-e8b9d00e7e76

I am using these sentences: [‘the man went out for a walk’,’the children sat around the fire’]

The results can be seen in this pandas dataframe table:

enter image description here

The dictionaries that are used to show the tf-idf result can be seen here:

[{'a': 0.09902102579427789, 'for': 0.09902102579427789, 'man': 0.09902102579427789, 'out': 0.09902102579427789, 'the': 0.0, 'walk': 0.09902102579427789, 'went': 0.09902102579427789}, 

{'around': 0.11552453009332421, 'children': 0.11552453009332421, 'fire': 0.11552453009332421, 'sat': 0.11552453009332421, 'the': 0.0}]

How can I transform this list of TF-IDF result dictionaries into one dictionary of the top tf-idf results overall, in order?

How to&Answers:

Since we are working with just a few sentences here and given nature of TF-IDF, i.g. word frequency in overall document vs. word frequency in overall corpus, we could just put your result in order from bigger to smaller. To do that we can use a method to sort the dictionary you shown in your question.

def sort_dictionary(my_dict):
    return {k: v for k, v in sorted(my_dict.items(), key=lambda item: item[1], reverse=True)}

Doing so we get a result of:

{'a': 0.09902102579427789, 'for': 0.09902102579427789, 'man': 0.09902102579427789, 'out': 0.09902102579427789, 'walk': 0.09902102579427789, 'went': 0.09902102579427789, 'the': 0.0}

In line with the documents, or sentences, we used as input. Although we have thirteen words in the sentences, we only have 7 unique ones, but had we had hundreds of them, we could limit our search to the first ten in the sorted dictionary and that would give us a top ten.