Home » Python » python – How do you return all the words in a text file (once per word) in alphabetical order?-Exceptionshub

python – How do you return all the words in a text file (once per word) in alphabetical order?-Exceptionshub

Posted by: admin February 24, 2020 Leave a comment

Questions:

I have to return each word once meaning if a word is repeated in the file, it only gets printed one time, not twice- hence the unique part. I need help figuring out how to do that. I have it to where it is in alphabetical order but I can’t figure out to have the words only print once and not in a list.

Here is my code:

file = input("Enter the input file name:")
f = open(file, 'r')
words = f.read()
unique_words = sorted(words.split(' '))
for word in words:
    if word == word:
        value = word
        unique_words.remove(value)
    else:
        print(word)
How to&Answers:

Lots of errors in such a small program. Let’s list them, shall we.

...
unique_words = sorted(words.split(' '))

Looking good so far. But then you run into several logical problems:

for word in words:

If you print out what word is, you will see it is a letter. That is because you are iterating over the original words string. You meant to use unique_words here, the list that you prepared just a line earlier.

if word == word:

This cannot fail. The same string is always equal to the same string. (It’s hard to imagine otherwise; but do note that this only true for strings. Other objects may have this curious property.) You probably meant if word in words, and with the earlier correction, more probably if word in unique_words. That is a superfluous test – you are already looping over unique_words so yeah, each word already is in unique_words. So even more probably you meant something like “does this word occur more than once in my list?”

It is possible you attempted this as it is because in other languages, a double for loop is needed to check. Python has count; you could have used if unique_words.count(word) > 1 here. But you should not. Let’s see why.

    value = word
    unique_words.remove(value)

This is a huge issue. This would change the list unique_words here while it is looped upon. That is a big no-no, because the internal loop counters (the original length of unique_words is off, then.

That looping problem is introduced by my earlier fix for your code, so let’s assume you intended this instead:

for word in words.split():
    if word in unique_words:
        value = word
        unique_words.remove(value)
    else:
        print(word)

– notice the small adjustment to the first line. No error but it still does not work (as intended), because now it prints nothing. That is because you are now removing every word from unique_list

You can solve that as per above, by using count again:

if unique_words.count(word) > 1:

and then finally you get only one occurrence printed per word.

Since the only problem is getting a list of unique words, then there is a simple and very Pythonic solution: use a set. The one unique thing about a set, compared to a list, is that every item is only allowed to occur only once. If you convert a list to a set, all duplicates magically disappear.

print (set(unique_words))
>>> {'alphabetical', 'printed', 'I', 'out', 'list.', 'that.', ...

But what happened to the sort order? As it is, the contents of a set is not stored in an ordered list (due to the way the string hashes work). So, the trick is to first eliminate duplicates and then sort:

unique_words = sorted(set(words.split(' ')))
print (unique_words)

where no further loops are necessary and words is just your original input string.

Answer:

You can use a set to remove duplicates, and then pass that to the builtin sorted() function.

file = input("Enter the input file name:")
with open(file) as f:
    for word in sorted(set(f.read().split())):
        print(word)

Here “word” means groups of characters separated by whitespace. Depending on your file, this might be good enough. If you need to filter punctuation, you can use a regex instead of .split(). You could also coerce to lowercase if you don’t want an uppercase version counting as a different word. Depends on your file and what exactly you are trying to do.

import re
file = input("Enter the input file name:")
with open(file) as f:
    for word in sorted(set(re.findall('\w+', f.read().lower()))):
        print(word)

The \w+ will match “word characters”, while .lower() converts the whole string read from the file into lowercase.