Home » Python » How can I search sub-folders using glob.glob module in Python?

How can I search sub-folders using glob.glob module in Python?

Posted by: admin November 30, 2017 Leave a comment

Questions:

I want to open a series of subfolders in a folder and find some text files and print some lines of the text files. I am using this:

configfiles = glob.glob('C:/Users/sam/Desktop/file1/*.txt')

But this cannot access the subfolders as well. Does anyone know how I can use the same command to access subfolders as well?

Answers:

In Python 3.5 and newer use the new recursive **/ functionality:

configfiles = glob.glob('C:/Users/sam/Desktop/file1/**/*.txt', recursive=True)

When recursive is set, ** followed by a path separator matches 0 or more subdirectories.

In earlier Python versions, glob.glob() cannot list files in subdirectories recursively.

In that case I’d use os.walk() combined with fnmatch.filter() instead:

import os
import fnmatch

path = 'C:/Users/sam/Desktop/file1'

configfiles = [os.path.join(dirpath, f)
    for dirpath, dirnames, files in os.walk(path)
    for f in fnmatch.filter(files, '*.txt')]

This’ll walk your directories recursively and return all absolute pathnames to matching .txt files. In this specific case the fnmatch.filter() may be overkill, you could also use a .endswith() test:

import os

path = 'C:/Users/sam/Desktop/file1'

configfiles = [os.path.join(dirpath, f)
    for dirpath, dirnames, files in os.walk(path)
    for f in files if f.endswith('.txt')]

Questions:
Answers:

The glob2 package supports wild cards and is reasonably fast

code = '''
import glob2
glob2.glob("files/*/**")
'''
timeit.timeit(code, number=1)

On my laptop it takes approximately 2 seconds to match >60,000 file paths.

Questions:
Answers:

To find files in immediate subdirectories:

configfiles = glob.glob(r'C:\Users\sam\Desktop\*\*.txt')

For a recursive version that traverse all subdirectories, you could use ** and pass recursive=True since Python 3.5:

configfiles = glob.glob(r'C:\Users\sam\Desktop\**\*.txt', recursive=True)

Both function calls return lists. You could use glob.iglob() to return paths one by one. Or use pathlib:

from pathlib import Path

path = Path(r'C:\Users\sam\Desktop')
txt_files_only_subdirs = path.glob('*/*.txt')
txt_files_all_recursively = path.rglob('*.txt') # including the current dir

Both methods return iterators (you can get paths one by one).

Questions:
Answers:

You can use Formic with Python 2.6

import formic
fileset = formic.FileSet(include="**/*.txt", directory="C:/Users/sam/Desktop/")

Disclosure – I am the author of this package.

Questions:
Answers:

Here is a adapted version that enables glob.glob like functionality without using glob2.

def find_files(directory, pattern='*'):
    if not os.path.exists(directory):
        raise ValueError("Directory not found {}".format(directory))

    matches = []
    for root, dirnames, filenames in os.walk(directory):
        for filename in filenames:
            full_path = os.path.join(root, filename)
            if fnmatch.filter([full_path], pattern):
                matches.append(os.path.join(root, filename))
    return matches

So if you have the following dir structure

tests/files
├── a0
│   ├── a0.txt
│   ├── a0.yaml
│   └── b0
│       ├── b0.yaml
│       └── b00.yaml
└── a1

You can do something like this

files = utils.find_files('tests/files','**/b0/b*.yaml')
> ['tests/files/a0/b0/b0.yaml', 'tests/files/a0/b0/b00.yaml']

Pretty much fnmatch pattern match on the whole filename itself, rather than the filename only.

Questions:
Answers:

configfiles = glob.glob('C:/Users/sam/Desktop/**/*.txt")

Doesn’t works for all cases, instead use glob2

configfiles = glob2.glob('C:/Users/sam/Desktop/**/*.txt")

Questions:
Answers:

If you can install glob2 package…

import glob2
filenames = glob2.glob("C:\top_directory\**\*.ext")  # Where ext is a specific file extension
folders = glob2.glob("C:\top_directory\**\")

All filenames and folders:

all_ff = glob2.glob("C:\top_directory\**\**")  

Questions:
Answers:

As pointed out by Martijn, glob can only do this through the **operator introduced in Python 3.5. Since the OP explicitly asked for the glob module, the following will return a lazy evaluation iterator that behaves similarly

import os, glob, itertools

configfiles = itertools.chain.from_iterable(glob.iglob(os.path.join(root,'*.txt'))
                         for root, dirs, files in os.walk('C:/Users/sam/Desktop/file1/'))

Note that you can only iterate once over configfiles in this approach though. If you require a real list of configfiles that can be used in multiple operations you would have to create this explicitly by using list(configfiles).

Questions:
Answers:

If you’re running Python 3.4+, you can use the pathlib module. The Path.glob() method supports the ** pattern, which means “this directory and all subdirectories, recursively”. It returns a generator yielding Path objects for all matching files.

from pathlib import Path
configfiles = Path("C:/Users/sam/Desktop/file1/").glob("**/*.txt")