The Python language is not only a clever language, its clear syntax, the wealth of modules and intuitiveness have contributed to its success and adoption in a wide variety of domains.
It's very useful to traverse a directory and operate on files within all subdirectories. This is the case with many utilities such as ls, find and tar -- quite frankly used on a daily basis by most. It is inevitable most script writers and all programmers will need to programmatically handle files throughout a directory structure. The following program shows a very reusable function (largely inspired from the Python documentation itself) that simply traverses a directory tree and calls a callback function for each regular file. I hope you find it useful:
#!/usr/bin/env python
import os, sys
from stat import S_ISDIR, S_ISREG, ST_MODE
def walktree(top, callback):
'''recursively descend the directory tree rooted at top,
calling the callback function for each regular file'''
for f in lsdir(top):
pathname = os.path.join(top, f)
mode = os.lstat(pathname)[ST_MODE]
if S_ISDIR(mode):
# It's a directory, recurse into it
print 'Entering', pathname
walktree(pathname, callback)
print 'Leaving', pathname
elif S_ISREG(mode):
# It's a file, call the callback function
callback(pathname)
else:
# Unknown file type, print a message
print 'Skipping %s' % pathname
def lsdir(dir):
'''list the contents of a directory ordering them
files first and dirs last in the returned list'''
files = []
dirs = []
try:
for file in os.listdir(dir):
if S_ISDIR(os.lstat(os.path.join(dir, file))[ST_MODE]):
dirs.append(file)
else:
files.append(file)
return files + dirs
except OSError:
return []
def visitfile(file):
print 'visiting', file
if __name__ == '__main__':
walktree(sys.argv[1], visitfile)
As my friend David Wragg pointed out, Python is moving toward preferring generators to callbacks. Thanks! A few modifications:
import os, sys
from stat import S_ISDIR, S_ISREG, ST_MODE
def walktree(top):
'''recursively descend the directory tree rooted at top,
yielding each regular file'''
for f in lsdir(top):
pathname = os.path.join(top, f)
mode = os.lstat(pathname)[ST_MODE]
if S_ISDIR(mode):
# It's a directory, recurse into it
print 'Entering', pathname
for parent in walktree(pathname):
yield parent
print 'Leaving', pathname
elif S_ISREG(mode):
# It's a file, yield filename
yield pathname
else:
# Unknown file type, print a message
print 'Skipping %s' % pathname
def lsdir(dir):
'''list the contents of a directory ordering them
files first and dirs last in the returned list'''
files = []
dirs = []
try:
for file in os.listdir(dir):
if S_ISDIR(os.lstat(os.path.join(dir, file))[ST_MODE]):
dirs.append(file)
else:
files.append(file)
return files + dirs
except OSError:
return []
def visitfile(file):
print 'visiting', file
if __name__ == '__main__':
for file in walktree(sys.argv[1]):
visitfile(file)
Obviously it is anyone's call to what method they prefer to use. Both have their distinct advantages. However, any lengthy processing on each file or a heavily populated directory would lead to vulnerabilities in directory structure changes and in the case of large directories, excessive memory usage. In this case, the generator method would be much preferrable.
jc@ldesktop:~$ time ./walkdir_callback.py ~ | wc -l 52387 real 0m1.349s user 0m0.780s sys 0m0.560s jc@ldesktop:~$ time ./walkdir_generator.py ~ | wc -l 52387 real 0m1.404s user 0m0.756s sys 0m0.624s