PREAMBLE. Most UNIX programs do not understand directory structures. They rely on the shell to provide pan-ready file-names. But when shell globbing is used extensively its advantages are rapidly negated. find provides a coherent approach for complex file searches and therefore is a key UNIX utility. However, its command-line interface is designed hetereogenous from most other UNIX commands.

Many command-line stubs and fragmented examples spook arround in the WWW. Therefore the current survey intends to give a systematic overview plus concrete approaches for some common applications of find. We begin with a sketch on files and ASCII text (the fundamental carriers of information in UNIX) and proceed with an insight into the find-mini-language to query the UNIX file system.

We go on with presenting the four central ideas behind find (find by file attribute/location, recursion and pruned directories, find by content/command execution, find in scripts).

Further topics are UNIX processes and pipes, motivated by the questions how find behaves in broken pipes, and how find bypasses shell quoting rules.

We close with a treatment of find versus ls. The accompanying code examples are intended to help you get started quickly, even if you have never programmed UNIX before.

Keynotes: UNIX, find.

Venue

If you are a UNIX developer and you want to know more about the options that are available to you when you develop software for UNIX, or work regularily with UNIX using the shell, this article is written for you.

Files

Any UNIX system is a time-sharing systems which concurrently runs processes and attends to multiple users. What UNIX aims at with such skills are files ([Ritchie84]). An urban legend says that after MULTICS as abandoned "they needed to rewrite an operating system (OS) in order to play space war on another smaller machine, a DEC PDP-7". In an interview Ken Thompson explained he actually implemented UNIX as test environment for an hierarchical file-system he had designed earlier ([Seibel09]). What is true? In fact many resources under UNIX are mapped into the file system: devices such as 'tty’s, and concepts such as FIFOs and processes.

Secondly, since then it began to escape from AT&T’s Bell Laboratories in the early 1970’s UNIX follows a "component philosophy"- In UNIX programs enhance the functionality of programs. Under the UNIX shell programs are like commands in a high-level language to handle files and file contents. Where other languages are build upon scalars and few keywords, UNIX shell programs are build upon programs and few keywords.

But the going is tough. Most UNIX tools are not good in finding files, with none or simple abilities for directory traversal. Only a small set of UNIX programs actively find files: * ls, grep, chmod, chgrp, rsync, rm and cp, for example, provide -r and -R options; * tar, find and du legally process sub-directories.

UNIX programs with -r are problematic: usually they * find all file types, * run into infinite loops caused by hard- and soft-links, and * descend into mounted file-systems.

Note
Even more natural than files under UNIX are lines of ASCII text. For the UNIX kernel input and output goes through a teletype-printer (tty).

Shell globbing, POSIX find and GNU find

The reason why UNIX programs treat the file-system lightly are the efficient shell quoting and globbing rules. The first UNIX-shell was the Thompson Shell (1971), and it already expanded the characters * ? [ ] specially
[The Thompson Shell was introduced in the first version of UNIX 1971, and was written by Ken Thompson. It was a simple command interpreter, not designed for scripting, but it introduced pipes, quoting and glob-patterns.]
. Words that use these characters are called wildcards, glob-patterns or globs
[From the glob and fnmatch system functions.]
. To get some C source files, for example, we can use

$ ls -a *.c */*.c libext/*/*.c libext/*/*.c

Multiple * characters may occur in the same word, and the shell will know where to expand directory names, and where to expand filenames.

                         file-names generated by the shell
                          |
                          |
                +-----+------------+------------+
                |     |            |            |
                v     v            v            v
        $ ls -a *.c */*.c libext/*/*.c libext/*/*.c
                    ^            ^            ^
                    |            |            |
                    +------------+------------+
                                 |
                                 |
                                directory-names generated by the shell

This is a deep concept. Without such a higher level of interpretation each program would believe in its worm’s-eye view. The UNIX philosophy ("programs enhance programs") wouldn’t work
[Users of MSDOS command.com could write a book about it. But UNIX missed something too: coherent command-line-parsing. Arguments are not separated into filename and option-name arguments. So some programs accept long option-names, some let short and long names both begin with "-", or use "+"; some won’t accept "--" as the end of the argument list, and some allow "-" as filename (which is a bad idea, because it names stdin).]
. But wildcards are not a general method to find files. The above command shows a few problems:

  • It will list some .c-files and misses others, e.g. those in ./libext/*/*/*.c.

  • It is tedious to type.

  • The shell will not match files in directories starting with a period.

  • The argument-list may become too long.

find solves all these problems. The two major standards are POSIX find and GNU find, part of the GNU findutils. The latter has useful additional capabilities, such as -regex, -delete and -print0.

What find finds

The find program has become is the tool of choice to examine the file system and file contents. For example

grep -r "search string" .

is equivalent to the basic

find . -exec grep -H "search string" '{}' \;

but find can select files more subtly than that, by searching the file’s

  • name and type,

  • size,

  • user/group ids,

  • access permissions,

  • access times,

  • special mode bits,

  • the directory under which files are stored,

  • the disk partition and the file-system type.

The find command line

        find [options]... [<search-root>] <search-expression>
              ^             ^              ^
              |             |              |
              |             |              |
             -H, -L, -P    directories    mini-language
                                          narrowing down files/directories

The find utility recursively descends the directory hierarchy of each search-root, evaluating the Boolean expression search-expression.

find only has three real command-line options: -H, -L and -P. All three treat symbolic links. * The default behavior is to never follow symbolic links (-P). * The -L flag will cause find to follow symbolic links. * The -H flag will only follow symbolic links while processing the command line arguments. These options are mutually exclusive, but using them together is not considered an error. The last option specified determines the behavior.

By default, when no search-roots are defined, find returns a list of all files below the current working directory.

The first argument with a -, ! or ( as the first character after search-root is interpreted as start of search-expression. The search expression contains elements of a mini-language.

Elements are interpreted by find and must circumvent shell quoting rules for the characters ( ) ;. find will handle wildcards internally. Arguments to -name, -path, -wholename and -regex use wildcard characters must also be quoted. See pattern arguments for more information.

Elements are whitespace-separated, i.e. pass \( name foo \) not \(-name foo\) or '( -name foo )'.

Each element in the search-expression yields true or false. The whole expression is evaluated from left to right. Standard Boolean logic applies, and so not all elements might be evaluated.

The exit code is 0 for ok or >0 when an error occurred. Note that the exit code is 0 too when no files where found.

Pattern arguments

Several expressions require pattern arguments (globs): -name, -path, -regex and -wholename. These arguments are expanded internally and must be protected from the shell.

*

Matches any zero or more characters.

?

Matches any one character.

[string]

Matches exactly one character that is a member of string (the character class). string may contain ranges. For example, [a-z0-9_] matches a lowercase letter, a number, or an underscore. Negate a class by placing a ! or ^ immediately after the opening bracket.

\

Removes the special meaning of the character that follows it. Works also in character classes.

find expands patterns differently as does the shell:

  • The * and ? wildcards will match a dot at the beginning of a filename.

  • Slash characters have no special significance, unlike in the shell, in which wildcards do not match them.

  • Where the shell can difference between wildcards for directory names and file names, find uses different search-expressions.

    • The -path pattern applies to the whole path-name.

    • The -name pattern applies only to the basename.

For example, -path "foo\*bar" matches files named ./foobar, ./foo/.bar and ./foo/foo/bar.

The same pattern, when passed as -name "foo\*bar" is only applied to the filename (basename); it matches ./foobar and ./bluh/foo.bar, but not ./foo/.bar.

Default evaluation

When the search-expression contains no actions (-exec, -execdir, -ok or -print) then -print is appended:

find ( <search-expression> ) -and -print

Therefore:

$ touch foo bar # create two files
$ find -name foo -o -name bar
./bar
./foo

The above example finds both files, and is not equivalent to

$ find -name foo -o -name bar -print
./bar

Directory traversal

For brevity each directory-entry is called a file here. All search-roots are processed sequentially from left to right. find processes all files in the root directory, then in sub-directories.

  1. When the search-root does not exist print an error to stderr.

  2. When the search-root is a file find this file.

  3. When the search-root is a directory find all files in the directory:

    • regular files, symbolic links, named pipes, sockets, doors (Solaris only) and buffers

    • the search-directory name is always prepended to the filename, by default ./

    • the .. directory is never found

  4. Print exactly one filename per line.

  5. An error message is printed on stderr for each directory on which you don’t have read permission.

  6. No output is printed to stdout if no files are found.

  7. With -depth process each directory’s contents before the directory itself.

  8. Process all sub-directories recursively, or get the next search-root (file or directory).

  9. In case of an infinite loop, write a diagnostic message to stderr and recover the position in the hierarchy or terminate.

The complete directories on the disk are always scanned, and every file is fed into the search-expression. The search-expression consists of tests and actions (see the command-line). Tests return true or false, depending on the file qualities. Actions return true on principle. However:

  • All subsequent -prune expressions after -depth return false.

  • -exec and -execdir are true if the executed command returns zero, otherwise they’re false.

Deep directories and infinite loops

find can descend into directories of arbitrary depths and will handle any valid path length.

In reality file-systems often contain loops created through the use of hard- or symbolic links. The POSIX standard therefore requires that: "The find utility shall detect infinite loops; that is, entering a previously visited directory that is an ancestor of the last file encountered. When it detects an infinite loop, find shall write a diagnostic message to standard error and shall either recover its position in the hierarchy or terminate." (POSIX find).

Logical Operator Precedence

In Boolean logic, expressions are evaluated from left to right. When more than one logical operator is used in a statement, NOT is evaluated first, then AND, and finally OR. Arithmetic (and bit-wise) operators are handled before logical operators. find provides only logical operators.

A simple shell function example:

function have_file() {
        [[ -f "$1" ]] && return 0 || return 1
}

In the following example the file program is run on all files of the working directory, but not files under ./tmp. It prunes tmp:

find -type d -path './tmp' -prune -o -exec file {} ;

The use of parentheses, even when not required, can improve the readability of expressions and reduce the chance of making a subtle mistake because of operator precedence. There is no performance penalty in using parentheses. This example is more readable than the previous one, although they are logically the same:

find ( -type d -and -path './tmp' -and -prune ) -or ( -exec file {} ; )

Next we additionally prune all .svn sub-directories:

find -type d -path './tmp' -prune -o -type d -name '.svn' -prune -o -exec file {} ;

To change the meaning of expressions, add parentheses to force evaluation of OR parts. This example uses parentheses, but is otherwise similar to the previous one:

find ( -type d -path './tmp' -prune ) -o ( -type d -name '.svn' -prune ) -o -exec file {} ;

Parentheses make the expression more readable. Apparently the example can be optimized:

find -type d ( -path './tmp' -o -name '.svn' ) -prune -o -exec file {} ;

As evidenced by:

$ find \( -type d -and \( -path './tmp' -or -name '.svn' \) -and -prune \) -or -exec file {} \;
.: directory
./FindPruneMiniHowto.text: HTML document text
./NUL: empty
./test.sh: Bourne-Again shell script text executable

Note that -and, -or and -not are GNU extensions. POSIX-compliant are -a, -o and !.

Pruned Directories

find traverse all directories entries (see the find-algorithm). The -prune action causes find not to descend the current path-name if it is a directory. In find-slang “the directory is pruned”. Pruned directories are simply considered empty by find. For example, all of the the following find invocations print all files and directories that do not begin with a dot:

$ find . -path '*/.*'                   -o -print # (1)
$ find . -path '*/.*'    -prune -o -print # (2)
$ find . -path '*/.*' -a -prune -o -print # (3)

Variants 2 and 3 additionally do not descend into directories starting with a dot. Both variants are technically identical, because find combines actions with -a (and) if no explicit -o (or) is used.

Since find applies the search-expression simply to each entry of the directory tree, the -prune command is only executed when the elements left to it are true; execution will break on the first element that is false. Therefore -path is often used to filter which entries are actually seen.

Example 1: A typical application of -prune

The simplest example is

$ find

which prints every directory entry of the directory and all sub-directories. Normally only specific files or directories are of interest. The following (dangerous!) example removes all files (not directories) of a user:

$ find / -user <UID> -type f -exec rm -vf {} ';'

To prevent find from searching certain directories use -prune. Here is an example found on the find man-page: to skip the directory src/emacs and all files under it, and print the names of the other files found, do something like this:

find . -path ./src/emacs -prune -o -print

Here is a typical example you’ll often find on the Internet:

find <input-dir> -path <spec> -prune -o <commands>
  1. -prune is just another action find, and is tied to -path by a logical AND.

  2. commands will not be undertaken when the found directory matches spec. Reasons:

  3. When the found directory does not match spec then commands are undertaken. Reasons:

Since -prune is always true it does not block the commands right to it. Additionally -prune has the side-effect that find considers the directory it currently iterates empty. This happens to be the directory that matches spec.

Note that -prune is not always true. It is false in conjunction with -depth and -delete (GNU find). The -depth action lets find ascend from the bottom directories, so -prune will be senseless.

-depth is stronger than -prune.

Example 2: Find files only in the current directory

find by default descends into all sub-directories. -prune avoids certain directories; -depth forces a bottom-up search. But how to process only files in the current directory? A portable solution is

find . ( -type d ! -name . -prune ) -o <commands>

Prunes all directories except the search-root. The grouped expression is false for all files and for the directories not named like the search-root. All other directories are pruned, in which case the grouped expression is true. A more compact form of the above is

find . ! -name . -prune -print

which is executed on all items expect the search-root. -print shall be a placeholder for any further search-expression here. The example prunes every file, but find will ignore the faux-pas. The only exception is the search-root, which is not pruned. So the contents of . are printed, but not . itself. To print . plus all files in it use the form
[Pointed out by Stéphane Chazelas on 3/10/2009 in comp.unix.shell]

find . ( -name . -o -prune ) -print

Here the grouped expression is always true and -print is called on all items. But all directories except . are pruned; only . and items in . are found. Again, find gracefully ignores the attempt to prune non-directories.

Exactly the opposite behavior is triggered by omitting the -name part. For example

find . -prune -print

finds only the search-root, as evidenced by

$ find /etc /bin ~ -prune -print
/etc
/bin
/home/andreas

This can be useful to produce some output in the proper find format.

GNU find introduced new expressions to control descending

  • -maxdepth n descends at most n levels of directories below the search-root.

  • -mindepth n does not apply the search-expression at levels less than n levels below the search-root.

Use -maxdepth 1 to let find find each search-root plus files in it, but don’t descend deeper. To not include the search-root additionally specify -mindepth 1.

Example 3: Find the first file in a directory-tree (stop on first match)

Recurse the search-root and stop after printing the first file found. Useful to quickly test whether a directory somewhere contains a certain file or file type. If so don’t waste time to find additional files.

find has no option to do this.

The solution is find|head -1 to get the first line of find-output (original solution was posted here). As explained in How UNIX processes pipes below each program in the pipe is forked by the same sub-shell. head exits when the second line occurs. Then find receives a SIGPIPE signal to indicate that the pipe to which it writes is broken. The reaction of find is to terminate itself.

The solution is unsatisfying because find receives the signal only when it prints the second file found. So find will search a second file, and if there’s no second file it will search the whole directory. A better solution is find|sed 1q. sed reads the line (first pathname), and exits as response to q. At the same time find will receive SIGPIPE and exits too.

The following example makes use of find|sed 1q by searching a directory of RAW files (digital negatives of photos).

for raw in dng orf crw cr2 nef; do
        first=$(find -type f -iname "*.$raw" | sed 1q)
        if [ -n "$first" ]; then
                echo found $raw file
                # exiftool -ext $raw ...
        fi
done

Example 4: Find the nth to the mth file

Again, find has no option to do this. We have to force a SIGPIPE signal to terminate find after enough files have been found. See example 3.

The general question here is to print the nth to the mth file from standard out.

Print the 2nd up to the 10th file:

$ find | sed -n '2,10p'

Additonally print the 17th to the 20th line:

$ find | sed -n '2,10p;17,20p'

Example 5: Process only specific directories

The following command prunes all directories named name. Use this form to execute a command only on directories named name, but on none of its possible entries.

find <input-dir> -type d -name <name> -prune <commands>

find will combine all expressions with -and. Because of the left to right evaluation order commands occur last. When the current path names a directory, and the name matches name or spec, then find also evaluates commands. -prune is true, but has the side-effect of not descending into the directory on which commands are undertaken.

Example 6: Remove all sub-directories

Removing sub-directories is a special case where -prune is necessary. Consider:

$ find -type d ! -path '.' ! -path '*/.*' -print -exec rm -rf {} \;

The command removes all sub-directories of the current working directory unless their name begins with a dot. But during this operation most versions of find will complain:

find: '<directory-name>': Not a directory

The message occurs after find executed rm on directory-name. The message reveals that find has seen the directories before it evaluated -exec. I leave it open whether the above message makes any sense. The only way to prevent it is -prune:

$ find -type d ! -path '.' ! -path '*/.*' -prune -print -exec rm -rf {} \;

Alias:

$ find -type d \
          -and -not -path '.' \
          -and -not -path '*/.*' \
          -and -prune \
          -and -print \
          -and -exec rm -rf {} \;

Example 7: Carefully cleanup sub-directories

A more complex example, which illustrates that find can make extra shell scripts unnecessary, and is also useful in makefile rules.

Let’s say in a software project we have a sub-folder libs holding internal and third-party libraries. The third-party libraries are located under libs/third-party. Now we want to clean libs from compiler-intermediate files, unnecessary files, empty directories etc.

The constraints:

  1. Exclude the top-level directory libs itself from being cleaned.

  2. Exclude libs/third-party since it may hold prebuild files, closed-source DLLs or are even nested Subversion working copies.

  3. All directories beginning with a dot are taboo too, where ever they occur. We will protect .svn directories maintained by Subversion under all means.

Although not too simple, the expression implementing this behavior is still compact:

$ find libs \( -type d ! -path 'libs' ! -path 'libs/third-party' ! -path '*/.*' -prune \) \
           -exec cleanup -af {} \;

Alias:

$ find libs \( -type d \
         -and -not -path 'libs' \
         -and -not -path 'libs/third-party' \
         -and -not -path '*/.*' \
         -and      -prune \) \
         -and -exec cleanup -af {} \;

Basically this prunes all directories under libs and executes cleanup on them ([TheCleanupTool]). Directories not pruned, and hence not cleaned, are named by the -path expressions. We prune directories because the cleanup scripts descends the directory that was passed to it.

Note that find does not add a superfluous ./ at the begin nor / to the end of directory path-names. In the above example, where libs is passed as the search-root, it finds libs/third-party and not ./libs/third-party or libs/third-party/.

find will always put the search-root in front of found path-names.

Example 8: Prune more than one directory

$ find -type d \( -path '*/lib' -o -path '*/etc' -o -path '*/.svn' \) -prune -o \
           -print

Depth of field

Normally find traverses the directory tree top-down. It sees sub-directories not before it had processed all files in a directory. With -depth it processes each directory’s contents before the directory itself (bottom-up).

If the -depth option is in effect, the subdirectories will have already been visited. Hence -prune has no effect and returns false.

Example 9: Remove all empty directories

The basic command to delete any empty directory under from-dir is ([DelEmptyDir]):

$ find <from-dir> -depth -type d -empty -print                          # list empty dirs
$ find <from-dir> -depth -type d -empty -print -exec rmdir {} \;

When deleting empty directories we must ascend from the bottom directories, or empty directories are not considered empty because they contain empty directories. Hence -depth is mandatory.

There are pitfalls! With GNU find the + forms of -exec may disable -depth. The reason is that empty directories are found before the execute expression is evaluated. Consequently, a directory containing nothing but empty sub-director(ies) is not considered empty:

$ rm -rf test
$ mkdir -p test/a/b test/c/d
$ find test -depth -type d -empty -print
test/a/b
test/c/d
$ find test -depth -type d -empty -exec rmdir -v {} \+
removing directory 'test/a/b'
removing directory 'test/c/d'
$ find test -depth -type d -empty -exec rmdir -v {} \+
removing directory 'test/a'
removing directory 'test/c'
$ find test -depth -type d -empty -exec rmdir -v {} \+
removing directory 'test'

find removed the empty directories ./test/a/b and ./test/c/d, but we have to run find three times to delete ./test/a and ./test/c and ./test. With ; the command works as expected:

$ rm -rf test
$ mkdir -p test/a/b test/c/d
$ find test -depth -type d -empty -exec rmdir -v {} \;
removing directory 'test/a/b'
removing directory 'test/a'
removing directory 'test/c/d'
removing directory 'test/c'
removing directory 'test'

Example 10: Remove empty directories that do not begin with a dot

To omit directory names that begin with a dot simply negate the -path.

$ find . -depth -type d ! -path '*/.*' -empty -exec rmdir -v {} \;

This operation is implemented in my shell script to remove unneeded files ([TheCleanupTool]). Note that ; must be escaped to prevent it from expansion by the shell. In shell scripts, however, use

searchroot="$1"
searchexpr="-exec rmdir -v {} ;"
find $searchroot -depth -type d ! -path '*/.*' -empty -exec $searchexpr

Example 11: Unwanted side-effects of -prune

This command from the GNU find manual is likely to do something unintended:

find <dirname> -path "<dirname>/foo" -prune -o -delete

Because -delete turns on -depth, the -prune action has no effect and files in dirname/foo will be deleted too.

find by content

Example 12: Running the type command (or other Shell built-ins) over find

From the perspective of a shell a name either names an alias, keyword, function, builtin, or file. The type command prints how the shell would interpret a name if used as command-name.

$ find -type f -exec bash -c "type {}" \;

Note that the semicolon must be placed outside of the quotes. Because type is not working for directory names the -type f expression is necessary.

Example 13: Find files by content type (find and file)

Goal: to execute dos2unix on all text files (notes, source codes, shell scripts, HTML, makefiles etc.).

The UNIX file utility determines the file type by peeking into the file data. The type printed will contain one of the word "text" when the file contains only printing characters and a few common control characters and is probably safe to read on an ASCII terminal.

$ find -type f -exec file {} \; 2>/dev/null | grep -sw 'text' | cut -f1 -d: | xargs dos2unix

grep -w force pattern "text" to match as whole word. So far the output is like:

./layout.css: ASCII C program text, with CRLF line terminators
./links.htt: HTML document text
./Makefile: ISO-8859 make commands text
./MANIFEST: ASCII text
./MOTIVATION: ISO-8859 English text
./wintee.pod: a /usr/bin/perl script text executable
./dist/src/cleanup: Bourne-Again shell script text executable

Next cut -f1 -d: splits each line at ":" and prints the first field (the filename).

./layout.css
./links.htt
./Makefile
        .
        .

find in Shell scripts

For repetitive tasks such as finding files and directories by name, find’s command-line is too complex.

Example 14: Find files and directories by glob-pattern

function ff
function ff {
        if [ "${1:--h}" = "-h" ]; then cat <<EOF; return; fi
Usage:
         $FUNCNAME <path-pattern>

Find files in current directory and sub-directories, according to
case-insensitive glob \$1. Prefixes \$1 with '*' to parse the directory part.

Example:
         "ff '*.txt'" alias "ff .txt" alias "ff txt"

EOF
        find . -type f -iname "*${1}" -ls
}
function fd
function fd {
        if [ "${1:--h}" = "-h" ]; then cat <<EOF; return; fi
Usage:
         $FUNCNAME <path-pattern>

Like ff but find directories.

EOF
        find . -type d -iname "*${1}" -ls
}

Example 15: Execute commands on files

function fx
function fx {
        if [ "${1:--h}" = "-h" ]; then cat <<EOF; return; fi
Usage:
        $FUNCNAME <path-pattern> [<exec-args>]...

Like ff but execute \$2 to \$9 on files. By default execute the "file" program.

Examples:
        $ $FUNCNAME .d
        $ $FUNCNAME .o rm -v
EOF
        find . -type f -iname '*'${1} -exec ${2:-file} $3 $4 $5 $6 $7 $8 $9 {} ';'
}

Example 16: Find files modified lately

function ft24: Find files modified in the last 24 hours
function ft24 {
        if [ "$1" = "-h" ]; then cat <<EOF; return; fi
Usage:
        $FUNCNAME [<directory-name>]

Find files modified in the last 24 hours.

EOF
        find "${1:-.}" -mtime +0 "$@" -ls
}
function ft5: Find files modified in the last 5 days
function ft5 {
        if [ "$1" = "-h" ]; then cat <<EOF; return; fi
Usage:
        $FUNCNAME [<directory-name>]

Find files modified in the last 5 days.

Examples:
        $ $FUNCNAME                               # files
        $ $FUNCNAME -size +10M  # large files
        $ $FUNCNAME -type d               # directories
EOF
        find "${1:-.}" -mtime +5 "$@" -ls
}
function f10m: List files greater than 10 MB (f10m)
function f10m {
        if [ "$1" = "-h" ]; then cat <<EOF; return; fi
Usage:
        $FUNCNAME [<directory-name>]

List files greater than 10 MB.

EOF
        find "${1:-.}" -size +10M -ls
}

Example 17: Find files by content

find is not able to peek into files. To do this the following function combines find and egrep:

function fstr
function fstr {
        local case= word=0 usage="Usage:
        $FUNCNAME [-i] [-w] [-h] <grep-pattern> [<path-pattern>]
"
        OPTIND=1
        while getopts wihH opt; do
                case "$opt" in
                        i) case='-i';;
                        w) word=1;;
                        [hH]) cat <<EOF; return;;
$usage
Find a pattern in a set of files under the current working directory, and
highlight them (requires a recent version of egrep and GNU find/xargs). When
<path-pattern> is omitted grep all files. "-i" enables case-insensitive grep,
"-w" finds at word boundaries only.

Example:
        $ $FUNCNAME -w foo '*.cpp'

EOF
                esac
        done
        shift $(( $OPTIND - 1 ))
        if [ "$#" -lt 1 ]; then
                echo "$usage"; return
        fi
        if (($word==1)); then pat="\\b${1}\\b"; else pat="$1"; fi
        find . -type f -name "${2:-*}" -print0 | \
                xargs -0 grep --color=auto -sn ${case} "$pat" 2>&- | less
}
function fstrc: Find regexp-pattern in C/C++ source files
function fstrc {
        if [ "${1:--h}" = "-h" ]; then cat <<EOF; return; fi
Usage:
        $FUNCNAME <grep-pattern>

Find string case-insensitively in C/C++ source files.

EOF
        fstr -i "${1}" '*.c[px][px]'
        fstr -i "${1}" '*.h'
}
function fid: Find regexp-pattern in C/C++ source files at word boundary
function fid {
        if [ "${1:--h}" = "-h" ]; then cat <<EOF; return; fi
Usage:
        $FUNCNAME <grep-pattern>

Like fstrc but find on word boundaries (find identifier).

EOF
        fstr -i -w "${1}" '*.c[px][px]'
        fstr -i -w "${1}" '*.h'
}

Example 18: Running find under Shell quoting rules (from scripts)

Calling find from Shell scripts requires special care, because find accepts file-glob-patterns and grep-patterns as arguments. Most UNIX programs leave filename generation to the shell, but since find expands wildcards internally, arguments for find have to be quoted.

Suppose that the current directory contains the file a.log. Then the find command produces:

$ find -name '*.log'
./a.log

But the following shell function will not find ./a.log:

runfind() {
        # First positional argument is the find search expression.
        find . $1
}

Within a real world script the runfind function possibly has the general form:

runfind() {
        # First positional argument is the find search expression, second
        # the execute expression in xargs-style.
        find . \( $1 \) -and \( -exec ${2:-echo} {} + \)
}

With both variants all of the following attempts to call runfind will fail:

runfind "-name '*.log'" # evaluates to: find -name "'*.log'"
runfind '-name "*.log"' # evaluates to: find -name '"*.log"'
runfind '-name  *.log'  # evaluates to: find -name a.log

The first two will find nothing; the third only a.log in the current working directory, but not as a glob-pattern. When searching big directories trees it is not obvious that find has seen just some of the files! And, if the working directory contains two files, a.log and b.log, the script breaks with a syntactic error. The third runfind call will then be equivalent to:

$ find -name a.log b.log
find: Der Pfad muss vor dem Suchkriterium stehen: b.log

Most confusingly, when the working directory contain no .log-files the third runfind call will be equivalent to what was actually intended:

$ find . -type f -name '*.log'

It will then — and only then — work as expected. When a file-name generation is not successful, the shell keeps the pattern literally, instead of replacing it with the empty string.

What about:

runfind() {
        find . -type f "-name $1"
}

runfind "*.log"

Now the shell will pass the whole string as supposed option name:

find: unbekannte Option `-name *.log'

Slowly but surely we’re by ourselves running out of options. The reason are the quoting rules for shell scripts ([ShellQuotingRules]). runfind hadn’t worked because

  • the shell removes only quotes it has seen during tokenizing;

  • in particular the shell won’t remove quotes that occurred during parameter/command expansion;

  • the wildcards that occurred during parameter/command expansion are expanded.

Quoting occurs in most languages, including the natural ones we speak. Quoting introduces different levels of interpretation. In the shell language quoting is implemented by the characters " ' ` (double/single quote and back tick). To solve the original problem it is necessary to quote all arguments, and force one more round through the shell quoting steps. by using the built-in shell function eval.

runfind() {
        eval find . "$1"
}

runfind "-name '*.log'"

Finally this works. When the argument to runfind is itself a variable use:

x="-name '*.log'"
runfind "$x"

With Bourne shell it is advisable to additionally disable file-name generation:

runfind() {
        set -f
        eval find . "$1"
        set +f
}

runfind "-name *.log"
runfind "-name '*.log'"

Both calls to runfind will yield the same set of .log-files. When file-name generation is disabled the call to eval becomes unnecessary:

runfind() {
        set -f
        find . $1
        set +f
}

Example 19: Copy find output into a bash array

The goal is to read all direct sub-directories of $PWD into Bash-array a. Let’s jump into the cold water again.

Why isn’t this working?

a=() i=0
find -type d -maxdepth 1 -mindepth 1 -print0 | \
        while read -d $'\0' file
        do
                a[$i]="$file"
                let ++i
        done

The fauxpas here is that the while-loop is executed in a sub-shell. The elements of a will be lost when the pipeline finished, and then ${#a[@]} and i will both be 0. A less important problem is that read does backslash escaping, but under UNIX backslashes are legal path-name characters. Finally read trims leading and trailing newlines and white-spaces, but these are legal path-name characters too.

But this code goes into the right direction. Most importantly, find and read use a zero character \0 as the path-name delimiter, instead of allowing filenames to contain newlines and white-spaces. These abilities are GNU/BSD extensions. $'…' is Bash’s syntax to denote ANSI-C-characters, so $'\0' is just a showy empty string.

Rather than sending find’s output down a pipe we have to put the cart before the horse:

a=() i=0
while IFS= read -r -d '' a[$i]; do
        let ++i
done < <(find -type d -maxdepth 1 -mindepth 1 -print0)

IFS= (or IFS="") disables word-splitting, and -r disables backslash escaping. The default value for IFS is $' \t\n'. Note that IFS is not emptied here to disable word- or line-splitting! IFS= prevents read from trimming leading and trailing newlines and white-spaces in path-names. Incidentally… will IFS= clobber the value of IFS for the entire script? No.

The environment for any simple command or function may be augmented temporarily by prefixing it with parameter assignments, as described above in PARAMETERS. These assignment statements affect only the environment seen by that command.

ENVIRONMENT
— Bash man-page

Another Bash-homegrown used here is the process-substitution operator <(…). It forks a new simultaneous process and connects it to a FIFO. <(…) allows you to spawn pipes in each direction (like "multidimensional pipes"). Note, however, that <(…) is just syntax. It is easy to fake the operator in the traditional Bourne shell by calling mkfifo explicitly.

Another form:

a=() i=0
find -type f \( -iname '*.jpeg' -o -iname '*.jpg' \) -print0 | \
        while IFS= read -r -d '' a[$i]; do
                let ++i
        done

For more information see [CaptureFind] and [Wheeler].

Example 20: find now, process later

The previous example used process-substitution, which is a form of command-substitution. The traditional Bourne shell form of command-substitution uses backquotes (`…`). The new form that permits nesting is $(…). Both forms fork sub-processes, like each external command, whereas a Shell builtin like for or echo is not paralleled. In some cases therefore Shell wildcard expansion might be faster than using find.

To delay the processing of found files use a temporary file. The following line slurps the list of files printed by find into a shell variable:

find -type f >found
# ...
found=`<found`

The downside is that found may contain any number of blanks and control characters from filenames. For filenames UNIX prohibits only the null character, since it’s the C string terminator.

Maybe you’re concerned about find’s linewise output. This is not a problem. In the output of a substituted command the shell silently replaces newlines by single blanks.

Note that found=`<found` is identical to found=`cat found`, but is faster since no sub-shell is forked (see [ShellCmdExecution]).

find in Perl

Example 21: Use Shell-pipelines and not File::Find

CPAN provides the File::Find and File::Finder (sic) packages. File::Find is just a layer that introduces a new syntax. In particular the wanted function of File::Find is not practical. File::Finder (by Randal R. Schwartz) is the better choice, since it emulates find syntax in Perl syntax.

However, there is evidence (posted here) that running find in a Shell pipeline is faster, making it unnecessary to learn yet another software-layer’s syntax.

#!/usr/bin/perl
#
# Example wrapper around the find utility.
#

use strict;

open(FIND, "find $ARGV[0] -type f |");
print while <FIND>;
close FIND;

Appendix

How UNIX executes commands

The way in which commands are executed by the shell can be summarized as follows [Ritchie84]:

  1. The shell reads a command line from the terminal.

  2. It creates a child process by fork.

  3. The child process uses exec to call in the command from a file.

  4. Meanwhile, the parent shell uses wait to wait for the child (command) process to terminate by calling exit.

  5. The parent shell goes back to step 1.

How UNIX processes pipes

Every program you run from the shell opens three files: standard input (0), standard output (1), and standard error (2). The files provide the primary means of communications between the programs, and exist for as long as the process runs.

  • The standard input provides a way to send data to a process.

  • The standard output provides a means for the program to output data.

  • The standard error is where the program reports any errors encountered during execution.

As some defaults, standard input is connect to the terminal keyboard, and standard output and standard error to the terminal display screen.

UNIX allows you to connect processes, by letting the standard output of one process feed into the standard input of another process. That mechanism is called a pipe. Example:

find -type f | grep MANIFEST

All processes in the pipe are started and their standard inputs and outputs are chained together. In the above example the shell forks find and grep within a sub-shell.

When a program does not support standard I/O, or exits for any other reason, the processes that read from its standard output, or write to its standard input, are sent a SIGPIPE signal.

Note that some programs may complain when receiving SIGPIPE. The ls and find programs will not complain.

How UNIX process input (ttys)

In 1963 7-bit ASCII eventually became the successor to the 5-Bit ITA2-code (International Telegraph Alphabet No. 2) used by teletype-printers since the 1930s. In the late 60s ASCII-teletype-printers still were the most common terminal-type. In the new video terminals (nicknamed "glass teletypes") a scrolling window replaced the paper roll. They provided random access to character cells and characters have no longer to be over-typed by a # or @ to erase them.

Until today the tty-functionality of the UNIX kernel canonicalizes input as lines of ASCII-text. The kernel echoes characters to the terminal as they’re typed, but it sends only complete lines from input buffers to programs. The line terminator is a linefeed (ASCII 10). Full-screen editors like vi and tools like tput actually have to tweak the tty: mainly disable cooked mode and stop the tty from echoing characters. Yet the X Window terminal xterm has to provide a pseudo-terminal (pty) to simulate a tty, which in turn mainly pretends to be a massive ASCII-printer.

Shell quoting rules

Shell quoting rules have only marginally changed since the Bourne shell was published with UNIX System VII (1978). An exceptional guide is provided by [Waldmann]. To summarize, before the shell executes a command it performs the following steps:

  1. Tokenizing and syntax analysis, comment removal.

  2. Parameter expansion ($VAR), command-substitution (`cmd`) and process-substitution (>(cmd) and <(cmd)).

  3. White-space removal and word splitting (according to the IFS variable).

  4. Pathname expansion (Globbing).

  5. Removal of all quoting characters " ' \ detected in step 1.

These steps are repeatedly done for each script and function.

Process-substitution is only available on systems that can afford it.

Some operators are special. Consider

> a='x y z'
> b=$a
> echo $b
x y z
> b=x y z
bash: y: command not found

find: {} and ;

The regular syntax

find ... -exec <command> {} ;

runs a command once for each file. The command is executed in the directory where find was started. (Use -execdir to run the command in the sub-directory containing the file.) -exec and -execdir are true if the executed command returns zero.

Why the peculiar syntax? Because quoting the whole -exec argument doesn’t scale — the command-line immediately becomes unreadable.

  1. By reusing Shell characters find will not come into conflict with its own syntax, because find has no own syntax.

  2. Eventually only {} and ; have to be quoted. The string-argument passed to -exec must not be quoted.

  3. Finally all words are already separated and quoted by the shell. find must only iterate the argument-strings after -exec, and concatenate the command while testing for ;.

{} is replaced by the full path-name relative to (and including) the search-root. GNU find will replace {} everywhere in a word, not in words equal to {}, as in some versions of find. Note also that some versions of find only replace the first occurrence of {}. GNU find will replace {} in all words.

; is expanded empty.

Generally is necessary to quote ; but not {}. The shell uses { and } to open sub-shells and for functions, but it will not open an empty sub-shell, and makes it hard to declare empty functions:

$ empty() {}
bash: syntax error near unexpected token `}'
$ empty() { ; }
bash: syntax error near unexpected token `;'
$ empty() {;;} # ok

The shell may even interpret {} as command-name:

$ echo|{}
bash: {}: command not found

xargs versus -exec {} +

When + instead of the ; punctuator is used to terminate -exec the command-line is built by appending each selected file name at the end; the total number of shell invocations will be much less than the number of matched files. This syntax is available only with GNU find. + builds the command-line in much the same way that xargs builds its command-lines. In fact find will use the maximum number of file-names it can pass in a shell command on this system.

Example: Compile some text files using the asciidoctool script
$ cd articles
$ find -name '*.txt' -exec asciidoctool {} +
$ find -name '*.txt' | xargs asciidoctool                 # dto.

The -exec expression is suitable for executing relatively simple commands. For complex commands it is safer to open an additional context. Do this by feeding find's output to xargs. xargs reads file-names from the standard input, delimited by blanks or newlines, and executes the command one or more times (default is /bin/echo). File-names can be protected with double or single quotes or a backslash.

If any invocation of the command exits with a status of 255, xargs will stop immediately without reading any further input.

xargs is not generally applicable. xargs has no placeholder syntax, so it is only useful when the command expects the file-list at the end of the command-line. Otherwise you must use a form of -exec.

Note that the above example is not quite exact. To achieve the same behavior as with -exec {} + we must specify the --no-run-if-empty (alias -r) option to xargs — this is a GNU extension.

A unique feature of GNU xargs is the execution of parallel commands. See the -P switch on the man-page, and my post on the ExifTool forum.

Is echo $(<$0) a Quine?

This question raised while writing example 17.

In computing self-producing program is called a Quine. Quines are eventually named after the philosopher and logician Willard van Orman-Quine. A Quine is the sophisticated variant of a "Hello, World!" program: an exercise divorced from reality.

A Quine will reproduce it’s source — perpetuating knowledge about concepts and an exact implementation. A Quine that locates its source-code on disk, and prints it, is considered a cheating Quine. So echo $(<$0) or cat $0 scripts are not Quines. Less apparently, #!/bin/cat is cheating too, without even running a Shell.

find and ls

find is a tool for tools that do not provide tree traversal, or cannot filter input files by file patterns. These are 99% of all UNIX tools. find's capabilities are superior to simple tree traversal. ls is easier to use on the command-line, but has no query language and does not provide access to all aspects of the file-system.

find regularly outputs all hidden files (dot-files), whereas ls does not print these files unless you specify -a or -A.

The output of find can be used as arguments to ls:

$ ls -ld `find -type f`

Note also the -ls option of find. Both of the following command-lines are equivalent:

$ find -ls
$ find | xargs ls -dils

Lists of independent operations with ,

GNU find additionally defines the list operator , (comma). All expressions in the list will always be executed, and the result is that of the last expression. This allows multiple independent operations on one traversal. The operations are not truly independent because one operation could have side-effects like deleting files, that effect other operations in the list.

Likewise, if an operation is -prune, subsequent operations are "pruned".

Prevent find error messages

To prevent error messages pipe stderr to /dev/null.

Further reading

This Mini-How-to is a fall-out from the cleanup script ([TheCleanupTool]) by the same author.

Some useful find examples are provided by [Zimmerly].

Bibliography