File search without programmining

References for this week

There are lots of references for regular expressions.

If you have taken CSCI 434, Automata Theory and Formal Languages, you may be an expert with regular expressions.

The find command is not as widely used. You can start with the manual page. For information overload, you can try the huge GNU Finding Files reference.

Regular expressions

The regular expression was created by mathematicians and linguists to describe words. The logician Stephen Cole Kleene is considered the creator of the regular expression.

Regular expressions are used in many Unix utilities. They are also used to generate the lexical analyzers used in modern compilers. Finally, they have found their way into many programming languages.

Unix regular expressions

This is just the beginning. See the grep man page for the complete list.

Matching specific characters or locations

Most single characters are matched by the character itself: a matches a. Some characters, such as +, have a special meaning and may need to be escaped to make the match: \+ matches + (and \a matches a).

Bracket expressions also match single characters. For example, [AaEeIiOoUu] matches a single vowel, [0-9] matches a single decimal digit, [0-9A-Fa-f]] matches a single hexadecimal digit, and [^a] matches anything but an a. See the man page for more bracket expressions.

It is also possible to use anchors to match positions within a line: $ matches the beginning and ^ matches the end.

Operators

Finally, we get to the stuff that Kleene did. When two regular expressions are concatenated, they are matched in order. For example, b[aeiou]t will match five different English words.

The Kleene star matches zero or more repetitions. This is written with the * postfix operator. For example, bo*t will match bt and boot.

The postfix + matches one or more repetitions. The postfix ? matches zero or more repetitions, or more simply, the postfix ? is an optional match.

Finally, there is the alternation infix operator |. For example CSCI|MATH matches a couple of UNCA majors.

The precedence order of operators is:

repetition
concatenation
alternation

Just like in all programming languages, the precedence order can be overridden with parentheses.

Examples

X|YZ
(X|Y)Z
(X|Y)Z
CA*T|DO?G
CA*(T|(DO)?)D

Lex and sed were two early programs that used regular expressions. flex, a successor of lex, is still used by many compilers.

Regular expressions in programming languages

The Perl programming language was the first to make serious use of built-in regular expressions for string manipulation. JavaScript has followed this practice. Python and Java have libraries to support regular expressions. Regular expressions are frequently used in JavaScript, Perl and Python for string parsing. No one worries about the performance implications.

All of these programming languages use Perl-type regular expressions which extend the grep syntax in some useful directions. (Most versions of grep will support Perl regular expression if the -P option is used.)

The Perl escape sequence

Perl escape sequences can be used to match useful classes of characters. For example \s will match a whitespace character, \S will match a character that is not whitespace, and \w will match a word character and \W will match a non-word character.

Perl also pays attention to your LANG setting. This means you can control if ñ matches \w or \W.

Getting the match

grep will tell you if there is a match, but Perl, Python, and JavaScript programmers really want to know where the match occurs. For example, when JavaScript is processing the address field “Asheville, NC 28804” it may want to pull out the three fields in a code similar to the following Perl code:

if (($city, $state, $zip) = $field =~ /^\s*(\w*)\s*,\s*(\W\W)\s+(\d{5})\s*$/) {
    &processAddress($city, $state, $zip) ;
} else {
    print "BAD ADDRESSFIELD: $field\n" ;
}

I’m certain there’s a syntax error in there somewhere. You can also think about fixing it to match “Winston-Salem, NC 27101” or “Elizabeth City, NC 27906”.

Regular expressions in more conventional programming lanuages

find

The basic idea is simple: Do a recursive traversal of the file hierarchy starting with a list of directories. Whenever a file or directory is visited, perform a specified action on the node you are visiting. The action is specified with a boolean expression.

The problem is doing this from a shell command. These boolean expressions will be ugly.

The boolean operators

You noticed in CSCI 181 or 182 that there are three common boolean operators: AND (&&), OR (||) and NOT (!).

These three are all present in find but are expressed clumsily.

OR: -o
AND: -a, concatenation
NOT: !

You can also use parentheses to override the usual precedence rules. This can get ugly when you do this from the shell because the shell has its on special interpretation of (, ) and !. This means that when you want to say
find ( α -o ! β ) γ
you must type
find $ α -o \! β $ γ
Be sure to surround all of those operators with spaces.

Examples

Unfortunately, the terms of the find command are many and difficult. So let’s just go to a collection of Unix/Linux find command examples and try some of them out.