References for this week
There are lots of references for regular expressions.
- Chapter 18 of Advanced Bash-Scripting Guide
- Syntax section of the Wikipedia regular expression page
- grep man page
- Python re module
- Mozilla Developer Network tutorial on JavaScript regular expressions
The find command is not as widely used. You can start with the manual page. For information overload, you can try the huge GNU Finding Files reference.
Readings for 27 January
By next week, you need to be a semi-experienced user of the Shell. To do this you need to read (quickly) Part 1, “Learning the Shell” of The Linux command line: A Complete Introduction by William E. Shotts, Jr. This book can be read on-line from the library or downloaded from the author’s web site LinuxCommand.org. This is about 120 pages of book; however, you should be able to open it at the terminal and go through the commands pretty quickly. I would estimate about less than one minute per page.
Regular expressions
The regular expression was created by mathematicians and linguists to describe words. The logician Stephen Cole Kleene is considered the creator of the regular expression.
Regular expressions are used in many Unix utilities. They are also used to generate the lexical analyzers used in modern compilers. Finally, they have found their way into many programming languages. The Mozilla Developer Network page for JavaScript Regular Expressions is a great tutorial on how regular expressions are used in recent programming languages.
Unix regular expressions
This is just the beginning. See the grep man page for the complete list.
Matching specific characters or locations
Most single characters are matched by the character itself:
a
matches a.
Some characters, such as +, have a special meaning and may
need to be escaped to make the match:
\+
matches +
(and \a
matches a).
Bracket expressions also match single characters.
For example, [AaEeIiOoUu]
matches a single
vowel, [0-9]
matches a single decimal digit,
[0-9A-Fa-f]]
matches a single hexadecimal digit,
and [^a]
matches anything but an a.
See the man page for more bracket expressions.
It is also possible to use anchors to match positions within a line:
$
matches the beginning and
^
matches the end.
operators
Finally, we get to the stuff that
Kleene did.
When two regular expressions are concatenated, they are matched in
order. For example, b[aeiou]t
will match five different
English words.
The Kleene star matches
zero or more repetitions.
This is written with the *
postfix operator.
For example, bo*t
will match
bt and boot.
The postfix +
matches one or more repetitions.
The postfix ?
matches zero or more repetitions, or more
simply, the postfix ?
is an optional match.
Finally, there is the alternation infix operator |
. For example
CSCI|MATH
matches a couple of UNCA majors.
The precedence order of operators is:
- repetition
- concatenation
- alternation
Just like in all programming languages, the precedence order can be overridden with parentheses.
Examples
X|YZ
(X|Y)Z
(X|Y)Z
CA*T|DO?G
CA*(T|(DO)?)D
Lex and sed were two early programs that used regular expressions. flex, a successor of lex, is still used by many compilers.
Regular expressions in programming languages
The Perl programming language was the first to make serious use of built-in regular expressions for string manipulation. JavaScript has followed this practice. Python and Java have libraries to support regular expressions. Regular expressions are frequently used in JavaScript, Perl and Python for string parsing. No one worries about the performance implications.
All of these programming languages use
Perl-type regular
expressions which extend the
grep syntax
in some useful directions.
(Most versions of grep will support
Perl regular expression if the -P
option is used.)
The Perl escape sequence
Perl escape sequences can be used to match useful classes of
characters. For example \s
will match a whitespace character,
\S
will match a character that is not whitespace, and
\w
will match a word character and
\W
will match a non-word character.
Perl also pays attention to your LANG
setting.
This means you can control if ñ matches
\w
or \W
.
Getting the match
grep will tell you if there is a match, but Perl, Python, and JavaScript programmers really want to know where the match occurs. For example, when JavaScript is processing the address field “Asheville, NC 28804” it may want to pull out the three fields in a code similar to the following Perl code:
if (($city, $state, $zip) = $field =~ /^\s*(\w*)\s*,\s*(\W\W)\s+(\d{5})\s*$/) { &processAddress($city, $state, $zip) ; } else { print "BAD ADDRESSFIELD: $field\n" ; }
I’m certain there’s a syntax error in there somewhere. You can also think about fixing it to match “Winston-Salem, NC 27101” or “Elizabeth City, NC 27906”.
find
The basic idea is simple: Do a recursive traversal of the file hierarchy starting with a list of directories. Whenever a file or directory is visited, perform a specified action on the node you are visiting. The action is specified with a boolean expression.
The problem is doing this from a shell command. These boolean expressions will be ugly.
The boolean operators
You probably noticed in CSCI 181 or 182 or 255 that there are
three common
boolean operators: AND (&&
), OR (||
)
and NOT (!
).
These three are all present in find but are expressed clumsily.
- OR:
-o
- AND:
-a
, concatenation - NOT:
!
You can also use parentheses to override the usual precedence rules.
This can get ugly when you do this from the shell because the shell
has its on special interpretation of (
, )
and !
.
This means that when you want to say
find ( α -o ! β ) γ
you must type
find \( α -o \! β \) γ
Be sure to surround all of those operators with spaces.
Examples
Unfortunately, the terms of the find command are many and difficult. So let’s just go to a collection of Unix/Linux find command examples and try some of them out.