< Day Day Up > |
The power of pattern matching is demonstrated in the previous two sections, and now it is time to discuss another very powerful tool, the awk programming language, which also makes use of the pattern-matching techniques. Linux operating system provides the GNU version of awk, known as gawk, which is developed based on the POSIX standards. In the Linux world, awk and gawk are used synonymously, and awk is a symbolic link to the gawk program in the /bin directory, whereas traditional UNIX professionals may look at them as different, because gawk was developed after awk and on the principles laid out by awk. One may wonder how the name awk is selected for this utility. History tells us that the utility is named by combining the first letters of the last names of the three professionals who developed it.
While grep is a command-line utility and the sed editor provides scriptable interface in addition to command-line features, the gawk tool goes one step further, as a programming/scripting language. Though sed reads scripted instructions, the sed scripts do not provide many (complex) programmable constructs. In this respect, the gawk utility may be considered as a programming language with conditional constructs, iterative loops, predefined functions, and so on. Therefore, gawk may be viewed somewhat similarly to the C programming language. However, gawk is a single program and interprets the scripts written to the syntax of the gawk language. There is no separate compile/link process to create executables out of the gawk scripts.
Like sed, the gawk utility also works on an input stream, either a file or the standard input. The input stream comprises one or more lines, and each line is processed at a time, which executes the instructions from the script or those provided at the command-line. From the point of view of gawk, each input line is viewed as a record, and the words separated by delimiters are viewed as fields (similar to the concept of a relational database system). The input lines (or records) may not necessarily be of the same length and may not necessarily have the same number of words (or fields). The standard syntax of using gawk is provided here in a few examples. It is very similar to sed.
$ gawk '{print "Hello", $0}' awk_input_sample.txt $ gawk '{print "Hello"}' awk_input_sample.txt
The awk_input_sample.txt file contains one line of text Satya Sai Kolachina. Then the output of the above commands is shown below in the same order.
$ Hello Satya Sai Kolachina $ Hello
The first command includes a $0 after a comma delimiter. In awk terminology, the $0 indicates the complete input line. Therefore, the instruction tells gawk to print the literal ‘Hello’ followed by the content of the input line. In the second command, the $0 is removed along with the preceding comma delimiter. The corresponding output prints only the word ‘Hello.’ It should also be noted from the second command-line syntax that an input file (or stream) is always necessary for awk to process the instructions (or the script) even if the contents of the stream are not used by the instructions. If the input stream or file is omitted at the command line, then the command waits for us to enter input through the standard input, which is the keyboard. It is also important to note that awk executes the instructions once per every input line. The output of gawk is thrown to the standard output by default. It can be redirected to a file as shown here.
If the instruction ‘{print “Hello”, $0}’ is saved in a script file, say, awk_script.txt, then the command-line syntax should be changed as demonstrated here. The ‘–f’ option indicates that the instructions are saved in the specified script file. The quotes surrounding the instruction are removed when saved in the script file.
$ gawk –f awk_script.txt awk_input_sample.txt > awk_output1.txt
The following subsections will provide more details on the features of gawk and the syntax of the language.
The gawk scripts (also called programs) consist of a series of instructions, which are simply pattern and action combinations. For every input line, gawk attempts to match it with the pattern in the script file in order to perform the associated action. If the input line fails to match the pattern in the specific instruction, then the associated action is not performed on the input line, and gawk continues to the next instruction. After all the instructions in the script file are processed on the current line, gawk restarts the process with the next input line from the first instruction onwards. The procedure associated with the BEGIN instruction call, if specified in the script, enables the performance of some tasks (such as setting some system variables) before gawk reads the first input line. Similarly, the procedure associated with the END instruction call may also be specified to perform tasks after all the input lines are processed.
Gawk identifies individual words in a line as fields, with system variables of the form $n. Thus $1 identifies the first field, $2 the second field, $3 the third field and so on, whereas $0 stands for the entire input line. By default, the fields are separated by spaces and tabs. One or more consecutive spaces and tabs are considered as a single field separator. However, we may tell gawk to consider any specific character as a field separator with the command-line option –F<fs>. The field separator should be specified in quotes. Often, data from a relational database may be exported into an ASCII text file, with the fields being separated by delimiters. If we want to analyze data from such a file, the most ideal tool is gawk. It is common practice to use characters such as tilde ~, vertical bar |, and comma , as delimiters while separating field values. The command syntax shown below would separate the fields considering a tilde ~ character as the field separator in the input stream records.
$ gawk –F "~" –f awk_script2.txt awk_input2.txt
In this example, the awk_script2.txt file contains the following simple instructions.
{print $1, $2, $3}
The awk_input2.txt file contains sample records as shown below. The first field represents the first name, the second field represents the last name, and the third field represents the age.
James~Smith~35 Jennifer~Brown~29 Kelly~White~34
Execution of the above command with this input file produces output with the fields separated by standard output field separator, which is the space character.
James Smith 35 Jennifer Brown 29 Kelly White 34
The field separator can also be set in the script instead of at the command line. Including the field separator (identified by the FS system variable), any other system variables may also be set in the script, through the BEGIN procedure call. For example, the output field separator can also be changed by setting the OFS system variable in the BEGIN procedure call. The following example illustrates these settings in the script file.
BEGIN {FS="~" OFS="|"} {print $1, $2, $3}
The script file contains the FS and OFS settings in the BEGIN procedure call, which is executed before the first input line is read. Multiple settings in the BEGIN procedure call may be specified in subsequent lines as shown here, or may be specified on the same line, separated by a semicolon. While processing this script, the command line does not have to specify the field separator. Even if it is specified, the FS setting in the script file takes precedence over the command-line specifier as it is the latest setting before the input line is processed. Both the FS and OFS can be specified to contain a string value instead of a single character. There is also a record separator variable RS, which is the newline \n character by default. However, we may need to process files containing records that span across multiple lines. This means that lines separated by the newline character may belong to the same record, which is in contrast to the default notation where the newline character separates the records. In such cases, we can define the variable RS to be something other than the newline character, such as an empty line for example. An empty line is to be specified with a pair of double quotes without anything in between as RS="". Similar to the OFS, the ORS variable may be used to identify the output record separator.
Predefined system variables are available for awk programmers. The system variables are of two types: those that can be reset by the programmers, such as FS, OFS, RS, and ORS (which are discussed in the previous section) and those that are read-only type, which provide information about the current record, file or command-line arguments, and so on. Table 3.10 describes some of most commonly used awk system variables.
Awk System Variables |
Description of the Variable |
---|---|
$n |
Describes the ‘n’th field in the current record. For example, $1 represents the first field; $2 represents the second field, and so on. |
$0 |
the entire current record. |
ARGC |
The number of arguments specified at the command-line. While computing this value, the name of the script and the awk command-line options are ignored. The individual arguments are obtained from the ARGV array. |
ARGV |
This is an array of the command-line arguments. The program name (awk or gawk) is always the first argument and is identified as ARGV[0]. Other arguments are identified as ARGV[I] where I is the index of the argument. Maximum value of I is ARGC-1 as the index is starting at zero. To demonstrate this concept, one of the previous command lines is retyped here. |
$ gawk –F "~" –f awk_script2.txt awk_input2.txt |
|
In this command line, ARGC value is 2, whereas ARGV[0] value is “gawk,” ARGV[1] value is “awk_input2.txt”. Please note that the command-line options with the corresponding values and the script name are ignored. |
|
FILENAME |
This variable provides the name of the current input file being processed by awk. |
FIELDWIDTHS |
This variable provides list of field widths separated by whitespace. |
NF |
This variable provides the number of fields in the current record. Because it is a number, it can also be used to access the last field in the current record by using the $NF notation. |
NR |
This variable provides the current record number in the file. |
ENVIRON |
Provides an array of environment variables available in the current session. For example, by iterating through the array elements in a for loop, we can access these variables in the awk program, as shown below. |
for(env in ENVIRON) print ENVIRON[env] |
|
We can also access the specific environment variable by name. ENVIRON["PATH"] retrieves the PATH variable, which is accessed as $PATH in a typical shell script. |
|
ERRNO |
Provides the description of the last system error. |
OFS |
Defines the output field separator used while writing output to the standard output. |
FS |
Defines the field separator of the input records. |
RS |
Defines the input record separator. |
ORS |
Defines the output record separator. |
It has been noted that an awk script consists of a set of instructions to be executed on the input stream, and an instruction is a combination of a pattern and action, as shown below.
pattern1 { action1 }
{ action2 }
pattern3 { action3 }
The pattern is described first, and then the corresponding action enclosed within a pair of curly braces. The action should be performed only on those input lines that match the specified pattern. If a pattern does not precede the action as shown in the second line in the example, then the action is performed on all the input lines.
We also have noted that the print function is used to send output to the standard output. Additionally, the printf function is available to format the output. The syntax of the printf function is very similar to the one in C programming language. The set of logical operators used in awk are <, <=, >, >=, ==, !=, ~ and !~, which stand for less than, less than or equal to, greater than, greater than or equal to, equal to, not equal to, matches, and does not match, respectively. These logical operators are useful for deciding whether the particular action should be performed on the line or not, as shown in the following example.
$ 2 == "Smith" { print $1, $2 }
In this example, we are instructing awk to compare the value of the second field in the current input line to the string “Smith,” and if it matches, then print the first field and second field, which could be first name followed by last name delimited with the standard delimiter. Instead of attempting to do a perfect string match, we can use the expression-matching principles by using the match ~ operator and an expression on the righthand side of the operator, as shown here.
$ 3 ~ /[0-9]+/ { print $1, $2 }
In this example, we are instructing awk to print the first and second fields, if the third field is a number. Logical expressions may be combined using the Boolean operators ||, &&, and ! for logical or, logical and, and logical negation, respectively. The arithmetic operators +, -, *, /, %, and ^ are used on numerical operands for addition, subtraction, multiplication, division, modulo operation, and exponentiation, respectively. The simple assignment operator = is used to assign values to variables. The arithmetic operators may be used in conjunction with the assignment operator to combine the two operations into one, as the += operator performs an addition of the operands on both the sides and assigns the value to the operand on the left side.
The logical constructs such as if and if … else provided by awk are as powerful as they are in any other programming language. These are used in the action part of the instruction to enable us to write complex programming logic. Both the system variables and program variables can be used to take part in the expressions used in these constructs. The general syntax of these constructs is provided below. The first type checks for one condition, and if it is evaluated to be true, then the action is performed on the current line. If the condition evaluates to false, the action is not performed, and the control is passed to the instruction that follows the if construct. The second type of the construct is an extension of the simple if construct. In this construct, if the logical expression evaluates to true, then the action1 is performed, and if the condition evaluates to false, then action2 is performed.
if (logical expression)
action
if (logical expression1)
action1
else
action2
In both these constructs, if the action consists of a single statement to be performed when the condition is evaluated to true (or false as in the second type), then the statement does not have to be enclosed within a pair of curly braces. However, if more than one statement should be grouped as a single action when the condition is evaluated to true (or false as the case may be), then these multiple statements are enclosed within curly braces, as shown below.
if (logical expression) {
statement1
statement2
}
if (logical expression1) {
statement1
statement2
}
else {
statement3
statement4
}
Another type of conditional construct is a simplified version of the ‘if…else’ construct as provided in the C programming language and is shown below.
(logical expression1) ? action1 : action 2
In this construct, if the logical expression evaluates to true, the action1 is performed, and if evaluates to false, the action2 is performed. For performing repeated execution of the same set of instructions within a loop, there are several looping constructs provided by awk, such as the for loop, the while loop, and the do…while loop. The general syntax of these constructs is provided here.
for ( initialize loop counter; test the loop counter; increment loop counter)
action
while (logical condition)
action
do
action
while (logical condition)
The action specified within a for loop is performed to execute the action for a specific number of iterations while the iterations are monitored through a counter and the counter is controlled by testing against a value. The logical expression of the for loop constitutes three parts: the first part initializes the counter to a value; the second part tests the counter against a test value and the third part increments the counter as desired. A typical example of the for loop is given here. In this example, the for loop is executed eight times with the counter value starting from 2 and ending at 9. Therefore, the for loop is very useful for executing an action a specific number of times.
for ( counter=2; counter < 10; counter++ ) {
print counter
}
The action specified within a while loop is executed until the logical condition is evaluated to ‘true.’ The loop exits only if the logical condition evaluates to ‘false.’ Also, the loop does not execute even a single time if the logical condition fails before the first iteration itself. However, the do…while loop is guaranteed to execute at least once, even if the logical condition evaluates to false, because the condition is evaluated at the end of the loop after completing the first iteration. Both the while and do…while loops have to be programmed cautiously, and an appropriate exit criteria should be built in the loop; otherwise, it is likely that we will write infinite loops, and the program never terminates execution. The continue and break statements can be used in all the looping constructs to better control the iterative execution and premature exit criteria, respectively.
In all the looping constructs, if the action consists of multiple statements, they should be grouped within a pair of curly braces as we have done in the case of if and if…else constructs. If the curly braces do not appear in these constructs at the appropriate locations, then awk assumes that the action consists of only one statement, and if it encounters more than one statement, a parsing error is thrown.
The next statement is used to retrieve the next input line from the current input stream, while the exit statement should be used to terminate the awk program without further processing remaining instructions.
An awk program may be modularized by the use of functions. If you are a developer experienced with any high-level programming language, you should be familiar with functions. Functions provide a way to group sets of statements performing specific repeated tasks. Instead of writing the same set of instructions again and again, these instructions are grouped together with a given function name. Wherever these instructions have to be executed repeatedly, the whole set of instructions can be replaced by a simple call to the function. Functions can also take arguments so that they provide customized execution scenarios in different situations. The typical syntax of a function is given here.
function <function name> (comma separated argument list) {
statements
}
The word ‘function’ should precede the name of the function. You can give a name of your choice to the function, but you must give one; it is not optional. If the function should return a value (or expression), you should use the return statement in the form return [expression/value]. Returning a value is not always required. Functions may simply return control of execution to the point from where they are called. If there is no value/expression to return, the return statement is optional. However, a properly used return statement also aids in breaking the program logic abnormally and returning to the calling point.
Before finishing this topic, we will discuss a few of the most commonly used predefined system functions. These functions are known to awk (because they belong to awk) and are available for ready use. Table 3.11 displays some of the most commonly used awk functions.
Awk System Functions |
Description of the Function |
---|---|
system(<command>) |
The system(<command>) function is used to execute a shell command (which we normally execute at the shell prompt) from within the awk program. For example, frequently we need to check if a specific file exists in the file system, check/change the file permissions, create a subdirectory within the current directory, and so on. In all these cases, these commands should be wrapped in the system() function call. We can also check the return code returned by the executed command to make sure that the command executed properly. |
getline <file name> |
This function may be used to read the next input line from the current input stream or from another file/pipe without leaving the current input stream. The <file name> represents the file to read, or may be omitted (along with the input redirection symbol) to just read the next line from the current input file. |
gsub(regex, subst, inpstr) |
This function is used to make global substitution of the subst string within the input inpstr string for every occurrence of the matching pattern evaluated by the regular expression regex. |
sub(regex, subst, inpstr) |
This function works similar to the gsub() function except that it substitutes only the first occurrence of the matching pattern. |
close(filename) |
The close() function should be used to close open files and pipes. With each call to the function, one file/pipe may be closed by specifying its name as the function argument. |
match(inpstr, regex) |
The match() function should be used to check if the pattern evaluated by the regular expression regex exists in the input string inpstr; returns the starting position of the matching substring if a match exists and returns 0 if a matching substring does not exist in the input string. |
tolower(string) |
This function converts the input string to all lowercase and returns the same. |
toupper(string) |
This function converts the input string to all uppercase and returns the same |
Awk provides many more functions, variables, and program constructs not discussed in this chapter, and the reader is encouraged to explore further, as the chapter is intended mainly to introduce the concepts of using the powerful awk language.
< Day Day Up > |