5.2. Objective 2: Process Text Streams Using Filters
Many of the commands on Linux systems are intended to be used as filters, which modify text in helpful ways. Text fed into the command's standard input or read from files is modified in some useful way and sent to standard output or to a new file leaving the original source file unmodified. Multiple commands can be combined to produce text streams, which are modified at each step in a pipeline formation. This section describes basic use and syntax for the filtering commands important for Exam 101. Refer to a Linux command reference for full details on each command and the many other available commands.
Syntax
cut options [files]
Description
Cut out (that is, print) selected columns or fields from one or more files. The source file is not changed. This is useful if you need quick access to a vertical slice of a file. By default, the slices are delimited by a tab character.
Frequently used options
-blist
Print bytes in list positions.
-clist
Print characters in list columns.
-ddelim
Set field delimiter for -f.
-flist
Print list fields.
Example
Show usernames (in the first colon-delimited field) from /etc/passwd:
$ cut -d: -f1 /etc/passwd
Syntax
expand [options] [files]
Description
Convert tabs to spaces. Sometimes the use of tab characters can make output that is attractive on one output device look bad on another. This command eliminates tabs and replaces them with the equivalent number of spaces. By default, tabs are assumed to be eight spaces apart.
Frequently used options
-tnumber
Specify tab stops, in place of default 8.
-i
Initial; convert only at start of lines.
Syntax
fmt [options] [files]
Description
Format text to a specified width by filling lines and removing newline characters. Multiple files from the command line are concatenated.
Frequently used options
-u
Use uniform spacing: one space between words and two spaces between sentences.
-w width
Set line width to width. The default is 75 characters.
Syntax
head [options] [files]
Description
Print the first few lines of one or more files (the "head" of the file or files). When more than one file is specified, a header is printed at the beginning of each file, and each is listed in succession.
Frequently used options
-c n
Print the first n bytes, or if n is followed by k or m, print the first n kilobytes or megabytes, respectively.
-nn
Print the first n lines. The default is 10.
Syntax
join [options] file1 file2
Description
Print a line for each pair of input lines, one each from file1 and file2, that have identical join fields. This function could be thought of as a very simple database table join, where the two files share a common index just as two tables in a database would.
Frequently used options
-j1field
Join on field of file1.
-j2field
Join on field of file2.
-jfield
Join on field of both file1 and file2.
Example
Suppose file1 contains the following:
1 one
2 two
3 three
and file2 contains:
1 11
2 22
3 33
Issuing the command:
$ join -j 1 file1 file2
yields the following output:
1 one 11
2 two 22
3 three 33
Syntax
nl [options] [files]
Description
Number the lines of files, which are concatenated in the output. This command is used for numbering lines in the body of text, including special header and footer options normally excluded from the line numbering. The numbering is done for each logical page, which is defined as having a header, a body, and a footer. These are delimited by the special strings \:\:\:, \:\:, and \:, respectively.
Frequently used options
-b style
Set body numbering style to style, t by default.
-f style
Set footer number style to style, n by default.
-h style
Set header numbering style to style, n by default.
Styles can be in these forms:
A
Number all lines.
t
Only number non-empty lines.
n
Do not number lines.
pREGEXP
Only number lines that contain a match for regular expression REGEXP.
Example
Suppose file file1 contains the following text:
\:\:\:
header
\:\:
line1
line2
line3
\:
footer
\:\:\:
header
\:\:
line1
line2
line3
\:
footer
If the following command is given:
$ nl -h a file1
the output would yield numbered headers and body lines but no numbering on footer lines. Each new header represents the beginning of a new logical page and thus a restart of the numbering sequence:
1 header
2 line1
3 line2
4 line3
footer
1 header
2 line1
3 line2
4 line3
footer
Syntax
od [options] [files]
Description
Dump files in octal and other formats. This program prints a listing of a file's contents in a variety of formats. It is often used to examine the byte codes of binary files but can be used on any file or input stream. Each line of output consists of an octal byte offset from the start of the file followed by a series of tokens indicating the contents of the file. Depending on the options specified, these tokens can be ASCII, decimal, hexadecimal, or octal representations of the contents.
Frequently used options
-t type
Specify the type of output. Typical types include:
A
Named character
c
ASCII character or backslash escape
O
Octal (the default)
x
Hexadecimal
Example
If file1 contains:
a1\n
A1\n
where \n stands for the newline character. The od command specifying named characters yields the following output:
$ od -t a file1
00000000 a 1 nl A 1 nl
00000006
A slight nuance is the ASCII character mode. This od command specifying named characters yields the following output with backslash-escaped characters rather than named characters:
$ od -t c file1
00000000 a 1 \n A 1 \n
00000006
With numeric output formats, you can instruct od on how many bytes to use in interpreting each number in the data. To do this, follow the type specification by a decimal integer. This od command specifying single-byte hex results yields the following output:
$ od -t x1 file1
00000000 61 31 0a 41 31 0a
00000006
Doing the same thing in octal notation yields:
$ od -t o1 file1
00000000 141 061 012 101 061 012
00000006
If you examine an ASCII chart with hex and octal representations, you'll see that these results match those tables.
Syntax
paste [options] [files]
Description
Paste together corresponding lines of one or more files into vertical columns.
Frequently used options
-dn
Separate columns with character n in place of the default tab.
-s
Merge lines from one file into a single line. When multiple files are specified, their contents are placed on individual lines of output, one per file.
For the following three examples, file1 contains:
1
2
3
and file2 contains:
A
B
C
Example 1
A simple paste creates columns from each file in standard output:
$ paste file1 file2
1 A
2 B
3 C
Example 2
The column separator option yields columns separated by the specified character:
$ paste -d'@' file1 file2
1@A
2@B
3@C
Example 3
The single-line option (-s) yields a line for each file:
$ paste -s file1 file2
1 2 3
A B C
Syntax
pr [options] [file]
Description
Convert a text file into a paginated, columnar version, with headers and page fills. This command is convenient for yielding nice output, such as for a line printer from raw uninteresting text files. The header will consist of the date and time, the filename, and a page number.
Frequently used options
-d
Double space.
-hheader
Use header in place of the filename in the header.
-llines
Set page length to lines. The default is 66.
-o width
Set the left margin to width.
Syntax
sort [options] [files]
Description
Write input to stdout, sorted alphabetically.
Frequently used options
-f
Case-insensitive sort.
-kPOS1[,POS2]
Sort on the key starting at POS1 and (optionally) ending at POS2.
-n
Sort numerically.
-r
Sort in reverse order.
-tSEP
Use SEP as the key separator. The default is to use whitespace as the key separator.
Example
Sort all processes on the system by resident size (RSS in ps):
$ ps aux | sort -k 6 -n
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 2 0.0 0.0 0 0 ? SW Feb08 0:00 [keventd]
root 3 0.0 0.0 0 0 ? SWN Feb08 0:00 [ksoftirqd_CPU0]
root 4 0.0 0.0 0 0 ? SW Feb08 0:01 [kswapd]
root 5 0.0 0.0 0 0 ? SW Feb08 0:00 [bdflush]
root 6 0.0 0.0 0 0 ? SW Feb08 0:00 [kupdated]
root 7 0.0 0.0 0 0 ? SW Feb08 0:00 [kjournald]
root 520 0.0 0.3 1340 392 tty0 S Feb08 0:00 /sbin/mingetty tt
root 335 0.0 0.3 1360 436 ? S Feb08 0:00 klogd -x
root 1 0.0 0.3 1372 480 ? S Feb08 0:18 init
daemon 468 0.0 0.3 1404 492 ? S Feb08 0:00 /usr/sbin/atd
root 330 0.0 0.4 1424 560 ? S Feb08 0:01 syslogd -m 0
root 454 0.0 0.4 1540 600 ? S Feb08 0:01 crond
root 3130 0.0 0.5 2584 664 pts/0 R 13:24 0:00 ps aux
root 402 0.0 0.6 2096 856 ? S Feb08 0:00 xinetd -stayalive
root 385 0.0 0.9 2624 1244 ? S Feb08 0:00 /usr/sbin/sshd
root 530 0.0 0.9 2248 1244 pts/0 S Feb08 0:01 -bash
root 3131 0.0 0.9 2248 1244 pts/0 R 13:24 0:00 -bash
root 420 0.0 1.3 4620 1648 ? S Feb08 0:51 sendmail: accepti
root 529 0.0 1.5 3624 1976 ? S Feb08 0:06 /usr/sbin/sshd
Syntax
split [option] [infile] [outfile]
Description
Split infile into a specified number of line groups, with output going into a succession of files, outfileaa, outfileab, and so on (the default is xaa, xab, etc.). The infile remains unchanged. This command is handy if you have a very long text file that needs to be reduced to a succession of smaller files. This was often done to email large files in smaller chunks, because at one time it was considered bad practice to a send single large email message.
Frequently used option
-n
Split the infile into n-line segments. The default is 1,000.
Example
Suppose file1 contains:
1 one
2 two
3 three
4 four
5 five
6 six
Then the command:
$ split -2 file1 splitout_
yields as output three new files, splitout_aa, splitout_ab, and splitout_ac. The file splitout_aa contains:
1 one
2 two
splitout_ab contains:
3 three
4 four
and splitout_ac contains:
5 five
6 six
Syntax
tac [file]
Description
This command is named as an opposite for the cat command, which simply prints text files to standard output. In this case, tac prints the text files to standard output with lines in reverse order.
Example
Suppose file1 contains:
1 one
2 two
3 three
Then the command:
$ tac file1
yields as output:
3 three
2 two
1 one
Syntax
tail [options] [files]
Description
Print the last few lines of one or more files (the "tail" of the file or files). When more than one file is specified, a header is printed at the beginning of each file, and each is listed in succession.
Frequently used options
-cn
This option prints the last n bytes, or if n is followed by k or m, the last n kilobytes or megabytes, respectively.
-nm
Prints the last m lines. The default is 10.
-f
Continuously display a file as it is actively written by another process. This is useful for watching log files as the system runs.
Syntax
tr [options] [string1 [string2]]
Description
Translate characters from string1 to the corresponding characters in string2. tr does not have file arguments and therefore must use standard input and output.
Note that string1 and string2 should contain the same number of characters since the first character in string1 will be replaced with the first character in string2 and so on.
Either string1 or string2 can contain several types of special characters. Some examples follow, although a full list can be found in the tr manpage.
a-z
All characters from a to z.
\\
A backslash (\) character.
\nnn
The ASCII character with the octal value nnn.
\x
Various control characters:
\a bell
\b backspace
\f form feed
\n newline
\r carriage return
\t horizontal tab
\v vertical tab
[:class:]
A POSIX character class:
[:alnum:] alphanumeric characters (letters and digits)
[:aplha:] alpha (letter) characters
[:blank:] horizontal whitespace (space or tab)
[:cntrl:] control characters
[:digit:] numeric (digit) characters
[:graph:] printable characters, not including space
[:lower:] lower case alpha characters
[:print:] all printable characters
[:punct:] punctuation characters
[:space:] all whitespace, horizontal, or vertical (space,tab, newline, etc.)
[:upper:] upper case alpha characters
[:xdigit:] hexadecimal digits
Tip: The actual contents of the POSIX character classes varies based on locale.
Frequently used options
-c
Use the complement of (or all characters not in) string1.
-d
Delete characters in string1 from the output.
-s
Squeeze out repeated output characters in string1.
Example 1
To change all lowercase characters in file1 to uppercase, use:
$ cat file1 | tr a-z A-Z
or:
$ cat file1 | tr '[:lower:]' '[:upper:]'
Example 2
To suppress repeated whitespace characters from file1
$ cat file1 | tr -s '[:blank:]'
Example 3
To remove all non-printable characters from file1 (except the newline character):
$ cat file1 | tr -dc '[:print:]\n'
Syntax
unexpand [options] [files
Description
Convert spaces to tabs. This command performs the opposite action of expand. By default, tab stops are assumed to be every eight spaces.
Frequently used options
-a
Convert all spaces, not just leading spaces. Normally unexpand will only work on spaces at the beginning of each line of input. Using the -a option causes it to replace spaces anywhere in the input.
Tip: This behavior of unexpand differs from expand. By default, expand converts all tabs to spaces. It requires the -i option to convert only leading spaces.
-t number
Specify tab stops, in place of default 8.
Syntax
uniq [options] [input [output]]
Description
Writes input (or stdin) to output (or stdout), eliminating adjacent duplicate lines.
Since uniq works only on adjacent lines of its input, it is most often used in conjunction with sort.
Frequently used options
-d
Print only non-unique (repeating) lines.
-u
Print only unique (non-repeating) lines.
Example
Suppose file containts the following:
b
b
a
a
c
d
c
Issuing the command uniq with no options:
$ uniq file
yields the following output:
b
a
c
d
c
Notice that the line with c is repeated, since the duplicate lines were not adjacent in the input file. To eliminate duplicate lines regardless of where they appear in the input, use sort on the input first:
$ sort file | uniq
a
b
c
d
To print only lines that never repeat in the input, use the -u option:
$ sort file | uniq -u
d
To print only lines that do repeat in the input, use the -d option:
$ sort file | uniq -d
a
b
c
Syntax
wc [options] [files]
Description
Print counts of characters, words, and lines for files. When multiple files are listed, statistics for each file output on a separate line with a cumulative total output last.
Frequently used options
-c
Print the character count only.
-l
Print the line count only.
-w
Print the word count only.
Example 1
Show all counts and totals for file1, file2, and file3:
$ wc file[123]
Example 2
Count the number of lines in file1:
$ wc -l file1
Syntax
xargs [options] [command] [initial-arguments]
Description
Execute command followed by its optional initial-arguments and append additional arguments found on standard input. Typically, the additional arguments are filenames in quantities too large for a single command line. xargs runs command multiple times to exhaust all arguments on standard input.
Frequently used options
-n maxargs
Limit the number of additional arguments to maxargs for each invocation of command.
-p
Interactive mode. Prompt the user for each execution of command.
Example
Use grep to search a long list of files,
one by one, for the word "linux":
$ find / -type f | xargs -n 1 grep -H linux
find searches for normal files (-type f) starting at the root directory. xargs executes grep once for each of them due to the -n 1 option. grep will print the matching line preceded by the filename where the match occurred (due to the -H option).
|