1.7. How the C Compiler WorksOnce you have written a source file using a text editor, you can invoke a C compiler to translate it into machine code. The compiler operates on a translation unit consisting of a source file and all the header files referenced by #include directives. If the compiler finds no errors in the translation unit, it generates an object file containing the corresponding machine code. Object files are usually identified by the filename suffix .o or .obj . In addition, the compiler may also generate an assembler listing (see Part III). Object files are also called modules. A library, such as the C standard library, contains compiled, rapidly accessible modules of the standard functions. The compiler translates each translation unit of a C programthat is, each source file with any header files it includesinto a separate object file. The compiler then invokes the linker, which combines the object files, and any library functions used, in an executable file. Figure 1-1 illustrates the process of compiling and linking a program from several source files and libraries. The executable file also contains any information that the target operating system needs to load and start it. Figure 1-1. From source code to executable file1.7.1. The C Compiler's Translation PhasesThe compiling process takes place in eight logical steps. A given compiler may combine several of these steps, as long as the results are not affected. The steps are:
For most compilers, either the preprocessor is a separate program, or the compiler provides options to perform only the preprocessing (steps 1 through 4 in the preceding list). This setup allows you to verify that your preprocessor directives have the intended effects. For a more practically oriented look at the compiling process, see Chapter 18. 1.7.2. TokensA token is either a keyword, an identifier, a constant, a string literal, or a symbol. Symbols in C consist of one or more punctuation characters, and function as operators or digraphs, or have syntactic importance, like the semicolon that terminates a simple statement, or the braces { } that enclose a block statement. For example, the following C statement consists of five tokens: printf("Hello, world.\n"); The individual tokens are: printf ( "Hello, world.\n" ) ; The tokens interpreted by the preprocessor are parsed in the third translation phase. These are only slightly different from the tokens that the compiler interprets in the seventh phase of translation:
In parsing the source file into tokens, the compiler (or preprocessor) always applies the following principle: each successive non-whitespace character must be appended to the token being read, unless appending it would make a valid token invalid. This rule resolves any ambiguity in the following expression, for example: a+++b Because the first + cannot be part of an identifier or keyword starting with a, it begins a new token. The second + appended to the first forms a valid tokenthe increment operatorbut a third + does not. Hence the expression must be parsed as: a ++ + b See Chapter 18 for more information on compiling C programs. |