Word Size and Data Types
A word is the amount of data that a machine can process at one time. This fits into the document analogy that includes characters (usually eight bits) and pages (many words, often 4 or 8KB worth) as other measurements of data. A word is an integer number of bytesfor example, one, two, four, or eight. When someone talks about the "n-bits" of a machine, they are generally talking about the machine's word size. For example, when people say the Pentium is a 32-bit chip, they are referring to its word size, which is 32 bits, or four bytes.
The size of a processor's general-purpose registers (GPR's) is equal to its word size. The widths of the components in a given architecturefor example, the memory busare usually at least as wide as the word size. Typically, at least in the architectures that Linux supports, the memory address space is equal to the word size. Consequently, the size of a pointer is equal to the word size. Additionally, the size of the C type long is equal to the word size, whereas the size of the int type is sometimes less than that of the word size. For example, the Alpha has a 64-bit word size. Consequently, registers, pointers, and the long type are 64 bits in length. The int type, however, is 32 bits long. The Alpha can access and manipulate 64 bitsone wordat a time.
Each supported architecture under Linux defines BITS_PER_LONG in <asm/types.h> to the length of the C long type, which is the system word size. A full listing of all supported architectures and their word size is in Table 19.1.
The C standard explicitly leaves the size of the standard types up to implementations, although it does dictate a minimum size. The uncertainty in the standard C types across architectures is both a pro and a con. On the plus side, the standard types can take advantage of the word size of various architectures, and types need not explicitly specify a size. The size of the C long type is guaranteed to be the machine's word size. On the downside, however, code cannot assume that the standard C types have any specific size. Furthermore, there is no guarantee that an int is the same size as a long.
The situation grows even more confusing because there doesn't need to be a relation between the types in user-space and kernel-space. The sparc64 architecture provides a 32-bit user-space and therefore pointers and both the int and long types are 32-bit. In kernel-space, however, sparc64 has a 32-bit int type and 64-bit pointers and long types. This is not the norm, however.
Opaque data types do not reveal their internal format or structure. They are about as "black box" as you can get in C. There is not a lot of language support for them. Instead, developers declare a typedef, call it an opaque type, and hope no one typecasts it back to a standard C type. All use is generally through a special set of interfaces that the developer creates. An example is the pid_t type, which stores a process identification number. The actual size of this type is not revealedalthough anyone can cheat and take a peak and see that it is an int. If no code makes explicit use of this type's size, it can be changed without too much hassle. Indeed, this was once the case: In older Unix systems, pid_t was declared as a short.
Another example of an opaque type is atomic_t. As discussed in Chapter 9, "Kernel Synchronization Methods," this type holds an integer value that can be manipulated atomically. Although this type is an int, using the opaque type helps ensure that the data is used only in the special atomic operation functions. The opaque type also helps hide the size of the type, which was not always the full 32 bits because of architectural limitations on 32-bit SPARC.
Some data in the kernel, despite not being represented by an opaque type, requires a specific data type. Two examples are jiffy counts and the flags parameter used in interrupt control, both of which should always be stored in an unsigned long.
When storing and manipulating specific data, always pay careful attention to the data type that represents the type and use it. It is a common mistake to store one of these values in another type, such as unsigned int. Although this will not result in a problem on 32-bit architectures, 64-bit machines will have trouble.
Explicitly Sized Types
Often, as a programmer, you need explicitly sized data in your code. This is usually to match an external requirement, such as with hardware, networking, or binary files. For example, a sound card might have a 32-bit register, a networking packet might have a 16-bit field, or an executable file might have an 8-bit cookie. In these cases, the data type that represents the data needs to be exactly the right size.
The kernel defines these explicitly sized data types in <asm/types.h>, which is included by <linux/types.h>. Table 19.2 is a complete listing.
The signed variants are rarely used.
typedef signed char s8; typedef unsigned char u8; typedef signed short s16; typedef unsigned short u16; typedef signed int s32; typedef unsigned int u32; typedef signed long s64; typedef unsigned long u64;
On a 32-bit machine, however, they are probably defined as follows:
typedef signed char s8; typedef unsigned char u8; typedef signed short s16; typedef unsigned short u16; typedef signed int s32; typedef unsigned int u32; typedef signed long long s64; typedef unsigned long long u64;
These types can be used only inside the kernel, in code that is never revealed to user-space (say, inside a user-visible structure in a header file). This is for reasons of namespace. The kernel also defines user-visible variants of these types, which are simply the same type prefixed by two underscores. For example, the unsigned 32-bit integer type that is safe to export to user-space is __u32. This type is the same as u32; the only difference is the name. You can use either name inside the kernel, but if the type is user-visible you must use the underscored prefixed version to prevent polluting user-space's namespace.
Signedness of Chars
The C standard says that the char data type can be either signed or unsigned. It is the responsibility of the compiler, the processor, or both to decide what the suitable default for the char type is.
On most architectures, char is signed by default and thus has a range from 128 to 127. On a few other architectures, such as ARM, char is unsigned by default and has a range from 0 to 255.
For example, on systems where a char is by default unsigned, this code ends up storing 255 instead of 1 in i:
char i = -1;
On other machines, where char is by default signed, this code correctly stores 1 in i. If the programmer's intention is to store 1, the previous code should be
signed char i = -1;
And if the programmer really intends to store 255, then the code should read
unsigned char = 255;