Encyclopedia > Integral data type

Article Content

Integral data type

The integral data types are computer data types capable of storing integers. Integral data types generally consist of a certain fixed number of bits, which is called the data type size and is usually a power of two. This implementation of an integral data type is sometimes called a fixed precision integer. Arbitrary or infinite precision integers are found in Lisp. Integral data types are treated as a unit of storage and manipulation.

The table below lists data types recognized by common processors. Additional data types, such as bit-fields and extended-precision integers, found in high level programming languages are not discussed here. Following the table are additional usage notes, then details on number representation. For information about representation of real numbers, see real data type.

Table of contents

1 Common sizes of integral data types

1.1 Integers
1.2 Pointers
1.3 Words
1.4 Bytes

2 Representing integers

2.1 Complement
2.2 Sign-and-magnitude
2.3 Ones' complement
2.4 Two's complement

3 Endianness

Common sizes of integral data types

bits	name	comments
1	bit	status, Boolean flag, has only 2 possible states
4	nibble, nybble	humorously derived half a byte; can contain a single BCD digit
8	byte, octet	integers, ASCII characters
16	word	integers, pointers, UCS-2 characters
32	doubleword/longword	usually shortened to long; integers, pointers
64	quadword, long long	integers, pointers
128	octword	integers, pointers

The terms in the table are typically used only when the content is to be interpreted numerically (and not as some other kind of data structure).

Integers

An integral data type of size n be used to represent numbers in the range of 0 to 2ⁿ - 1. Thus, a 16-bit integral data type can be used to represent numbers from 0 to 2¹⁶ - 1 = 65,535. Negative numbers can be represented as well, at the expense of decreasing the maximum positive number that can be represented (see below).

Pointers

Pointers are a generic term used to indicate an integral value (or a structure thereof) that is used to specify ("point to") a location (address) in memory. Pointer size indicates how much RAM memory a computer can use. Since modern applications demand the simultaneous access to hundreds of megabytes of information, pointers of 32 bits and above are widely used.

Words

The term "word" is highly CPU and OS-specific. It initially meant "the size of an address in the system memory". Thus, one could say that the IBM 370 had 32-bit words. However since the introduction of the x86 architecture, a "word" has become virtually synonymous to "16 bits" (even though Intel 80386 and higher use 32-bit addresses). Different types of CPU used other sizes as well—8-, 12-, 16-, 32-, 36-, 60- and 64-bit words have all been used. The meaning of "doubleword" and "quadword" changed depending on the definition of "word".

Bytes

"Byte" sometimes means a quantity of bits other than 8; 36-bit word architectures commonly had 9-bit bytes. The term octet can be used for more clarity, and always refers to eight bits. It is mostly used in the field of computer networking. Since on most computer architectures, bytes are the most basic units of information, they are used to express the size or amount of computer memory or storage, regardless of the type of data represented. For example, the sizes of a 50 byte text string, 100 KB (kilobytes) files, 128 MB (megabytes) of RAM, or 30 GB (gigabytes) of disk storage are all expressed through bytes.

Representing integers

Complement

Complementing a binary number means changing all the 0s to 1s and all the 1s to 0s; a Boolean NOT on each bit. A byte, holding 8 bits, can represent the values 00000000 (0) to 11111111 (255₁₀), if all bits are used to represent the magnitude of the number. This is called an unsigned integer.

To represent both positive and negative (signed) integers, the convention is that the most significant bit of the binary representation of the number will be used to indicate the sign of the number, rather than contributing to its magnitude; three formats have been used for representing the magnitude: sign-and-magnitude, one's complement, and two's complement, the latter being by far the most common nowadays.

Sign-and-magnitude

Sign-and-magnitude is the simplest and most like human writing forms. The MSB is set to 0 for a positive number or zero, and set to 1 for a negative number. The remaining bits in the number indicate the (positive) magnitude. Hence in a byte with only seven bits (apart from the sign bit), the magnitude can range from 0000000 (0) to 1111111 (127). Thus you can represent numbers from -127₁₀ to +127₁₀. -43 encoded in a byte this way is 10101011.

Ones' complement

The ones'-complement representation of a negative number is created by taking the complement of its positive counterpart. For example, negated 00101011 (43) becomes 11010100 (-43) (Notice how this is different from the sign-and-magnitude convention where the same bit pattern would be -84). The PDP-1 and UNIVAC 1100/2200 series use ones'-complement arithmetic. The range of signed numbers using one's complement in a byte is -127₁₀ to +127₁₀.

Two's complement

Both ones'-complement and sign-and-magnitude have two ways to represent zero: 00000000 (+0) and 11111111 (-0) in one's-complement and 10000000 in sign-and-magnitude. This is sometimes problematic (since hardware for adding and subtracting becomes more complicated, as does testing for 0).

To avoid this, and to also make integer addition simpler, the two's-complement representation is the one generally used. The two's-complement representation is created by first complementing the positive number, then adding 1 to it. Thus 00101011 (43) becomes 11010101 (-43).

In two's-complement, there is only one zero (00000000). Negating a negative number involves the same operation: complementing, then adding 1. The pattern 11111111 now represents -1₁₀ and 10000000 represents -128₁₀; that is, the range of two's-complement integers is -128₁₀ to +127₁₀.

To add two two's-complement integers, treat them as unsigned numbers, add them, and ignore any potential carry over (this is essentially the great advantage that two's-complement has over the other conventions). The result will be the correct two's-complement number, unless both summands were positive and the result is negative or both summands were negative and the result is non-negative. The latter cases are referred to as "overflow" or "wrap around"; the addition cannot be carried out in 8 bit two's-complement in these cases. For example:

      00101011 (+43)     11010101 (-43)     00101011 (+43)     10011010 (-101)
    + 11010101 (-43)   + 11100011 (-29)   + 11100011 (-29)   + 10110001 (- 79)
    - - - - - - - -    - - - - - - - -    - - - - - - - -    - - - - - - - - -
      00000000 (  0)     10111000 (-72)     00001110 (+14)     01001011 (overflow)

Endianness When an integer is represented with multiple bytes, the actual ordering of those bytes in memory, or the sequence in which they are transmitted over some medium, is subject to convention. This is similar to the situation in written languages, where some are written left-to-right, while others are written right-to-left.

Using a 4-byte integer, written as "ABCD", where A is the most significant byte and D is least significant byte, big-endian convention would store the number in successive memory locations as A (lowest address), then B, then C, finally D, while little-endian convention would store the bytes in D-C-B-A order.

Network byte order is, by convention, sending the bytes in the order A, then B, etc., onto the medium. It is the responsibility for the transmitting and receiving systems to convert, if necessary, to their internal endian format.

Big-endian numbers are easier to read when debugging a program but less intuitive (because the high byte is at the smaller address); similarly little-endian numbers are more intuitive but harder to debug. The choice of big-endian vs. little-endian for a CPU design has begun a lot of flame wars. Emphasizing the futility of this argument, the very term big-endian was taken from the Big-Endians of Jonathan Swift's Gulliver's Travels. See the Endian FAQ (http://rdrop.com/~cary/html/endian_faq), including the significant essay "On holy wars and a plea for peace", Danny Cohen 1980.

Processor families that use big-endian storage: SPARC, Motorola 68000, IBM 370
Processor families that use little-endian format: x86, VAX
Processor families that use either (determined by software): MIPS, DEC Alpha, PowerPC
The PDP family of processors, which were word- rather than byte-addressable, used the unusual pattern of B-A-D-C (that is, byte-swap within words).

All Wikipedia text is available under the terms of the GNU Free Documentation License

Search Encyclopedia

Search over one million articles, find something about almost anything!