Discover millions of ebooks, audiobooks, and so much more with a free trial

From $11.99/month after trial. Cancel anytime.

An Introduction to Information Theory
An Introduction to Information Theory
An Introduction to Information Theory
Ebook1,096 pages8 hours

An Introduction to Information Theory

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Written for an engineering audience, this book has a threefold purpose: (1) to present elements of modern probability theory — discrete, continuous, and stochastic; (2) to present elements of information theory with emphasis on its basic roots in probability theory; and (3) to present elements of coding theory.
The emphasis throughout the book is on such basic concepts as sets, the probability measure associated with sets, sample space, random variables, information measure, and capacity. These concepts proceed from set theory to probability theory and then to information and coding theories. No formal prerequisites are required other than the usual undergraduate mathematics included in an engineering or science program. However, since these programs may not include a course in probability, the author presents an introductory treatment of probability for those who wish to pursue the general study of statistical theory of communications.
The book is divided into four parts: memoryless discrete themes, memoryless continuum, schemes with memory, and an outline of some recent developments. An appendix contains notes to help familiarize the reader with the literature in the field, while the inclusion of many reference tables and an extensive bibliography with some 200 entries makes this an excellent resource for any student in the field.

LanguageEnglish
Release dateJul 13, 2012
ISBN9780486158440
An Introduction to Information Theory

Related to An Introduction to Information Theory

Related ebooks

Mathematics For You

View More

Related articles

Reviews for An Introduction to Information Theory

Rating: 0 out of 5 stars
0 ratings

0 ratings1 review

What did you think?

Tap to rate

Review must be at least 10 words

  • Rating: 5 out of 5 stars
    5/5
    This is a good book, but it was published in 1961, not 2012.

Book preview

An Introduction to Information Theory - Fazlollah M. Reza

MATHEMATICS

CHAPTER 1

INTRODUCTION

Information theory is a new branch of probability theory with extensive potential applications to communication systems. Like several other branches of mathematics, information theory has a physical origin. It was initiated by communication scientists who were studying the statistical structure of electrical communication equipment.

Our subject is about a decade old. It was principally originated by Claude Shannon through two outstanding contributions to the mathematical theory of communications in 1948 and 1949. These were followed by a flood of research papers speculating upon the possible applications of the newly born theory to a broad spectrum of research areas, such as pure mathematics, radio, television, radar, psychology, semantics, economics, and biology. The immediate application of this new discipline to the fringe areas was rather premature. In fact, research in the past 5 or 6 years has indicated the necessity for deeper investigations into the foundations of the discipline itself.

Despite this hasty generalization which produced several hundred research papers (with frequently unwarranted conclusions), one thing became evident. The new scientific discovery has stimulated the interest of thousands of scientists and engineers around the world.

Our first task is to present a bird’s–eye view of the subject and to specify its place in the engineering curriculum. In this chapter a heuristic exposition of the topic is given. No effort is made to define the technical vocabulary. Such an undertaking requires a detailed logical presentation and is out of place in this informal introduction. However, the reader will find such material presented in a pedagogically prepared sequence beginning with Chap. 2. This introductory chapter discusses generalities, leaving a more detailed and precise treatment to subsequent chapters.¹ The specialist interested in more concrete statements may wish to forgo this introduction and begin with the body of the book.¹

1-1. Communication Processes.

Communication processes are concerned with the flow of some sort of information–carrying commodity in some network. The commodity need not be tangible; for example, the process by which one mind affects another mind is a communication procedure. This may be the sending of a message by telegraph, visual communication from artist to viewer, or any other means by which information is conveyed from a transmitter to a receiver. The subject matter deals with the gross aspects of communication models rather than with their minute structure. That is, we concentrate on the over–all performance of such systems without being restrained to any particular equipment or organ. Common to all communication processes is the flow of some commodity in some network. While the nature of the commodity can be as varied as electricity, words, pictures, music, and art, one could suggest at least three essential parts of a communication system (Fig. 1-1):

1. Transmitter or source

2. Receiver or sink

3. Channel or transmission network which conveys the communiqué from the transmitter to the receiver

FIG. 1-1. The model of a communication system.

This is the simplest communication system that one can visualize. Practical cases generally consist of a number of sources and receivers and a complex network. A familiar analogous example is an electric power system using several interconnected power plants to supply several towns.

In such problems one is concerned with a study of the distribution of the commodity in the network, defining some sort of efficiency of transmission and hence devising schemes leading to the most efficient transmission.

When the communiqué is tangible or readily measurable, the problems encountered in the study of the communication system are of the types somewhat familiar to engineers and operational analysts (for instance, the study of an electric circuit or the production schedule of a manufacturing plant). When the communiqué is intelligence or information, this general familiarity cannot be assumed. How does one define a measure for the amount of information? And having defined a suitable measure, how does one apply it to the betterment of the communication of information?

To mention an analog, consider the case of an electric power network transmitting electric energy from a source to a receiver (Fig. 1-2). At the source the electric energy is produced with voltage Vs. The receiver requires the electric energy at some prescribed voltage Vr. One of the problems involved is the heat loss in the channel (transmission line). In other words, the impedance of the wires acts as a parasitic receiver. One of the many tasks of the designer is to minimize the loss in the transmission lines. This can be accomplished partly by improving the quality of the transmission lines. A parallel method of transmission improvement is to increase the voltage at the input terminals of the line. As is well known, this improves the efficiency of transmission by reducing energy losses in the line. A step–up voltage transformer installed at the input terminals of the line is appropriate. At the output terminals another transformer (step–down) can provide the specified voltage to the receiver.

Without being concerned about mathematical discipline in this introductory chapter, let us ask if similar procedures could be applied to the transmission of information. If the channel of transmission of information is a lossy one, can one still improve the efficiency of the transmission by procedures similar to those in the above case? This of course depends, in the first place, on whether a measure for the efficiency of transmission of information can be defined.

FIG. 1-2. An example of a communication system.

1-2. A Model for a Communication System.

The communication systems considered here are of a statistical nature. That is, the performance of the system can never be described in a deterministic sense; rather, it is always given in statistical terms. A source is a device that selects and transmits sequences of symbols from a given alphabet. Each selection is made at random, although this selection may be based on some statistical rule. The channel transmits the incoming symbols to the receiver. The performance of the channel is also based on laws of chance. If the source transmits a symbol, say A, with a probability of P{A} and the channel lets through the letter A with a probability denoted by P{A|A}, then the probability of transmitting A and receiving A is

P{A} · P{A|A}

The communication channel is generally lossy; i.e., a part of the transmitted commodity does not reach its destination or it reaches the destination in a distorted form. There are often unwanted sources in a communication channel, such as noise in radio and television or passage of a vehicle in the opposite direction in a one–way street. These sources of disturbance are generally referred to as noise sources or simply noise. An important task of the designer is the minimization of the loss and the optimum recovery of the original commodity when it is corrupted by the effect of noise.

In the deterministic electrical model of Fig. 1-2, it was pointed out that one device which may be used to improve the efficiency of the system is called a transformer. In the vocabulary of information theory a device that is used to improve the efficiency of the channel may be called an encoder. An encoded message is less susceptible to channel noise. At the receiver’s terminal a decoder is employed to transform the encoded messages into the original form which is acceptable to the receiver. It could be said that, in a certain sense, for more efficient communication, the encoder performs a one-to-one mathematical mapping or an operation F on the input commodity I, F(I), while the decoder performs the inverse of that operation, F−1.

(1–1)

FIG. 1-3. General structure of a communication system used in information theory.

This perfect procedure is, of course, hypothetical; one has to face the ultimate effect of noise which in physical systems will prevent perfect communication. This is clearly seen in the case of the transmission of electrical energy where the transformer decreases the heat loss but an efficiency of 100 per cent cannot be expected. The step-up transformer acts as a sort of encoder and the step-down transformer as a decoding apparatus.

Thus, in any practical situation, we have to add at least three more basic parts to our mathematical model: source of noise, encoder, and decoder (Fig. 1-3). The model of Fig. 1-3 is of a general nature; it may be applied to a variety of circumstances.

A novel application of such a model was made by Wiener and Shannon in their discussions of the statistical nature of the communication of messages. It was pointed out that a radio, television, teletype, or speech transmitter selects sequences of messages from a known transmitter vocabulary at random but with specified probabilities. Therefore, in such communication models, the source, channel, encoder, decoder, noise source, and receiver must be statistically defined. This point of view in itself constitutes a significant contribution to the communication sciences. In light of this view, one comes to realize that a basic study of communication systems requires some knowledge of probability theory. Communication theories cannot be adequately studied without having a good background of probability. Conversely, readers acquainted with the fundamentals of probability theory can proceed most efficiently with research in the field of communication.

In the macroscopic study of communication systems, some of the basic questions facing us are these:

1. How does one measure information and define a suitable unit for such measurements?

2. Having defined such a unit, how does one define an information source, or how does one measure the rate at which an information source supplies information?

3. What is the concept of channel? How does one define the rate at which a channel transmits information?

4. Given a source and a channel, how does one study the joint rate of transmission of information and how does one go about improving that rate? How far can the rate be improved?

5. To what extent does the presence of noise limit the rate of transmission of information without limiting the communication reliability?

To present systematic answers to these questions is our principal task. This is undertaken in the following chapters. However, for the benefit of those who wish to acquire a heuristic introduction to the subject, we include a brief discussion of it here.

1-3. A Quantitative Measure of Information.

In our study we deal with ideal mathematical models of communication. We confine ourselves to models that are statistically defined. That is, the most significant feature of our model is its unpredictability. The source, for instance, transmits at random any one of a set of prespecified messages. We have no specific knowledge as to which message will be transmitted next. But we know the probability of transmitting each message directly, or something to that effect. If the behavior of the model were predictable (deterministic), then recourse to measuring an amount of information would hardly be necessary.

When the model is statistically defined, while we have no concrete assurance of its detailed performance, we are able to describe, in a sense, its over-all or average performance in the light of its statistical description. In short, our search for an amount of information is virtually a search for a statistical parameter associated with a probability scheme. The parameter should indicate a relative measure of uncertainty relevant to the occurrence of each particular message in the message ensemble.

We shall illustrate how one goes about defining the amount of information by a well-known rudimentary example. Suppose that you are faced with the selection of equipment from a catalog which indicates n distinct models:

[x1,x2, . . .,xn]

The desired amount of information I(xk ) associated with the selection of a particular model xk must be a function of the probability of choosing xk:

(1-2)

If, for simplicity, we assume that each one of these models is selected with an equal probability, then the desired amount of information is only a function of n.

(1-2a)

Next assume that each piece of equipment listed in the catalog can be ordered in one of m distinct colors. If for simplicity we assume that the selection of colors is also equiprobable, then the amount of information associated with the selection of a color cj among all equiprobable colors [c1,c2, . . . ,cm] is

(1-2b)

where the function f(x) must be the same unknown function used in Eq. (1-2a).

Finally, assume that the selection is done in two ways:

1. Select the equipment and then select the color, the two selections being independent of each other.

2. Select the equipment and its color at the same time as one selection from mn possible equiprobable choices.

The search for the function f(x) is based on the intuitive choice which requires the equality of the amount of information associated with the selection of the model xk with color cj in both schemes (1-2c) and (1-2d).

(1-2c)

(1-2d)

Thus

(1-3)

This functional equation has several solutions, the most important of which, for our purpose, is

(1-4)²

To give a numerical example, let n = 18 and m = 8.

Thus, when a statistical experiment has n equiprobable outcomes, the average amount of information associated with an outcome is log n. The logarithmic information measure has the desirable property of additivity for independent statistical experiments. These ideas will be elaborated upon in Chap. 3.

1-4. A Binary Unit of Information.

The simplest case to consider is a selection between two equiprobable events E1 and E2. E1 and E2 may be, say, head or tail in a throwing of an honest coin. Following Eq. (1-4), the amount of information associated with the selection of one out of two equiprobable events is

− log ½ = log 2

An arbitrary but convenient choice of the base of the logarithm is 2. In that case, − log2 ½ = 1 provides a unit of information. This unit is commonly known as a bit.³

FIG. 1-4. A probability space with two equiprobable events.

Next consider the selection of one out of 2², 2³, 2⁴, . . . , 2N equally likely choices. By successively partitioning a selection into two equally likely selections, we come to the conclusion that the amounts of information associated with the previous selection schemes are, respectively, 2, 3, 4, . . . , N bits.

FIG. 1-5. Successive partitioning of the probability space.

In a slightly more general case, consider a source with a finite number of messages and their corresponding transmission probabilities.

The source selects at random each one of these messages. Successive selections are assumed to be statistically independent. The probability associated with the selection of message xk is P{xk}. The amount of information associated with the transmission of message xk is defined as

Ik = − log P{xk}

Ik is also called the amount of self–information of the message xk. The average information per message for the source is

(1-5)

For instance, the amount of information associated with a source of the above type, transmitting two symbols 0 and 1 with equal probability, is

If the two symbols were transmitted with probabilities α and 1 − α, then the average amount of information per symbol becomes

(1-6)

The average information per message I is also referred to as the entropy (or the communication entropy) of the source and is usually denoted by the letter H. For instance, the entropy of a simple source of the above type is

H(p1,p2, . . . ,pn) = − (p1 log p1 + p2 log p+ pn log pn)

where (p1,p2, . . . ,pn) refers to a discrete complete probability scheme. Figure 1-6 shows the entropy of a simple binary source for different message probabilities.

Next, consider a second similar source having m symbols, and designate the amount of information per symbol of the two sources by H(n) and H(m), respectively. If the two sources transmit their symbols independently, their joint output might be considered as a source having mn distinct pairs of symbols. It can be shown that for two such independent sources the average information per joint symbol is

H(mn) = H(m) + H(n)

FIG.1-6. The entropy of an independent discrete memoryless binary source

The formal derivation of this relation is given in Chap. 3.

1-5. Sketch of the Plan.

From a mathematical point of view, the heuristic exposition of the previous two sections is somewhat incomplete.

We still need to formalize our understanding of the basic concepts involved and to develop techniques for studying more complex physical models. It was suggested that, given an independent source S which transmits messages xk from a finite set

there is an average amount of information I(x) associated with the independent source S.

I (x) = expected value or average of I(xk) for all messages

Our next step is to generalize this to the case of random variables with two or more not necessarily statistically independent dimensions, for instance, to define the amount of information per symbol of a scheme having pairs of statistically related symbols (xk,yk). This investigation in turn will lead to the study of a channel driven by the source supplying information to that channel. It will be shown that the average information for such a system is

(1-7)

From a physical point of view, the above model may be viewed in a simpler fashion. Consider a source transmitting any one of the two messages x1 and x2 with respective probabilities of α and 1 − α. The output of this source is communicated to a receiver via a noisy binary channel. The channel is described by a stochastic matrix:

When x1 is transmitted, the probability of a correct reception is a and otherwise 1 − a. Similarly, when x2 is transmitted, the probability of correct and incorrect receptions are b and 1 − b, respectively.

It will be shown (Chap. 3) that there is an average amount of information I(X ;Y) associated with this model which exhibits the rate of the information transmitted over the channel. This, in turn, raises a basic question. Given such a channel, what is the highest possible rate of transmission of information over this channel for a specified class of sources? In this manner, one arrives in a natural way at the concept of channel capacity and efficiency of a statistical communication model.

In the above example, the capacity of the channel may be computed by maximizing the information measure I(X;Y) over all permissible values of α.

In short, with each probability scheme we associate an entropy which represents, in a way, the average amount of information for the outcomes of the scheme. When a source and a receiver are connected via a channel, several probability schemes such as conditional and joint probabilities have special significance. An important task is to investigate the physical significance and the interrelationships between different entropies in a communication system. The formal treatment of these relations and the concept of channel capacity is presented in several chapters of the text.

The reader acquainted with probability theory may regard information theory as a new branch of that discipline. He can grasp it at a fair speed. The reader without such a background has to move much more slowly. However he will find the introductory material of Chap. 2 of substantial assistance in the study of Chaps. 3 and 4. An introductory treatment of a random variable assuming a continuum of values is given in Chap. 5. Chapter 6 presents a general study of averaging and moments. The reader with such a background will readily recognize the entropy functions that form the nucleus of information theory as moments of an associated logarithmic random variable: − log P {X}. Thus the entropy appears to be a new and useful form of moment associated with a probability scheme. This idea will serve as an important link in the integration of information and probability theories. Chapter 7 gives a concise introduction to multinormal distributions, laws of large numbers, and central-limit theorems. These are essential tools for the proof of the main theorems of information theory.

Chapters 8 and 9 extend the information-theory concept to random variables assuming a continuum of values (also continuous signals). The probability background of Chaps. 2, 5, 6, 7, and 10 is in most part indispensable for the study of information theory. However, a few additional topics are included for the sake of completeness, although they may not be directed toward an immediate applicaton.

Chapter 10 presents a bird’s-eye view of stochastic theory, followed by Chap. 11, which studies the information theory of stochastic models. A slightly more advanced consideration (but perhaps the heart of the subject) appears in Chap. 12.

A main application of the theory thus far seems to be in the devising of an efficient matching of the information source and the channel, the so-called coding theory. The elements of this theory appear in Chaps. 4 and 13. The Appendix is designed to introduce the reader to a few of the many topics available for further reading in this field.

1-6. Main Contributors to Information Theory.

The historical background of information theory cannot be covered in a few pages. Fortunately there are several sources where the reader can find a historical review of this subject, e.g., The Communication of Information, by E. C. Cherry (Am. Scientist, October, 1952), and On Human Communication, by the same author (John Wiley & Sons, Inc., 1957). (In Chap. 2 of the latter book, Cherry gives a very interesting historical account of developments leading to the discovery of information theory, particularly the impact of the invention of telecommunication.)

As far as the communication engineering profession is concerned, it seems that the first attempt to define a measure for the amount of information was made by R. V. L. Hartley⁵ in a paper called Transmission of Information (Bell System Tech. J., vol. 7, pp. 535–564, 1928).

Hartley suggested that information arises from the successive selection of symbols or words from a given vocabulary. From an alphabet of D distinct symbols we can select DN different words, each word containing N symbols. If these words were all equiprobable and we had to select one of them at random, there would be a quantity of information I associated with such a selection. Furthermore, Hartley suggested the logarithm to the base 10 of the number of possible different words DN as the quantity of information I = N log D.

The main contributions, which really gave birth to the so-called information theory, came shortly after the Second World War from the mathematicians C. E. Shannon and N. Wiener. Wiener’s mathematical contributions to the field of Fourier series and later to time series, plus his genuine interest in the field of communication, led to the foundation of communication theories in general. His two books, Cybernetics and Extrapolation, Interpolation, and Smoothing of Stationary Time Series (1948 and 1949), paved the way for the arrival of new statistical theories of communication. In a paper entitled The Mathematical Theory of Communication (Bell System Tech. J., vol. 27, 1948), Shannon made the first integrated mathematical attempt to deal with the new concept of the amount of information and its main consequences. Shannon’s first paper, along with a second paper, laid the foundation for the new science to be named information theory. Shannon’s earlier contribution may be summarized as follows:

1. Definition of the amount of information from a semiaxiomatic point of view.

2. Study of the flow of information for discrete messages in channels with and without noise (models of Figs. 1-1 and 1-3).

3. Defining the capacity of a channel, that is, the highest rate of transmission of information for a channel with or without noise.

4. In the light of 1, 2, and 3, Shannon gave some fundamental encoding theorems. These theorems state roughly that for a given source and a given channel one can always devise an encoding procedure leading to the highest possible rate of transmission of information.

5. Study of the flow of information for continuous signals in the presence of noise, as a logical extension of the discrete case.

Subsequent to his earlier work, Shannon has made several additional contributions. These have considerably strengthened the position of the original theory.

Following Wiener’s and Shannon’s works an unusually large number of scientific papers appeared in the literature in a relatively short time. A bibliography of information theory and allied topics might now, 13 years after the publication of Shannon’s and Wiener’s works, contain close to 1,000 papers. This indicates the great interest and enthusiasm (perhaps overenthusiasm) of scientists toward this fascinating new discipline. Here it would be impossible to give a detailed account of the contributions in this field. The reader may refer to A Bibliography of Information Theory, by F. L. Stumpers, and also to IRE Transactions on Information Theory (vol. IT-1, no. 3, pp. 31–47, September, 1955).

Even though a historical account has not been attempted here, the names of some of the contributors should be mentioned in passing. Bell Telephone Laboratories appears to be the birthplace of information and coding theory. Among the contributors from Bell Labs are E. N. Gilbert, R. W. Hamming, J. L. Kelley, Jr., B. McMillan, S. 0. Rice, C. E. Shannon, and D. Slepian. P. Elias, R. M. Fano, A. Feinstein, D. Huffman, C. E. Shannon, N. Wiener, and J. A. Wozencraft of the Massachusetts Institute of Technology have greatly contributed to the advancement of information and coding theory. Information theory has received significant stimuli from the works of several Russian mathematicians. A. 1. Khinchin, by employing the results of McMillan and Feinstein, produced one of the first mathematically exact presentations of the theory. Academician A. N. Kolmogorov, a leading man in the field of probability, and his colleagues have made several important contributions. A few of the other Russian contributors are R. L. Dobrushin, D. A. Fadiev, M. A. Gavrilov, I. M. Gel’fand, A. A. Kharkevich, V. A. Kotelnikov,⁶ M. Rozenblat-Rot, V. I. Siforov, and I. M. and A. M. Iaglom.

The afore-mentioned names are only a few of a long list of mathematicians and communication scientists who have contributed to information theory. Some other familiar names are D. A. Bell, A. Blanc-Lapierre, L. Brillouin, N. Abramson, D. Gabor, S. Goldman, I. J. Good, N. K. Ignatyev, J. Loeb, B. Mandelbrot, K. A. Meshkovski, W. Meyer-Eppler, F. L. Stumpers, M. P. Schutzenberger, A. Perez, W. Peterson, A. Thomasian, R. R. Varsamov, J. A. Ville, P. M. Woodward.

A list of those actively engaged in the field would be too long to be included here. Reference to some of the current work will be found in the text and in the bibliography at the end of the book.

For a comprehensive list, the reader is referred to existing bibliographies such as those by Stumpers, Green, and Cherry. Recent contributions to information theory have been aimed at providing more exact proofs for the basic theorems stated by earlier contributors. A state of steady improvement has been prevailing in the literature.

McMillan, Feinstein, and Khinchin have greatly enhanced the elegance of the theory by putting it on a more elaborate mathematical basis and providing proofs for the central theorems as earlier stated by C. E. Shannon. These contributors have confirmed that under very general circumstances, it is possible to transmit information with a high degree of reliability over a noisy channel at a rate as close to the channel capacity as desired.

J. Wolfowitz derived a strong converse of the fundamental theorem of information theory. Among other important theorems, he proved that reliable transmission at a rate higher than the channel capacity is not possible. In the past 2 or 3 years a large number of scientists have become interested in integrating some of the work on encoding theory within the framework of classical mathematics. Reference will be made to their work in Chaps. 13 and 14.

S. Kullback has described the growth of information theory from its statistical roots and emphasized the interrelation between information theory and statistics (Kullback).

The study of time-varying channels has also received considerable attention. Among those who have contributed are C. E. Shannon, R. A. Silverman and S. H. Chang, and V. I. Siforov and his colleagues.

To sum up, the present trend in information theory seems to be as follows: From an engineering point of view, a search for applications of the theory (radar detection, speech, telephone and radio communication, game and decision theory, and particularly implementation of codes) is evident, while the mathematician is still seeking for more rigor in the foundation of the theory and elegance par excellence.

1-7. An Outline of Information Theory.

If we were to make a two–page résumé of information theory for those scientists with a broad background of probability theory, the following could be suggested.

1. The average amount of information conveyed by a discrete random variable Y about another discrete random variable X is suggested by C. E. Shannon.

(1-8)

This definition can be generalized to cover not only the case of two or more random variables assuming a continuum of values but also the more general case of random vectors, generalized functions, and stochastic processes (Gel’fand and Iaglom⁷).

2. The channel is specified by P{Y = yk|X = x;} for all encountered integers i and k. The largest value of the transinformation I(X;Y) obtained over all possible source distribution P{X = xi} is called the capacity of the channel [Shannon (I)].

The definition of the channel capacity can be subjected to generalizations similar to those suggested in 1.

3. Let X and Y be two finite sets of alphabets with x X and y Y. The simplest channel is specified by P{ |x X}. Now consider words of n symbols selected from the X alphabet. These words will be denoted by u U and their corresponding received pairs by v V. This is an nth–order extension of the channel.

4. Given a source P{X = xi}, a channel P{ |X = xi}, and their respective nth–order extensions, then to a specified message ensemble U, we may associate a partitioning of the V space such that

This is a decision scheme which in turn specifies a code (N,n,λ) [A. Feinstein (I)].

5. The central theme of information theory is the following so-called fundamental coding theorem. Given a source, a channel with capacity C, and two constants

0≤ H C 0 < λ < 1

it can be shown that there are an integer n = g(λ,H) and a code (N, n, λ) with (N = function of λ and n) ≥ 2nH. This is the coding theorem stating the possibility of transmitting information at a rate H C over a noisy channel under specified circumstances.

6. Further elaborate mathematical treatment of the concepts of information theory was presented by B. McMillan, who extended the definition of source and channels from a Markov chain to stationary processes. A proof of the fundamental theorem of 5 as well as a clear understanding of the concepts involved in 5 is due to Feinstein. Khinchin considerably improved the status of the art in general and gave a proof of the fundamental theorem of 5 for the case of stationary processes. A converse of the fundamental theorem of 5 is due to J. Wolfowitz, who also gave sharper estimates than those given in 5. Remaining questions include the search for more general encoding theorems along the lines suggested in 1. A recent step in this direction was taken by C. E. Shannon.⁸ The search for engineering applications, particularly low-error probability codes, is ever increasing.

PROBLEMS

1-1. An alphabet consists of four letters A, B, C, D . Find the average amount of information associated with the transmission of a letter.

1-2. An independent, discrete source transmits letters selected from an alphabet consisting of three letters A, B, and C, with respective probabilities

pA pB pC = 0.1

(a) Find the entropy per letter.

(b) If consecutive letters are statistically independent and two–symbol words are transmitted, find all the pertinent probabilities for all two–letter words and the entropy of the system of such words.

1-3. Plot the curve y = −x log2 x for

0 ≤ x ≤ 1

1-4. A pair of dice are thrown. We are told that the sum of the faces is 7. What is the average amount of information contained in this message (that is, the entropy associated with the probability scheme of having the sum of the faces equal to 7)?

1-5. An alphabet consists of six symbols A, B, C, D, E, and F which are transmitted with the probabilities indicated below:

(a) Find the average information content per letter.

(b) If the letters are encoded in a binary system as shown above, find P{1} and P{0} and the entropy of the binary source.

1-6. A bag contains 100 white balls, 50 black balls, and 50 blue balls. Another bag contains 80 white balls, 80 black balls, and 40 blue balls. Determine the average amount of information associated with the experiment of drawing a ball from each bag and predicting its color. The result of which experiment is, on the average, harder to predict?

1-7. There are 12 coins, all of equal weight except one, which may be lighter or heavier. Using information-theory concepts, show that it is possible to determine which coin is the odd one and indicate whether it is lighter or heavier in not more than three weighings with an ordinary balance.

1-8. Solve Prob. 1-7 when the number of coins is N. What is the minimum number of weighings?

1-9. There are seven coins, five of equal weight and the remaining two also of equal weight but lighter than the first five coins. Find the minimum number of weighings necessary to locate these two coins.

PART 1

DISCRETE SCHEMES WITHOUT MEMORY

. . . choose a set of symbols, endow them with certain properties and postulate certain relationships between them. Next, . . . deduce further relationships between them. . . . We can apply this theory if we know the exact physical significance of the symbols. That is, if we can find objects in nature which possess exactly those properties and inter-relations with which we endowed the symbols. . . . The pure mathematician is interested only in the inter-relations between the symbols. . . . The applied mathematician always has the problem of deciding what is the exact physical significance of the symbols. If this is known, then at any stage in the theory we know the physical significance of our theorems. But the strength of the chain depends on the strength of the weakest link, and on occasion the link of physical significance is exceedingly fragile.

J. E. Kerrich, An Experimental Introduction to the Theory of Probability

Belgisk Import Co., Copenhagen

CHAPTER 2

BASIC CONCEPTS OF DISCRETE PROBABILITY

2-1. Intuitive Background.

Most of us have some elementary intuitive notions about the laws of probability, and we may set up a game or an experiment to test the validity of these notions. This procedure is much like the so-called classical approach to the theory of probability, which was commonly used by mathematicians up to the 1930s. However, this approach has been subjected to considerable criticism; indeed, the literature on the subject contains many contradictions and controversies in the writings of the major authors. These arise from the intuitive background used and the lack of well-defined formalism and rigor. Thus, the experiment or game is usually defined by assuming certain symmetries and by accepting certain results a priori, such as the idea that certain possible outcomes are equally likely to occur. For example, consider the following problem: Two persons, A and B, play a game of tossing a coin. The coin is thrown twice. If a head appears in at least one of the two throws, A wins. Otherwise, B wins. Intuitively, it seems that the four following possible outcomes are equally probable:

(HH), (HT), (TH), (TT)

where H denotes head and T denotes tail. A may assume that his chances of winning the game are ¾, since a head occurs in three out of four cases (to his advantage). On the other hand, the following reasoning may also seem logical. If the outcome of the first throw is H, A wins; there is no need to continue the game. Accordingly, only three possibilities need be considered, namely:

(H), (TH), and (TT)

where the first two cases are favorable to A and the last one to B. In other words, the probability that A instead of ¾. The intuitive approach in this problem thus seems to lead to two different estimates of probability.¹⁰

The twentieth century has witnessed enormous advances in the rigorous axiomatic treatment of many branches of mathematics. It is true that the axiomatic approach is essentially present in the familiar euclidean geometry and is, in a way, a very old principle. But it was not until the early twentieth century, when the formal and logical structure of mathematics was given serious, systematic study, that its fundamental and profound implications were recognized. Actually, however, the groundwork for the axiomatic treatment was laid by mathematicians such as Peano, Cantor, and Boole during the middle of the nineteenth century. The later efforts of Hilbert, Russell, Whitehead, and others led to a complete reorientation of the basic formulations, bringing mathematics to its present level.

Although consideration of the axiomatic treatment is not our subject here, it may be interesting to point out its general nature. First, a necessary set of symbols is introduced. Then certain inference or operation rules are given for the desired formal manipulation of the symbols, and a proper set of axioms is determined. The formal system thus created must be consistent; that is, the axioms must be independent and noncontradictory. Strictly speaking, the derivation of the theorems is manipulation of symbols without content, using axioms as a starting point and applying the rules of operation. The fundamental nature of a formal system is by no means obvious, and the limitations are even today under very careful study.

A rather new branch of mathematics exists which deals in an axiomatic manner with properties of various abstract spaces and functions defined over these spaces. This is the so-called measure theory. In the late 1930s and early 1940s attempts were made to put the probability calculus on an axiomatic basis. The work of Kolmogorov, Doob, and many others has contributed greatly toward this aim. Today formal probability theory is an important branch of measure theory (in a strictly formal sense), although the epistemological meaning of probability itself is subject to philosophical discussion. This latter aspect has been studied by several profound thinkers (von Neumann, Carnap, Russell, Fisher, Neyman, and many others).

Today engineers and research scientists recognize that they must have a working knowledge of the powerful tools of twentieth-century mathematics. Although completely axiomatic and rigorous treatment of this subject is far beyond the scope of this discussion, a classical presentation would be out of date, as it would completely forgo the important modern contributions to the theory. Under these circumstances, it seems that a survey of the modern theory of probability at a nonprofessional level will be a reasonable compromise. Most engineering students are not very familiar with concepts of probability, and it is important that they gain some appreciation of them.

In what follows, some elementary concepts of the theory of sets or so–called set algebra must first be introduced. Then these concepts are used to introduce the fundamental definitions of the theory of probability. Such a presentation allows a much wider application of the probability theory than does the older approach, which is inadequate for attacking a large class of modern problems.

2-2. Sets.

The word set, in mathematics, is used to denote any collection of objects specified according to a well-defined rule. Each object in a set is called an element, a member, or a point of the set. If x is an element of the set X, this relationship is expressed by

(2-1)

When x is not a member of the set X, this fact is shown by

(2-2)

For example, if X is the set of all positive integers, then

A set can be specified by either giving all its elements or stating the requirements for the elements belonging to the set. If a, b, c, and d are the only

Enjoying the preview?
Page 1 of 1