{"title": "Bayesian Unsupervised Learning of Higher Order Structure", "book": "Advances in Neural Information Processing Systems", "page_first": 529, "page_last": 535, "abstract": null, "full_text": "Bayesian Unsupervised Learning of \n\nHigher Order Structure \n\nMichael S. Lewicki Terrence J. Sejnowski \nlevicki~salk.edu \n\nterry~salk.edu \n\nThe Salk Institute \n\nHoward Hughes Medical Institute \nComputational Neurobiology Lab \n\n10010 N. Torrey Pines Rd. \n\nLa Jolla, CA 92037 \n\nAbstract \n\nMultilayer architectures such as those used in Bayesian belief net(cid:173)\nworks and Helmholtz machines provide a powerful framework for \nrepresenting and learning higher order statistical relations among \ninputs. Because exact probability calculations with these mod(cid:173)\nels are often intractable, there is much interest in finding approxi(cid:173)\nmate algorithms. We present an algorithm that efficiently discovers \nhigher order structure using EM and Gibbs sampling. The model \ncan be interpreted as a stochastic recurrent network in which ambi(cid:173)\nguity in lower-level states is resolved through feedback from higher \nlevels. We demonstrate the performance of the algorithm on bench(cid:173)\nmark problems. \n\n1 \n\nIntroduction \n\nDiscovering high order structure in patterns is one of the keys to performing complex \nrecognition and discrimination tasks. Many real world patterns have a hierarchical \nunderlying structure in which simple features have a higher order structure among \nthemselves. Because these relationships are often statistical in nature, it is natural \nto view the process of discovering such structures as a statistical inference problem \nin which a hierarchical model is fit to data. \n\nHierarchical statistical structure can be conveniently represented with Bayesian \nbelief networks (Pearl, 1988; Lauritzen and Spiegelhalter, 1988; Neal, 1992). These \n\n\f530 \n\nM. S. Lewicki and T. 1. Sejnowski \n\nmodels are powerful, because they can capture complex statistical relationships \namong the data variables, and also mathematically convenient, because they allow \nefficient computation of the joint probability for any given set of model parameters. \nThe joint probability density of a network of binary states is given by a product of \nconditional probabilities \n\n(1) \n\nwhere W is the weight matrix that parameterizes the model. Note that the prob(cid:173)\nability of an individual state Si depends only on its parents. This probability is \ngiven by \n\nP(Si = 1lpa[Si], W) = h(L SjWji) \n\n(2) \n\nj \n\nwhere Wji is the weight from Sj to Si (Wji = 0 for j < i). \nThe weights specify a hierarchical prior on the input states, which are the fixed \nsubset of states at the lowest layer of units. The active parents of state Si represent \nthe underlying causes of that state. The function h specifies how these causes are \ncombined to give the probability of Si. We assume h to be the \"noisy OR\" function, \nh(u) = 1 - exp( -u), u >= O. \n\n2 Learning Objective \n\nThe learning objective is to adapt W to find the most probable explanation of the \ninput patterns. The probability of the input data is \n\nP(D n IW) is computed by marginalizing over all states of the network \n\nn \n\nP(DnIW) = L P(DnISk, W)P(SkIW) \n\nk \n\n(3) \n\n(4) \n\nBecause the number of different states, Sk, is exponential in the number of units, \ncomputing the sum exactly is intractable and must be approximated. The nature \nof the learning tasks discussed here, however, allow us to make accurate approxi(cid:173)\nmations. A desirable property for representations is that most patterns have just \none or a few possible explanations. In this case, all but a few terms P(D n ISk, W) \nwill be zero, and, as described below, it becomes feasible to use sampling based \nmethods which select Sk according to P(SkIDn, W). \n\n3 \n\nInferring the Internal Representation \n\nGiven the input data, finding its most likely explanation is an inference process. \nAlthough it is simple to calculate the probability of any particular network state, \nthere is no simple way to determine the most probable state given input D. A \ngeneral approach to this problem is Gibbs sampling (Pearl, 1988; Neal, 1992). \n\n\fBayesian Unsupervised Learning of Higher Order Structure \n\n531 \n\nIn Gibbs sampling, each state Si of the network is updated iteratively according to \nthe probability of Si given the remaining states in the network. This conditional \nprobability can be computed using \n\nP(SiISj : j::l i, W) oc P(Silpa[Si], W) II P(Sjlpa[Sj],Si' W) \n\n(5) \n\njEch[S.] \n\nwhere ch[Sil indicates the children of state Si. In the limit, the ensemble of states \nobtained by this procedure will be typical samples from P(SID, W). More generally, \nany subset of states can be fixed and the rest sampled. \n\nThe Gibbs equations have an interpretation in terms of a stochastic recurrent neural \nnetwork in which feedback from higher levels influences the states at lower levels. \nFor the models defined here, the probability of Si changing state given the remaining \nstates is \n\n(6) \n\nThe variable D-xi indicates how much changing the state Si changes the probability \nof the network state \n\n~xi=logh(ui;l-Si)-logh(ui;Si)+ L logh(uj+bij;Sj)-logh(uj;Sj) (7) \n\njEch[S;] \n\nwhere h(u; a) = h(u) if a = 1 and 1 - h(u) if a = o. The variable Ui is the causal \ninput to Si, given by l:k SkWki. The variable bj specifies the change in Uj for a \nchange in Si: bij = +SjWij if Si = 0 and -SjWij if Si = 1. \nThe first two terms in (7) can be interpreted as the feedback from higher levels. The \nsum can be interpreted as the feedforward input from the children of Si. Feedback \nallows the lower level units to use information only computable at higher levels. The \nfeedforward terms typically dominate the expression, but the feedback becomes the \ndetermining factor when the feedforward input is ambiguous. \n\nFor general distributions, Gibbs sampling can require many samples to achieve a \nrepresentative samples. But if there is little ambiguity in the internal representation, \nas is the goal, Gibbs sampling can be as efficient as a single feedforward pass. One \npotential problem is that Gibbs sampling will not work before the weights have \nadapted, when the representations are highly ambiguous. We show below, however, \nthat it is not necessary to sample for long periods in order for good representations \nto be learned. As learning proceeds, the internal representations obtained with \nlimited Gibbs sampling become increasingly accurate. \n\n4 Adapting the Weights \n\nThe complexity of the model is controlled by placing a prior on the weights. For \nthe form of the noisy OR function in which all weights are constrained to be pos(cid:173)\nitive, we assume the prior to be the product of independent gamma distributions \nparameterized by 0: and {3. The objective function becomes \n\nC = P(D 1 : N IW)P(Wlo:, {3) \n\n(8) \n\nA simple and efficient EM-type formula for adapt the weights can be derived by \nsetting aCjwij to zero and solving for Wij. Using the transformations iij = 1 -\n\n\f532 \n\nM. S. Lewicki and T. 1. Sejnowski \n\nexp( -Wij) and 9i = 1 - exp( -Ui) we obtain \n\n(9) \n\nwhere s(n) is the state obtained with Gibbs sampling for the nth input pattern. \nThe variable Iij can be interpreted as the frequency of state Sj given cause Si. The \nsum in the above expression is a weighted average of the number of times Sj was \nactive when Si was active. The ratio hj /9j weights each term in the sum inversely \naccording to the number of different causes for Sj. If Si is the unique cause of Sj \nthen hj = 9j and the term would have full weight. \n\nA straightforward application of the learning algorithm would adapt all the weights \nat the same time. This does not produce good results, however, because there \nis nothing to prevent the model from learning overly strong priors. This can be \nprevented by adapting the weights in the upper levels after the weights in the lower \nlevels have stabilized. This allows the higher levels to adapt to structure that is \nactually present in the data. We have obtained good results from both the naive \nmethod of adapting the lowest layers first and from more sophisticated methods \nwhere stability was based on how often a unit changed during the Gibbs sampling. \n\n5 Examples \n\nIn the following examples, the weight prior was specified with a = 1.0 and {3 = 1.5. \nWeights were set to random values between 0.05 and 0.15. Gibbs sampling was \nstopped if the maximum state change probability was less than 0.05 or after 15 \nsweeps through the units. Weights were reestimated after blocks of 200 patterns. \nEach layer of weights was adapted for 10 epochs before adapting the next layer. \n\nA High Order Lines Problem. The first example illustrates that the algorithm \ncan discover the underlying features in complicated patterns and that the higher \nlayers can capture interesting higher order structure. The first dataset is a variant of \nthe lines problem proposed by Foldiak (1989) . The patterns in the dataset are com(cid:173)\nposed of horizontal and vertical lines as illustrated in figure 1. Note that, although \n\nB \n\nFigure 1: Dataset for the high order lines problem. \n(A) Patterns are generated by \nselecting one of the pattern types according to the probabilities next to the arrows. Top \npatterns are copied to the input. The horizontal and vertical lines on the left are selected \nwith probability 0.3. (B) Typical input patterns. \n\n\fBayesian Unsupervised Learning of Higher Order Structure \n\n533 \n\nthe datasets are displayed on a 2-D grid, the network makes no assumptions about \ntopography. Because the network is fully connected, all spatial arrangements of the \ninputs are identical. The weights learned by the network are shown in figure 2. \n\nI'~ 1.111:1 ~\u00b7I\u00b7II:I:l:I:I:II.I\u00b7~tl.II:~:I\u00b7I:1 \n\n__ -.... IIlImmll \n1111111111 \n\nFigure 2: The weights in a 25-10-5 network after training. Blocks indicate the weights to \neach unit. Square size is proportional to the weight values. Second layer units capture the \nstructure of the horizontal and vertical lines. Third layer units capture the correlations \namong the lines. The first unit in the third layer is active when the 'II' is present. The \nsecond, fourth, and fifth units have learned to represent the '+', '=', and '0' respectively, \nwith the remaining unit acting as a bias. \n\nThe Shifter Problem. The shifter problem (Hinton and Sejnowski, 1986), ex(cid:173)\nplained in figure 3, is important because the structure that must be discovered is in \nthe higher order input correlations. This example also illustrates the importance of \nallowing high level states to influence low level states to determine the most proba(cid:173)\nble internal representation. The units in the second layer can only capture second \norder statistics and cannot determine the direction of the shift. The only way these \nunits can be disambiguated is to use the feedback from the units in the third layer \nwhich detect the direction of the shift by integrating the output of the units in \nthe second layer. This allows the representation in the second layer to be \"cleaned \nup\" and makes it easier to discover the higher order structure of the global shift. \nThe speed and reliability of the learning was tested by learning from random initial \nconditions. The results are shown in figure 4. Note that the best solutions have a \ncost of about one bit higher than the optimal cost of less than 9 bits, because top \nunits cannot capture the fact that they are mutually exclusive. \n\n6 Discussion \n\nThe methods we have described work well on these simple benchmark problems \nand scale well to larger problems such as the handwritten digits example used in \n(Hinton et al., 1995). We believe there are two main reasons why the algorithm \ndescribed here runs considerably faster than other Gibbs sampling based methods. \nThe first is that there is no need to collect state statistics for each pattern. The \nweight values are reestimated using just one sampled internal state per pattern. The \nsecond is that weights that are not connected to informative units are not updated. \nThis prevents the model from learning what are effectively overly strong priors and \nallows the weights in upper layers to adapt to structure actually in the data. \n\nGibbs sampling allows internal representations to be selected according to their true \nposterior probability. This was shown to be effective in cases where the resulting \n\n\f.\n\n. . . . \n\nM. S. Lewicki and T. 1. Sejnowski \n\n. . . . \n. \n\n.. .. . .. \nB m\u00b7\u00b7\u00b7 _\u00b7 \n.. \n. . \n' ... . \n----\n----\n----\n----\n----\n\n534 \n\nA \n\nFigure 3: The shifter problem. (A) Input patterns are generated by generating a random \nbinary vector in the bottom row. This pattern is shifted either left or right (with wrap \naround) with equal probability and copied to the top row. The input rows are duplicated \nto add redundancy as in Dayan et al. (1995). (B) The weights of a 32-20-2 after learning. \nThe second layer of units learn to detect either local left shifts or right shifts in the data. \nThese units cannot determine the shift direction alone, however, and require feedback from \nthird layer units which integrate the outputs of all the units that represent a common shift \n(note that there is no overlap in the weights for the two third-layer units) . This feedback \nturns off units that are inconsistent with the direction of shift. The weights that are close \nto zero for both third layer units effectively remove redundant second layer units that are \nnot required to represent the input patterns. \n\n~w-----~------~----~-------.------. \n\n~~-----I~O------~15------~ro~-----2~5----~~ \n\nNumber of Epochs \n\nFigure 4: The graph shows 10 runs on the shifter problem from random initial conditions. \nThe average bits per pattern is computed by -log(C)/(Nlog2). Each epoch used 200 \ninput randomly generated input patterns. Two additional epochs were performed with \n1000 random patterns to obtain accurate estimates of the average bits per pattern. The \nnetwork converges rapidly and reliably. The best solutions, like the one shown in figure 3b, \nwere found in 4/10 runs and had costs of approximately 10 bits at epoch 30. \nIn this \nexample, the network can get caught in local minima if too many units learn to represent \nthe same local shifts. \n\n\fBayesian Unsupervised Learning of Higher Order Structure \n\n535 \n\nrepresentation has little ambiguity, i.e. each pattern has only a small number of \nprobable explanations. If the causal structure to be learned is inherently ambiguous, \ne.g. in modeling the causal structure of medical symptoms, Gibbs sampling will be \nslow and better performance can be obtained with wake-sleep learning (Hinton \net al., 1995; Frey et al., 1995) or mean field approximations (Saul et al., 1996). \n\nThere are many natural situations when there is ambiguity in low level features. \nThis ambiguity can only be resolved by integrating the contextual information which \nitself is derived from the ambiguous simple features. This problem is common in \nthe case of noisy input patterns and in feature grouping problems such as figure(cid:173)\nground separation. Feedback is crucial for ensuring that low-level representations \nare consistent within the larger context. \n\nSome systems, such as the Helmholtz machine (Dayan et al., 1995; Hinton et al., \n1995) , arrive at the internal state through a feedforward process. It possible that this \nambiguity in lower-level representations could be resolved by circuitry in the higher(cid:173)\nlevel representations, but if multiple higher-level modules make use of the same low(cid:173)\nlevel representations, the additional circuitry would have to be duplicated in each \nmodule. It seems more parsimonious to use feedback to influence the formation of \nthe lower-level representations. \n\nReferences \n\nDayan, P., Hinton, G. E., Neal, R. M., and Zemel, R. S. (1995). The Helmholtz \n\nmachine. Neural Computation, 7:889-904. \n\nF61dilik, P. (1989). Adaptive network for optimal linear feature extraction. In \nProceedings of the International Joint Conference on Neural Networks, volume I, \npages 401-405, Washington, D. C. \n\nFrey, B. J., Hinton, G. E., and Dayan, P. (1995). Does the wake-sleep algorithm \nproduce good density estimators? In Touretzky, D. S., Mozer, M., and Hasselmo, \nM., editors, Advances in Neural Information Processing Systems, volume 8, pages \n661-667, San Mateo. Morgan Kaufmann. \n\nHinton, G. E., Dayan, P., Frey, B. J ., and Neal, R. M. (1995). The wake-sleep \n\nalgorithm for unsupervised neural networks. Science, 268(5214):1158-116l. \n\nHinton, G. E. and Sejnowski, T. J . (1986). Learning and relearning in Boltzmann \nmachines. In Rumelhart, D. E. and McClelland, J. L., editors, Parallel Distributed \nProcessing, volume 1, chapter 7, pages 282-317. MIT Press, Cambridge. \n\nLauritzen, S. L. and Spiegelhalter, D. J. (1988). Local computations with proba(cid:173)\n\nbilities on graphical structures and their application to expert systems. J. Royal \nStatistical Soc. Series B Methodological, 50(2):157- 224. \n\nNeal, R. M. (1992) . Connectionist learning of belief networks. Artificial Intelligence, \n\n56(1):71-113. \n\nPearl, J. (1988). Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, \n\nSan Mateo. \n\nSaul, L. K., Jaakkola, T ., and Jordan, M. I. (1996). Mean field theory for sigmoid \n\nbelief networks. J. Artificial Intelligence Research, 4:61-76. \n\n\f", "award": [], "sourceid": 1293, "authors": [{"given_name": "Michael", "family_name": "Lewicki", "institution": null}, {"given_name": "Terrence", "family_name": "Sejnowski", "institution": null}]}