Subject: review massaro 1998

massaro , d . w . ( 1998 ) . perceiving talking faces : from speech perception to a behavioral principle ( mit press , cambridge , mass . ) . 552 pp . , 212 illus . , 1 cd-rom . $ 55 . 00 . reviewed byl noel nguyen , laboratory for psycholinguistics , fpse , university of geneva , switzerland ( nnguyen @ fapse . unige . ch ) . 1 synopsis 1 . 1 general outline this book is concerned with how multiple sources of information are processed in speech perception and , more generally , in pattern recognition . it is based upon an important research programme conducted by massaro and his colleagues over the last two decades . the book focuses on the perception of so-called bimodal speech , addressing a wide range of issues about the way in which visual information ( as provided by the speaker 's face ) and auditory information are combined with each other by the perceptual system . the scope of the book is much larger , however , as massaro 's purpose here is to describe and defend a new psychological law relevant to a wide variety of domains . in contrast to already well-established laws of the same kind ( e . g . weber 's law of perception ) , which are all unidimensional , the new principle is multidimensional , in that it describes how several factors impact behaviour . this principle is embodied in a computational model of pattern recognition , the fuzzy logical model of perception ( flmp ) , whose latest version is presented and discussed in detail . the flmp is systematically contrasted with alternative computational models , using a broad perceptual database as benchmark throughout the book . in a separate part , the book also deals with methods for synthesizing talking faces in experiments on bimodal speech perception , and introduces baldi , the talking face developed by massaro and his coworkers . the book is accompanied by a cd-rom which contains a series of demonstrations relating to many of the topics dealt with . the book is divided into four main sections . section 1 , " perceiving talking faces " , focuses on the perception of speech by ear and eye . massaro reviews the most significant empirical findings in that domain , discusses the main methodological issues , and presents a general classification of the existing computational models of bimodal speech perception . central to this section is the idea that speech perception obeys a general behavioural principle of integration between different sources of information . section 2 , " broadening the domain " , aims at assessing how well this principle holds up across broad individual and situational variability . the author demonstrates that inter-individual variations in how bimodal speech is perceived , depending on the listener 's age or native language for instance , can be accounted for within the flmp framework . using examples taken from different perceptual and cognitive situations , massaro also defends the idea that the flmp adequately describes information processing irrespective of these situational differences . section 3 , " broadening the framework " , opens with a presentation of an extended and more explicit version of the flmp , designed in particular to account for the dynamics of speech processing . the section also includes a detailed analysis of the methodological issues involved in assessing quantitative predictions in psychology , along with a discussion of the critiques expressed by other investigators about the flmp over the years . finally , section 4 , " creating talking faces " , is specifically dedicated to the synthesis of visual speech . 1 . 2 the new behavioural principle although many readers may already be familiar with massaro 's fuzzy logical model of perception , i shall here assume the contrary , and proceed to present a brief outline of the model . a central assumption of the flmp is that pattern recognition involves a common set of processes regardless of the specific nature of the patterns . speech is not seen as being associated with a dedicated processing module , as in the motor theory of speech perception ( liberman , 1996 ) for instance . on the contrary , the sensory information is assumed to be processed in the same way whether our brain is busy recognizing speech sounds , letters , or manual gestures , to take but a few examples . in any of these cases , the flmp postulates that mapping a stimulus into a unique perceptual category entails going through three main stages of processing , the feature evaluation stage , the feature integration stage , and the decision stage . the evaluation stage consists of converting the available sources of information into a set of properties referred to as features . each feature is given a continuous ( fuzzy truth ) value , and represents the degree to which the stimulus corresponds to each of a set of internal prototypical patterns , along a particular perceptual dimension . thus , one important visual feature in the perception of cv syllables is the degree of opening of the lips . the model therefore assumes that the internal prototypes available to the perceptual system will specify that the lips are open at the onset of the syllable for / da / , closed for / ba / , etc . in a second stage , the features are integrated with each other , so as to determine the overall degree of match of the sensory input with each of the prototypes ( e . g . each of the syllables known to the receiver ) . in the third and final stage , a decision is taken , on the basis of the relative goodness of match of the input with each prototype . the flmp makes a number of specific assumptions at each stage in this process . first , it hypothesizes that all of the available sources of information are simultaneously brought into play in pattern recognition . thus , visible speech and auditory speech are both assumed to have an influence on how bimodal speech is perceived . second , different sources of information are assumed to be evaluated independently of each other . this means for example that visible speech does not have any effect on how auditory speech is converted into a set of features , the two sources of information being combined at a later stage of processing only . the model also makes specific assumptions about how sources of information are integrated with each other ( multiplicative rule ) , and about how decisions are taken ( relative goodness rule ) . a major prediction of the model is that " the influence of one source of information is greatest when the other source is neutral or ambiguous " ( 19 ) . this prediction is best illustrated by an experiment whose results served as a database for testing models of pattern recognition on several occasions in the book ( chapters 2 and 11 ) . in this experiment , synthetic auditory stimuli ranging on a continuum between / ba / and / da / were crossed with visual stimuli also varying between / ba / and / da / . the bimodal stimuli were presented to subjects in a forced-choice identification task , along with each of the unimodal stimuli . ( this expanded factorial design is shown by massaro to be the most appropriate experimental design for determining how two sources of information are combined with each other in pattern recognition . ) for the bimodal stimuli , the main results are typically depicted as a two-factor plot , with the proportion of / da / responses on the ordinate , the levels of the auditory source of information on the abscissa , and a different curve for each of the levels of the visual source of information . when represented in that way , the results clearly show a statistical interaction between the two sources of information . specifically , the influence of one source of information proves to be larger in the middle , ambiguous range of the other source . this interaction graphically takes the shape of an american football , which is for this reason presented throughout the book as the hallmark of the the fuzzy logical model of perception . in summary , massaro proposes a universal principle of perceptual cognitive performance to explain pattern recognition . according to this principle , " people are influenced by multiple sources of information in a diverse set of situations . in many cases , these sources of information are ambiguous and any particular source alone does not usually specify the appropriate interpretation . the perceiver appears to evaluate the multiple sources of information in parallel for the degree to which each supports various interpretations , integrate them together to derive the overall support for each interpretation , assess the support of each alternative based on all of the alternatives , and select the most appropriate response . " ( p . 291 ) . 2 critical evaluation 2 . 1 general evaluation this book is clearly a major contribution to the study of speech perception and , more generally , to cognitive psychology . it is admirably clear and is written in quite an elegant manner . i do not doubt that the book will be read with great interest by research scientists from many different fields . this work is the result of an ambitious intellectual endeavour aimed at introducing a new behavioural law , which is placed by massaro on an equal footing with weber 's law of perception , or the power law of learning . speech scientists are presented with an extensive series of experiments on the perception of bimodal speech . whatever stance they take in that domain , they should find quite challenging massaro 's view that speech perception constitutes but one aspect of a much more general form of cognitive processing , namely pattern recognition . computer scientists working in the field of speech technology should be particularly interested in the book 's final section about the synthesis of visual speech . regardless of their background , readers should also find the book worth using as a tutorial on the experimental methods available for investigating speech perception . a great variety of experimental paradigms and tasks are discussed at length by massaro , who also extensively discusses the methods for assessing computational models of pattern recognition and , in particular , for fitting these models to observed results . in that respect , using the results of the experiment described above as a reference database was quite a good initiative in my view , as this allows the reader more easily to understand massaro 's point as new issues are raised , without having again to go through the details of the experimental design each time . the book should also prove an invaluable resource for teaching . care was taken to select prototypical results , as well as to set this work in its historical context . a number of rather fascinating anecdotes and historical references are given , going from mcgurk 's personal account of the discovery of the mcgurk effect , to an audio-visual rendition of the introduction to george miller 's seminal article on the ubiquitousness of the number 7 plus or minus 2 , with miller 's face texture-mapped onto baldi 's wire-frame head . the cd-rom that accompanies the book enables the reader directly to experience the psychological illusions associated with the perception of bimodal speech , and constitutes as such a most useful research and teaching tool . on the negative side , massaro 's use of the / ba / - / da / experiment as a leading strand throughout obviously results in the book being focused on the perception of non-sense syllables . although the interaction of visible speech and audible speech in word recognition is mentioned on a number of occasions ( e . g . pp 21-23 and pp . 181-182 ) , the book contains few suggestions as to how we perceive isolated words , let alone connected speech . i also was surprised by the fact that little place was devoted to presenting other current theories and models of speech perception . although models such as trace are mentioned on several occasions in the book , i think it is fair to say that the flmp is still given the lion 's share . the book also has some minor defects such as the absence of a list of figures , and the fact that some of the cd-rom bands ( 1 . 4 , 1 . 5 and 1 . 6 ) are referred to incorrectly in the text . the list of the cd-rom selections should have pointed to the pages where each band is referred to . in another domain , it would have been quite interesting to have the perceptual database used in the book made available on the cd-rom . although this would have probably required a substantial amount of additional work , i should also have found it useful to be provided with an interactive version of the main computational models discussed in the book ( flmp , the race model , the single channel model , etc . ) . the flmp model can be downloaded from massaro 's laboratory web site at santa cruz ( http : / / mambo . ucsc . edu ) , but it is currently distributed in fortran code which has to be modified and recompiled for each new set of data , an operation which is probably out of reach of many students in psychology or linguistics . 2 . 2 specific comments i am not familiar with all of the areas dealt with in this book , and will not hide the fact that this review is biased towards my own interests , namely the production and perception of auditory speech . the following comments more specifically concentrate on two issues relating to this area of research , the role of features in speech processing and the time course of speech processing . 2 . 2 . 1 features most useful are the extensive comments made by massaro about the status of features in his model ( see in particular chapter 2 and chapter 10 ) . i long have found it difficult to determine how close these features were to classical phonetic features . the book makes it clear to me that there is no direct relation between the former and the latter . as indicated above , the flmp postulates that there are three main stages of processing in pattern recognition : the feature evaluation stage , the feature integration stage , and the decision stage . specific assumptions are made in the model about how features are integrated with each other , and how a decision is taken depending on the outcome of this integration . from a set of feature values , therefore , the model will predict the probability of occurrence of each possible response ( e . g . " ba " and " da " ) . however , attention should be paid to the fact that these feature values are in no way derived from the stimulus . they are actually determined in an posteriori manner , from the subjects ' observed responses , using an algorithm ( stepit ) which allows the deviation between these responses and the predicted ones to be minimal . features are seen in the model as * free parameters * , whose values are set on the basis of the actual performance of the subject in the pattern recognition task , so as to make the model perform at its best , i . e . to maximize its goodness of fit . according to massaro , " [ the model is ] * predicting * the exact * form * of the results , but * postdicting * the actual quantitative * values * that make up the overall predictions " ( p . 294 , his emphasis ) . in other words , the stimulus is on no occasion explicitly mapped onto the internal features of the flmp model . in that respect , features as defined in the flmp look markedly different from phonetic features . let us take for example the opposition between / ba / and / da / , on which much emphasis is put in the book . acoustically , / b / and / d / are said to differ from each other according to the feature grave-acute , / b / being classified as grave and / d / as acute . as is the case with flmp features , grave and acute can be viewed as target values referring to prototypical stops . however , the grave-acute feature is explicitly defined in acoustical terms ( e . g . slope of the short-term spectrum at the release of the stop , see stevens & blumstein , 1978 ) . on the contrary , the exact nature of the flmp features remains undetermined , their values being subject to one main constraint which is to make the model account for the subjects ' responses as accurately as possible . thus , the acoustic structure of the stimulus is not directly taken into consideration in the estimation of the feature values . in the experiments using audible speech , flmp features do lend themselves to an acoustic interpretation . in the / ba / - / da / experiment for example , the prototypes for / ba / and / da / are assumed to include one auditory feature , namely the variations in frequency of the second ( f2 ) and third ( f3 ) formants at the onset of the vowel ( slightly falling f2-f3 for / da / , rising f2-f3 for / ba / ) . however , this interpretation stems from the fact that f2 and f3 onset frequencies were precisely the acoustic parameters manipulated by the experimenters to synthesize the auditory continuum between / ba / and / da / . in other words , the acoustic significance of the flmp features is derived from the way in which the experiment has been designed . the model does rely on a particular system of acoustic features ( see for example stevens & blumstein , 1978 , for an alternative system ) , but this system is embodied in the experimental design , and is as such external to the model itself . in practice , therefore , the issue of how speech sounds are mapped onto features is not addressed in the model . why this is so is not clear to me . on several occasions , massaro suggests that determining in advance how a given individual will convert a given stimulus into a set of feature values is simply out of our reach . this stimulus-to - feature mapping shows a variability which is said to be analogous to the variability of the weather : there are just too many previous contributions and influences to allow quantitative prediction ( 135 ) . a fundamental distinction is in fact established in the flmp between the intake of * information * , i . e . the stimulus-to - feature mapping , and * information processing * , i . e . how features are combined with each other and mapped into a response ( cf . p . 135 ) . while the flmp predicts that the information will be processed in the same way from one individual to the other , regardless of whether it relates to speech sounds , facial movements , manual gestures , etc . , it is assumed that the way in which this information is extracted from the stimulus is on the contrary subject to too many sources of variations to be accurately characterized ahead of time . in my understanding , this means that the so-called evaluation stage cannot be accounted for by the model , or at least not with much accuracy . however , at least on one occasion massaro does suggest that this limitation is not consubstantial with every model of perception and pattern recognition , and could be circumvented in some way . according to him , one could indeed " easily hypothesize functions relating the feature values to the stimulus levels , [ although ] that would represent a * model of information * in addition to one of information processing " ( 294 , my emphasis ) . this suggests that building such a model of information is feasible . whether there is a possibility of the flmp being completed with a model of this kind , i . e . an explicit stimulus-to - feature mapping stage , is an issue which remains to be addressed . 2 . 2 . 2 the time course of speech processing time plays quite a central role in different ways in the book . first , massaro shows how the flmp can be explicitly formalized to account for the dynamics of perceptual processing ( chap . 9 ) . this formalization is presented in reply to criticisms expressed by a number of investigators ( e . g . mcclelland , 1991 ) , who have pointed out that the flmp accurately characterizes the asymptotic outcome of the perceptual system ( e . g . the probability for a particular response to occur ) , but has little to say about the time course of processing . the dynamic version of the flmp is intended to address these reactions . in this version , the stimulus-to - feature mapping is assumed to take a certain amount of time . during this interval , the information about the stimulus gradually accumulates , and becomes increasingly accurate . it is assumed that accuracy increases as a negatively accelerated function of processing time , so that more information is gleaned early than late in the processing of the stimulus . one further assumption is that " integration of the separate features [ is ] updated continuously as the featural information is being evaluated . similarly , decision [ can ] occur at any time after the stimulus presentation " ( 259 ) . thus , there is a partial temporal overlap between the different stages of processing , in the sense that one process can begin before a previous process is finished ( see also figure 2 . 1 , p . 41 ) . these assumptions about the time course of information processing are supported by a number of experiments concerned with the effect of backward masking in the recognition of pure tones , and in the recognition of letters . speech obviously raises a number of specific issues in that domain , however . unlike written words , speech is a temporal phenomenon , it is continuous ( i . e . there are no systematic acoustic boundaries between phonemes , syllables , or words ) and , furthermore , time per se serves as a source of information in speech , as pointed out by massaro ( e . g . vowel duration is a major cue to the voicing of the following obstruent , to take but one example ) . somewhat regrettably , few indications are given about how the model could be assessed in the speech domain ( see remarks p . 194 and p . 263 ) . in addition to discussing the dynamics of processing , massaro examines how the temporal relations between sources of information are dealt with in pattern recognition . chapter 3 focuses on our sensitivity to temporal asynchronies between visible and audible speech . in the experiments reported in this chapter , bimodal cv syllables with various degrees of onset asynchrony between the auditory synthetic speech and the visible synthetic speech were presented to subjects in a forced-choice identification task . the results show that integration between the two sources of information still occurs when these sources of information are made asynchronous , provided that the time shift does not exceed a certain duration . one major challenge for phoneticians and psycholinguists alike is to characterize the relationship between what could be called the * external * dynamics of speech , i . e . the temporal organization of the speech signal , and the * internal * time course of speech processing . both play a role in the perception of speech , and it is most difficult to tell apart their respective influences on the listener 's behaviour ( samuel , 1996 ) . for example , in a gating study investigating the role of vowel duration as a cue to the voicing of the post-vocalic stop in cvc syllables , warren and marslen - wilson ( 1988 ) found that the proportion of voiced-coda responses increased as the listeners were presented with increasingly long portions of the initial cv sequence . one obvious interpretation is that longer vowels were perceived as being associated with voiced coda rather than voiceless ones . in keeping with massaro 's dynamical flmp , however , it may also be assumed that evaluating the information provided by the vowel takes time , and that the evidence pointing to a voiced coda gradually accumulates as more processing time is made available to the listener , all other things being equal . thus , the above finding raises the issue of how to differentiate the effect of vowel duration per se on the listener 's response , from that of the internal dynamics of processing . although this issue is not directly addressed in the book , there is no doubt that the flmp would constitute a most appropriate framework for further investigations in this domain . 2 . 3 general conclusion this book provides us with quite an extensive review of the work carried out by the author and others on the use of multiple cues in speech perception and , more generally , pattern recognition . it is aimed at a very large audience , and constitutes a most useful tool both for teaching and research purposes . i do not doubt that it will soon become a major reference for researchers in phonetics , psycholinguistics , and cognitive psychology . 4 bibliography liberman , a . m . ( 1996 ) . speech : a special code ( mit press , cambridge , mass . ) . mcclelland , j . l . ( 1991 ) . " stochastic interactive processes and the effect of context on perception " , cognitive psychology 23 , 1-44 . samuel , a . g . ( 1996 ) . " the role of time during lexical access " , journal of the acoustical society of america 100 , 4 / 2 , 2572 . stevens , k . n . , and blumstein , s . e . ( 1978 ) . " invariant cues for place of articulation in stop consonants " , journal of the acoustical society of america 64 , 1358-1368 . warren , p . , and marslen - wilson , w . ( 1988 ) . " cues to lexical choice - discriminating place and voice " , perception and psychophysics 43 , 21-30 . 5 biographical details the reviewer is a lecturer in the laboratory for psycholinguistics , fpse , university of geneva , switzerland . his current research covers a variety of topics ranging from the dynamics of articulatory movements in speech production to the phonetic bases of word recognition . thanks are due to uli frauenfelder for helpful comments . a latex version of this document is available upon request ( nnguyen @ fapse . unige . ch ) .
