Home | Speakers | Presentations | Schedule | Travel
Presentations from:

Jont B. Allen     Timothy R. Anderson    Albert S. Bregman     Guy J. Brown    
Douglas Brungart        Dan Ellis        Larry Feth        William M. Hartmann     
Mari R. Jones      Hideki Kawahara      Willard Larkin      Roy D. Patterson
Jose C. Principe      Shihab Shamma      Richard Stern      DeLiang Wang


From Lord Rayleigh to Shannon: How Do We Decode Speech? (zipped ps)

Jont B. Allen

AT&T Labs – Research
180 Park Ave.
Florham Park, NJ 07932-0971
jba@auditorymodels.org


    In 1908 Lord Rayleigh reported on his speech perception studies using the "acousticon'' (a commercial sound system produced in 1905), demonstrating that he was well aware of the importance of the bandwidth in speech perception. It was the development of the telephone that both allowed and pushed mathematicians and physicists to develop the science of speech perception.

    From 1910 to 1950 speech perception was extensively studied by telephone research departments throughout the world. However it was the work of AT&T's Harvey Fletcher in 1921 that made the first major breakthroughs. During WWII the Harvard Acoustics Lab took on this problem where breakthroughs were provided by George Miller and his colleagues. Miller used concepts from information theory developed at Bell Labs by Claude Shannon to quantify speech entropy. I will attempt to pass along some wisdom I have learned over the years on what we now know about human speech recognition (HSR).

    My talk will be in four parts. In part one I briefly summarize key results from the 30 years of work by Fletcher and his colleagues, which resulted in the "articulation index.'' In part two I summarize the work of George Miller. Miller studied the importance of varying the source entropy (randomness) in speech perception. In part three I describe some work in progress where I partially repeated Miller and Nicely's experiment. In part four I describe recent experimental work in building more robust ASR. One goal is to make a system that works as well as human listeners in decoding degraded (filtered plus noise) nonsense speech sounds.    



Computational Audition at AFRL/HE: Past, Present and Future (ppt)

Timothy R. Anderson

Air Force Research Laboratory
Human Effectiveness Directorate
Wright-Patterson Air Force Base, OH 45433
Tim.Anderson@wpafb.af.mil


    Research in the use of auditory models for speech recognition has been conducted at the Air Force Research Laboratory for over a decade. This talk will present a snapshot of that research. Results of monaural and binaural phoneme recognition experiments in noise will be examined. Continuous speech recognition experiments comparing traditional and auditory model features will be discussed. Work on the development of a hardware auditory model for real-time feature generation will also be presented. In conclusion possible areas of future research and development will be discussed.



Auditory Scene Analysis in Humans:
Implications for Computational Implementations
 (pdf)

Albert S. Bregman

Department of Psychology
McGill University
1205 Docteur Penfield Avenue
Montreal, QC, Canada H3A 1B1
al.bregman@mcgill.ca


    The main phenomena of human auditory scene analysis (ASA) will be introduced with auditory illustrations. ASA is the human ability in multi-sound environments to parse the acoustic input and form separate auditory representations of the individual sound sources. Suggestions will be made concerning issues in the implementation of a system for computational auditory scene analysis (CASA). These will be based on a number of points:

1. The default status of an input auditory array is the integration of all of it into a single sound.

2. Models should be designed so that there is a parallel activity of the systems that deal with different cues. In natural environments, sometimes a particular cue is useful and sometimes not. The high-quality cues in any environment should be able to take over the burden of segregation and grouping in a seamless way.

3. In a computational model, each cue-using mechanism should be able to be shut off without affecting the activity of the remaining ones. The system should degrade gracefully.

4. Typically a CASA model attempts to test out the use of some particular cue for grouping. The modeler should be able to compare the success rates with and without the mechanism that uses this principle, so as to determine its “incremental” utility. This should be compared under different types of degradation of the signal.

5. Perhaps the most powerful cue to segregation is the “old-plus-new heuristic”. Models should make more use of it.

6. The phenomenon of duplex perception suggests that the bottom-up ASA principles should not partition the signal outright but should negotiate with top-down processes to converge on a description of multiple sources that maximally satisfies both.    


Linking Computational Auditory Scene Analysis
with "Missing Data" Recognition of Speech
 (ppt)

Guy J. Brown

Department of Computer Science
University of Sheffield
Sheffield, S1 4DP, U.K.
g.brown@dcs.shef.ac.uk


    We describe a binaural auditory model for speech recognition, which is robust in the presence of reverberation and spatially separated noise intrusions. The principle underlying the model is to identify time-frequency regions which constitute reliable evidence of the speech signal. This is achieved both by determining the spatial location of the speech source using interaural time difference (ITD) and interaural intensity difference (IID) cues, and by applying a model of reverberation masking. Reliable time-frequency regions are passed to a 'missing data' speech recogniser. We show, firstly, that the auditory model improves recognition performance in various reverberation conditions when no noise intrusion is present. Secondly, we demonstrate that the model improves performance when the speech signal is contaminated by noise, both for an anechoic environment and in the presence of room reverberation. Links between the 'missing data' approach to automatic speech recognition and neural oscillator models of auditory function are also discussed.    


Adaptation to Target Transitions in the Cocktail Party Problem (ppt)

Douglas Brungart

Air Force Research Laboratory
Human Effectiveness Directorate
2610 Seventh St. WPAFB, OH 45433-7901
Douglas.Brungart@wpafb.af.mil


    To this point, most research on the cocktail party problem has focused on one of two different multitalker listening configurations: the selective attention configuration, where the listener knows the location of the target talker in advance; and the divided attention configuration, where the listener has no information about the location of the target talker prior to listening to the stimulus. Most real-world multitalker listening tasks fall somewhere between these two extremes: listeners have to dynamically allocate their attention according to the likelihood that the target information will originate from each individual talker in the next stimulus interval. This experiment measured performance in a three-talker cocktail party task where the probability of a change in the location of the target talker across trials was varied from 0% to 100%. The listeners were given no information about the transition probability used in each 60-trial block. The results show that listeners adapt relatively slowly to unexpected changes in the location of the target talker. The results also suggest that listeners adopt different strategies for listening situations with different target transition probabilities. These results have important implications in the development of comprehensive models of auditory attention.     


Sound, Mixtures, and Learning
(pdf)

Dan Ellis

Laboratory for Recognition and Organization of Speech and Audio
Department of Electrical Engineering
Columbia University
New York, NY 10027
dpwe@ee.columbia.edu


     Human listeners, like other animals, gain a great deal of information from their acoustic environment. Sound provides a useful and complementary information channel for situations in which vision is inadequate, e.g., for detecting events regardless of their direction, and for operating in darkness. As with visual scenes, however, the presence of multiple, interfering sources makes the extraction of reliable, high-level information from real-world sounds extremely challenging.

     The most successful application of acoustic analysis is automatic speech recognition; however, well-performing speech recognizers require highly-controlled acoustic environments and rely on the assumption that the signal is dominated by a single voice; system performance degrades rapidly in noise, and no coherent approach has been developed for the problem of recognizing an acoustic mixture of several, equally-prominent voices - the kind of situation people face every day in situations such as meetings.

     Computational auditory scene analysis (CASA) treats the separation of different sound sources as its primary goal, yet the promise of the original psychologists' descriptions has not been fulfilled in practice. Systems that rely entirely on simple, local signal features such as periodicity and onset appear to have very limited application. Instead, the use of signal-knowledge constraints, learned from prolonged exposure to real-world sound, seems inescapable.

     While current CASA systems typically address individual aspects of human performance, they would clearly benefit from the techniques that have contributed to the success of speech recognition, particularly the use of machine learning to build statistical models of large training corpora. Models that adequately capture the constraints implicit in the concept of a 'natural sound source' can form the basis of general-purpose sound scene analysis system, able to handle the wide spectrum of complex, natural environments encountered in everyday life.

     In this talk, I will review human sound organization and various computational modeling approaches, arguing for the importance of top-down, knowledge-based constraints. By drawing on examples of how knowledge is employed in speech recognizers, and on recent work on recognizing partial speech information in mixtures, I will describe the general framework for sound mixture organization being pursued within LabROSA, including inference techniques drawn from machine learning. Examples of general sound mixture recognition, and detectors for specific nonspeech sound classes, will illustrate these ideas.
    


Psychoacoustics of Dynamic ‘Center-of-Gravity’ Signals
 (ppt)

Larry Feth

Department of Speech and Hearing Sciences
The Ohio State University
Columbus, OH 43210
feth.1@osu.edu


         


Binaural Coherence: Mathematics,
Sound Localization, and Room Acoustics


William M. Hartmann

Department of Physics and Astronomy
Michigan State University
East Lansing, MI 48824
hartmann@pa.msu.edu


    Mathematically, binaural coherence is measured by the binaural cross- correlation function. The cross-correlation function, in turn, is easily calculated from either a temporal representation or a spectral representation in the ideal case. It is possible to derive theorems for the behavior of cross-correlation in acoustically dispersive environments that serve as guides for cross-correlational models of binaural function. However, the application to human perception experiments often leads to complications arising from small bandwidth, limited stimulus duration, differences between amplitude incoherence and phase incoherence, and a perceptual system that may be sensitive to interaural time differences only in the envelope. Ultimately it becomes necessary to tailor the mathematics to the experimental circumstances and the human listener.

    Binaural coherence is necessary for a listener to make use of interaural time differences in sound localization. Without coherence there is nothing to time. By contrast, recent experiments show that binaural coherence is almost completely unimportant for sound localization by interaural level differences. The requirements on binaural coherence for the use of interaural time differences depend sensitively on the stimulus frequency range. Tasks that can be done with a coherence of 0.05 when the stimulus contains frequencies well below 1000 Hz require a coherence more than ten times greater when the spectrum is above 2000 Hz. This experimental result has been successfully reproduced by an auditory model originally developed to explain masking level differences.

    Binaural coherence can be measured in a room by means of an artificial head or head-worn microphones. The reflections and standing waves in a room inevitably reduce the coherence, especially in the critical frequency region between 500 and 1000 Hz. In highly reverberant environments, the binaural coherence is so low that listeners subconsciously change their localization strategies, from an emphasis on interaural time differences to an emphasis on interaural level differences. Acoustical computations show that binaural coherence increases with increasing frequency, but the increase may not be fast enough to meet the relatively enormous coherence requirements that the binaural system appears to make at high frequency.
    


The Dynamics of Attending and Scene Analysis
 (ppt)

Mari R. Jones

Psychology Department
The Ohio State University
Columbus, Oh 43210 USA
jones.80@osu.edu
         


Fixed Point Representations for Very High Quality Speech and Sound Modification Systems
 (pdf)

Hideki Kawahara

Design Information Sciences
Department Fuculty of Systems Engineering
Wakayama University
930 Sakaedani, Wakayama 640-8510, JAPAN
kawahara@sys.wakayama-u.ac.jp


    A speech analysis, modification and re-synthesis system STRAIGHT was developed to enable high quality speech feature manipulation based on perceptually relevant parameters. It destroys waveform structure while preserving perceptual similarity. In other words, it provides clues to seek for invariance structure in our auditory space. A speculative discussion will be presented suggesting that fixed points in the time domain, the frequency domain and the lag domain and their associated attributes may provide a basis for representing auditorily relevant information.     


Welcome Remarks
 (ppt)

Willard Larkin

Program Mgr. For Life Sciences
Directorate of Chemistry & Life Sciences
The Air Force Office of Scientific Research
willard.larkin@afosr.af.mil
         


Melodic Pitch and Foveal Audition
 (link to ppt)

Roy D. Patterson

Centre for the Neural Basis of Hearing
Department of Physiology
University of Cambridge
Downing Street Cambridge, CB2 3EG, U.K.
rdp1@cam.ac.uk


    The range of notes normally used to make melodies extends from about two octaves below to about two octaves above middle C on the keyboard. The frequency of middle C is a little over 256 Hz, so the range of notes in melodies is from about 64 Hz to 1024 Hz. The range of hearing for young adults is from about 32 Hz to 12,000 Hz. So why do we not use more of the range available? Part of the reason is that the notes produced by traditional instruments have harmonics that give the instruments their distinctive timbres, and some frequency range is needed to accommodate the harmonics. But recent research shows that this does not really explain either the upper or the lower limit of pitch. In this talk, I will describe some recent research we have performed to define what melodic pitch is and to measure the domain available to composers.

    The results may have wider importance for our understanding of hearing; together with recent time-interval masking experiments, they suggest that, within the space of frequencies and time-intervals to which we are sensitive, there is a ‘foveal’ region where the system has much greater resolution than in the surrounding ‘periphery’. Specifically, in channels with frequencies from 40 to 4000 Hz, the auditory system can process time intervals from .3 to 33 ms with much greater accuracy than outside this frequency/time-interval range. Monaural masking experiments will be described which indicate that, within this range, we can detect monaural time differences of a few tens of microseconds in continuous sounds even when they have matched excitation patterns. Outside this region, we can still hear sound and make distinctions about the repetition rate of the sound or the form of the source, but the resolution is an order of magnitude less precise, and this is why it is not used for melody. The talk will argue 1) that the foveal/peripheral distinction of vision can be a useful explanatory tool in hearing, and 2) that there may be a physiological basis for the analogy.
    


Innovating Signal Processing Methods for Computational Audition (ppt)

Jose C. Principe

Computational NeuroEngineering Laboratory
University of Florida
Gainesville, FL 32611-6200
principe@cnel.ufl.edu


    Signal processing has played a key role in our understanding of the auditory system, as well as in the design and implementation of devices for the hearing impaired. However, we believe that the rate of progress could be increased if new signal processing methodologies are derived taking into consideration the structure of biosignals. Essentially, the present theory of optimal signal processing assumes linear systems, and it is based on the Gaussianity and stationarity assumptions. These conditions are not met in practice.

    We have been working on lifting the restrictions of linearity and second order statistics in optimal signal processing algorithms. In this presentation we will briefly review a new optimization criterion based on Renyi’s entropy and show how it can be utilized to adapt linear and nonlinear filters. We will present results for blind source separation.
    


The Cortical Representation of Spectral Dynamics
 (ppt)

Shihab Shamma

Department of Electrical Engineering
University of Maryland
College Park, MD 20742
sas@eng.umd.edu


    To understand the representation of broadband, dynamic sounds in Primary Auditory Cortex (A1), we characterize its responses by the Spectro-Temporal Response Field (STRF). The STRF describes and predicts the linear response of neurons to sounds rich with spectro-temporal envelopes. It is calculated from responses to broadband sounds with rippled spectral envelopes that drift up and down the frequency axis at various speeds. These stimuli allow us also to compute a "ripple transfer function" which summarizes the way a cell responds to all ripples. In this talk, we shall first summarize how the transfer function relates to the STRF, how it can be used to investigate the spectral and temporal response properties of the cell, and what implications these properties have to the connectivity of the cell within the cortex, and to the thalamus. We shall also address the functional implications of these results to the processing of complex sounds such as speech and music, and the relationship between auditory and other sensory processing in the cortex.    


Using Computational Models of Binaural Hearing to
Improve Automatic Speech Recognition Accuracy:
Promise, Progress, and Problems
 (ppt)

Richard Stern

Department of Electrical and Computer Engineering
and School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213
rms@cs.cmu.edu


    For many years the human binaural system has been an inspiration for developers of robust automatic speech recognition systems because of its ability to separate competing sources of sound arriving from different azimuths, and because of its ability to suppress the effects of some kinds of room reverberation. Nevertheless, while several groups of researchers have incorporated computational models of binaural interaction into the feature extraction stage of automatic speech recognition systems, the impact of binaural processing on recognition accuracy has been more limited than expected, and has been achieved only at substantial computational cost. This talk will review the motivation for and structure of current models of binaural hearing that have been applied to speech recognition. We will compare the improvements in speech recognition accuracy obtained through the use of binaural processing. Finally, we will discuss some of the reasons why we believe that progress to date has been limited and speculate on especially promising arenas for future research.    


Monaural Speech Segregation:
Representation, Harmonicity, and Amplitude Modulation
 (ppt)

DeLiang Wang

Department of Computer and Information Science
and Center for Cognitive Science
The Ohio State University
Columbus, OH 43210-1277
dwang@cis.ohio-state.edu


    Speech segregation in the monaural condition is a primary task of computational auditory scene analysis, and has proven to be very challenging. We present a multi-stage model for the task. The model starts with simulated auditory periphery. A subsequent stage computes mid-level auditory representations, including correlograms and cross-channel correlations. The core of the system performs segmentation and grouping in a two-dimensional time-frequency representation that encodes proximity in frequency and time, periodicity, and amplitude modulation (AM). Motivated by psychoacoustic observations, our system employs different mechanisms for handling resolved and unresolved harmonics. For resolved harmonics, the system generates segments based on temporal continuity and cross-channel correlation, and groups them according to periodicity. For unresolved harmonics, the system generates segments based on AM in addition to temporal continuity and groups them according to AM repetition rates derived from sinusoidal modeling. Underlying the segregation process is a pitch contour that is first estimated from speech segregated according to global pitch and then adjusted according to psychoacoustic constraints. The model has been systematically evaluated using a common corpus of speech mixed with a variety of interfering sounds, and it yields substantially better performance than previous systems.     

Last modified Feb 14th, 2003 by shaoy@cis.ohio-state.edu