Presentations from:
Jont
B. Allen
Timothy R. Anderson Albert
S. Bregman Guy
J. Brown
Douglas Brungart Dan
Ellis Larry Feth William
M. Hartmann Mari R. Jones Hideki
Kawahara Willard Larkin Roy D. Patterson
Jose C. Principe Shihab
Shamma Richard Stern DeLiang
Wang
|
|
|
|
|
In
1908 Lord Rayleigh reported on his speech perception studies
using the "acousticon'' (a commercial sound system produced
in 1905), demonstrating that he was well aware of the importance
of the bandwidth in speech perception. It was the development
of the telephone that both allowed and pushed mathematicians
and physicists to develop the science of speech perception.
From 1910 to 1950 speech perception
was extensively studied by telephone research departments
throughout the world. However it was the work of AT&T's Harvey
Fletcher in 1921 that made the first major breakthroughs.
During WWII the Harvard Acoustics Lab took on this problem
where breakthroughs were provided by George Miller and his
colleagues. Miller used concepts from information theory developed
at Bell Labs by Claude Shannon to quantify speech entropy.
I will attempt to pass along some wisdom I have learned over
the years on what we now know about human speech recognition
(HSR).
My talk will be in four parts. In
part one I briefly summarize key results from the 30 years
of work by Fletcher and his colleagues, which resulted in
the "articulation index.'' In part two I summarize the
work of George Miller. Miller studied the importance of varying
the source entropy (randomness) in speech perception. In part
three I describe some work in progress where I partially repeated
Miller and Nicely's experiment. In part four I describe recent
experimental work in building more robust ASR. One goal is
to make a system that works as well as human listeners in
decoding degraded (filtered plus noise) nonsense speech sounds.
|
|
|
|
|
Research
in the use of auditory models for speech recognition has been
conducted at the Air Force Research Laboratory for over a
decade. This talk will present a snapshot of that research.
Results of monaural and binaural phoneme recognition experiments
in noise will be examined. Continuous speech recognition experiments
comparing traditional and auditory model features will be
discussed. Work on the development of a hardware auditory
model for real-time feature generation will also be presented.
In conclusion possible areas of future research and development
will be discussed.
|
|
|
Auditory
Scene Analysis in Humans:
Implications for Computational Implementations (pdf)
Albert S. Bregman
Department of Psychology
McGill University
1205 Docteur Penfield Avenue
Montreal, QC, Canada H3A 1B1
al.bregman@mcgill.ca
|
The
main phenomena of human auditory scene analysis (ASA) will be
introduced with auditory illustrations. ASA is the human ability
in multi-sound environments to parse the acoustic input and
form separate auditory representations of the individual sound
sources. Suggestions will be made concerning issues in the implementation
of a system for computational auditory scene analysis (CASA).
These will be based on a number of points:
1. The default status of an input auditory array is the integration
of all of it into a single sound.
2. Models should be designed so that there is a parallel activity
of the systems that deal with different cues. In natural environments,
sometimes a particular cue is useful and sometimes not. The
high-quality cues in any environment should be able to take
over the burden of segregation and grouping in a seamless way.
3. In a computational model, each cue-using mechanism should
be able to be shut off without affecting the activity of the
remaining ones. The system should degrade gracefully.
4. Typically a CASA model attempts to test out the use of some
particular cue for grouping. The modeler should be able to compare
the success rates with and without the mechanism that uses this
principle, so as to determine its “incremental” utility. This
should be compared under different types of degradation of the
signal.
5. Perhaps the most powerful cue to segregation is the “old-plus-new
heuristic”. Models should make more use of it.
6. The phenomenon of duplex perception suggests that the bottom-up
ASA principles should not partition the signal outright but
should negotiate with top-down processes to converge on a description
of multiple sources that maximally satisfies both.
 |
|
|
|
We
describe a binaural auditory model for speech recognition, which
is robust in the presence of reverberation and spatially separated
noise intrusions. The principle underlying the model is to identify
time-frequency regions which constitute reliable evidence of
the speech signal. This is achieved both by determining the
spatial location of the speech source using interaural time
difference (ITD) and interaural intensity difference (IID) cues,
and by applying a model of reverberation masking. Reliable time-frequency
regions are passed to a 'missing data' speech recogniser. We
show, firstly, that the auditory model improves recognition
performance in various reverberation conditions when no noise
intrusion is present. Secondly, we demonstrate that the model
improves performance when the speech signal is contaminated
by noise, both for an anechoic environment and in the presence
of room reverberation. Links between the 'missing data' approach
to automatic speech recognition and neural oscillator models
of auditory function are also discussed.  |
|
|
|
To
this point, most research on the cocktail party problem has
focused on one of two different multitalker listening configurations:
the selective attention configuration, where the listener knows
the location of the target talker in advance; and the divided
attention configuration, where the listener has no information
about the location of the target talker prior to listening to
the stimulus. Most real-world multitalker listening tasks fall
somewhere between these two extremes: listeners have to dynamically
allocate their attention according to the likelihood that the
target information will originate from each individual talker
in the next stimulus interval. This experiment measured performance
in a three-talker cocktail party task where the probability
of a change in the location of the target talker across trials
was varied from 0% to 100%. The listeners were given no information
about the transition probability used in each 60-trial block.
The results show that listeners adapt relatively slowly to unexpected
changes in the location of the target talker. The results also
suggest that listeners adopt different strategies for listening
situations with different target transition probabilities. These
results have important implications in the development of comprehensive
models of auditory attention.  |
|
Sound,
Mixtures, and Learning
(pdf)
Dan Ellis
Laboratory for Recognition and Organization of Speech and
Audio
Department of Electrical Engineering
Columbia
University
New York, NY 10027
dpwe@ee.columbia.edu
|
Human listeners, like other animals, gain a great deal of information
from their acoustic environment. Sound provides a useful and
complementary information channel for situations in which vision
is inadequate, e.g., for detecting events regardless of their
direction, and for operating in darkness. As with visual scenes,
however, the presence of multiple, interfering sources makes
the extraction of reliable, high-level information from real-world
sounds extremely challenging.
The most successful application of
acoustic analysis is automatic speech recognition; however,
well-performing speech recognizers require highly-controlled
acoustic environments and rely on the assumption that the signal
is dominated by a single voice; system performance degrades
rapidly in noise, and no coherent approach has been developed
for the problem of recognizing an acoustic mixture of several,
equally-prominent voices - the kind of situation people face
every day in situations such as meetings.
Computational auditory scene analysis
(CASA) treats the separation of different sound sources as its
primary goal, yet the promise of the original psychologists'
descriptions has not been fulfilled in practice. Systems that
rely entirely on simple, local signal features such as periodicity
and onset appear to have very limited application. Instead,
the use of signal-knowledge constraints, learned from prolonged
exposure to real-world sound, seems inescapable.
While current CASA systems typically
address individual aspects of human performance, they would
clearly benefit from the techniques that have contributed to
the success of speech recognition, particularly the use of machine
learning to build statistical models of large training corpora.
Models that adequately capture the constraints implicit in the
concept of a 'natural sound source' can form the basis of general-purpose
sound scene analysis system, able to handle the wide spectrum
of complex, natural environments encountered in everyday life.
In this talk, I will review human sound
organization and various computational modeling approaches,
arguing for the importance of top-down, knowledge-based constraints.
By drawing on examples of how knowledge is employed in speech
recognizers, and on recent work on recognizing partial speech
information in mixtures, I will describe the general framework
for sound mixture organization being pursued within LabROSA,
including inference techniques drawn from machine learning.
Examples of general sound mixture recognition, and detectors
for specific nonspeech sound classes, will illustrate these
ideas.  |
|
|
|
|
|
|
Binaural
Coherence: Mathematics,
Sound Localization, and Room Acoustics
William M. Hartmann
Department of Physics and Astronomy
Michigan State University
East Lansing, MI 48824
hartmann@pa.msu.edu
|
Mathematically,
binaural coherence is measured by the binaural cross- correlation
function. The cross-correlation function, in turn, is easily
calculated from either a temporal representation or a spectral
representation in the ideal case. It is possible to derive theorems
for the behavior of cross-correlation in acoustically dispersive
environments that serve as guides for cross-correlational models
of binaural function. However, the application to human perception
experiments often leads to complications arising from small
bandwidth, limited stimulus duration, differences between amplitude
incoherence and phase incoherence, and a perceptual system that
may be sensitive to interaural time differences only in the
envelope. Ultimately it becomes necessary to tailor the mathematics
to the experimental circumstances and the human listener.
Binaural coherence is necessary for
a listener to make use of interaural time differences in sound
localization. Without coherence there is nothing to time. By
contrast, recent experiments show that binaural coherence is
almost completely unimportant for sound localization by interaural
level differences. The requirements on binaural coherence for
the use of interaural time differences depend sensitively on
the stimulus frequency range. Tasks that can be done with a
coherence of 0.05 when the stimulus contains frequencies well
below 1000 Hz require a coherence more than ten times greater
when the spectrum is above 2000 Hz. This experimental result
has been successfully reproduced by an auditory model originally
developed to explain masking level differences.
Binaural coherence can be measured in
a room by means of an artificial head or head-worn microphones.
The reflections and standing waves in a room inevitably reduce
the coherence, especially in the critical frequency region between
500 and 1000 Hz. In highly reverberant environments, the binaural
coherence is so low that listeners subconsciously change their
localization strategies, from an emphasis on interaural time
differences to an emphasis on interaural level differences.
Acoustical computations show that binaural coherence increases
with increasing frequency, but the increase may not be fast
enough to meet the relatively enormous coherence requirements
that the binaural system appears to make at high frequency.  |
|
|
|
|
|
|
|
|
A
speech analysis, modification and re-synthesis system STRAIGHT
was developed to enable high quality speech feature manipulation
based on perceptually relevant parameters. It destroys waveform
structure while preserving perceptual similarity. In other words,
it provides clues to seek for invariance structure in our auditory
space. A speculative discussion will be presented suggesting
that fixed points in the time domain, the frequency domain and
the lag domain and their associated attributes may provide a
basis for representing auditorily relevant information.
 |
|
Welcome Remarks (ppt)
Willard Larkin
Program Mgr. For Life Sciences
Directorate of Chemistry & Life Sciences
The Air Force Office of Scientific Research
willard.larkin@afosr.af.mil
|
|
|
|
Melodic
Pitch and Foveal Audition (link to ppt)
Roy D. Patterson
Centre for the Neural Basis of Hearing
Department of Physiology
University of Cambridge
Downing Street Cambridge, CB2 3EG, U.K.
rdp1@cam.ac.uk
|
The
range of notes normally used to make melodies extends from about
two octaves below to about two octaves above middle C on the
keyboard. The frequency of middle C is a little over 256 Hz,
so the range of notes in melodies is from about 64 Hz to 1024
Hz. The range of hearing for young adults is from about 32 Hz
to 12,000 Hz. So why do we not use more of the range available?
Part of the reason is that the notes produced by traditional
instruments have harmonics that give the instruments their distinctive
timbres, and some frequency range is needed to accommodate the
harmonics. But recent research shows that this does not really
explain either the upper or the lower limit of pitch. In this
talk, I will describe some recent research we have performed
to define what melodic pitch is and to measure the domain available
to composers.
The results may have wider importance
for our understanding of hearing; together with recent time-interval
masking experiments, they suggest that, within the space of
frequencies and time-intervals to which we are sensitive, there
is a ‘foveal’ region where the system has much greater resolution
than in the surrounding ‘periphery’. Specifically, in channels
with frequencies from 40 to 4000 Hz, the auditory system can
process time intervals from .3 to 33 ms with much greater accuracy
than outside this frequency/time-interval range. Monaural masking
experiments will be described which indicate that, within this
range, we can detect monaural time differences of a few tens
of microseconds in continuous sounds even when they have matched
excitation patterns. Outside this region, we can still hear
sound and make distinctions about the repetition rate of the
sound or the form of the source, but the resolution is an order
of magnitude less precise, and this is why it is not used for
melody. The talk will argue 1) that the foveal/peripheral distinction
of vision can be a useful explanatory tool in hearing, and 2)
that there may be a physiological basis for the analogy.  |
|
|
|
Signal
processing has played a key role in our understanding of the
auditory system, as well as in the design and implementation
of devices for the hearing impaired. However, we believe that
the rate of progress could be increased if new signal processing
methodologies are derived taking into consideration the structure
of biosignals. Essentially, the present theory of optimal signal
processing assumes linear systems, and it is based on the Gaussianity
and stationarity assumptions. These conditions are not met in
practice.
We have been working on lifting the
restrictions of linearity and second order statistics in optimal
signal processing algorithms. In this presentation we will briefly
review a new optimization criterion based on Renyi’s entropy
and show how it can be utilized to adapt linear and nonlinear
filters. We will present results for blind source separation.
|
|
|
|
To
understand the representation of broadband, dynamic sounds in
Primary Auditory Cortex (A1), we characterize its responses
by the Spectro-Temporal Response Field (STRF). The STRF describes
and predicts the linear response of neurons to sounds rich with
spectro-temporal envelopes. It is calculated from responses
to broadband sounds with rippled spectral envelopes that drift
up and down the frequency axis at various speeds. These stimuli
allow us also to compute a "ripple transfer function" which
summarizes the way a cell responds to all ripples. In this talk,
we shall first summarize how the transfer function relates to
the STRF, how it can be used to investigate the spectral and
temporal response properties of the cell, and what implications
these properties have to the connectivity of the cell within
the cortex, and to the thalamus. We shall also address the functional
implications of these results to the processing of complex sounds
such as speech and music, and the relationship between auditory
and other sensory processing in the cortex.  |
|
|
|
For
many years the human binaural system has been an inspiration
for developers of robust automatic speech recognition systems
because of its ability to separate competing sources of sound
arriving from different azimuths, and because of its ability
to suppress the effects of some kinds of room reverberation.
Nevertheless, while several groups of researchers have incorporated
computational models of binaural interaction into the feature
extraction stage of automatic speech recognition systems, the
impact of binaural processing on recognition accuracy has been
more limited than expected, and has been achieved only at substantial
computational cost. This talk will review the motivation for
and structure of current models of binaural hearing that have
been applied to speech recognition. We will compare the improvements
in speech recognition accuracy obtained through the use of binaural
processing. Finally, we will discuss some of the reasons why
we believe that progress to date has been limited and speculate
on especially promising arenas for future research.  |
|
|
|
Speech
segregation in the monaural condition is a primary task of computational
auditory scene analysis, and has proven to be very challenging.
We present a multi-stage model for the task. The model starts
with simulated auditory periphery. A subsequent stage computes
mid-level auditory representations, including correlograms and
cross-channel correlations. The core of the system performs
segmentation and grouping in a two-dimensional time-frequency
representation that encodes proximity in frequency and time,
periodicity, and amplitude modulation (AM). Motivated by psychoacoustic
observations, our system employs different mechanisms for handling
resolved and unresolved harmonics. For resolved harmonics, the
system generates segments based on temporal continuity and cross-channel
correlation, and groups them according to periodicity. For unresolved
harmonics, the system generates segments based on AM in addition
to temporal continuity and groups them according to AM repetition
rates derived from sinusoidal modeling. Underlying the segregation
process is a pitch contour that is first estimated from speech
segregated according to global pitch and then adjusted according
to psychoacoustic constraints. The model has been systematically
evaluated using a common corpus of speech mixed with a variety
of interfering sounds, and it yields substantially better performance
than previous systems.  |
|
|
|
|