Gerasimos Potamianos: Research Work
Prof. Potamianos' core research work has been focusing on
the general area of multimodal and
multisensory perception technologies, primarily speech,
with emphasis in smart space / ambient intelligence environments,
where multiple far-field sensors provide the signals to be processed.
His research activities in this area are currently being partially funded
by EU-FP7 project DIRHA
(2012-2014), performed in collaboration with the
and, in the past, as part of the Programme on
Intelligence, during his tenure at
as well as EU-funded projects
while at IIT /
during his employment
at the IBM
T.J. Watson Research Center.
A related focus of Prof. Potamianos' work since 1996 has been
on the problem of audio-visual speech processing,
a research area where he is currently a recipient of EU Marie Curie
(2009-2013), in collaboration with
Several highlights of work in these areas are given next,
together with additional recent and past research activities.
Research work on the CHIL, NETCARITY, and DICIT EU Projects:
Work under these three projects in general has aimed
at the development of unobtrusive far-field perception and interaction
technologies using multiple audio and/or visual sensors.
More details on the projects and Prof. Potamianos' work in them follow:
standing for "Computers in the Human Interaction Loop",
has been an Integrated Project within the
European Union framework programme
with the participation of 15 partner sites from nine countries,
under the joint coordination of the
Fraunhofer Institute fur Informations- und Datenverarbeitung
(IITB) and the Integrated Systems Lab
of the University of Karlsruhe
CHIL focused on analyzing, understanding, and facilitating human
interaction during lectures and meetings inside
smart rooms equipped with multiple far-field
audio and visual sensors. Based on these, the CHIL consortium effort
concentrated on detecting, classifying, and understanding human activity
in the space, addressing the basic questions of the "who", "where",
"what", "when", and "how" of the interaction.
In the CHIL vision, computers fade into the background, reduced
to discreet observers of human activity,
ready to provide services proactively and implicitly supporting
the meeting participants.
An overview of the CHIL activities of Dr. Potamianos' team,
while at IBM, can be found at his
keynote talk at
Briefly, these efforts focused on:
Speech technologies using far-field sensors in the CHIL scenarios
of interest, i.e., seminars (lectures/meetings) inside smart rooms.
In particular, work concentrated on
automatic speech recognition (ASR),
speech activity detection (SAD), and
speaker diarization (SPKR) technologies,
focusing on acoustic and language modeling,
as well as multi-channel processing.
The developed technologies were evaluated in the
evaluation campaigns, overseen by the
National Institute of Standards and Technology
in 2006 and 2007.
Computer vision technology for tracking
the lecture / meeting participants using multiple
fixed cameras with overlapping views. Relevant work focused on both
3D head- and 2D-face tracking
in the 3D space and available 2D camera views, respectively.
Multiple initialization and tracking algorithms were developed,
including a novel extension of
mean shift tracking to 3D,
adaptive subspace tracking with a "forgetting mechanism",
and a variant of the IBM smart surveillance engine.
The developed technologies were benchmarked in the evaluation campaign
for the Classification of Events, Activities and Relationships
Development of a smart room for data collection
This was a regular conference room retrofitted with numerous audio-visual
sensors connected to multiple computers.
In particular, the smart room infrastructure contained 9 cameras,
152 microphones, and 7 computers running Linux,
inter-connected via 1 Gb ethernet,
as well as dedicated audio and video data links to the sensors.
Ten interactive seminars (meetings)
were recorded in this smart room in support of the 2006 and 2007
evaluation campaigns, constituting part of the
has been an Integrated Project within the
European Union framework programme
with the participation of 15 partners.
The project focused on researching and testing technologies
that will help older people to improve
their well-being, independence, safety, and health at home.
In particular, Dr. Potamianos' efforts at IBM focused on
the development of acoustic scene analysis
algorithms that can detect dangerous events (for example falls),
in order to protect the elderly, living alone at home.
Such analysis can complement information from
additional sensors, for example a 3D range camera
and a wearable accelerometer,
in order to improve fall detection rate
and reduce false alarms.
Additional efforts of Dr. Potamianos' team, while at IBM, focused in
multi-channel acoustic scene analysis in smart homes
for the classification of activities of daily living (ADLs).
standing for "Distant-Talking Interfaces for Control of Interactive TV",
has been a STREP
coordinated by the SHINE
group at FBK
with the participation of 7 partners.
DICIT has focused on the development of advanced technologies for far-field
speech processing and acoustic scene analysis,
as well as multimodal dialogue design and
natural language interpretation
in multiple languages for a specific application in mind:
the multimodal control (primarily by voice) of interactive TV systems
from the comfort of one's couch, using microphone arrays
for speech sensory input.
For this project, while at IBM, Dr. Potamianos supervised team work
in the areas of natural language modeling and understanding,
dialogue design and implementation, as well as
development of far-field automatic speech recognition systems for
English and Italian.
Research Work on Audio-Visual Speech Processing:
constitutes the main emphasis of Dr. Potamianos' research work since 1996.
The work has been motivated by human speech
production and perception, and the fact that visual information plays an
important role in the latter. Not surprisingly,
visual information turns out to be beneficial
to a number of speech technologies.
For example, it can dramatically improve automatic speech recognition
accuracy in noise (similarly to human lipreading),
it can help disambiguate the "who" and "when" of the active speaker in
multi-party interaction (speech activity detection,
speaker localization), help with person recognition
(identification, verification), as well as improve speech "delivery"
through the use of visual speech synthesis (avatars,
photo-realistic talking faces).
Dr. Potamianos has been actively investigating many of the above aspects
of audio-visual speech processing over the past several years,
with his main concentration being the topic of
audio-visual automatic speech recognition (AVASR).
he has been working on algorithms for face detection and tracking,
extraction of speech-informative visual features,
integration of audio and visual features into the speech recognition process,
as well as development of AVASR prototypes.
He has published over 60 articles in this general area
that have received over 700 citations in the literature.
Some highlights of his recognized contributions to the field include
his participation in the summer 2000 Workshop
at the Johns Hopkins University
teaching at the 2001 ELSNET Summer School,
a Tutorial at ICIP 2003, Plenary Talks
at AVSP 2003 and VisHCI 2006,
a Panel participation at MMSP 2006,
an IBM Research Accomplishment award in 2002,
and a special mention of the work as part of the 2006
North American Frost and Sullivan Award
for Excellence in Research in the speech recognition field
awarded to the IBM Corporation.
Furthermore, Dr. Potamianos has been a Guest Co-Editor
of two special journal issues on this general topic,
namely in the
IEEE TASLP 2009 journals.
He has also received the Best Paper Award
at ICME 2005 and was co-author of the Best Student Paper
at Interspeech 2007.
Some highlights of Prof. Potamianos' research work
on audio-visual speech technologies include:
Visual front end: Work has been focusing on face and
facial feature detection and tracking,
as well as the extraction of visual features, relevant to speech.
For the former,
AdaBoost and GMM based classification schemes have been used.
For visual feature extraction, the focus has been on appearance-based
methods, employing pattern recognition and image compression
techniques. For example, among others,
linear discriminant analysis, the
discrete cosine transform, and
mutual information approaches have been used.
In addition, comparisons with alternative visual features have been conducted,
active-appearance model based ones, etc.
The visual front end has been appropriately extended to handle
profile view data,
as well as mouth-only region data provided by a specially designed
Fusion for speech recognition: Work has been focusing on
HMM-based integration approaches for optimal gains in recognition performance.
Techniques for feature, decision, and hybrid
fusion have been developed, including
asynchronous integration of the audio and visual streams.
Of particular interest is ongoing research on
stream reliability modeling and integration,
including algorithms for training global
or locally adaptive stream weights.
As a result of this work, large improvements
in speech recognition performance have been demonstrated,
as compared to traditional audio-only ASR systems.
Prototype systems: Efficient
AVASR algorithms have been implemented
based on the IBM ViaVoice engine platform.
The resulting AVASR prototype operates in real-time,
and it can accept input from various visual sensors,
including a web-cam, a specially designed audio-visual headset,
and a camera for the automobile environment.
Databases: Dr. Potamianos has overseen the collection
of state-of-the-art corpora both while at AT&T and IBM.
In particular, datasets collected during his
tenure at IBM Research allow large-scale AVASR experiments in a
variety of environments,
ranging from controlled settings ("studio"-like)
to significantly more challenging domains,
such as offices, automobiles,
broadcast news, and smart rooms.
In addition to these sets that contain full-face data,
sets containing mouth-only frontal data
using a specially designed
audio-visual headset, as well as
multi-view (frontal / profile) data
using multiple synchronized cameras have also been collected.
All corpora contain a large number of speakers
and both continuous large-vocabulary speech,
as well as a "control" small-vocabulary (connected digits) set.
In addition to AVASR, Prof. Potamianos
has been conducting research concentrating
on the problems of audio-visual
speech activity detection,
speech enhancement, and
All share significant commonalities with the AVASR problem.
An overview of his work on audio-visual speech technologies
can be found in his
keynote talk at
Earlier Research Work:
In addition to the above areas,
Prof. Potamianos has worked on statistical language modeling
during a Postdoctoral appointment with the
for Language and Speech Processing (CLSP)
at Johns Hopkins (1994-1996)
and on statistical models for image analysis as part of his Ph.D. work
at Hopkins (1990-1994).
In more detail:
Work on Statistical Language Modeling (1994-1996):
Improving the performance of available language models is essential
in the quest for reliable ASR, as well as improvements
in machine translation, optical character recognition, and spelling correction.
Two critical issues in language modeling are the partition of the observed
"history" space into equivalence classes,
as well as the estimation of
conditional probabilities of the next word,
given the observed equivalence class, based on typically sparse data.
In his Postdoctoral research, Dr. Potamianos has investigated both problems:
Equivalence classes have been determined by means of
n-gram language models, decision tree,
and decision trellis classifiers, whereas conditional probabilities
have been estimated from sparse data using variants
of the classical deleted interpolation smoothing algorithm.
More specifically, Dr. Potamianos developed a baseline n-gram language model,
employing the widely available Brown corpus.
In the process, he devised an improved smoothing algorithm
that significantly reduced test data perplexity over the traditional
He then studied three variants of decision tree language models for the
same problem, and obtained the best results by using a K-means type
clustering algorithm to design decision tree splits.
The decision tree language model was further improved by a
merge-split algorithm, which converted the decision tree
into a decision trellis. This approach gave encouraging results
on the Brown corpus.
Ph.D. Thesis Work on Image Analysis
Using Markov Random Fields (1990-1994):
Dr. Potamianos' doctoral thesis has focused on the theory and applications of
Markov random fields (MRFs) in image processing and analysis.
MRFs belong to a
well known exponential parametric family of random field models,
and are extensively used for modeling spatial interaction phenomena
in terms of a few parameters. Although conceptually simple,
their probability distribution has a rather involved form,
due to the intractable nature of its normalizing constant.
This constant is known as the partition function
and, in the general case, lacks a closed-form expression,
thus hindering statistical inference (e.g., parameter
estimation and hypothesis testing) of fully or partially observed
MRF images. Reliable estimation of the partition
function (and, hence of likelihoods) would
allow replacing various ad-hoc and moderately-only successful MRF parameter
estimation techniques with efficient maximum likelihood parameter estimation.
That would result to improved performance in problems
such as image restoration, segmentation, texture modeling, classification, etc.
Motivated by the above facts, Dr. Potamianos' dissertation concentrated on
the problem of efficiently estimating the partition function of MRF models.
A stochastic simulation (i.e., Monte Carlo)
approach was proposed for this purpose,
with a number of Monte Carlo algorithms introduced
and rigorously analyzed in terms of computational complexity
and statistical properties of the resulting estimators.
This unified analysis allowed a comparative study of the algorithms,
and was backed up by extensive simulation experiments.
The best such algorithm was then successfully applied
to maximum likelihood statistical inference of MRF images.
Original contributions of the dissertation included:
(a) Development of new partition function estimation (PFE) algorithms.
(b) Analysis and classification of all Monte Carlo PFE algorithms
into two categories, in terms of their computational complexity.
(c) Suggestion of a practical and most efficient PFE algorithm, based on the
above mentioned analysis.
(d) Application of this algorithm to Monte Carlo maximum likelihood
based parameter estimation and hypothesis testing of fully and
partially observed MRF images, as well as to image restoration.
Last Updated on April 20, 2012
Back to Prof. Potamianos' Home Page