Gerasimos Potamianos: Research Work

Prof. Potamianos' core research work has been focusing on the general area of multimodal and multisensory perception technologies, primarily speech, with emphasis in smart space / ambient intelligence environments, where multiple far-field sensors provide the signals to be processed. His research activities in this area are currently being partially funded by EU-FP7 project DIRHA (2012-2014), performed in collaboration with the "Athena" Research Center-IAMU, and, in the past, as part of the Programme on Ambient Intelligence, during his tenure at ICS / FORTH (2008-2009), as well as EU-funded projects INDIGO (2009), while at IIT / NCSR`D', and projects NETCARITY, DICIT, and CHIL (2004-2008), during his employment at the IBM T.J. Watson Research Center. A related focus of Prof. Potamianos' work since 1996 has been on the problem of audio-visual speech processing, a research area where he is currently a recipient of EU Marie Curie re-integration grant AVISPIRE (2009-2013), in collaboration with IIT / NCSR`D'. Several highlights of work in these areas are given next, together with additional recent and past research activities.

Research work on the CHIL, NETCARITY, and DICIT EU Projects: Work under these three projects in general has aimed at the development of unobtrusive far-field perception and interaction technologies using multiple audio and/or visual sensors. More details on the projects and Prof. Potamianos' work in them follow:

CHIL, standing for "Computers in the Human Interaction Loop", has been an Integrated Project within the FP6 European Union framework programme with the participation of 15 partner sites from nine countries, under the joint coordination of the Fraunhofer Institute fur Informations- und Datenverarbeitung (IITB) and the Integrated Systems Lab (ISL) of the University of Karlsruhe (UKA), Germany. CHIL focused on analyzing, understanding, and facilitating human interaction during lectures and meetings inside smart rooms equipped with multiple far-field audio and visual sensors. Based on these, the CHIL consortium effort concentrated on detecting, classifying, and understanding human activity in the space, addressing the basic questions of the "who", "where", "what", "when", and "how" of the interaction. In the CHIL vision, computers fade into the background, reduced to discreet observers of human activity, ready to provide services proactively and implicitly supporting the meeting participants. An overview of the CHIL activities of Dr. Potamianos' team, while at IBM, can be found at his keynote talk at VisHCI'06. Briefly, these efforts focused on:

  • Speech technologies using far-field sensors in the CHIL scenarios of interest, i.e., seminars (lectures/meetings) inside smart rooms. In particular, work concentrated on automatic speech recognition (ASR), speech activity detection (SAD), and speaker diarization (SPKR) technologies, focusing on acoustic and language modeling, as well as multi-channel processing. The developed technologies were evaluated in the Rich Transcription (RT) evaluation campaigns, overseen by the National Institute of Standards and Technology (NIST) in 2006 and 2007.

  • Computer vision technology for tracking the lecture / meeting participants using multiple fixed cameras with overlapping views. Relevant work focused on both 3D head- and 2D-face tracking in the 3D space and available 2D camera views, respectively. Multiple initialization and tracking algorithms were developed, including a novel extension of mean shift tracking to 3D, adaptive subspace tracking with a "forgetting mechanism", and a variant of the IBM smart surveillance engine. The developed technologies were benchmarked in the evaluation campaign for the Classification of Events, Activities and Relationships (CLEAR).

  • Development of a smart room for data collection and demos. This was a regular conference room retrofitted with numerous audio-visual sensors connected to multiple computers. In particular, the smart room infrastructure contained 9 cameras, 152 microphones, and 7 computers running Linux, inter-connected via 1 Gb ethernet, as well as dedicated audio and video data links to the sensors. Ten interactive seminars (meetings) were recorded in this smart room in support of the 2006 and 2007 RT and CLEAR evaluation campaigns, constituting part of the CHIL corpus.
  • NETCARITY has been an Integrated Project within the FP6 European Union framework programme with the participation of 15 partners. The project focused on researching and testing technologies that will help older people to improve their well-being, independence, safety, and health at home. In particular, Dr. Potamianos' efforts at IBM focused on the development of acoustic scene analysis algorithms that can detect dangerous events (for example falls), in order to protect the elderly, living alone at home. Such analysis can complement information from additional sensors, for example a 3D range camera and a wearable accelerometer, in order to improve fall detection rate and reduce false alarms. Additional efforts of Dr. Potamianos' team, while at IBM, focused in multi-channel acoustic scene analysis in smart homes for the classification of activities of daily living (ADLs).

    DICIT, standing for "Distant-Talking Interfaces for Control of Interactive TV", has been a STREP within FP6, coordinated by the SHINE group at FBK with the participation of 7 partners. DICIT has focused on the development of advanced technologies for far-field speech processing and acoustic scene analysis, as well as multimodal dialogue design and natural language interpretation in multiple languages for a specific application in mind: the multimodal control (primarily by voice) of interactive TV systems from the comfort of one's couch, using microphone arrays for speech sensory input. For this project, while at IBM, Dr. Potamianos supervised team work in the areas of natural language modeling and understanding, dialogue design and implementation, as well as development of far-field automatic speech recognition systems for English and Italian.

    Research Work on Audio-Visual Speech Processing: This field constitutes the main emphasis of Dr. Potamianos' research work since 1996. The work has been motivated by human speech production and perception, and the fact that visual information plays an important role in the latter. Not surprisingly, visual information turns out to be beneficial to a number of speech technologies. For example, it can dramatically improve automatic speech recognition accuracy in noise (similarly to human lipreading), it can help disambiguate the "who" and "when" of the active speaker in multi-party interaction (speech activity detection, speaker localization), help with person recognition (identification, verification), as well as improve speech "delivery" through the use of visual speech synthesis (avatars, photo-realistic talking faces). Dr. Potamianos has been actively investigating many of the above aspects of audio-visual speech processing over the past several years, with his main concentration being the topic of audio-visual automatic speech recognition (AVASR). In particular, he has been working on algorithms for face detection and tracking, extraction of speech-informative visual features, integration of audio and visual features into the speech recognition process, as well as development of AVASR prototypes. He has published over 60 articles in this general area that have received over 700 citations in the literature. Some highlights of his recognized contributions to the field include his participation in the summer 2000 Workshop at the Johns Hopkins University (WS'00), teaching at the 2001 ELSNET Summer School, a Tutorial at ICIP 2003, Plenary Talks at AVSP 2003 and VisHCI 2006, a Panel participation at MMSP 2006, an IBM Research Accomplishment award in 2002, and a special mention of the work as part of the 2006 North American Frost and Sullivan Award for Excellence in Research in the speech recognition field awarded to the IBM Corporation. Furthermore, Dr. Potamianos has been a Guest Co-Editor of two special journal issues on this general topic, namely in the EURASIP JASP 2002 and the IEEE TASLP 2009 journals. He has also received the Best Paper Award at ICME 2005 and was co-author of the Best Student Paper at Interspeech 2007. Some highlights of Prof. Potamianos' research work on audio-visual speech technologies include:

  • Visual front end: Work has been focusing on face and facial feature detection and tracking, as well as the extraction of visual features, relevant to speech. For the former, AdaBoost and GMM based classification schemes have been used. For visual feature extraction, the focus has been on appearance-based methods, employing pattern recognition and image compression techniques. For example, among others, linear discriminant analysis, the discrete cosine transform, and mutual information approaches have been used. In addition, comparisons with alternative visual features have been conducted, for example geometric features, active-appearance model based ones, etc. The visual front end has been appropriately extended to handle profile view data, as well as mouth-only region data provided by a specially designed audio-visual headset.

  • Fusion for speech recognition: Work has been focusing on HMM-based integration approaches for optimal gains in recognition performance. Techniques for feature, decision, and hybrid fusion have been developed, including asynchronous integration of the audio and visual streams. Of particular interest is ongoing research on stream reliability modeling and integration, including algorithms for training global or locally adaptive stream weights. As a result of this work, large improvements in speech recognition performance have been demonstrated, as compared to traditional audio-only ASR systems.

  • Prototype systems: Efficient AVASR algorithms have been implemented based on the IBM ViaVoice engine platform. The resulting AVASR prototype operates in real-time, and it can accept input from various visual sensors, including a web-cam, a specially designed audio-visual headset, and a camera for the automobile environment.

  • Databases: Dr. Potamianos has overseen the collection of state-of-the-art corpora both while at AT&T and IBM. In particular, datasets collected during his tenure at IBM Research allow large-scale AVASR experiments in a variety of environments, ranging from controlled settings ("studio"-like) to significantly more challenging domains, such as offices, automobiles, broadcast news, and smart rooms. In addition to these sets that contain full-face data, sets containing mouth-only frontal data using a specially designed audio-visual headset, as well as multi-view (frontal / profile) data using multiple synchronized cameras have also been collected. All corpora contain a large number of speakers and both continuous large-vocabulary speech, as well as a "control" small-vocabulary (connected digits) set.
  • In addition to AVASR, Prof. Potamianos has been conducting research concentrating on the problems of audio-visual speaker recognition, speech activity detection, speech enhancement, and speech synthesis. All share significant commonalities with the AVASR problem. An overview of his work on audio-visual speech technologies can be found in his keynote talk at VisHCI'06 and invited lecture at ASRU'09.

    Earlier Research Work: In addition to the above areas, Prof. Potamianos has worked on statistical language modeling during a Postdoctoral appointment with the Center for Language and Speech Processing (CLSP) at Johns Hopkins (1994-1996) and on statistical models for image analysis as part of his Ph.D. work at Hopkins (1990-1994). In more detail:

  • Work on Statistical Language Modeling (1994-1996): Improving the performance of available language models is essential in the quest for reliable ASR, as well as improvements in machine translation, optical character recognition, and spelling correction. Two critical issues in language modeling are the partition of the observed "history" space into equivalence classes, as well as the estimation of conditional probabilities of the next word, given the observed equivalence class, based on typically sparse data. In his Postdoctoral research, Dr. Potamianos has investigated both problems: Equivalence classes have been determined by means of n-gram language models, decision tree, and decision trellis classifiers, whereas conditional probabilities have been estimated from sparse data using variants of the classical deleted interpolation smoothing algorithm. More specifically, Dr. Potamianos developed a baseline n-gram language model, employing the widely available Brown corpus. In the process, he devised an improved smoothing algorithm that significantly reduced test data perplexity over the traditional smoothing approach. He then studied three variants of decision tree language models for the same problem, and obtained the best results by using a K-means type clustering algorithm to design decision tree splits. The decision tree language model was further improved by a merge-split algorithm, which converted the decision tree into a decision trellis. This approach gave encouraging results on the Brown corpus.

  • Ph.D. Thesis Work on Image Analysis Using Markov Random Fields (1990-1994): Dr. Potamianos' doctoral thesis has focused on the theory and applications of Markov random fields (MRFs) in image processing and analysis. MRFs belong to a well known exponential parametric family of random field models, and are extensively used for modeling spatial interaction phenomena in terms of a few parameters. Although conceptually simple, their probability distribution has a rather involved form, due to the intractable nature of its normalizing constant. This constant is known as the partition function and, in the general case, lacks a closed-form expression, thus hindering statistical inference (e.g., parameter estimation and hypothesis testing) of fully or partially observed MRF images. Reliable estimation of the partition function (and, hence of likelihoods) would allow replacing various ad-hoc and moderately-only successful MRF parameter estimation techniques with efficient maximum likelihood parameter estimation. That would result to improved performance in problems such as image restoration, segmentation, texture modeling, classification, etc. Motivated by the above facts, Dr. Potamianos' dissertation concentrated on the problem of efficiently estimating the partition function of MRF models. A stochastic simulation (i.e., Monte Carlo) approach was proposed for this purpose, with a number of Monte Carlo algorithms introduced and rigorously analyzed in terms of computational complexity and statistical properties of the resulting estimators. This unified analysis allowed a comparative study of the algorithms, and was backed up by extensive simulation experiments. The best such algorithm was then successfully applied to maximum likelihood statistical inference of MRF images. Original contributions of the dissertation included: (a) Development of new partition function estimation (PFE) algorithms. (b) Analysis and classification of all Monte Carlo PFE algorithms into two categories, in terms of their computational complexity. (c) Suggestion of a practical and most efficient PFE algorithm, based on the above mentioned analysis. (d) Application of this algorithm to Monte Carlo maximum likelihood based parameter estimation and hypothesis testing of fully and partially observed MRF images, as well as to image restoration.

  • Last Updated on April 20, 2012

    Back to Prof. Potamianos' Home Page