Electrical Engineering and Computer Science

AI Seminar & Natural Language Processing Seminar

The Vision-Language Interface in Machines and Humans

Jeffrey Mark Siskind

Associate Professor
School of Electrical and Computer Engineering, Purdue University
Tuesday, November 18, 2014
4:00pm - 5:30pm
3725 BBB

Add to Google Calendar

About the Event

In the first part of the talk, I will discuss a method for vision-language interface in machines: a unified cost function that integrates object detection, tracking, event recognition, and natural language semantics. The roles played by participants (nouns), their characteristics (adjectives), the actions performed (verbs), the manner of such actions (adverbs), and changing spatial relations between participants (prepositions), in the form of whole-sentence descriptions, can guide activity recognition, allowing more robust object detection and tracking than is possible without sentential guidance. A general framework scores a video-sentence pair and produces objects tracks that delineate the participants in the video that correspond to the sentential roles. This framework supports searching large video corpora for clips that depict a sentential query by scoring each clip and returning a ranked list of clips. Compositional semantics can encode subtle meaning distinctions between two sentences that have the same words but different meanings: `The person rode the horse' vs. `The horse rode the person.' We demonstrate this approach by searching for 141 sentential queries involving people and horses interacting with each other in 10 full-length Hollywood movies.

In the second part of the talk, I will discuss investigation of the vision-language interface in humans: the question of how the human brain represents simple compositions of constituents: actors, verbs, objects, directions, and locations? Subjects viewed videos during neuroimaging (fMRI) sessions from which sentential descriptions of those videos were identified by decoding the brain representations based only on their fMRI activation patterns. Constituents (e.g., `fold' and `shirt') were independently decoded from a single presentation. Independent constituent classification was then compared to joint classification of aggregate concepts (e.g., `fold-short'}); results were similar as measured by accuracy and correlation. The brain regions used for independent constituent classification are largely disjoint and largely cover those used for joint classification. This allows recovery of sentential descriptions of stimulus videos by composing the results of the independent constituent classifiers. Furthermore, classifiers trained on the words one set of subjects think of when watching a video can recognize sentences a different subject thinks of when watching a different video.

Joint work with Andrei Barbu, Daniel P. Barrett, Wei Chen, N. Siddharth, Caiming Xiong, Haonan Yu, Jason J. Corso, Christiane D. Fellbaum, Catherine Hanson, Stephen Jose Hanson, Sebastien Helie, Evguenia Malaia, Barak A. Pearlmutter, Thomas Michael Talavage, and Ronnie B. Wilbur.


Jeffrey M. Siskind received the B.A. degree in computer science from the Technion, Israel Institute of Technology, Haifa, in 1979, the S.M. degree in computer science from the Massachusetts Institute of Technology (M.I.T.), Cambridge, in 1989, and the Ph.D. degree in computer science from M.I.T. in 1992. He did a postdoctoral fellowship at the University of Pennsylvania Institute for Research in Cognitive Science from 1992 to 1993. He was an assistant professor at the University of Toronto Department of Computer Science from 1993 to 1995, a senior lecturer at the Technion Department of Electrical Engineering in 1996, a visiting assistant professor at the University of Vermont Department of Computer Science and Electrical Engineering from 1996 to 1997, and a research scientist at NEC Research Institute, Inc. from 1997 to 2001. He joined the Purdue University School of Electrical and Computer Engineering in 2002 where he is currently an associate professor. His research interests include machine vision, artificial intelligence, cognitive science, computational linguistics, child language acquisition, and programming languages and compilers.

Additional Information

Sponsor(s): Toyota

Open to: Public