Sight and Sound - CVPR 2020

Schedule

9:00 - 9:05 (PST)	Welcome
9:05 - 11:00 (PST)	Paper session		Session chair: Arsha Nagrani
[Paper] [Video]	A Local-to-Global Approach to Multi-modal Movie Scene Segmentation		Anyi Rao, Linning Xu, Yu Xiong, Guodong Xu, Qingqiu Huang, Bolei Zhou, Dahua Lin
[Paper] [Video]	Audio-Visual SfM towards 4D reconstruction under dynamic scenes		Takashi Konno, Kenji Nishida, Katsutoshi Itoyama, Kazuhiro Nakadai
[Paper] [Video]	Co-Learn Sounding Object Visual Grounding and Visually Indicated Sound Separation in A Cycle		Yapeng Tian, Di Hu, Chenliang Xu
	Q&A session
[Paper] [Video]	Deep Audio Prior: Learning Sound Source Separation from a Single Audio Mixture		Yapeng Tian, Chenliang Xu, Dingzeyu Li
[Paper] [Video]	Weakly-Supervised Audio-Visual Video Parsing Toward Unified Multisensory Perception		Yapeng Tian, Dingzeyu Li, Chenliang Xu
[Paper] [Video]	What comprises a good talking-head video generation?		Lele Chen, Guofeng Cui, Ziyi Kou, Haitian Zheng, Chenliang Xu
	Q&A session
[Paper] [Video]	A Two-Stage Framework for Multiple Sound-Source Localization		Rui Qian, Di Hu, Heinrich Dinkel, Mengyue Wu, Ning Xu, Weiyao Lin
[Paper] [Video]	BatVision with GCC-PHAT Features for Improved Sound to Vision Predictions		Jesper Christensen, Sascha A Hornauer, Stella Yu
[Paper] [Video]	Heterogeneous Scene Analysis via Self-supervised Audiovisual Learning		Di Hu, Zheng Wang, Haoyi Xiong, Dong Wang, Feiping Nie, Dejing Dou
	Q&A session
[Paper] [Video]	Does Ambient Sound Help? - Audiovisual Crowd Counting		Di Hu, Lichao Mou, Qingzhong Wang, Junyu Gao, Yuansheng Hua, Dejing Dou, Xiaoxiang Zhu
[Paper] [Video]	An end-to-end approach for visual piano transcription		A. Sophia Koepke, Olivia Wiles, Yael Moses, Andrew Zisserman
[Paper] [Video]	Visual Self-Supervision by Facial Reconstruction for Speech Representation Learning		Abhinav Shukla, Stavros Petridis, Maja Pantic
	Q&A session
11:00 - 11:30 (PST)	Invited talk [Video]	Lorenzo Torresani Self-supervised Video Models from Sound and Speech
11:30 - 12:00 (PST)	Invited talk [Video]	Linda Smith Sight, sounds, hands: Learning object names from the infant point of view
12:00 - 12:30 (PST)	Invited talk [Video]	Adam Finkelstein Optical Audio Capture: Recovering Sound from Turn-of-the-century Sonorine Postcards
12:30 - 2:00 (PST)	Invited paper talks		Session chair: Ruohan Gao
[Paper] [Video]	What Makes Training Multi-Modal Classification Networks Hard?		Weiyao Wang, Du Tran, Matt Feiszli
[Paper] [Video]	Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis		K R Prajwal, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, C V Jawahar
[Paper] [Video]	Multi-modal Self-Supervision from Generalized Data Transformations		Mandela Patrick, Yuki M. Asano, Polina Kuznetsova, Ruth Fong, João F. Henriques, Geoffrey Zweig, Andrea Vedaldi
	Q&A session
[Paper] [Video]	VGGSound: A Large-Scale Audio-Visual Dataset		Honglie Chen, Weidi Xie, Andrea Vedaldi, Andrew Zisserman
[Paper] [Video]	Semantic Object Prediction and Spatial Sound Super-Resolution with Binaural Sounds.		Arun Balajee Vasudevan, Dengxin Dai, Luc Van Gool
[Paper] [Video]	Epic-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition		Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, Dima Damen
[Paper] [Video]	Telling Left From Right: Learning Spatial Correspondence of Sight and Sound		Karren Yang, Bryan Russell, Justin Salamon
	Q&A session
2:00 - 2:30 (PST)	Invited talk [Video]	Doug James Advances in Audiovisual Simulation
2:30 - 3:00 (PST)	Invited talk [Video]	David Harwath Vision as a Rosetta Stone for Speech
3:00 - 3:30 (PST)	Invited talk [Video]	Kristen Grauman Sights, Sounds, and 3D Spaces

Summary

In recent years, there have been many advances in learning from visual and auditory data. While traditionally these modalities have been studied in isolation, researchers have increasingly been creating algorithms that learn from both modalities. This has produced many exciting developments in automatic lip-reading, multi-modal representation learning, and audio-visual action recognition.

Since pretty much every internet video has an audio track, the prospect of learning from paired audio-visual data — either with new forms of unsupervised learning, or by simply incorporating sound data into existing vision algorithms — is intuitively appealing, and this workshop will cover recent advances in this direction. But it will also touch on higher-level questions, such as what information sound conveys that vision doesn't, the merits of sound versus other "supplemental" modalities such as text and depth, and the relationship between visual motion and sound. We'll also discuss how these techniques are being used to create new audio-visual applications, such as in the fields of speech processing and video editing.

Previous workshops: 2018, 2019

Presentation instructions

Authors of accepted papers can present a 5-minute (or shorter) talk about their work. Please submit the video by June 13th (11:59 PST) to CMT, following the CVPR oral instructions here (uploading as a .mp4 file).
We'll have a paper presentation session on 9:00am - 11:00am PST on June 15. During this session, we'll play the pre-recorded talks, with time for Q&A from authors (if they are present). We'll also release the videos on our website for offline viewing.
Please also submit the camera ready version of your paper via CMT by June 13th (11:59 PST). Papers will be available on our website.
Looking forward to seeing you there!

Organizers

Andrew Owens University of Michigan	Jiajun Wu Stanford	Ruohan Gao UT Austin	Arsha Nagrani Oxford	Hang Zhao Waymo
William Freeman MIT/Google	Andrew Zisserman Oxford	Jean-Charles Bazin KAIST	Antonio Torralba MIT	Kristen Grauman UT Austin / Facebook