Paper Interaction Tool (Link)

Main conference program guide (Link)

Dataset Contributions table can be found (here) 

2-page "At-a-Glance" summary (Link)


Workshop/Tutorial Program (Link)

At a Glance Workshop/Tutorials (link)


06/16  Virtual links will be emailed sometime on Friday 6/17 as well as posted here. There will be a virtual POSTER SESSION held on June 28th at 10:00 AM CDT and 10:00 PM CDT.  Authors can choose to attend either or both for additional q&a.  The papers present will be available here.


All Times are based on US Central Time Zone

Registration Hours:

Saturday 6/18              4PM – 6PM               
Sunday 6/19                 7AM – 5PM 
Monday 6/20                7AM – 5PM              
Tuesday 6/21               7AM – 5PM              
Wednesday                  7:30 AM – 4PM        
Thursday                        8AM – 2PM             
Friday                              8AM – 2PM 

Registration will be in the Mosaic Lounge just inside the Julia Street entrance of the convention center.  Please have your CLEAR or SafeExpo app updated and your QR code or confirmation number ready for quicker badge retrieval. If you did not complete the CLEAR or SafeExpo, please remember to bring your valid PCR test results and check-in at the far end of the registration counters. Reminder – there will be NO BADGE reprints. Anyone caught sharing CVPR badges in-person or online will be expelled from CVPR.


Program Overview

Main Conference

Tuesday, June 21 - Friday, June 24, 2022
Tuesday, June 21 - Thursday, June 23, 2022

Workshops & Tutorials
CVPR Evening Event

Sunday, June 19 & Monday, June 20, 2022
Wednesday, June 22, 2022


Paper Sessions    https://docs.google.com/spreadsheets/d/1ilb2Wqea0uVKxlJsdJHR-3IVaWeqsWyk2M2aGmYVAPE


KEYNOTE                          Tuesday, June 21               5:00PM  CDT

Josh Tennebaum     Professor    Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology

profile pic  

Josh Tenenbaum is Professor of Computational Cognitive Science at the Massachusetts Institute of Technology in the Department of Brain and Cognitive Sciences, the Computer Science and Artificial Intelligence Laboratory (CSAIL) and the Center for Brains, Minds and Machines (CBMM).  He received a BS from Yale University (1993) and a PhD from MIT (1999). His long-term goal is to reverse-engineer intelligence in the human mind and brain, and use these insights to engineer more human-like machine intelligence.  In cognitive science, he is best known for developing theories of cognition as probabilistic inference in structured generative models, and applications to concept learning, causal reasoning, language acquisition, visual perception, intuitive physics, and theory of mind.  In AI, he and his group have developed widely influential models for nonlinear dimensionality reduction, probabilistic programming, and Bayesian unsupervised learning and structure discovery. His current research focuses on the development of common sense in children and machines, common sense scene understanding in humans and machines, and models of learning as program synthesis. His work has been recognized with awards at conferences in Cognitive Science, Philosophy and Psychology, Computer Vision, Neural Information Processing Systems, Reinforcement Learning and Decision Making, and Robotics. He is the recipient of the Troland Research Award from the National Academy of Sciences (2012), the Howard Crosby Warren Medal from the Society of Experimental Psychologists (2015), the R&D Magazine Innovator of the Year (2018), and a MacArthur Fellowship (2019), and he is an elected member of the American Academy of Arts and Sciences.


Learning to see the human way


Computer vision is one of the great AI success stories.  Yet we are still far from having machine systems that can reliably and robustly see everything a human being sees in an image or in the real world.  Despite rapid advances in self-supervised visual and multimodal representation learning, we are also far from having systems that can learn to see as richly as a human does, from so little data, or that can learn new visual concepts or adapt their representations as quickly as a human does. And even today's remarkable generative image synthesis systems imagine the world in a very different and fundamentally less flexible way than human beings do.  How can we close these gaps?  I will describe several core insights from the study of human vision and visual cognitive development that run counter to the dominant trends in today's computer vision and machine learning world, but that can motivate and guide an alternative approach to building practical machine vision systems.  

Technically, this approach rests on advances in differentiable and probabilistic programming: hybrids of neural, symbolic and probabilistic modeling and inference that can be more robust, more flexible and more data-efficient than purely neural approaches to learning to see. New probabilistic programming platforms offer to make these approaches scalable as well. Conceptually, this approach draws on classic proposals for understanding vision as "inverse graphics", "analysis by synthesis" or "inference to the best explanation", and the notion that at least some high-level architecture for scene representation is built into the brain by evolution rather than learned from experience, reflecting invariant properties of the physical world. Learning then enables, enriches and extends these built-in representations; it does not create them from scratch. I will show a few examples of recent machine vision successes based on these ideas, from our group and others.  But the hardest problems are still very open.  I will highlight some "Grand Challenge" tasks for building machines that learn to see like people: problems that far outstrip the abilities of any current system, and that I hope can inspire the next steps towards progress for computer vision researchers regardless of which approach they favor.


KEYNOTE                          Wednesday, June 22               5:00PM  CDT

Xuedong Huang     Technical Fellow     Chief Technology Officer Azure AI

profile pic  

Dr. Xuedong Huang is a Technical Fellow in the Cloud and AI group at Microsoft and the Chief Technology Officer for Azure AI. Huang oversees the Azure Cognitive Services team and manages researchers and engineers around the world who are building artificial intelligence-based services to power intelligent applications. He has held a variety of responsibilities in research, incubation, and production to advance Microsoft’s AI stack from deep learning infrastructure to enabling new experiences. In 1993, Huang joined Microsoft to establish the company's speech technology group, where he led Microsoft’s spoken language efforts for over a few decades. In addition to bringing speech recognition to the mass market, Huang led his team in achieving historical human parity milestones in speech recognition, machine translation, conversational question answering, machine reading comprehension, image captioning and commonsense language understanding. Huang’s team provides AI-based Azure Vision, Speech, Language and Decision services that power popular world-wide applications from Microsoft like, Office 365, Dynamics, Xbox and Teams to numerous 3rd party services and applications. Huang is a recognized executive and scientific leader from his contributions in the software and AI industry. He is an Institute of Electrical and Electronics (IEEE) and Association for Computer Machinery (ACM) Fellow. Huang was named Asian American Engineer of the Year (2011), Wired Magazine's 25 Geniuses Who Are Creating the Future of Business (2016), and AI World's Top 10 (2017). Huang holds over 170 patents and has published over 100 papers and two books.

Before Microsoft, Huang received his PhD in Electrical Engineering from University of Edinburgh and was on the faculty at Carnegie Mellon University. He is a recipient of the Allen Newell Award for Research Excellence.

Toward Integrative AI with Computer Vision

Abstract :
The pace of innovation in AI over the past decade has been remarkable. Thanks to big data, computing power and modern network architecture, we are seeing a wave of continuous breakthroughs find their way into people’s everyday lives. In my role as a technology leader developing both the science and engineering of AI, I see the unrealized potential of making AI more useful in the real world for every person and organization. While modern AI has reached human parity on a few well-defined, narrow research benchmarks such as speech recognition, SuperGLUE, and image captioning, a rapidly growing number of disjointed AI tasks are needed to mimic human intelligence in understanding the open and complex world. As each AI task is often defined by the statistics manifested from large amounts of task-specific data, we end up building expensive silos without a synergistic way of knowledge sharing and transferring among the different AI tasks.

In this keynote, I will share our progress on Integrative AI, a multi-lingual, multi-modal approach addressing this challenge using a holistic, semantic representation to unify various tasks in speech, language, and vision. To apply Integrative AI to computer vision, we have been developing a foundation model called Florence, which introduces the concept of a semantic layer through large-scale image and language pretraining. Our Florence model distills visual knowledge and reasoning into an image and text transformer to enable zero-shot and few-shot capabilities for common computer vision tasks such as recognition, detection, segmentation, and captioning. Through bridging the gap between textural representation and various vision downstream tasks, we not only achieved state-of-the-art results on dozens of benchmarks such as ImageNet-1K zero-shot, COCO segmentation, VQA and Kinetcs-600, but also discovered novel results of image understanding. 

We are encouraged by these preliminary successes and believe we have only scratched the surface of Integrative AI. As language is at the core of human intelligence, I foresee that the semantic layer will empower computer vision to go beyond visual perception and connect pixels seamlessly to the core of human intelligence: intent, reasoning, and decision. 


KEYNOTE                          Thursday, June 23               5:00PM  CDT

Kavita Bala    Dean     Ann S. Bowers College of Computing and Information Science,Cornell University

profile pic  

Kavita Bala is the dean of Cornell Ann S. Bowers College of Computing and Information Science at Cornell University. Bala received her S.M. and Ph.D. from the Massachusetts Institute of Technology (MIT). Before becoming dean, she served as the chair of the Cornell Computer Science department.  Bala leads research in computer vision and computer graphics in recognition and visual search; material modeling and acquisition using physics and learning; physically based scalable rendering; and perception. She co-founded GrokStyle, a visual recognition AI company, which drew IKEA as a client, and was acquired by Facebook in 2019.

Bala is the recipient of the SIGGRAPH Computer Graphics Achievement Award (2020) and the IIT Bombay Distinguished Alumnus Award (2021), and she is an Association for Computing Machinery (ACM) Fellow (2019) and Fellow of the SIGGRAPH Academy (2020). Bala has received multiple Cornell teaching awards (2006, 2009, 2015).  Bala has authored the graduate-level textbook “Advanced Global Illumination” and has served as the Editor-in-Chief of Transactions on Graphics (TOG).

Understanding Visual Appearance from Micron to Global Scale


We use visual input to understand and explore the world we live in: the shapes we hold, the materials we touch and feel, the activities we do, the scenes we navigate, and more.  With the ability to image the world, from micron resolution in CT images to planet-scale satellite imagery, we can build a deeper understanding of the appearance of individual objects, and also build our collective understanding of world-scale events as recorded through visual media.

In this talk, I will describe my group's research on better visual understanding, including graphics models for realistic visual appearance and rendering, reconstruction of shape and materials, and visual search and recognition for world-scale discovery of visual patterns and trends across geography and time.


PANEL  Friday, June 24 5:00 PM

Embodied Computer Vision

Moderator: Martial Hebert (CMU)

profile pic  

Martial Hebert is University Professor at Carnegie Mellon University, where is currently Dean of the School of Computer Science. His research interests include computer vision, machine learning for computer vision, 3-D representations and 3-D computer vision, and autonomous systems.

Kristen Grauman (UT Austin, Meta AI)

profile pic  

Kristen Grauman is a Professor in the Department of Computer Science at the
University of Texas at Austin and a Research Director in Facebook AI Research (FAIR).  Her research in computer vision and machine learning focuses on visual recognition, video, and embodied perception.  Before joining UT-Austin in 2007, she received her Ph.D. at MIT.  She is an IEEE Fellow, AAAI Fellow, Sloan Fellow, and recipient of the 2013 Computers and Thought Award.  She and her collaborators have been recognized with several Best Paper awards in computer vision, including a 2011 Marr Prize and a 2017 Helmholtz Prize (test of time award).  She has served as Associate
Editor-in-Chief for PAMI and Program Chair of CVPR 2015 and NeurIPS 2018.

Nicholas Roy (Zoox, MIT)

profile pic  

Nicholas Roy is a Principal Software Engineer at Zoox and the Bisplinghoff Professor of Aeronautics & Astronautics at MIT. He received
his B.Sc. in Physics and Cognitive Science in 1995 and his M.Sc. in Computer Science in 1997, both from McGill University. He received his Ph. D. in Robotics from Carnegie Mellon University in 2003. He founded and led Project Wing at Google [X] from 2012-2014, and is the Director of Systems Engineering under MIT's Quest for Intelligence (on leave). He and his students have received best paper or best student paper awards at many conferences (ICRA, RSS, ICMI, ICAPS, PLANS). He has made research contributions to planning under uncertainty, autonomous navigation and motion planning, machine learning, human-computer
interaction and aerial robotics.

Michael Ryoo (Stony Brook University, Google)

profile pic  

Michael S. Ryoo is a SUNY Empire Innovation Associate Professor in the Department of Computer Science at Stony Brook University, and is also a staff research scientist at Robotics at Google. His research focuses on video representations, self-supervision, neural architectures, egocentric perception, as well as robot action policy learning. He previously was an assistant professor at Indiana University Bloomington, and was a staff researcher within the Robotics Section of NASA's Jet Propulsion Laboratory (JPL). He received his Ph.D. from the University of Texas at Austin in 2008, and B.S. from Korea Advanced Institute of Science and Technology (KAIST) in 2004. His paper on robot-centric activity recognition at ICRA 2016 received the Best Paper Award in Robot Vision. He provided a number of tutorials at the past CVPRs (including CVPR 2022/2019/2018/2014/2011) and organized previous workshops on Egocentric (First-Person) Vision.

PC Hosts:
Kristin Dana (Rutgers University, Steg AI)
Dimitris Samaras, (Stony Brook University)


This year CVPR is celebrating Juneteenth in a city with outsize significance in black history. As a nod to the black artistic and musical tradition of New Orleans CVPR is working with a local artist to feature a special screen print on the back of the conference t-shirt.

profile pic   profile pic

Self-taught artist M. Sani was born in Cameroon, also known as the “hinge of Africa.” He left his country to pursue his dream of sharing his art with the world and settled briefly in Lafayette, Louisiana, before moving to New Orleans. But after Katrina hit, he was forced to close his art gallery and moved to St Louis, Missouri. Five years later, he re-opened the art gallery on the famed Royal Street in the French Quarter of New Orleans as he is not going to give up on his dream. His work has been featured at the Festival National des Arts et de la Culture (FENAC) in Cameroon, where he represented the Adamawa region in Ngaoundéré and Ebolowa. He also has had art placements in the New Orleans Jazz and Heritage Festival and other premier festivals. M. Sani is a working artist in an ever-changing art culture

His work has been described as bold, abstract, and primordial. “My inspirations come from my dreams and nature,” M. Sani says. “Each of my paintings tells a history. You just need to observe and listen.” He uses oil, acrylic, mixed-media, and collage techniques. His process includes, dripping, squeezing, and splashing paint. He says he sometimes paints on reclaimed wood from old historic houses. His work combines the culture of Cameroon and New Orleans. He likes to make people think with his work. “Some important themes such as the spiritual influence of religions, tribal ceremonies, and music are present in my work,” M. Sani says. When M. Sani says he’s working on his musicians, he is referring to his All That Jazz, Let’s Jazz it Up, and Colors of Jazz collections. The collections often feature silhouettes of musicians playing jazz into the early hours of the morning. M. Sani is looking to turn his gallery into an art space and hopes to bring back the “Little Artists Session,” a workshop for kids who want to express themselves creatively. He is also looking into selling his work virtually as NFTs.