Thoughts on the next 20 years in computer vision

April 6, 2016

3 min read

The past few years have brought a remarkable change in the public's awareness and acceptance of computer vision systems. In the 1980s, we were often met with disbelief, a chuckle, or a puzzled look when we talked about our research, even from other scientists and engineers. Others thought it trivial because "Any three year old can do that." Now, nearly everyone is aware of self-driving cars, medical image analysis, face recognition, and more. It's no longer "science fiction;" it's engineering fact - and it's on the evening news and in an application on your phone.

The progress from a small group of researchers working in a niche area to a range of commercial successes is a credit to our persistence as a scientific community. The public's recognition of computer vision's potential, with all the benefits (and harm) that it can bring, opens significant new application domains, brings new sources of funding, creates economic opportunity, and suggests fascinating new problems to address. Faster, smaller, cheaper processors and smaller, more rugged sensors do their part. Our work is now entering domains, societal and legal, that bring new levels of scrutiny and constraint; this process will continue unabated for some time.

Machine learning's role, while important, will be less significant than many now believe. As we encounter its limitations, we will re-discover vision applications that go beyond detection. We will tackle important problems that come without large training sets and well-defined ground truth, and which must be solved on smaller, often mobile, platforms. As we do, we will also develop rich, yet compact, representations and re-discover the value of a good model. There will be a re-stimulation of work in object class inference, perceptual organization, and other middle-level problems that learning allowed us to bypass for a time, but which it seems ill suited to address. General vision systems must infer identity and function for things never before encountered, or for things - such as weather systems - that defy traditional modeling and/or fail to provide a suitably rich training set. These systems will require data structures and fast algorithms for handling structural representations combined with appropriate measures of uncertainty or confidence.

With that said, I do not envision a strict dichotomy between learning-based and other approaches. Rather, I foresee a spectrum of blended approaches, often coupled to data analytics for the analysis, retrieval, and cataloging of video and audio data collected from, for example, surveillance, portable, and wearable cameras. It will be essential to extract information reliably and quickly on demand, spatio-temporally registered with other data. There will be legal implications and societal and cultural sensitivities. The processing of video data acquired in the public domain will be held to legal standards of evidence, and image and video forensics will become extremely important to verify the results for use in legal proceedings.