I am Efstratios Gavves and I am a researcher in the area of Computer Vision and Machine Learning. Computer Vision is gaining more and more popularity lately, especially with expensive and impressive buy-outs of spin-off research companies by tech giants. Is it, however, that computer vision is a hype, a trend, that is likely to be forgotten in a couple of years, or is it here to stay? As by explaining things one understands them better and also improves his/her narrative skills, I thought to start (yet another :-P) blog about Computer Vision and Machine Learning. I will often try to discuss things from a more humane (hopefully) perspective, without many equations, so that even non-specialists can understand what Computer Vision is about. Today I will start with my first blog post and discuss a brief history of Computer Vision.
The early days
The first attempt to solve (!) the problem of Computer Vision was made by Seymour Paper. This first attempt was brave enough to be referred to as the summer vision project. Apparently, the summer vision project was not as successful as intended, otherwise we would not be here and talking. And the reason why it was not successful is that computer vision is more complicated than what most people would think. It is not about translating the lights and the colors and the shades into pixels. It is about translating the pixels into abstract mathematical concepts, which challenge basic philosophical definitions of what is a concept, what is an object and why is a chair a chair? In fact, according to cognitive research the human brain, an energy devouring organ, is devoting something between 40-70% of its capacity into processing the visual signals that the eye receives, which is not a coincidence. So, why is Computer Vision so difficult?
Understanding the difficulty of computer vision is extremely challenging, as we humans train ourselves to recognize our environment ever since we are newborns. Newborns, which by the way, can see only very blurry images for quite a long time. Perhaps an example of how difficult vision is, is to check the picture on the left (this picture was also on the cover of my thesis). In this picture we have what appears to be a simple triangle on a table. Now, try to explain this picture, does it feel right? Apparently not! There is something definitely off with this picture. So, if for a human with so many years of training such a ordinary image causes so much trouble, imagine the trouble we need to get in to teach a computer how to see.
As is many other disciplines of science and engineering, the first thing to come in mind for solving vision related problems was models inspired from the brain. In the 1958 Frank Rosenblatt presented his new algorithm, the perceptron, which is a form of a neural network. Rosenblatt demonstrated his perceptron on classifying automatically images containing tanks camouflaged in a forest, as compared to images that showed forests only. Although in the research experiments the algorithm seemed to be successful enough, it turned out to fail in field tests. The reason was, that the pictures were heavily biased against certain weather conditions: the images with the tanks were taken in cloudy days, whereas the images with just the forest where taken in sunny days. Hence, the algorithm did not learn to recognize tasks, but sunny days from cloudy days. Soon, and despite their natural appeal, neural network were in dismain, although later research proved that this was unjustified to a large extent.
The middle ages
After this early catastrophe, which stalled the scientific progress in AI and computer vision, researchers focused mostly on solving image processing problems. Image processing involves pixel-level operations, like finding the edges from an image, applying many of the cool filters that Photoshop has on an image, or compress images without losing the essential content. Although this progress has been great, and especially helpful to practical, industrial applications, they were not what Computer Vision was meant to be, that is a way to interpret the visual world. Then, in the early 90’s neural networks appeared again under the name convolutional neural networks this time, and were able to solve a challenging for the time problem, that is digit recognition for bank cheques. Despite their success, neural networks still were unable to perform on harder tasks with three dimensional objects in unconstrained pictures. Once again, they were not favored by the computer vision community.
The golden years
And all of a sudden we are in the 90’s. Together with the Backstreet boys and Spice girls, the 90’s and 00’s was also the actual birth time of modern Computer Vision. All of a sudden a plethora of methods were proposed to tackle generic, hardcore computer vision problems, such as object classification, object detection and segmentation, face recognition etc, which I will not cover now as this history of Computer Vision would not be brief. What was the result? We started having “smart” cameras that were able to detect our faces (yes, that was a paper about 15 years ago). We witnessed the baby steps of applications like Google Goggles. We welcomed cool hardware like the Kinect. We started having smart parking systems in the cars, creepy robots playing rock, paper, scissors any many, many more cool applications. Personally, I attribute the big bang of Computer Vision on four reasons.
One reason for the sudden success was the discovery of key, feature extraction and representation algorithms. The introduction of the SIFT feature in 1999 opened the doors to viable object recognition. The SIFT feature was efficient and accurate and it allowed for a precise comparison between the same object in different images. At the same time local keypoint extraction algorithms started springing, being able to “magically” discover the “interesting” locations in an image. Also, normalized cuts were able to segment the object boundaries from a noisy scene, therefore allowing a more refined recognition focused on the object only and discarding the background. And finally, perhaps the king of them all together with SIFT, the Bag-of-Words. The Bag-of-Words was a method proposed for describing the content of an image in a very simple and straightforward manner. The innovation of the Bag-of-Words was its very simplicity. Until then, researchers were trying to accurately model the three dimensional geometry of the objects, often attempting to build models either geometrically accurate but computationally too complex, or computationally feasible but overly-simplistic. Bag-of-words, however, suggested that geometry is not as important, or to phrase it better, it is not worth the effort. Instead, we should focus only on how an image looks like when we focus on small patches, mixing them together. See the picture on the left for example. We can understand that we have an airplane, or a face, or a bike in the respective pictures without really having seen the whole picture. In essence, the Bag-of-Words model attempted to reach the final goal of classifying the image content, without trying to solve an intermediate problem, that of describing (geometrically) all the objects that can appear in an image (as the father of statistical learning theory would also advise). Needless to say, Bag-of-Words caused a seismic shift in Computer Vision research. Researchers now applied SIFT with Bag-of-Words on every other vision problem. Not long after, robust object detection was also within reach.
Another very influential factor for the sudden rise of Computer Vision was the introduction of a powerful learning algorithm, the support vector machines or SVM, as most people call them. Support vector machines are a generic purpose, out-of-the-box classifiers. A classifier is a “black box”, which gets as input examples of things (images, documents, songs, whatever) and returns as output a classification line. You can imagine the classification line as sieve and our images are the seeds. Is the seed/image smaller than the sieve’s holes? Then, the seed/image is a “car”. Otherwise it is not a car. What support vector machines brought was a very robust, accurate and easy to use classifier.
The revolution in Computer Vision research was further nourished by the introduction of public, open challenges (like here and here), where everyone could participate with his/her favorite algorithm in classifying images, videos and more. These challenges were a catalyst. For one, they provided a common ground for discussion. Researchers now met regularly in conferences and workshops to discuss further developments and had a fair ground for comparisons. Research was not restricted anymore to dark laboratories in basements. Everybody was working on the same, clear target. What is more, open challenges led to open databases. Everyone could now download large amounts of data, and data is very important for all data sciences, computer vision and machine learning included. Finally, public challenges and the general openness led to a generally more transparent and open culture, with code being often publically available either from the researchers themselves or open source third-parties.
Interestingly, during the golden decade of Computer Vision research there were two more revolutions. First, the social media revolution. People were now sharing everything with others, also pictures. Suddenly, researchers acquired a cheap and limitless source of training data, which led to the new era of the Big Data. While the concept of Big Data deserves a post on its own, we can say for now that Big Data unlocked new capabilities in learning and predicting. Second, computer hardware started becoming very cheap and ridiculously powerful. Modern CPUs and GPUs are nowadays able to sustain a tremendous amount of computations for a couple of bucks only.
After the golden era, Computer Vision arrived to the today, where it can finally start fulfilling its prehistoric promises. All of a sudden, the lessons tought from the previous decade, the abundance of data and the tremendous power of modern hardware brought neural networks, the forgotten child of AI, to the surface again. Today, based on the modern version of neural networks, namely deep learning networks (see picture on the left), which I will re-visit in a later post, we are able to classify the image content very, very accurately. In fact, it could be that computers are now reaching a human level of accuracy in certain tasks. Also, and although it could be just a coincidence or my own bias, cracking vision was the trigger for general purpose AI to draw a huge amount of attention. Nowadays, many big tech companies are investing a lot in bringing these technologies to real life applications, a topic which I will also analyze further in a separate post.
Well, what is left to do then? Computer Vision is not solved yet. In fact now we are on such a level of accuracy that we can go to the next level and really try to understand an image to its full extent. An image is not just the one or two objects that appear in it. It is the whole interaction of objects, it is the whole story that this interaction tells. And this level of content only explodes when we go from still images to moving images, that is video. Of course, one could still say, just apply deep learning once more, problem solved. However, it is not as simple as that. We should not forget that neural networks worked only when enough data and computational power was available. For describing such high level image and video content, therefore, one cannot really say that the current databases/hardware are enough. And even if they were enough, one cannot confidently say that the current learning methods would suffice on more complicated tasks, like “find all the videos in which a brown-white cat plays with the red laser on a colourful carpet”.
So, to conclude computer vision is alive and kicking, and an awesome field for doing research. Although, we are still not in a level of human level accuracy, after many, many years of persistent, amazing, inspiring research we are now one (or many) step(s) closer. And the future looks brighter than ever.