An Open Letter to Jitendra Malik

(A PROPOS AT THE FOUNDATION MODELS WORKSHOP)

For my Computer Vision & AI friends

Dear Jitendra,

I was very glad to hear your comments on the Foundation Models promoted by the Stanford group, during a workshop on the topic. You argued that Foundation Models have really no foundation, because cognition and language are based on sensorimotor intelligence that came before – grounding/ embodiment (you presented an evolutionary argument and a developmental argument). You are of course right, at a first glance!

I guess due to lack of time, you did not mention that the arguments you presented are not your ideas but they are the ideas of a school of thought represented by many people in many countries, under a variety of names (active/embodied/animate/grounded/.. perception). From a technical standpoint, it amounts to involving the motor system in the process of perception, an idea that the main computer vision intelligencia considers as heresy. For example the topic has been excluded from CVPR for a long time. Such notions are reinforced by the current zeitgheist: for example, OpenAI abandoned robotics. The real reason they did so is that they cannot do anything interesting. But it was presented as if you don’t really need robotics to achieve AI, which is nonsensical. This is the view of many linguists as well and shows the chasm that exists between them and the sensorimotor people – bad news for AI. Most AI people subscribe to “disembodied AI”.

What does it mean to involve the motor system in perception? It means many things but I will give an example. Let’s say you look at a face or an object close to you. How do you know how far it is? The computer vision textbooks that we teach will say that we use stereo, or motion or some process and we compute the distance (let’s say in meters). That’s how machine vision systems operate. But this is not how biological systems operate. They know no meters or inches. I know how far the face or the object is, because I know how to move my hand so that I can touch it at any point. Or I know how to throw a stone to hit it. That is, the motor system is our measuring device, and as such, it is involved almost everywhere. My friend Luciano Fadiga from Ferrara Italy sent me yesterday a hot of the press paper that shows that there are mirror neurons also in the rat. It looks like these multimodal perception/action neurons are in every living creature. We cannot ignore the motor system.

Why do we have a brain (why we developed a brain)? The answer of the sensorimotor school of thought is: so that we can move and do things. The foundational models will say that we have a brain so that we can think. Move or think?, that is the question.

Embodied cognition people like me for example, will say that if I have the “move”, then I can figure out the “think”. It is more natural – this is how it happened in evolution. This is the true bio-inspired approach. But the Foundation Model people can also say that if I have the “think” then I can figure out what parts of the “move” I need to ground my thinking. It means going backwards the evolutionary scale. It is unnatural but it is a serious gedankenexperiment.

Let me now come to the main argument of my letter. The (silent) conclusion that the Foundation Models are not “the right ones” because they have no sensorimotor intelligence foundation, is not necessarily right. Why? Because they can acquire it, by looking at the evolutionary ladder backwards! This would mean starting with a symbolic conceptual system based on words and extend it to become multimodal. How? Well, due to a freaky historical sequence of events and an extraordinary amount of luck, the foundational models are not really working with language. They are working with high dimensional vectors representing structured objects. Linguists and computer scientists call them embeddings, i.e. words are mapped to high dimensional vectors. So, what the foundational models are doing is transforming and manipulating high dimensional vectors. These vectors usually represent words, but they don’t have to. They use them with language, because the code developed is good enough. There is no code yet for vision, sound,..

Most linguists think of language as just “words”. Most of them do not worry about meaning. However, the linguists and the symbolic AI people are here with us and they have been doing a lot of interesting work. How can we capitalize on what they have done, take advantage of their progress towards the achievement of AI? It’s hard to do, because in this case you would start with the symbolic conceptual (Language) system and you want to go down to the sensorimotor system while retaining connections. The only way to do it, in my opinion, is to start with the conceptual system given by language and augment it to make it multimodal, using self -supervised learning and gargantuan datasets.

Indeed, given that they (the foundation models) have some conceptual system available (because of language), the way to build AI from the Foundation Models perspective, would be to develop a multimodal conceptual system (whether I see a dog, or I hear a dog, or I read the word dog – the system takes me to the same vector). This multimodal conceptual system, equipped with prediction, inference and planning, is some kind of mind, not developed as evolution worked, but differently. It would not be a true bio-inspired solution, but it has a lot of potential. GPT and BERT are not bio-inspired either.

So if we are able to develop (learn) codes for translating images, videos, motor control sequences, sounds, tactile information and smells into high dimensional vectors, like word2vec - for example - does for words, then the foundation models will be a very powerful system, some kind of mind. Its contents will be just high dimensional vectors, representing images, sounds, objects, motions, actions, events, thus solving the INTEGRATION problem. The math of this enterprise is what is known as Hyper Dimensional (HD) Computing, introduced by Kanerva (and Platt before him – Vector Symbolic Architectures). Introducing the appropriate multiplication and addition rules, we can bind “things” (vectors) to each other, create sets, achieve multimodal learning and turn cognition and inference into Linear Algebra. Currently there is a revolution in HD computing, completely orthogonal to the Stanford group, also fueled by the fact that the HD operations can happen in neuromorphic low power hardware.

So, there is some hope for the Foundation Models. I wish them well!

With best regards,

Yiannis

P.S.: I sketched here in broad terms a research plan to get us from a language-based symbolic conceptual system to a multimodal conceptual system, in a Foundation Model style. The details of self-supervised learning and datasets were left out. If the Foundation Model people attempt to do this, I am afraid they will leave out the motor system again, but they will use vision, sound and the other senses. So, even if they leave out the motor system, they will have accomplished something significant. I suspect they will do this, because they will not know which motor system to emulate – a drone’s, a humanoid’s, a self-driving car’s? Frankly, thinking in terms of learning from data cripples you when it comes to the motor system, because what you learned from drone data, you cannot apply to a crawling or walking robot. This is why OpenAI abandoned Robotics. A different approach is needed. In my Lab in Maryland we pursue the approach of a symbolic description of movement (action) – the action grammars. In other words you make the motor system abstract, and you learn that abstraction (agent independent). Then you can transfer it to other robot morphologies and architectures. Google, FB and the usual suspects that will most probably finance this operation, would love multimodal conceptual systems because it will allow them to do everything they do much better, but they wouldn’t really care about the motor system. So, I think, the real problem will still be open. There is some hope for the rest of us.