BUILDING BLOCKS OF AI
Basic building blocks of AI-Powered Jarvis — the Iron Man Assistant
Not long ago I used to think, that Jarvis from Iron Man cannot be a thing anytime soon, at least for decades. But this talk by Andrej Karpathy (posted below) just cleared the blurriness and bumped the possibilities in front of me. This talk is about Recurrent Neural Networks (RNN) — one of the deep learning tools that are a part of machine learning toolset that is in turn a part of the artificial intelligence domain. Imagine, what other remaining parts can do!
Let me first present his video talk here given at a Deep Learning Meetup in London. I will then try to point out the real-life applications that he has pointed out at different times during his presentation. I have tried to keep the language simple and to-the-point for the readers with just introductory knowledge of AI/ML.
Visualizing and Understanding Recurrent Networks | SkillsCast
Deep Learning London Meetup community cast. Andrej Karpathy: Recurrent Neural Networks (RNNs), and specifically a…
Please scroll to the below-indicated locations in this video to listen to specific parts of the presentation.
- At 08.00 — How the RNN is trained for normal character-by-character recognition of English text. It takes 4 characters of text of word hello i.e. ‘h’,‘e’,‘l’,‘l’, and predicts the last word ‘o’. (This is no big deal but a good start though!)
- At 16:40 — How the RNNs take the work of Shakespeare and train itself to produce some more work like him. A work that makes little sense but was never written by Shakespeare.
- At 18:30 — The RNN is trained with mathematics proofs of theorems and formulas. Next, while testing, when given a set of lines, it goes ahead and derives or proves a new theorem all together. These are not completely right but are still very close.
- At 20:00 — The RNN is trained with the source files for Linux operating system overnight and then it starts creating codes by itself. (Now, this is what I call, too much!) It follows indentation, opening & closing of brackets, semantics for ‘for loop’ and ‘if statements’, etc. Though the code is not very logical but is very close to an original one.
- At 23:10 — We see the examples where it creates recipes, music, bible sayings, and many other ‘sensemaking’ contents which are completely novel.
- At 35:00 — Finally, the network is trained using images with some text describing the image. Next, when the trained network is tested and given some test images, the network says what is happening in the image. This was not all accurate when this talk was delivered but pretty accurate now with more computing and more labeled data fed into the system. This part also utilizes the Convolutional Neural Networks (CNN) for the image identification and processing part which I will talk about later.
This talk was given in 2015 and deep learning has progressed exponentially after that. In recent years a lot of other papers have been submitted talking about Bi-direction RNNs which takes input from the future text and improves the suggestions for the previous one. Let me take an example and elaborate on this one. You might have noticed while talking to Apple’s voice assistant — Siri, once you start speaking, it displays some text but autocorrects itself in few seconds as you continue to speak. Actually, it takes input from the future text and corrects its previous interpretation which makes more sense in the context. Something called “LSTM Network” helps it in remembering the context even after a long sentence that you speak. However, to train the network to do such a marvelous task it needs a lot of computational power and a lot of data. One can get data labeled through Amazon Mechanical Turk — a platform where data seekers can post their requirements and labelers can provide the service for some bucks. This is a great way to collect data which is relatively cheap. We have companies in countries where we have relatively low-cost labor to benefit both — the worker and the consumer of the data.
Convolutional Neural Networks (CNN)
Moving ahead, let me bring the possibilities that CNNs bring to us. CNNs can do a hell lot of tasks in the image and video processing which we see in self-driving cars or in the Iron Man suit for that matter!
Here is a simplified list of tasks that CNNs can perform. I will take the running example of the Iron Man suit to relate it better.
- Detection: Whether there is any object in the image or not!
- Classification: It can say if the image/video contains a missile or a rocket (from a given set)!
- Localization: It can locate where exactly is the missile and where is the bridge!
- Recognition: Whether the missile is from stark industries or from a different nation!
- Verification: Among the 10 missiles that stark industries build, which one is this! Verify its identity and order it to go back (which sadly did not happen in the actual movie:P)
Combining both RNN and CNN we can get, what we have seen Jarvis doing in Iron Man — helping tony stark while flying by detecting & identifying objects, localizing them, verifying them (CNN) and parallely communicating (Bi-RNN) to tony about those objects and asking him for his response to act accordingly. It also remembers the previous conversation (LSTM) with Tony to suggest and assist him in his work.
Most of the commercialized products from voice assistants to AirWorks ariel data analysis and from robotic pet dogs to radiology assistants, mostly use the advanced versions of CNNs and RNNs to accomplish their tasks. This has just started to pick up fifteen years ago and we have many decades to go.