Is web data driving the development of Artificial Intelligence?
One of the most discussed topics in the past two years has been Artificial Intelligence (AI). Does it exist? When is it going to happen? Will we ever be able to create a machine with human-like intelligence? In this article, I want to look at some recent trends that show how AI might be on its way.
Does web data drive the development of AI?
If you have noticed a lot of discussions around Big Data, Machine Learning, and Deep Learning in the past few months, then you are not alone. There needs to be an underlying driving force for any new technology or means of understanding to be developed on a large scale. This driving force usually becomes apparent or starts becoming more evident as research progress and trends become noticeable.
These three terms have become one of the most used phrases for describing gaining knowledge from data. Large-scale trends in the size of these datasets are becoming apparent, and it has been noticed that they continue to grow exponentially. Such heavy datasets allow researchers to apply the theories on Machine Learning, Deep Learning, etc., which have shot up greatly in popularity recently due to their potential use-cases relevant for this kind of data.
The goal is usually to train a “machine learning algorithm” (artificial intelligence) with large amounts of web/text/image/video data sets and then be able to use this machine learning algorithm on new input data (test set). This will be done by training iterations times until the quality of data being predicted by the machine learning algorithm is within a certain threshold.
To gain better knowledge from these large scale datasets, Machine Learning and Deep Learning algorithms that use “Unsupervised Learning” have been used recently. This means that instead of training machines with examples from the real world (like kittens, people dancing, etc.), they are trained on web data (news articles).
This process has some clear advantages:
1) Unsupervised learning allows us to discover hidden features or relations in data without any additional information like labels, for example. It makes more sense if we think about it in the context of images: we can feed an algorithm thousands of cat pictures, and it will be able to pick up on some standard features that appear in most cat pictures (whiskers, mustache, etc.) or based on color distribution of these images, it might even be able to pick up the difference between cats and dogs.
2) Unsupervised learning can be used with large datasets since there are no labels involved. Hence, it is not necessary to have a large number of researchers manually labeling the web data. It just needs to be bucketed into familiar categories, allowing researchers to discover new trends within this bucketed data by applying supervised machine learning algorithms.
3) Training iterations of these models can usually take days/weeks depending on how big the dataset is being used, significantly reducing the time required for training models before they are applied on test sets. Know more at RemoteDBA.com
What are some recent examples of AI?
The use cases for this example are endless! 1) Researchers at Google Brain recently created an algorithm that can “hallucinate” missing parts of images. After training the algorithm on 30,000 car pictures, it started to fill in non-existing details of these car images themselves. This kind of machine learning could be used to improve self-driving cars or any other application where it is vital to not miss out on details like road signs etc. Even more so, because usually with Deep Learning models, there is no human intervention required after the initial setup and training iterations.
The average human only makes one mistake per 5000 words which means that they get it suitable about 99.95% of the time, while Facebook’s algorithm reached 99.38%. It might not look like much on paper, but this is a new record! 2) A recent paper published by Facebook has shown how their AI could outperform humans at understanding speech. This is an excellent milestone for AI research and shows how far we have come in technology.
Google Brain has recently published a set of tools called Tensorflow. Tensorflow is a framework that can be used for training new machine learning algorithms with large datasets. While some frameworks already existed at the time, TensorFlow allows faster training times and easy integration with other libraries/services/programming languages, making it easier to adopt by the general public. What does this mean? It means that Machine Learning will become even more accessible to everyone!
What are some future challenges for AI?
Accessibility for training large-scale machine learning models. Most of the frameworks available today require specific GPUs (Graphics Processing Units) to train these models, which can be expensive and slow down training time significantly. An additional challenge is the software required to do this training (and without any errors!) since it currently requires an extensive amount of knowledge in terms of coding, operating system knowledge, etc. This makes it hard for people outside the research community or students to adopt/use this kind of technology because they might not have access to costly hardware + expertise needed for error-free transfer learning.
How should we evaluate the quality of the models/algorithms?
This is a complex problem to solve since there is no access to training data in unsupervised learning. Solutions include using metrics like perplexity, BLEU, or ROUGE, which try to approximate how well the data has been classified or grouped. Still, these metrics might not always be reliable for specific tasks.
Security concerns
Algorithms trained by Deep Learning techniques could easily be hacked if they are exposed to large enough datasets with maliciously labeled data (as we saw with the cat vs. dog dataset). It would also be possible to hack AI systems to generate their code if someone unintentionally exposed them to that kind of input! These hacks are sometimes called “Generative Adversarial Networks,” where you have a generator that tries to produce data and a discriminator that tries to identify real vs. fake.
Who is responsible when an AI makes a mistake?
Should self-driving cars be banned if they are involved in accidents, even if these are not the algorithm’s fault? If so, who should take the blame then (the person who created the dataset or the actual programmers?) These are still big open questions that will have to be addressed sooner or later.