We may be stuck at home but that didn’t stop over 3000 enthusiastic AI experts from joining us for The Artificial Intelligence Festival on June 8-12.
We had some awesome presentations & product demonstrations but the fun didn’t stop there as our Slack channel was alive with Q&A.
We’ve pulled out some of the top questions and answers to share with the community. Check ’em out below…. And if you’d like to continue the conversation be sure to check out the #ai-questions channel on the AIAI Slack.
Gary Brown, Director of AI Marketing, Intel opened the festival with a bang 💥 and got us all thinking Can Accelerating AI at the Edge Help The World?
Q. Is Intel in the process of integrating dedicated ML hardware into their CPUs? If so, will it be compatible with Tensorflow, and the usage will be similar to running a model on the GPU?
Yes, there are two ways we’re integrating ML acceleration into CPUs -- one is with Intel Xeon processors that support something we call DL Boost to handle DNNs more efficiently. And also, we have integrated GPUs in our Intel Core processors (like the ones in laptops/tablets but are also used in Edge AI applications like smart signage and edge servers doing camera analytics)... we can have a long conversation about this for sure. And our tools support common frameworks like TensorFlow - check out OpenVINO toolkit.
Gary Brown, Intel
Ajay Nair, Product Management Lead, Edge TPU, Google went on to talk about how and why Machine learning applications are moving towards local / Edge inference needs.
Q. Some of the reticencies against Machine Learning and AI is the carbon footprint residuals from training “big” models. How do these shifts take climate change into account in order to be eco-friendly?
AI training like any other compute, is essentially high performance compute. ML accelerators help perform the same compute at higher efficiency.. at lower power footprint. Also, running AI inference as close to the Edge as possible, help make moving data to the Cloud for inference unnecessary.. so a huge reduction in power footprint.
To add on several AI use cases are actually designed to lower power consumption.. from use cases on running heavy machinery efficiently, turning of power in rooms, building intelligently etc
Ajay Nair, Google
Jonathan Rubin, Senior Scientist, Philips Research North America gave us an overview of deep learning models that have been developed for a range of medical applications.
Q. In many health-tech startups I have noticed that a lot of the deep learning models are not made from scratch but either directly used or modified through means like transfer learning ( that makes sense as well as it makes no sense to reinvent the wheel ). I was hoping for your insight on whether there will be shifts in this paradigm or more focus be put in academic or other institutions that do so, given the need for AI in this field is now more demanding than ever.
Selecting between transfer learning and training from scratch is sometimes a consideration of how much labeled data is available for training.
If labeled data is limited utilizing pre-trained weights can help boost model performance by borrowing information from other datasets where that information is then encoded in the earlier layers of the network. So, in general when making the determination to train from scratch or apply transfer learning, it will depend on the exact type of data, how much is available and how long you have to train your network.
Jonathan Rubin, Philips Research North America
Day 2 saw a lively morning with Claus Danielson, Principal Research Scientist, Mitsubishi Electric Research Laboratories (MERL) presenting the invariant-set motion-planner, developed at MERL, and its application to autonomous driving.
Q. Which “disturbance models” are in use to safeguard the invariant models eg. how do you provide for things like flat tires, street conditions, other things that can break and influence the car dynamic statespace behaviour ?
There are two types of disturbances we consider: Parametric model uncertainty and additive disturbances. Both of which are set-based, so that we do not need the actual values of the parameters/disturbances but rather a range of possible parameters.
Claus Danielson, MERL
The morning progressed with Matthew Mattina, Distinguished Engineer & Senior Director of Arm’s Machine Learning Research Lab, ARM talking us through ML on the Edge, specifically the hardware and models for machine learning on constrained platforms.
Q. You went over how to transform a model from training to inference to fit within hardware constraints, for CNN based models. What kind of techniques have been used for RNN and Transformer based models?
A key difference between CNN models and RNN and Transformer models is that RNN/Transformers usually have lots of fully connected/dense layers. So channel pruning--which is used to remove entire channels/output feature maps in CNNs--doesn’t really apply to RNN/Transformers.
For RNN/Transformers, I see magnitude based pruning being quite useful. In fact, we often see >90% of weights can actually be pruned (set to 0) in fully connected layers with almost 0 accuracy loss. There are also a range of matrix decomposition techniques that can be applied to RNN/xformers… things like singular value decomposition.
Matthew Mattina, ARM
Saber Moradi, Hardware Architect ML/AI, Grai Matter Labs talk on Energy-efficient Deep Neural Network Acceleration presented methods on how sparse changes of real-world signals such as audio and video streams could be utilized to design more efficient compute architecture for deep neural network accelerators.
Q. When I used to work in the chip design industry, CMOS based cells were always used because it was the most robust and well understood technology. CMOS wasn’t always the best, but it was reliable and that’s what counts. Is there currently an industry trend towards using custom neuromorphic computing cells?
Also, for the event driven algorithm, I understand that you are trying to prevent the switching of transistors when the input data is the same. The switching is what consumes the power. So is the event driven algorithm done at the SW level, or the HW level? And how much is the power overhead to run this algorithm?
In our technology we use standard CMOS cells but there are efforts in building custom neuromoprhic design with emergent technologies such as ReRAM etc. As you correctly pointed out, the reliability is of big concern in such approaches, and it will add more complexity in algorithm developments. That being said, there are research works that argue we dont need deterministic computing in fault tolerant neuromorphic algorithm. At GML we are focused on integrating neuromorphic features such as event-driven with deep learning models. Hope that helps, will be glad to follow up on that in case you are interested.
On the 2nd question, we address this point in both SW and HW level. But these features are mainly required to be supported in hw with minimal algorithmic changes. Considering the sparsity of video or audio stream, event driven algorithms significantly require lower number of operations in processing and so that will save energy in both on-chip communication as well as compute. in regard to power consumption, we have sample example with 15x improvements in number of operations, if you are interested please check our next talk by Mahesh Makhijani where he shares more insights on the power and performance figures for deep learning models on GML chip.
Saber Moradi, Grai Matter Labs
Where are we with AutoML?
What are the challenges of applying AutoML/NAS to problems across the industry?
Adam Kraft, Machine Learning Engineer from Google revealed all in his talk on Day 3.
Q. Would it be possible to generally optimize controllers in having another additional “controller” treating a “controller<>trainer” setup as “trainer” ?
Yes! Great question. There is active research on this. Some research is looking into using this multi-level AutoML to further reduce the need for human hyper parameter tuning. Other research is looking into using a higher-level “controller” to help optimize another “controller” that you care about, for instance in reinforcement learning.
Adam Kraft, Google
Christian Graber and Mahesh Makhijani from GrAI Matter Labs gave us a ground breaking product demonstration of GrAI One, industry’s first sparsity enabled AI accelerator for the edge.
Q. Congratulations on the first proof of concept chip! That is a major accomplishment. I spoke with your colleague Saber Moradi yesterday regarding the event driven algorithm. I understand that this is built into the SDK. Would it be possible for the SDK to apply the event driven and time sparsity techniques to run the model on a different general purpose hardware? For example, on a smart phone?
Good question. The ability to leverage Sparsity does require hardware we have developed and can be easily integrated as a co-processor in smart phones. Regular general purpose hardware cannot leverage the time and spatial aspects in run-time. You can use techniques to make your neural network sparse before deployment and those can run on general purpose hardware and ours, however the general purpose hardware would not be able to leverage the real-time input based sparsity. (which can offer over 20X performance gains depending on the environment)
Q. For the network on chip configuration, is there any frequency or voltage scaling for processors that are idle or have low loads?
Yes we do have techniques to deal with idle, low loads. Happy to provide more details under NDA.
Mahesh Makhijani, GrAI Matter Labs
The final festival day did not disappoint with Shuo Zhang, Senior Machine Learning Engineer, Bose Corporation discussing the application of NLP in music information technology in the light of the latest transformations brought about by deep learning, enabling machines to make sense of the world through multimodal music and sound data.
Q. One of the problems with using CNN to process sequential data such as audio is the inability to detect patterns across long sequences. This is why Transformer based architectures have been so popular recently, because its attention mechanism is designed to solve this exact problem. So my question is how does the CNN architectures that you’ve presented solve this problem?
I totally agree, that’s why CRNN is a popular architecture for audio ML tasks as well, and I personally find it more effective than CNN. But it is undeniable that many large scale audio event detection models (such as Google’s VGGish) is still purely CNN based. I think it has to do with the specific audio task at hand, whether sequence modeling is important to that task, and how important it is to predict on a fine resolution (such as per frame prediction).
For instance, in Google’s VGGish formulation, they only output predictions at a 1s resolution, and they take audio chunks at that level and perform STFT and obtain a 96 by 64 spectrogram to feed into the network. It definitely is arguable that temporal modeling is important in these tasks, but empirically it depends. Another thing that people have tried is something like dilated CNN - which increases receptive field and with that you can output framewise predictions by keeping the output temporal dimension the same as input (only pooling along the frequency axis), which is also a common technique in CRNNs.64 spectrogram to feed into the network. It definitely is arguable that temporal modeling is important in these tasks, but empirically it depends. Another thing that people have tried is something like dilated CNN - which increases receptive field and with that you can output framewise predictions by keeping the output temporal dimension the same as input (only pooling along the frequency axis), which is also a common technique in CRNNs.
Shuo Zhang, Bose Corporation
If you made it this far, thanks for reading! And as mentioned be sure to check out the AIAI Slack to engage with like-minded AI enthusiasts and keep these conversations going!
A generation of human intelligence will inspire the next-generation of machine intelligence 😎