Agenda
文章图片
Hardware 101: the Family
文章图片
Hardware 101: Number Representation
文章图片
Hardware 101: Number Representation
文章图片
1. Algorithms for Efficient Inference
1.1 Pruning Neural Networks
文章图片
文章图片
Iteratively Retrain to Recover Accuracy
文章图片
Pruning RNN and LSTM
文章图片
pruning之后准确率有所提升:
文章图片
【深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning】Pruning Changes Weight Distribution
文章图片
1.2 Weight Sharing
Trained Quantization
文章图片
文章图片
How Many Bits do We Need?
文章图片
Pruning + Trained Quantization Work Together
文章图片
Huffman Coding
文章图片
Summary of Deep Compression
文章图片
Results: Compression Ratio
文章图片
SqueezeNet
文章图片
Compressing SqueezeNet
文章图片
1.3 Quantization
Quantizing the Weight and Activation
文章图片
**Quantization Result**:选择8bit
文章图片
1.4 Low Rank Approximation
Low Rank Approximation for Conv:类似Inception Module
文章图片
Low Rank Approximation for FC :矩阵分解
文章图片
1.5 Binary / Ternary Net
Trained Ternary(三元) Quantization
文章图片
Weight Evolution during Training
文章图片
Error Rate on ImageNet
文章图片
1.6 Winograd Transformation
3x3 DIRECT Convolutions
文章图片
Direct convolution: we need 9xCx4 = 36xC FMAs for 4 outputs
3x3 WINOGRAD Convolutions:
Transform Data to Reduce Math Intensity
文章图片
Direct convolution: we need 9xCx4 = 36xC FMAs for 4 outputs
Winograd convolution: we need 16xC FMAs for 4 outputs: 2.25x fewer FMAs
2. Hardware for Efficient Inference Hardware for Efficient Inference:
a common goal: minimize memory access
文章图片
Google TPU
文章图片
文章图片
Roofline Model: Identify Performance Bottleneck
文章图片
Log Rooflines for CPU, GPU, TPU
文章图片
EIE: the First DNN Accelerator for Sparse, Compressed Model:
不保存、计算0值
文章图片
文章图片
EIE Architecture
文章图片
Micro Architecture for each PE
文章图片
Comparison: Throughput
文章图片
Comparison: Energy Efficiency
文章图片
3. Algorithms for Efficient Training
3.1 Parallelization
Data Parallel – Run multiple inputs in parallel
文章图片
Parameter Update
参数共享更新
文章图片
Model-Parallel Convolution – by output region (x,y)
文章图片
Model Parallel Fully-Connected Layer (M x V)
文章图片
文章图片
Summary of Parallelism
文章图片
3.2 Mixed Precision with FP16 and FP32
文章图片
Mixed Precision Training
文章图片
结果对比:
文章图片
3.3 Model Distillation
student model has much smaller model size
文章图片
Softened outputs reveal the dark knowledge
文章图片
Softened outputs reveal the dark knowledge
文章图片
3.4 DSD: Dense-Sparse-Dense Training
文章图片
DSD produces same model architecture but can find better optimization solution, arrives at better local minima, and achieves higher prediction accuracy across a wide range of deep neural networks on CNNs / RNNs / LSTMs.
DSD: Intuition
文章图片
DSD is General Purpose: Vision, Speech, Natural Language
文章图片
DSD on Caption Generation
文章图片
4. Hardware for Efficient Training GPU / TPU
文章图片
Google Cloud TPU
文章图片
Future
文章图片
Outlook: the Focus for Computation
文章图片
推荐阅读
- C语言学习|第十一届蓝桥杯省赛 大学B组 C/C++ 第一场
- paddle|动手从头实现LSTM
- pytorch|使用pytorch从头实现多层LSTM
- 推荐系统论文进阶|CTR预估 论文精读(十一)--Deep Interest Evolution Network(DIEN)
- pytorch|YOLOX 阅读笔记
- 前沿论文|论文精读(Neural Architecture Search without Training)
- 联邦学习|【阅读笔记】Towards Efficient and Privacy-preserving Federated Deep Learning
- OpenCV|OpenCV-Python实战(18)——深度学习简介与入门示例
- 深度学习|深度学习笔记总结
- 《繁凡的深度学习笔记》|一文绝对让你完全弄懂信息熵、相对熵、交叉熵的意义《繁凡的深度学习笔记》第 3 章 分类问题与信息论基础(中)(DL笔记整理