深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning

Agenda
深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

Hardware 101: the Family

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

Hardware 101: Number Representation

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

Hardware 101: Number Representation

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

1. Algorithms for Efficient Inference 1.1 Pruning Neural Networks

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片


深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

Iteratively Retrain to Recover Accuracy

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

Pruning RNN and LSTM

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

pruning之后准确率有所提升:

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

【深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning】Pruning Changes Weight Distribution

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

1.2 Weight Sharing
Trained Quantization

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片


深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

How Many Bits do We Need?

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

Pruning + Trained Quantization Work Together

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

Huffman Coding

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

Summary of Deep Compression

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

Results: Compression Ratio

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

SqueezeNet

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

Compressing SqueezeNet

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

1.3 Quantization
Quantizing the Weight and Activation
深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片
**Quantization Result**:选择8bit 深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片
1.4 Low Rank Approximation
Low Rank Approximation for Conv:类似Inception Module

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

Low Rank Approximation for FC :矩阵分解

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

1.5 Binary / Ternary Net
Trained Ternary(三元) Quantization

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

Weight Evolution during Training

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

Error Rate on ImageNet

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

1.6 Winograd Transformation
3x3 DIRECT Convolutions

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

Direct convolution: we need 9xCx4 = 36xC FMAs for 4 outputs
3x3 WINOGRAD Convolutions:
Transform Data to Reduce Math Intensity

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

Direct convolution: we need 9xCx4 = 36xC FMAs for 4 outputs
Winograd convolution: we need 16xC FMAs for 4 outputs: 2.25x fewer FMAs
2. Hardware for Efficient Inference Hardware for Efficient Inference:
a common goal: minimize memory access

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

Google TPU

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片


深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

Roofline Model: Identify Performance Bottleneck

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

Log Rooflines for CPU, GPU, TPU

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

EIE: the First DNN Accelerator for Sparse, Compressed Model:
不保存、计算0值

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片


深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

EIE Architecture

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

Micro Architecture for each PE

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

Comparison: Throughput

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

Comparison: Energy Efficiency

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

3. Algorithms for Efficient Training 3.1 Parallelization
Data Parallel – Run multiple inputs in parallel

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

Parameter Update
参数共享更新

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

Model-Parallel Convolution – by output region (x,y)

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

Model Parallel Fully-Connected Layer (M x V)

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片


深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

Summary of Parallelism

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

3.2 Mixed Precision with FP16 and FP32

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

Mixed Precision Training

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

结果对比:

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

3.3 Model Distillation
student model has much smaller model size

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

Softened outputs reveal the dark knowledge

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

Softened outputs reveal the dark knowledge

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

3.4 DSD: Dense-Sparse-Dense Training

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

DSD produces same model architecture but can find better optimization solution, arrives at better local minima, and achieves higher prediction accuracy across a wide range of deep neural networks on CNNs / RNNs / LSTMs.
DSD: Intuition

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

DSD is General Purpose: Vision, Speech, Natural Language

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

DSD on Caption Generation

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

4. Hardware for Efficient Training GPU / TPU

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

Google Cloud TPU

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

Future
深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

Outlook: the Focus for Computation

深度学习|CS231n学习笔记--15. Efficient Methods and Hardware for Deep Learning
文章图片

    推荐阅读