推荐系统论文进阶|CTR预估论文精读(十一)--Deep Interest Evolution Network(DIEN) 人工智能|深度学习|python|DIEN

1. 摘要 Deep Interest Evolution Network (DIEN) uses interest extractor layer to capture temporal interests from history behavior sequence. At this layer, an auxiliary loss is proposed to supervise interest extracting at each step. As user interests are diverse, especially in the e-commerce system, interest evolving layer is proposed to capture interest evolving process that is relative to the target item. At interest evolving layer, attention mechanism is embedded into the sequential structure novelly, and the effects of relative interests are strengthened during interest evolution.
2. 核心创新点
文章图片

2.1 Interest Extractor层
引入辅助loss帮助GRU的隐状态更好地表示用户兴趣；
Interest Extractor层：合适的兴趣表征是兴趣演化模型的基础。在Interest Extractor层，模型使用 GRU 来对用户行为之间的依赖进行建模；GRU的输入是用户按时间排序的行为序列，也就是行为对应的商品。
引入辅助loss的原因：
作者指出 GRU 只能学习行为之间的依赖，并不能很好反映用户兴趣。L t a r g e t L_{target} Ltarget? 只包含了最终兴趣的监督信息（因为最终兴趣导致了点击行为），而中间的历史状态h t h_t ht? 并不能得到监督信息来指导学习。我们知道兴趣可能会导致产生多个连续行为。所以模型引入辅助 loss。
具体来说就是用行为b t + 1 b_{t+1} bt+1? 来指导h t h_t ht? 的学习，正样本就是当前点击对应的下一个item，负样本就是从 item set 中随机抽取的 item。
引入辅助loss的好处：

正如作者强调的，辅助loss可以帮助GRU的隐状态更好地表示用户兴趣；（更好地表征用户兴趣特征）
RNN在长序列建模场景下梯度传播可能并不能很好的影响到序列开始部分，如果在序列的每个部分都引入一个辅助的监督信号，则可一定程度降低优化难度；（降低长序列特征的训练难度）
辅助loss可以给embedding层的学习带来更多语义信息，学习到item对应的更好的embedding；（为用户行为序列中item对应的embedding引入更多语义）

2.2 Interest Evolving层
引入带注意力更新门的GRU来模拟用户对不同商品的点击行为受不同兴趣的影响；
Interest Evolving 层：随着外部环境和内部认知的变化，用户兴趣也不断变化。用户对不同商品的点击行为受不同兴趣的影响。
Interest Evolving 层对与 target item 相关的兴趣演化轨迹进行建模。
作者提出了带注意力更新门的 GRU 结果也就是AUGRU。通过使用兴趣状态和 target item 计算得到的相关性，AUGRU 增强相关兴趣的影响，同时减弱不相关兴趣的影响。
3. 具体实现 3.1 Interest Extractor层
Interest Extractor层：电商网站用户的行为数据一般都很丰富。对于行为之间的依赖进行建模，作者使用GRU结果（权衡效果和性能）。
普通 GRU 的计算方式如下：
u t = σ ( W u i t + U u h t ? 1 + b u ) r t = σ ( W r i t + U r h t ? 1 + b r ) h ~ t = tanh ? ( W h i t + r t ° U h h t ? 1 + b h ) h t = ( 1 ? u t ) ° h t ? 1 + u t ° h ~ t \begin{aligned} \mathbf{u}_{t} &=\sigma\left(W^{u} \mathbf{i}_{t}+U^{u} \mathbf{h}_{t-1}+\mathbf{b}^{u}\right) \\ \mathbf{r}_{t} &=\sigma\left(W^{r} \mathbf{i}_{t}+U^{r} \mathbf{h}_{t-1}+\mathbf{b}^{r}\right) \\ \tilde{\mathbf{h}}_{t} &=\tanh \left(W^{h} \mathbf{i}_{t}+\mathbf{r}_{t} \circ U^{h} \mathbf{h}_{t-1}+\mathbf{b}^{h}\right) \\ \mathbf{h}_{t} &=\left(\mathbf{1}-\mathbf{u}_{t}\right) \circ \mathbf{h}_{t-1}+\mathbf{u}_{t} \circ \tilde{\mathbf{h}}_{t} \end{aligned} ut?rt?h~t?ht??=σ(Wuit?+Uuht?1?+bu)=σ(Writ?+Urht?1?+br)=tanh(Whit?+rt?°Uhht?1?+bh)=(1?ut?)°ht?1?+ut?°h~t??
其中σ \sigma σ 是 sigmoid 开关信息， ° \circ ° 是 element-wise product， W u , W r , W h ∈ R n H × n I W^u, W^r, W^h \in \mathbb{R} ^{n_H \times n_I} Wu,Wr,Wh∈RnH?×nI? ， U z , U r , U h ∈ n H × n H U^z, U^r, U^h \in n_H \times n_H Uz,Ur,Uh∈nH?×nH?，其中n H n_H nH? 是隐藏层size， n I n_I nI? 是输入size， i t i_t it? 是GRU 输入， i t = e b [ t ] i_t = e_b [t] it?=eb?[t] 表示用户采取的第 t 个行为， h t h_t ht? 表示第 t 个隐藏状态；
接下来介绍辅助损失L a u x L_{aux} Laux?，假设有 N 对行为 embedding 序列{ e b i , e ^ b i } ∈ D B , i ∈ 1 , 2 , ? ? , N \left\{\mathbf{e}_{b}^{i}, \hat{\mathbf{e}}_{b}^{i}\right\} \in \mathcal{D}_{\mathcal{B}}, i \in 1,2, \cdots, N {ebi?,e^bi?}∈DB?,i∈1,2,?,N，其中e b i ∈ R T × n E \mathbf{e}_{b}^{i} \in \mathbb{R}^{T \times n_E} ebi?∈RT×nE? 表示正样本行为序列， e ^ b i ∈ R T × n E \hat{\mathbf{e}}_{b}^{i} \in \mathbb{R}^{T \times n_E} e^bi?∈RT×nE? 表示负样本行为序列， T T T 是用户行为长度， e b i [ t ] ∈ G \mathbf{e}_{b}^{i}[t] \in \mathcal{G} ebi?[t]∈G 表示用户i i i 点击的第t t t 个 item embedding， G \mathcal{G} G 是总的 item 候选集， e ^ b i [ t ] ∈ G ? e b i [ t ] \hat{\mathbf{e}}_{b}^{i}[t] \in \mathcal{G}-\mathbf{e}_{b}^{i}[t] e^bi?[t]∈G?ebi?[t] 表示总候选集中出去当前当前第 t 个点击item之外的候选集（负样本序列），则辅助loss函数如下：
L a u x = ? 1 N ( ∑ i = 1 N ∑ t log ? σ ( h t i , e b i [ t + 1 ] ) + log ? ( 1 ? σ ( h t i , e ^ b i [ t + 1 ] ) ) ) \begin{aligned} L_{a u x}=-& \frac{1}{N}\left(\sum_{i=1}^{N} \sum_{t} \log \sigma\left(\mathbf{h}_{t}^{i}, \mathbf{e}_{b}^{i}[t+1]\right)\right.\\ &\left.+\log \left(1-\sigma\left(\mathbf{h}_{t}^{i}, \hat{\mathbf{e}}_{b}^{i}[t+1]\right)\right)\right) \end{aligned} Laux?=??N1?(i=1∑N?t∑?logσ(hti?,ebi?[t+1])+log(1?σ(hti?,e^bi?[t+1])))?
其中σ ( x 1 , x 2 ) = 1 1 + exp ? ( ? [ x 1 , x 2 ] ) \sigma\left(\mathbf{x}_{\mathbf{1}}, \mathbf{x}_{\mathbf{2}}\right)=\frac{1}{1+\exp \left(-\left[\mathbf{x}_{1}, \mathbf{x}_{2}\right]\right)} σ(x1?,x2?)=1+exp(?[x1?,x2?])1? 是 sigmoid 激活函数， h t i h_t^i hti? 是用户i i i 对应 GRU 第t t t 个隐藏状态；
则 CTR 模型最终的损失函数为：
L = L t a r g e t + α ? L a u x L = L_{target} + \alpha * L_{aux} L=Ltarget?+α?Laux?
其中α \alpha α 是平衡最终预测和兴趣表示的超参数。

其中辅助loss中的正样本采用hidden state中对应的下一个点击item embedding，而负样本采用抛开此点击item embedding中负采样得到的负样本（实际使用的是曝光未点击中的样本中采样得到）；
文章中最后提到：As for industrial dataset, we take the ad that been shown to user while not been clicked as negative instance；也就是从曝光给该用户但该用户并没有点击中的item里采样，但具体是用户的所有曝光未点击的item还是单次下发曝光未点击的item，文中并未详细说明。

3.2 Interest Evolving层

Interest Evolving层：

这部分结合了注意力机制中的局部激活能力和GRU的序列学习能力来建模用户兴趣演化。
该层 GRU 的输入就是上一层 GRU 的输出， i t ′ = h t i_t^\prime = h_t it′?=ht?，输出是h t ′ h_t^\prime ht′? ，最后一个状态h T ′ h_T^\prime hT′? 作为序列层输出和其他特征concat一起传给全连接层。
attention 部分系数计算方式如下：
a t = exp ? ( h t W e a ) ∑ j = 1 T exp ? ( h j W e a ) a_{t}=\frac{\exp \left(\mathbf{h}_{t} W \mathbf{e}_{a}\right)}{\sum_{j=1}^{T} \exp \left(\mathbf{h}_{j} W \mathbf{e}_{a}\right)} at?=∑j=1T?exp(hj?Wea?)exp(ht?Wea?)?
其中e α e_\alpha eα? 通过候选 item 不同 field 特征的concat 得到（如 candidate item embedding 和 candidate item cate embedding）：
【推荐系统论文进阶|CTR预估论文精读(十一)--Deep Interest Evolution Network(DIEN)】其中如何将 Attention 得到的权重引入 GRU 一种有三种实现方式：

GRU with attentional input(AIGRU)：较为简单，直接将 attention系数和输入相乘：
i t ′ = h t ? α t i_t^\prime = h_t * \alpha_t it′?=ht??αt?
Attention based GRU(AGRU)：采用问答领域文章提到的一种方法，直接将 attention 系数来替换 GRU 的update gate，直接对隐状态进行更新：
h t ′ = ( 1 ? a t ) ? h t ? 1 ′ + a t ? h ~ t ′ \mathbf{h}_{t}^{\prime}=\left(1-a_{t}\right) * \mathbf{h}_{t-1}^{\prime}+a_{t} * \tilde{\mathbf{h}}_{t}^{\prime} ht′?=(1?at?)?ht?1′?+at??h~t′?
GRU with attentional update gate（AUGRU）
虽然 AGRU 使用 attention 系数来直接控制隐状态的更新，但是它使用了一个标量（ α t \alpha_t αt?）来代替了update gate（u t u_t ut? ），也就是忽略了不同维度重要的区别。
作者提出了增加了 attentional update gate 的 GRU 结构实现 attention 机制和 GRU 的无缝结合：
u ~ t ′ = a t ? u t ′ h t ′ = ( 1 ? u ~ t ′ ) ° h t ? 1 ′ + u ~ t ′ ° h ~ t ′ \begin{aligned} \tilde{\mathbf{u}}_{t}^{\prime} &=a_{t} * \mathbf{u}_{t}^{\prime} \\ \mathbf{h}_{t}^{\prime} &=\left(1-\tilde{\mathbf{u}}_{t}^{\prime}\right) \circ \mathbf{h}_{t-1}^{\prime}+\tilde{\mathbf{u}}_{t}^{\prime} \circ \tilde{\mathbf{h}}_{t}^{\prime} \end{aligned} u~t′?ht′??=at??ut′?=(1?u~t′?)°ht?1′?+u~t′?°h~t′??

4. 总结作者提出了带注意力更新门的 GRU 结果也就是 AUGRU，通过使用兴趣状态和 target item 计算得到的相关性，AUGRU 增强相关兴趣的影响，同时减弱不相关兴趣的影响，同时，为了更好地建模用户行为序列特征，作者使用了辅助 loss 函数，。

DIEN 在建模行为序列的同时引入 Attention 机制，建模与当前候选 item 相关的兴趣演化，这点不难想到，巧妙的是通过u ~ t ′ = a t ? u t ′ \tilde{\mathbf{u}}_{t}^{\prime} =a_{t} * \mathbf{u}_{t}^{\prime} u~t′?=at??ut′? 将 Attention score 同 GRU 的 update gate 结合，实现 Attention 机制与 GRU 的巧妙连接；
第二点巧妙之处在于引入行为序列训练的辅助 loss 函数，用行为b t + 1 b_{t+1} bt+1? 来指导 GRU 隐藏状态h t h_t ht? 的学习（GRU 只能学习行为之间的依赖，并不能很好反映用户兴趣），更好地建模了用户兴趣特征，降低行为序列特征训练难度，提升行为序列 item 的特征表征效果；

5. 参考文献 [1] Deep Interest Evolution Network for Click-Through Rate Prediction
[2] 详解阿里之Deep Interest Evolution Network(AAAI 2019)