09.C2W4.Word Embeddings with Neural Networks

往期文章请点这里

Overview
Basic Word Representations
- Integers
- One-hot vectors
Word Embeddings
- Meaning as vectors
- Word embedding vectors
Word embedding process
Word Embedding Methods
- Basic word embedding methods
- Advanced word embedding methods
Continuous Bag-of-Words Model
- Center word prediction: rationale
- Creating a training example
- From corpus to training
Cleaning and Tokenization
- Cleaning and tokenization matters
- Example in Python
- - corpus
  - libraries
  - code
Sliding Window of Words in Python
Transforming Words into Vectors
- Transforming center words into vectors
- Transforming context words into vectors
- Final prepared training set
Architecture of the CBOW Model
Dimensions
- single input
- batch input
Activation Functions
- Rectified Linear Unit (ReLU)
- Softmax
- Softmax: example
Training a CBOW Model: Cost Function
- Loss
- Cross-entropy loss
Training a CBOW Model: Forward Propagation
- Forward propagation
- Cost
Training a CBOW Model: Backpropagation and Gradient Descent
- Backpropagation
- Gradient descent
Extracting Word Embedding Vectors
- option 1
- option 2
- option 3
Evaluating Word Embeddings
- Intrinsic evaluation
- Extrinsic Evaluation

往期文章请点这里

Overview

了解word embeddings一些基础应用
在这里插入图片描述
高级应用：

学习目标（需要掌握NN）：
●Identify the key concepts of word representations
●Generate word embeddings
●Prepare text for machine learning
●Implement the continuous bag-of-words model

Basic Word Representations

Integers

直接使用唯一的Integers对单词进行编码，优点是简单：
在这里插入图片描述
缺点是无法表达单词的语义信息：

One-hot vectors

使用0-1词向量来表示单词，向量长度与词表长度相同：
在这里插入图片描述
每一个单词可以使用其对应列为1，其他列为0的方式来表示：

Integers和独热编码可以相互转化

独热编码的优点是简单，没有暗含单词的排序信息；
但仍然没有语义信息：

且当词表较大时，向量长度很长：
在这里插入图片描述

Word Embeddings

Meaning as vectors

向量是否能包含语义？当然可以，这里用低维向量来进行演示：
在这里插入图片描述
上图是一个情感分析或情感评分的示例，它表示了一些词汇与它们对应的情感分数。

有8个词汇：spider, boring, kitten, happy, anger, paper, excited, rage。
这些词汇被分为4组，每组两个词，每组词旁边有括号内的情感分数，表示这些词与特定情感的关联强度。
第一组：spider (-2.52), boring (-2.08)，这些分数可能是负数，表明它们与负面情绪相关。
第二组：kitten (-1.53), happy (-0.91)，这些分数接近零或稍微负，可能表示它们与轻微的负面情绪或中性情绪相关。
第三组：anger (0.03), paper (1.09)，分数从接近零到正数，表明它们与正面情绪或中性情绪相关。
第四组：excited (2.31), rage，最后一个词 rage 没有给出分数，但根据上下文，它可能与强烈的负面情绪相关。
图片底部有标尺，从 -2 到 2，分为 negative（负向/消极）、0（中性）和 positive（正向/积极）三个情感区域。
当然还可以加上y轴表示单词的抽象和具体，例如：
在这里插入图片描述
当然，这样表示会丢失一些精确性，例如spider和snake都重合了，这个是不合理的。

Word embedding vectors

可以看到词嵌入向量表示有两个优点：
Low dimension（相对独热编码）
Embed meaning：
在这里插入图片描述
注意：
one-hot vectors，word embedding vectors都属于word vectors（词向量），但后者在很多场合也叫：“word vectors”，word embeddings

Word embedding process

Corpus对于生成词嵌入很重要，例如你要针对特定领域的单词进行词嵌入，则尽量包含该领域的语料，因为单词受到上下文影响很大，例如apple在农业领域是水果，在科技领域就是公司。
Embedding method这里主要是使用ML的模型，采用自监督的方式训练。
整个流程大概如下图所示：
在这里插入图片描述

Word Embedding Methods

Basic word embedding methods

●word2vec (Google, 2013)
○Continuous bag of words (CBOW)
○Continuous skip gram / Skip gram with negative sampling (SGNS)
●Global Vectors (GloVe) (Stanford, 2014)
●fastText (Facebook, 2016)
○Supports out of vocabulary (OOV) words
○训练速度很快

Advanced word embedding methods

Deep learning, contextual embeddings
●BERT (Google, 2018)
●ELMo (Allen Institute for AI, 2018)
●GPT 2 (OpenAI, 2018)
…
这些都是预训练模型，可以对其进行finetune

Continuous Bag-of-Words Model

在这里插入图片描述

Center word prediction: rationale

词向量是CBOW任务的副产物，其主线任务是做预测的，根据上下文预测中间词：
在这里插入图片描述
因为单词与上下文是有关系的，例如上图中，通过足够打的语料库，模型将学会预测缺失的单词与狗相关。

Creating a training example

在这里插入图片描述
中心词（Center word）：在这个示例中，中心词是 “happy”。
上下文词（Context words）：围绕中心词的词，用于提供上下文信息。在这个例子中，上下文词包括 “because”, “learning”, “am”（出现了两次）。
窗口大小（Window size）：指上下文窗口可以包含的总词数。在这个例子中，窗口大小是5。
上下文半尺寸（Context half-size）：指窗口一半的大小，通常用于确定窗口在中心词的左侧和右侧分别可以扩展多远。在这个例子中，上下文半尺寸是2，意味着窗口在中心词的左侧和右侧各扩展2个词的位置。
窗口（Window）：实际上指的是上下文词围绕中心词的布局。根据窗口大小和上下文半尺寸，窗口包括中心词以及它左右两侧的词。

From corpus to training

根据上面的训练实例，我们对I am happy because I am learning，假设窗口大小为5

Context words	Center word
I am because I	happy
am happy I am	because
happy because am learning	I

在这里插入图片描述

Cleaning and Tokenization

Cleaning and tokenization matters

数据清理是预处理阶段的重要步骤，目的是提高文本数据的质量，使其更适合后续的分析和模型训练。
●Letter case
●Punctuation
●Numbers
●Special characters
●Special words
在这里插入图片描述

Letter case（字母大小写）：
清理操作可能包括将所有文本转换为小写或大写，以消除大小写差异带来的影响。
例如，将 “Hello” 和 “hello” 统一转换为 “hello”，以便模型不会将它们视为两个不同的词。

Punctuation（标点符号）：
标点符号的清理可能涉及删除或替换文本中的所有标点符号，因为它们可能对某些NLP任务不重要或会干扰模型的分析。
例如，将句子 “Hello! How are you?” 中的感叹号和问号去除，变为 “Hello How are you”。

Numbers（数字）：
数字清理通常指将文本中的数字替换或删除，因为数字可能对某些文本分析任务没有意义或会引入噪声。
例如，将 “I have 3 apples” 中的 “3” 删除或替换，变为 “I have apples”。

Special characters（特殊字符）：
特殊字符包括非字母数字的符号，如 @, #, $, % 等。清理这些字符可以简化文本数据，避免它们对模型造成干扰。
例如，将 “email@example.com” 中的 “@” 和 “.” 删除，变为 “emailexamplecom”。

Special words（特殊词汇）：
特殊词汇的清理可能包括去除常见的但对分析没有帮助的词，如停用词（stop words，如 “and”, “the” 等）或特定的行业术语。
例如，从 “The quick brown fox jumps over the lazy dog” 中去除 “the” 和 “over” 等停用词。

Example in Python

corpus

在这里插入图片描述

libraries

# pip install nltk
# pip install emoji
import nltk
from nltk.tokenize import word_tokenize
import emoji
nltk.download(' punkt') # download pre trained Punkt tokenizer for English

code

corpus = 'Who ❤️"word embeddings" in 2020? I do!!!'
data = re.sub(r'[,!?;-]+', '.', corpus)

结果：
Who ❤️"word embeddings" in 2020. I do.

data = nltk.word_tokenize(data) # tokenize string to words

结果：
[‘Who’, ‘❤️’, ‘``’, ‘word’, ‘embeddings’, “‘’”, ‘in’, ‘2020’, ‘.’, ‘I’, ‘do’, ‘.’]

data = [ ch.lower() for ch in data
		 if ch.isalpha() 
		 or ch == '.'
		 or emoji.get_emoji_regexp().search(ch)
	   ]

结果：
[‘who’, ‘❤️’, ‘word’, ‘embeddings’, ‘in’, ‘.’, ‘i’, ‘do’, ‘.’]

Sliding Window of Words in Python

def get_windows (words, C):
	i = C
	while i < len(words)-C:
		center_word = words[i]
		context_words = words[(i-C):i] + words[(i+ 1 ):(i+C+1)]
		yield context_words, center_word
		i += 1

在这里插入图片描述
可以看到i初始化是从i = C=2开始的，也是第一个中心词happy对应的索引，i结束于倒数第三个词len(words)-C，每次i往前移动一个单词
最后使用yield 完成多次返回值传递

for x, y in get_windows(
	[' i', ' am', ' happy', ' because', ' i', ' am', 'learning'],
	2
):
print(f'{x}\t{y}')

结果：
在这里插入图片描述

Transforming Words into Vectors

有了上下文和中心词，接下来就是将它们转化为向量。

Transforming center words into vectors

语料库：I am happy because I am learning
词库：am, because, happy, I, learning
使用独热编码表示每个中心词：
在这里插入图片描述

Transforming context words into vectors

使用上下文的独热编码平均值来表示，对于中心词为happy的时候：
在这里插入图片描述

Final prepared training set

Context words	Context words vector	Center word	Center word vector
I am because I	[0.25; 0.25; 0; 0.5; 0]	happy	[0; 0; 1; 0; 0]

Architecture of the CBOW Model

在这里插入图片描述
CBOW 是一个典型的前馈神经网络结构，其中包括输入层、一个或多个隐藏层，以及一个输出层。每一层都包含权重和偏置，以及激活函数来处理数据和进行非线性变换。
Input layer（输入层）：这一层接收输入数据，在这个例子中是文本序列 “I am happy because I am learning”。输入数据通常会被转换为数值向量，如词嵌入（Word Embeddings）。

Context words and Center word（上下文词和中心词）：在某些模型中，如卷积神经网络（CNN）或循环神经网络（RNN），上下文词可以提供周围词的语境信息，而中心词是当前正在处理的词。

W1, W2, …（权重）：这些表示网络中的权重参数，每个权重连接输入层和隐藏层的神经元。

b, b2, …（偏置）：偏置参数，用于调整神经元的激活函数的输出。

Hidden layer（隐藏层）：输入层之后是隐藏层，隐藏层中的神经元会对输入数据进行处理，提取特征。

Output layer（输出层）：隐藏层之后是输出层，输出层的神经元数量通常取决于任务的类别数，用于生成最终的预测结果。

Vector（向量）：表示输入文本被转换为固定大小的数值向量，以便神经网络可以处理。

ReLU（Rectified Linear Unit）：一种常用的激活函数，用于增加非线性，帮助模型学习更复杂的特征。

softmax：一种在输出层使用的激活函数，用于多分类任务中将输出转换为概率分布。

V = 5：表示词表大小，这里使用独热编码，也是输入向量的维度大小。

X：可能表示输入数据的特征矩阵或特征向量。

当然还有别的超参数可以配置，例如：N: Word embedding size等等…

Dimensions

single input

在这里插入图片描述
如果输入不是列向量，而是行向量，则需要使用转置矩阵和矩阵乘法中的倒置项进行计算。

batch input

上面以单个样本作为输入为例，演示了CBOW的各个部分的维度，在实际操作过程中，为了加快运行速度，我们通常一次传入一个batch（批次）的数据，batch_size是一个超参数，下图给出了batch_size=m的例子：
在这里插入图片描述
我们将m个样本的列向量合在一起，变成输入矩阵
这里的偏置项写成了大写的B，之前的b是1×N大小的，这里在和矩阵做加法的时候，Python会自动做broadcasting，将其大小扩展到m×N大小：

这里注意输入和输出矩阵中向量于预测结果的对应关系（绿色部分）：
在这里插入图片描述

Activation Functions

Rectified Linear Unit (ReLU)

这个没有什么好说的，还有很多变体，例如：leakyReLU
输入层经过W和b后，再进入ReLU
$z_1 = W_1 x + b_1\\ h= ReLU(z_1)$
在这里插入图片描述
ReLU公式为：
$ReLU(x)=\max(0,x)$
图像为：

下面是一组 $z_1$ 对应的h值：

Softmax

Sofmax是吃隐藏层输出的线性变换：
$W_2 h + b_2\\ \hat y=softmax(z)$
一组实数经过Sofmax后会得到一组0-1之间的数字（可以说是概率），这一组数字和为1
在这里插入图片描述
对于CBOW模型，得到的是每个单词对应的出现概率：

$\hat y_i$ 的公式如下，其原理就相当于把每个 $\hat y_i$ 进行标准化，使其概率和为1。
$\hat y_i=\cfrac{e^{z_i}}{\sum_{j=1}^Ve^{z_j}}$

Softmax: example

最后预测结果是happy因为其对应的概率值最大。
在这里插入图片描述

Training a CBOW Model: Cost Function

Loss

"Loss"通常指的是在机器学习中，模型预测值与实际值之间的差异或误差。在训练机器学习模型的过程中，目标是最小化这个损失函数（Loss function），这样可以使模型的预测更加接近真实值。

具体来说，损失函数是一个衡量模型性能的指标，它计算了模型预测值与真实值之间的差距。不同的机器学习任务会使用不同类型的损失函数。例如：
对于分类问题，常用的损失函数是交叉熵损失（Cross-Entropy Loss）。
对于回归问题，常用的损失函数是均方误差（Mean Squared Error, MSE）。
在这里插入图片描述

Cross-entropy loss

CBOW 使用的损失函数形式为：
$J=-\sum_{k=1}^Vy_k\log \hat y_k$
真实值和预测值形式为：
在这里插入图片描述
对于语料：
I am happy because I am learning
前五个单词中心词是happy，假设其预测值和真实值如下：

按照公式取对数后与真实值进行点乘，然后再求和：

可以看到当预测值与真实值相近的时候，损失值较小。
下面看预测值为am是中心词的情况：
在这里插入图片描述
上面的损失函数计算可以进一步简化为：
$J=-\log \hat y_{actual\space word}$
例如：

J=-log 0.01=4.61，注意这里写的是log其实是ln
根据简化后的公式可以画出其函数图像：

正确中心词对应的预测概率越大，Loss值越小，反正Loss越大。

Training a CBOW Model: Forward Propagation

整个训练过程包含：
●Forward propagation
●Cost
●Backpropagation and gradient descent

Forward propagation

其实在CBOW构架中就cover了前向传播，尝试用自己的话描述下图（注意，这里使用的是batch模式）：
在这里插入图片描述
你能写出下面公式么？

Cost

“cost”（成本）和"loss"（损失）这两个术语经常被用来描述衡量模型预测与实际值之间差异的函数。尽管在日常使用中它们可能可以互换，但它们在严格意义上有一些区别。损失函数通常用于单个样本，而成本函数则用于整个数据集。在实践中，当我们说“最小化损失”时，我们通常指的是最小化成本函数，因为这是我们在训练模型时优化的总体目标。
这一节中的Cost是指一个Batch的Loss的平均，假设一个batch有m个样本，则：
$J_{batch}=-\cfrac{1}{m}\sum_{i=1}^m\sum_{j=1}^Vy_j^{(i)}\log \hat y_j^{(i)}$
同样的可以简化为：
$J_{batch}=-\cfrac{1}{m}\sum_{i=1}^mJ^{(i)}$
在这里插入图片描述

Training a CBOW Model: Backpropagation and Gradient Descent

训练模型的目的是最小化cost，按batch的cost 函数有四个变量：
$J_{batch}=f(W_1,W_2,b_1,b_2)$
我们可以使用Backpropagation: calculate partial derivatives of cost with respect to weights and biases
使用Gradient descent: update weights and biases

Backpropagation

$\cfrac{\partial J_{batch}}{\partial W_1}=\cfrac{1}{m}ReLU\left(W_2^\intercal (\hat Y-Y)\right)X^\intercal$
$\cfrac{\partial J_{batch}}{\partial W_2}=\cfrac{1}{m}(\hat Y-Y)H^\intercal$
$\cfrac{\partial J_{batch}}{\partial b_1}=\cfrac{1}{m}ReLU\left(W_2^\intercal (\hat Y-Y)\right)1_m^\intercal$
$\cfrac{\partial J_{batch}}{\partial b_2}=\cfrac{1}{m}(\hat Y-Y)1_m^\intercal$
这里 $1_m$ 是一个有m个元素且都为1的列向量，其转置后与其他矩阵相乘得到矩阵每行求和：
在这里插入图片描述
实际操作的时候是用numpy的求和函数实现的：

import numpy as np
# code to initialize matrix a omitted
np.sum(a, axis= 1 , keepdims=True )

反向传播就是要根据链式法则求偏导，具体计算推导这里不展开，可以直接使用现有的函数实现计算。

Gradient descent

Hyperparameter: learning rate $\alpha$
$W_1:= W_1-\alpha\cfrac{\partial J_{batch}}{\partial W_1}$
$W_2:= W_2-\alpha\cfrac{\partial J_{batch}}{\partial W_2}$
$b_1:= b_1-\alpha\cfrac{\partial J_{batch}}{\partial b_1}$
$b_2:= b_2-\alpha\cfrac{\partial J_{batch}}{\partial b_2}$

Extracting Word Embedding Vectors

共有3种方式

option 1

将 $W_1$ 的每一个列作为词表中每一个单词的嵌入列向量， $W_1$ 有V列刚好和词表长度对应，其对应方式与输入X的顺序相对应（看蓝色部分）：
在这里插入图片描述

option 2

将 $W_2$ 的每一个行作为词表中每一个单词的嵌入行向量， $W_2$ 有V行刚好和词表长度对应，其对应方式与输入X的顺序相对应（看蓝色部分）：
在这里插入图片描述

option 3

将上面二者相结合得到V×N的矩阵 $W_3$ ，每一个列作为词表中每一个单词的嵌入列向量：
$W_3=0.5(W_1+W_2^T)$
在这里插入图片描述

Evaluating Word Embeddings

主要有两种：Intrinsic Evaluation（内在评估），Extrinsic Evaluation（外在评估）。内在评估提供了关于模型预测能力的信息，而外在评估则提供了关于模型在实际应用中效果的信息。两者都是重要的，因为一个模型可能在技术上表现良好（内在评估），但如果它不能有效地支持最终的应用目标（外在评估），那么它可能不是一个成功的模型。在实际应用中，通常需要结合这两种评估方法来全面理解模型的性能。

Intrinsic evaluation

Analogies
Clustering
Visualization
Analogies主要是Test relationships between words，有三种常见方式：

Analogies	example
Semantic analogies	“France” is to “Paris” as “Italy” is to <?>
Syntactic analogies	“seen” is to “saw” as “been” is to <?>
Ambiguity	“wolf” is to “pack” as “bee” is to <?> → swarm? colony?
Clustering