理论课:C2W2.Part-of-Speech (POS) Tagging and Hidden Markov Models
文章目录
理论课: C2W2.Part-of-Speech (POS) Tagging and Hidden Markov Models
这次作业将开发语音部分(POS)标记技能,即为输入文本中的每个单词指定一个语音部分标记(名词、动词、形容词…)的过程。标记是很困难的,因为有些词在不同的语境可以有不同标签,例如:
- The whole team played well. [adverb]
- You are doing well for yourself. [adjective]
- Well, this assignment took me forever to complete. [interjection]
- The well is dry. [noun]
- Tears were beginning to well in her eyes. [verb]
词性标注任务有助于更好地理解句子的意思。该任务对搜索查询至关重要,识别专有名词、组织机构、股票代码或任何类似的东西,将大大提高从语音识别到搜索的各种能力。本作业的内容包括:
- 了解语音部分标记的工作原理
- 计算隐马尔可夫模型中的转换矩阵 A
- 计算隐马尔可夫模型中的发射矩阵 B
- 计算维特比算法
- 计算模型的准确性
先导入包
# Importing packages and loading in the data set
from utils_pos import get_word_tag, preprocess
import pandas as pd
from collections import defaultdict
import math
import numpy as np
import w2_unittest
0 Data Sources
将用到Wall Street Journal (WSJ)的两个标记数据集,词性标签含义可以看这里 。
- 一个数据集(WSJ-2_21.pos)将用于训练。
- 另一个数据集(WSJ-24.pos)将用于测试。
- 标记的训练数据经过预处理后形成了一个词汇表(hmm_vocab.txt,详情见绑定下载资源)。
- 词汇表中的词是训练集中使用两次或两次以上的词。
- 词汇表中还添加了一组 “'unknown word tokens”,详情见后文描述。
训练集将用于创建emission、transmission和词性标签的计数。
测试集(WSJ-24.pos)用来创建 y
。
- 其中包含测试文本和真实标签。
- 测试集还经过预处理以去除标签,形成test_words.txt(可下载)。
- 读入该文本后,使用 utils_pos.py中提供的函数对其进行进一步处理,以识别句子末尾并处理词汇表中没有的单词。
- 这就形成了 “prep ”列表,即用于测试 POS 标记器的预处理文本。
POS 标记器会遇到不在其数据集中的单词。
- 为了提高准确性,需在预处理过程中对这些单词进行进一步分析,以提取关于其适当标记的可用提示。
- 例如,后缀 “ize ”可以提示单词是动词,如 “final-ize ”或 “character-ize”。
- 这里使用自定义的未知标记,如“–unk-verb–”或“–unk-noun–”,将取代训练语料库和测试语料库中的未知单词,并将出现在训练语料库和测试语料库中。
加载训练语料
# load in the training corpus
with open("./data/WSJ_02-21.pos", 'r') as f:
training_corpus = f.readlines()
print(f"A few items of the training corpus list")
print(training_corpus[0:5])
结果:
A few items of the training corpus list
[‘In\tIN\n’, ‘an\tDT\n’, ‘Oct.\tNNP\n’, ‘19\tCD\n’, ‘review\tNN\n’]
# read the vocabulary data, split by each line of text, and save the list
with open("./data/hmm_vocab.txt", 'r') as f:
voc_l = f.read().split('\n')
print("A few items of the vocabulary list")
print(voc_l[0:50])
print()
print("A few items at the end of the vocabulary list")
print(voc_l[-50:])
结果:
A few items of the vocabulary list
[‘!’, ‘#’, ‘$’, ‘%’, ‘&’, “'”, “‘’”, “'40s”, “'60s”, “'70s”, “'80s”, “'86”, “'90s”, “'N”, “'S”, “'d”, “'em”, “'ll”, “'m”, “‘n’”, “'re”, “'s”, “'til”, “'ve”, ‘(’, ‘)’, ‘,’, ‘-’, ‘–’, ‘–n–’, ‘–unk–’, ‘–unk_adj–’, ‘–unk_adv–’, ‘–unk_digit–’, ‘–unk_noun–’, ‘–unk_punct–’, ‘–unk_upper–’, ‘–unk_verb–’, ‘.’, ‘…’, ‘0.01’, ‘0.0108’, ‘0.02’, ‘0.03’, ‘0.05’, ‘0.1’, ‘0.10’, ‘0.12’, ‘0.13’, ‘0.15’]
A few items at the end of the vocabulary list
[‘yards’, ‘yardstick’, ‘year’, ‘year-ago’, ‘year-before’, ‘year-earlier’, ‘year-end’, ‘year-on-year’, ‘year-round’, ‘year-to-date’, ‘year-to-year’, ‘yearlong’, ‘yearly’, ‘years’, ‘yeast’, ‘yelled’, ‘yelling’, ‘yellow’, ‘yen’, ‘yes’, ‘yesterday’, ‘yet’, ‘yield’, ‘yielded’, ‘yielding’, ‘yields’, ‘you’, ‘young’, ‘younger’, ‘youngest’, ‘youngsters’, ‘your’, ‘yourself’, ‘youth’, ‘youthful’, ‘yuppie’, ‘yuppies’, ‘zero’, ‘zero-coupon’, ‘zeroing’, ‘zeros’, ‘zinc’, ‘zip’, ‘zombie’, ‘zone’, ‘zones’, ‘zoning’, ‘{’, ‘}’, ‘’]
创建词典,键是单词,值是一个整数
# vocab: dictionary that has the index of the corresponding words
vocab = {}
# Get the index of the corresponding words.
for i, word in enumerate(sorted(voc_l)):
vocab[word] = i
print("Vocabulary dictionary, key is the word, value is a unique integer")
cnt = 0
for k,v in vocab.items():
print(f"{k}:{v}")
cnt += 1
if cnt > 20:
break
结果:
Vocabulary dictionary, key is the word, value is a unique integer
:0
!:1
#:2
$:3
%:4
&:5
':6
‘’:7
'40s:8
'60s:9
'70s:10
'80s:11
'86:12
'90s:13
'N:14
'S:15
'd:16
'em:17
'll:18
'm:19
‘n’:20
加载测试语料
# load in the test corpus
with open("./data/WSJ_24.pos", 'r') as f:
y = f.readlines()
print("A sample of the test corpus")
print(y[0:10])
结果:
A sample of the test corpus
[‘The\tDT\n’, ‘economy\tNN\n’, “'s\tPOS\n”, ‘temperature\tNN\n’, ‘will\tMD\n’, ‘be\tVB\n’, ‘taken\tVBN\n’, ‘from\tIN\n’, ‘several\tJJ\n’, ‘vantage\tNN\n’]
可以看到测试语料是带词性标签的,现在需要去掉标签,便于进行预测:
#corpus without tags, preprocessed
_, prep = preprocess(vocab, "./data/test.words")
print('The length of the preprocessed test corpus: ', len(prep))
print('This is a sample of the test_corpus: ')
print(prep[0:10])
结果:
The length of the preprocessed test corpus: 34199
This is a sample of the test_corpus:
[‘The’, ‘economy’, “'s”, ‘temperature’, ‘will’, ‘be’, ‘taken’, ‘from’, ‘several’, ‘–unk–’]
1 POS Tagging
1.1 Training
先从简单的开始,针对没有多种词性标签的单词进行处理:
- 例如,“is ”是一个动词,它没有别的词性标签。
- 在 “WSJ ”语料库中,86%$的词性标签是单一的(即它们只有一个标签)
- 大约 14 14% 14 是模棱两可的(即它们有一个以上的标记)
开始进行词性标签预测之前,先完成三个字典
Transition counts
transition_counts
用于计算每个词性标签与另一个词性标签相邻出现的次数。
P ( t i ∣ t i − 1 ) (1) P(t_i |t_{i-1}) \tag{1} P(ti∣ti−1)(1)
表示位置 i i i 的标签与位置 i − 1 i-1 i−1 的标签之间的概率。
为了计算公式 1,需要创建一个 transition_counts
字典,其中
- 键是
(prev_tag, tag)
- 值是这两个标记按该顺序出现的次数。
Emission counts
emission_counts
用于计算给定词性标签的条件下某个单词的出现概率。:
P ( w i ∣ t i ) (2) P(w_i|t_i)\tag{2} P(wi∣ti)(2)
该字典的
- 键是
(tag, word)
。 - 值是该词对在训练集中出现的次数。
Tag counts
最后一个字典是Tag counts:
- 关键字是标签
- 值是每个标签出现的次数。
Exercise 01
编写函数create_dictionaries
,吃进去training_corpus
,返回上面提到的三个字典:transition_counts
, emission_counts
, tag_counts
.
函数要使用到defaultdict,它是 dict 的子类。
- 标准 Python 字典会在尝试访问一个当前不在字典中的键时抛出一个 KeyError 错误。
- 相反,defaultdict 会创建一个与参数类型相同的项,在本函数中是一个默认值为 0 的整数。
# UNQ_C1 GRADED FUNCTION: create_dictionaries
def create_dictionaries(training_corpus, vocab, verbose=True):
"""
Input:
training_corpus: a corpus where each line has a word followed by its tag.
vocab: a dictionary where keys are words in vocabulary and value is an index
Output:
emission_counts: a dictionary where the keys are (tag, word) and the values are the counts
transition_counts: a dictionary where the keys are (prev_tag, tag) and the values are the counts
tag_counts: a dictionary where the keys are the tags and the values are the counts
"""
# initialize the dictionaries using defaultdict
emission_counts = defaultdict(int)
transition_counts = defaultdict(int)
tag_counts = defaultdict(int)
# Initialize "prev_tag" (previous tag) with the start state, denoted by '--s--'
prev_tag = '--s--'
# use 'i' to track the line number in the corpus
i = 0
# Each item in the training corpus contains a word and its POS tag
# Go through each word and its tag in the training corpus
for word_tag in training_corpus:
# Increment the word_tag count
i += 1
# Every 50,000 words, print the word count
if i % 50000 == 0 and verbose:
print(f"word count = {i}")
### START CODE HERE ###
# get the word and tag using the get_word_tag helper function (imported from utils_pos.py)
# the function is defined as: get_word_tag(line, vocab)
word, tag = get_word_tag(word_tag, vocab)
# Increment the transition count for the previous word and tag
transition_counts[(prev_tag, tag)] += 1
# Increment the emission count for the tag and word
emission_counts[(tag, word)] += 1
# Increment the tag count
tag_counts[tag] += 1
# Set the previous tag to this tag (for the next iteration of the loop)
prev_tag = tag
### END CODE HERE ###
return emission_counts, transition_counts, tag_counts
运行:
emission_counts, transition_counts, tag_counts = create_dictionaries(training_corpus, vocab)
结果:
word count = 50000
word count = 100000
word count = 150000
word count = 200000
word count = 250000
word count = 300000
word count = 350000
word count = 400000
word count = 450000
word count = 500000
word count = 550000
word count = 600000
word count = 650000
word count = 700000
word count = 750000
word count = 800000
word count = 850000
word count = 900000
word count = 950000
将所有词性标签打印出来看一看:
# get all the POS states
states = sorted(tag_counts.keys())
print(f"Number of POS tags (number of 'states'): {len(states)}")
print("View these POS tags (states)")
print(states)
结果:
Number of POS tags (number of ‘states’): 46
View these POS tags (states)
[‘#’, ‘ ′ , " ′ ′ " , ′ ( ′ , ′ ) ′ , ′ , ′ , ′ − − s − − ′ , ′ . ′ , ′ : ′ , ′ C C ′ , ′ C D ′ , ′ D T ′ , ′ E X ′ , ′ F W ′ , ′ I N ′ , ′ J J ′ , ′ J J R ′ , ′ J J S ′ , ′ L S ′ , ′ M D ′ , ′ N N ′ , ′ N N P ′ , ′ N N P S ′ , ′ N N S ′ , ′ P D T ′ , ′ P O S ′ , ′ P R P ′ , ′ P R P ', "''", '(', ')', ',', '--s--', '.', ':', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'PDT', 'POS', 'PRP', 'PRP ′,"′′",′(′,′)′,′,′,′−−s−−′,′.′,′:′,′CC′,′CD′,′DT′,′EX′,′FW′,′IN′,′JJ′,′JJR′,′JJS′,′LS′,′MD′,′NN′,′NNP′,′NNPS′,′NNS′,′PDT′,′POS′,′PRP′,′PRP’, ‘RB’, ‘RBR’, ‘RBS’, ‘RP’, ‘SYM’, ‘TO’, ‘UH’, ‘VB’, ‘VBD’, ‘VBG’, ‘VBN’, ‘VBP’, ‘VBZ’, ‘WDT’, ‘WP’, ‘WP$’, ‘WRB’, ‘``’]
上面的词性标签是从训练集中提取出来的,里面包含有辅助词性标签,如: '–s–'表示句子的起始位置。
打印一下转移示例、观察示例和多词性标签的示例:
print("transition examples: ")
for ex in list(transition_counts.items())[:3]:
print(ex)
print()
print("emission examples: ")
for ex in list(emission_counts.items())[200:203]:
print (ex)
print()
print("ambiguous word example: ")
for tup,cnt in emission_counts.items():
if tup[1] == 'back': print (tup, cnt)
结果:
transition examples:
(('--s--', 'IN'), 5050)
(('IN', 'DT'), 32364)
(('DT', 'NNP'), 9044)
emission examples:
(('DT', 'any'), 721)
(('NN', 'decrease'), 7)
(('NN', 'insider-trading'), 5)
ambiguous word example:
('RB', 'back') 304
('VB', 'back') 20
('RP', 'back') 84
('JJ', 'back') 25
('NN', 'back') 29
('VBP', 'back') 4
1.2 Testing
使用 emission_counts
字典测试词性标注的准确性。
- 在预处理过的测试语料库
prep
中,为语料库中的每个单词分配一个词性标签。 - 然后使用原始标签测试语料
y
,计算您正确标记的百分比。
Exercise 02
实现函数predict_pos
,计算模型的准确性。
- 要为单词指定词性标签,要为该单词从训练集中指定最常见的 POS。
- 然后评估这种方法的效果如何。 每次根据给定单词最常见的 POS 进行预测时,检查该单词的实际 POS 是否相同。 如果是,则说明预测是正确的!
- 用正确预测的次数除以预测 POS 标记的单词总数来计算准确率。
函数 predict_pos
的目的是遍历一个预处理后的词序列,根据观测/发射概率(emission probabilities)来预测每个词的词性,并与实际的词性标签进行比较,从而计算预测的准确率。以下是代码的逐步解释:
输入参数:
- prep: 预处理后的词列表,其中每个元素是一个词。
- y: 原始语料库,是一个由(word, POS)组成的元组列表。
- emission_counts: 一个字典,其键是(tag, word)元组,值是对应的计数,表示在特定词性下的词出现的次数。
- vocab: 词汇表,一个字典,其键是词,值是索引。
- states: 所有可能的词性标签的排序列表。
输出:
- accuracy: 预测正确的词性与实际词性标签一致的次数占总词数的比例。
函数逻辑:
- 初始化正确预测的数量 num_correct 为0。
- 通过 emission_counts.keys() 获取所有(tag, word)的集合 all_words,但这个集合在代码中没有直接使用。
- 计算语料库 y 中的总词对数量 total。
- 遍历 prep 和 y 中的每个词和词性元组。
- 检查每个元组是否包含词和词性,如果不是则跳过。
- 对于 prep 中的每个词:
- 检查这个词是否在词汇表 vocab 中。
- 如果在词汇表中,遍历所有可能的词性 states。
- 对于每个词性,构建一个(tag, word)键,并检查这个键是否存在于 emission_counts 字典中。
- 如果存在,获取该键对应的计数 count。
- 如果这个计数大于当前记录的最大计数 count_final,则更新 count_final 和预测的词性 pos_final。
- 如果预测的词性 pos_final 与实际的词性标签 true_label 匹配,则增加正确预测的数量 num_correct。
- 计算准确率 accuracy 为正确预测的数量除以总词数。
- 返回准确率
思考:为什么需要count_final?1
# UNQ_C2 GRADED FUNCTION: predict_pos
def predict_pos(prep, y, emission_counts, vocab, states):
'''
Input:
prep: a preprocessed version of 'y'. A list with the 'word' component of the tuples.不带标签,经过预处理的单词语料
y: a corpus composed of a list of tuples where each tuple consists of (word, POS)
emission_counts: a dictionary where the keys are (tag,word) tuples and the value is the count
vocab: a dictionary where keys are words in vocabulary and value is an index
states: a sorted list of all possible tags for this assignment
Output:
accuracy: Number of times you classified a word correctly
'''
# Initialize the number of correct predictions to zero
num_correct = 0
# Get the (tag, word) tuples, stored as a set
all_words = set(emission_counts.keys())
# Get the number of (word, POS) tuples in the corpus 'y'
total = len(y)
for word, y_tup in zip(prep, y):
# Split the (word, POS) string into a list of two items
y_tup_l = y_tup.split()
# Verify that y_tup contain both word and POS
if len(y_tup_l) == 2:
# Set the true POS label for this word
true_label = y_tup_l[1]
else:
# If the y_tup didn't contain word and POS, go to next word
continue
count_final = 0
pos_final = ''
# If the word is in the vocabulary...
if word in vocab:
for pos in states:
### START CODE HERE (Replace instances of 'None' with your code) ###
# define the key as the tuple containing the POS and word
key = (pos,word)
# check if the (pos, word) key exists in the emission_counts dictionary
if key in emission_counts.keys(): # Replace None in this line with the proper condition.
# get the emission count of the (pos,word) tuple
count = emission_counts[key]
# keep track of the POS with the largest count
if count_final<count: # Replace None in this line with the proper condition.
# update the final count (largest count)
count_final = count
# update the final POS
pos_final = pos
# If the final POS (with the largest count) matches the true POS:
if pos_final==true_label: # Replace None in this line with the proper condition.
# Update the number of correct predictions
num_correct += 1
### END CODE HERE ###
accuracy = num_correct / total
return accuracy
测试:
accuracy_predict_pos = predict_pos(prep, y, emission_counts, vocab, states)
print(f"Accuracy of prediction using predict_pos is {accuracy_predict_pos:.4f}")
结果:
Accuracy of prediction using predict_pos is 0.8889
还算可以,下面将使用隐马尔科夫模型将精度提高到95%以上
因为一个词可能会有多个词性标签,这里按其观测次数最高的来进行预测 ↩︎