C2W1.Assignment.Parts-of-Speech Tagging (POS).Part1

理论课:C2W2.Part-of-Speech (POS) Tagging and Hidden Markov Models


理论课: C2W2.Part-of-Speech (POS) Tagging and Hidden Markov Models

这次作业将开发语音部分(POS)标记技能,即为输入文本中的每个单词指定一个语音部分标记(名词、动词、形容词…)的过程。标记是很困难的,因为有些词在不同的语境可以有不同标签,例如:

  • The whole team played well. [adverb]
  • You are doing well for yourself. [adjective]
  • Well, this assignment took me forever to complete. [interjection]
  • The well is dry. [noun]
  • Tears were beginning to well in her eyes. [verb]

词性标注任务有助于更好地理解句子的意思。该任务对搜索查询至关重要,识别专有名词、组织机构、股票代码或任何类似的东西,将大大提高从语音识别到搜索的各种能力。本作业的内容包括:

  • 了解语音部分标记的工作原理
  • 计算隐马尔可夫模型中的转换矩阵 A
  • 计算隐马尔可夫模型中的发射矩阵 B
  • 计算维特比算法
  • 计算模型的准确性

先导入包

# Importing packages and loading in the data set 
from utils_pos import get_word_tag, preprocess  
import pandas as pd
from collections import defaultdict
import math
import numpy as np
import w2_unittest

0 Data Sources

将用到Wall Street Journal (WSJ)的两个标记数据集,词性标签含义可以看这里

  • 一个数据集(WSJ-2_21.pos)将用于训练
  • 另一个数据集(WSJ-24.pos)将用于测试
  • 标记的训练数据经过预处理后形成了一个词汇表(hmm_vocab.txt,详情见绑定下载资源)。
  • 词汇表中的词是训练集中使用两次或两次以上的词。
  • 词汇表中还添加了一组 “'unknown word tokens”,详情见后文描述。

训练集将用于创建emission、transmission和词性标签的计数。
测试集(WSJ-24.pos)用来创建 y

  • 其中包含测试文本和真实标签。
  • 测试集还经过预处理以去除标签,形成test_words.txt(可下载)。
  • 读入该文本后,使用 utils_pos.py中提供的函数对其进行进一步处理,以识别句子末尾并处理词汇表中没有的单词。
  • 这就形成了 “prep ”列表,即用于测试 POS 标记器的预处理文本。

POS 标记器会遇到不在其数据集中的单词。

  • 为了提高准确性,需在预处理过程中对这些单词进行进一步分析,以提取关于其适当标记的可用提示。
  • 例如,后缀 “ize ”可以提示单词是动词,如 “final-ize ”或 “character-ize”。
  • 这里使用自定义的未知标记,如“–unk-verb–”或“–unk-noun–”,将取代训练语料库和测试语料库中的未知单词,并将出现在训练语料库和测试语料库中。

在这里插入图片描述
加载训练语料

# load in the training corpus
with open("./data/WSJ_02-21.pos", 'r') as f:
    training_corpus = f.readlines()

print(f"A few items of the training corpus list")
print(training_corpus[0:5])

结果:
A few items of the training corpus list
[‘In\tIN\n’, ‘an\tDT\n’, ‘Oct.\tNNP\n’, ‘19\tCD\n’, ‘review\tNN\n’]

# read the vocabulary data, split by each line of text, and save the list
with open("./data/hmm_vocab.txt", 'r') as f:
    voc_l = f.read().split('\n')

print("A few items of the vocabulary list")
print(voc_l[0:50])
print()
print("A few items at the end of the vocabulary list")
print(voc_l[-50:])

结果:
A few items of the vocabulary list
[‘!’, ‘#’, ‘$’, ‘%’, ‘&’, “'”, “‘’”, “'40s”, “'60s”, “'70s”, “'80s”, “'86”, “'90s”, “'N”, “'S”, “'d”, “'em”, “'ll”, “'m”, “‘n’”, “'re”, “'s”, “'til”, “'ve”, ‘(’, ‘)’, ‘,’, ‘-’, ‘–’, ‘–n–’, ‘–unk–’, ‘–unk_adj–’, ‘–unk_adv–’, ‘–unk_digit–’, ‘–unk_noun–’, ‘–unk_punct–’, ‘–unk_upper–’, ‘–unk_verb–’, ‘.’, ‘…’, ‘0.01’, ‘0.0108’, ‘0.02’, ‘0.03’, ‘0.05’, ‘0.1’, ‘0.10’, ‘0.12’, ‘0.13’, ‘0.15’]

A few items at the end of the vocabulary list
[‘yards’, ‘yardstick’, ‘year’, ‘year-ago’, ‘year-before’, ‘year-earlier’, ‘year-end’, ‘year-on-year’, ‘year-round’, ‘year-to-date’, ‘year-to-year’, ‘yearlong’, ‘yearly’, ‘years’, ‘yeast’, ‘yelled’, ‘yelling’, ‘yellow’, ‘yen’, ‘yes’, ‘yesterday’, ‘yet’, ‘yield’, ‘yielded’, ‘yielding’, ‘yields’, ‘you’, ‘young’, ‘younger’, ‘youngest’, ‘youngsters’, ‘your’, ‘yourself’, ‘youth’, ‘youthful’, ‘yuppie’, ‘yuppies’, ‘zero’, ‘zero-coupon’, ‘zeroing’, ‘zeros’, ‘zinc’, ‘zip’, ‘zombie’, ‘zone’, ‘zones’, ‘zoning’, ‘{’, ‘}’, ‘’]
创建词典,键是单词,值是一个整数

# vocab: dictionary that has the index of the corresponding words
vocab = {}

# Get the index of the corresponding words. 
for i, word in enumerate(sorted(voc_l)): 
    vocab[word] = i       
    
print("Vocabulary dictionary, key is the word, value is a unique integer")
cnt = 0
for k,v in vocab.items():
    print(f"{k}:{v}")
    cnt += 1
    if cnt > 20:
        break

结果:
Vocabulary dictionary, key is the word, value is a unique integer
:0
!:1
#:2
$:3
%:4
&:5
':6
‘’:7
'40s:8
'60s:9
'70s:10
'80s:11
'86:12
'90s:13
'N:14
'S:15
'd:16
'em:17
'll:18
'm:19
‘n’:20

加载测试语料

# load in the test corpus
with open("./data/WSJ_24.pos", 'r') as f:
    y = f.readlines()
    
print("A sample of the test corpus")
print(y[0:10])

结果:
A sample of the test corpus
[‘The\tDT\n’, ‘economy\tNN\n’, “'s\tPOS\n”, ‘temperature\tNN\n’, ‘will\tMD\n’, ‘be\tVB\n’, ‘taken\tVBN\n’, ‘from\tIN\n’, ‘several\tJJ\n’, ‘vantage\tNN\n’]
可以看到测试语料是带词性标签的,现在需要去掉标签,便于进行预测:

#corpus without tags, preprocessed
_, prep = preprocess(vocab, "./data/test.words")     

print('The length of the preprocessed test corpus: ', len(prep))
print('This is a sample of the test_corpus: ')
print(prep[0:10])

结果:
The length of the preprocessed test corpus: 34199
This is a sample of the test_corpus:
[‘The’, ‘economy’, “'s”, ‘temperature’, ‘will’, ‘be’, ‘taken’, ‘from’, ‘several’, ‘–unk–’]

1 POS Tagging

1.1 Training

先从简单的开始,针对没有多种词性标签的单词进行处理:

  • 例如,“is ”是一个动词,它没有别的词性标签。
  • 在 “WSJ ”语料库中,86%$的词性标签是单一的(即它们只有一个标签)
  • 大约 14 14% 14 是模棱两可的(即它们有一个以上的标记)

开始进行词性标签预测之前,先完成三个字典

Transition counts

transition_counts用于计算每个词性标签与另一个词性标签相邻出现的次数。
P ( t i ∣ t i − 1 ) (1) P(t_i |t_{i-1}) \tag{1} P(titi1)(1)
表示位置 i i i 的标签与位置 i − 1 i-1 i1 的标签之间的概率。
为了计算公式 1,需要创建一个 transition_counts 字典,其中

  • 键是 (prev_tag, tag)
  • 值是这两个标记按该顺序出现的次数。

Emission counts

emission_counts用于计算给定词性标签的条件下某个单词的出现概率。:
P ( w i ∣ t i ) (2) P(w_i|t_i)\tag{2} P(witi)(2)
该字典的

  • 键是 (tag, word)
  • 值是该词对在训练集中出现的次数。

Tag counts

最后一个字典是Tag counts:

  • 关键字是标签
  • 值是每个标签出现的次数。

Exercise 01

编写函数create_dictionaries,吃进去training_corpus,返回上面提到的三个字典:transition_counts, emission_counts, tag_counts.
函数要使用到defaultdict,它是 dict 的子类。

  • 标准 Python 字典会在尝试访问一个当前不在字典中的键时抛出一个 KeyError 错误。
  • 相反,defaultdict 会创建一个与参数类型相同的项,在本函数中是一个默认值为 0 的整数。
# UNQ_C1 GRADED FUNCTION: create_dictionaries
def create_dictionaries(training_corpus, vocab, verbose=True):
    """
    Input: 
        training_corpus: a corpus where each line has a word followed by its tag.
        vocab: a dictionary where keys are words in vocabulary and value is an index
    Output: 
        emission_counts: a dictionary where the keys are (tag, word) and the values are the counts
        transition_counts: a dictionary where the keys are (prev_tag, tag) and the values are the counts
        tag_counts: a dictionary where the keys are the tags and the values are the counts
    """
    
    # initialize the dictionaries using defaultdict
    emission_counts = defaultdict(int)
    transition_counts = defaultdict(int)
    tag_counts = defaultdict(int)
    
    # Initialize "prev_tag" (previous tag) with the start state, denoted by '--s--'
    prev_tag = '--s--' 
    
    # use 'i' to track the line number in the corpus
    i = 0 
    
    # Each item in the training corpus contains a word and its POS tag
    # Go through each word and its tag in the training corpus
    for word_tag in training_corpus:
        
        # Increment the word_tag count
        i += 1
        
        # Every 50,000 words, print the word count
        if i % 50000 == 0 and verbose:
            print(f"word count = {i}")
            
        ### START CODE HERE ###
        # get the word and tag using the get_word_tag helper function (imported from utils_pos.py)
        # the function is defined as: get_word_tag(line, vocab)
        word, tag = get_word_tag(word_tag, vocab)
        
        # Increment the transition count for the previous word and tag
        transition_counts[(prev_tag, tag)] += 1
        
        # Increment the emission count for the tag and word
        emission_counts[(tag, word)] += 1

        # Increment the tag count
        tag_counts[tag] += 1

        # Set the previous tag to this tag (for the next iteration of the loop)
        prev_tag = tag
        
        ### END CODE HERE ###
        
    return emission_counts, transition_counts, tag_counts

运行:

emission_counts, transition_counts, tag_counts = create_dictionaries(training_corpus, vocab)

结果:
word count = 50000
word count = 100000
word count = 150000
word count = 200000
word count = 250000
word count = 300000
word count = 350000
word count = 400000
word count = 450000
word count = 500000
word count = 550000
word count = 600000
word count = 650000
word count = 700000
word count = 750000
word count = 800000
word count = 850000
word count = 900000
word count = 950000

将所有词性标签打印出来看一看:

# get all the POS states
states = sorted(tag_counts.keys())
print(f"Number of POS tags (number of 'states'): {len(states)}")
print("View these POS tags (states)")
print(states)

结果:
Number of POS tags (number of ‘states’): 46
View these POS tags (states)
[‘#’, ‘ ′ , " ′ ′ " , ′ ( ′ , ′ ) ′ , ′ , ′ , ′ − − s − − ′ , ′ . ′ , ′ : ′ , ′ C C ′ , ′ C D ′ , ′ D T ′ , ′ E X ′ , ′ F W ′ , ′ I N ′ , ′ J J ′ , ′ J J R ′ , ′ J J S ′ , ′ L S ′ , ′ M D ′ , ′ N N ′ , ′ N N P ′ , ′ N N P S ′ , ′ N N S ′ , ′ P D T ′ , ′ P O S ′ , ′ P R P ′ , ′ P R P ', "''", '(', ')', ',', '--s--', '.', ':', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'PDT', 'POS', 'PRP', 'PRP ,"′′",(,),,,s,.,:,CC,CD,DT,EX,FW,IN,JJ,JJR,JJS,LS,MD,NN,NNP,NNPS,NNS,PDT,POS,PRP,PRP’, ‘RB’, ‘RBR’, ‘RBS’, ‘RP’, ‘SYM’, ‘TO’, ‘UH’, ‘VB’, ‘VBD’, ‘VBG’, ‘VBN’, ‘VBP’, ‘VBZ’, ‘WDT’, ‘WP’, ‘WP$’, ‘WRB’, ‘``’]
上面的词性标签是从训练集中提取出来的,里面包含有辅助词性标签,如: '–s–'表示句子的起始位置。
打印一下转移示例、观察示例和多词性标签的示例:

print("transition examples: ")
for ex in list(transition_counts.items())[:3]:
    print(ex)
print()

print("emission examples: ")
for ex in list(emission_counts.items())[200:203]:
    print (ex)
print()

print("ambiguous word example: ")
for tup,cnt in emission_counts.items():
    if tup[1] == 'back': print (tup, cnt) 

结果:

transition examples: 
(('--s--', 'IN'), 5050)
(('IN', 'DT'), 32364)
(('DT', 'NNP'), 9044)

emission examples: 
(('DT', 'any'), 721)
(('NN', 'decrease'), 7)
(('NN', 'insider-trading'), 5)

ambiguous word example: 
('RB', 'back') 304
('VB', 'back') 20
('RP', 'back') 84
('JJ', 'back') 25
('NN', 'back') 29
('VBP', 'back') 4

1.2 Testing

使用 emission_counts字典测试词性标注的准确性。

  • 在预处理过的测试语料库 prep 中,为语料库中的每个单词分配一个词性标签。
  • 然后使用原始标签测试语料y,计算您正确标记的百分比。

Exercise 02

实现函数predict_pos,计算模型的准确性。

  • 要为单词指定词性标签,要为该单词从训练集中指定最常见的 POS。
  • 然后评估这种方法的效果如何。 每次根据给定单词最常见的 POS 进行预测时,检查该单词的实际 POS 是否相同。 如果是,则说明预测是正确的!
  • 用正确预测的次数除以预测 POS 标记的单词总数来计算准确率。

函数 predict_pos的目的是遍历一个预处理后的词序列,根据观测/发射概率(emission probabilities)来预测每个词的词性,并与实际的词性标签进行比较,从而计算预测的准确率。以下是代码的逐步解释:
输入参数:

  • prep: 预处理后的词列表,其中每个元素是一个词。
  • y: 原始语料库,是一个由(word, POS)组成的元组列表。
  • emission_counts: 一个字典,其键是(tag, word)元组,值是对应的计数,表示在特定词性下的词出现的次数。
  • vocab: 词汇表,一个字典,其键是词,值是索引。
  • states: 所有可能的词性标签的排序列表。

输出:

  • accuracy: 预测正确的词性与实际词性标签一致的次数占总词数的比例。

函数逻辑:

  1. 初始化正确预测的数量 num_correct 为0。
  2. 通过 emission_counts.keys() 获取所有(tag, word)的集合 all_words,但这个集合在代码中没有直接使用。
  3. 计算语料库 y 中的总词对数量 total。
  4. 遍历 prep 和 y 中的每个词和词性元组。
  5. 检查每个元组是否包含词和词性,如果不是则跳过。
  6. 对于 prep 中的每个词:
    • 检查这个词是否在词汇表 vocab 中。
    • 如果在词汇表中,遍历所有可能的词性 states。
    • 对于每个词性,构建一个(tag, word)键,并检查这个键是否存在于 emission_counts 字典中。
    • 如果存在,获取该键对应的计数 count。
    • 如果这个计数大于当前记录的最大计数 count_final,则更新 count_final 和预测的词性 pos_final。
  7. 如果预测的词性 pos_final 与实际的词性标签 true_label 匹配,则增加正确预测的数量 num_correct。
  8. 计算准确率 accuracy 为正确预测的数量除以总词数。
  9. 返回准确率

思考:为什么需要count_final?1

# UNQ_C2 GRADED FUNCTION: predict_pos
def predict_pos(prep, y, emission_counts, vocab, states):
    '''
    Input: 
        prep: a preprocessed version of 'y'. A list with the 'word' component of the tuples.不带标签,经过预处理的单词语料
        y: a corpus composed of a list of tuples where each tuple consists of (word, POS)
        emission_counts: a dictionary where the keys are (tag,word) tuples and the value is the count
        vocab: a dictionary where keys are words in vocabulary and value is an index
        states: a sorted list of all possible tags for this assignment
    Output: 
        accuracy: Number of times you classified a word correctly
    '''
    
    # Initialize the number of correct predictions to zero
    num_correct = 0
    
    # Get the (tag, word) tuples, stored as a set
    all_words = set(emission_counts.keys())
    
    # Get the number of (word, POS) tuples in the corpus 'y'
    total = len(y)
    for word, y_tup in zip(prep, y): 

        # Split the (word, POS) string into a list of two items
        y_tup_l = y_tup.split()
        
        # Verify that y_tup contain both word and POS
        if len(y_tup_l) == 2:
            
            # Set the true POS label for this word
            true_label = y_tup_l[1]

        else:
            # If the y_tup didn't contain word and POS, go to next word
            continue
    
        count_final = 0
        pos_final = ''
        
        # If the word is in the vocabulary...
        if word in vocab:
            for pos in states:

            ### START CODE HERE (Replace instances of 'None' with your code) ###
            
                # define the key as the tuple containing the POS and word
                key = (pos,word)

                # check if the (pos, word) key exists in the emission_counts dictionary
                if key in emission_counts.keys(): # Replace None in this line with the proper condition.

                # get the emission count of the (pos,word) tuple 
                    count = emission_counts[key]

                    # keep track of the POS with the largest count
                    if count_final<count: # Replace None in this line with the proper condition.

                        # update the final count (largest count)
                        count_final = count

                        # update the final POS
                        pos_final = pos

            # If the final POS (with the largest count) matches the true POS:
            if pos_final==true_label: # Replace None in this line with the proper condition.
                # Update the number of correct predictions
                num_correct += 1
            
    ### END CODE HERE ###
    accuracy = num_correct / total
    
    return accuracy

测试:

accuracy_predict_pos = predict_pos(prep, y, emission_counts, vocab, states)
print(f"Accuracy of prediction using predict_pos is {accuracy_predict_pos:.4f}")

结果:
Accuracy of prediction using predict_pos is 0.8889
还算可以,下面将使用隐马尔科夫模型将精度提高到95%以上


  1. 因为一个词可能会有多个词性标签,这里按其观测次数最高的来进行预测 ↩︎

相关推荐

  1. 1w实盘and大盘基金预测 day2

    2024-07-21 01:48:01       36 阅读

最近更新

  1. docker php8.1+nginx base 镜像 dockerfile 配置

    2024-07-21 01:48:01       67 阅读
  2. Could not load dynamic library ‘cudart64_100.dll‘

    2024-07-21 01:48:01       72 阅读
  3. 在Django里面运行非项目文件

    2024-07-21 01:48:01       58 阅读
  4. Python语言-面向对象

    2024-07-21 01:48:01       69 阅读

热门阅读

  1. DAY05 CSS

    DAY05 CSS

    2024-07-21 01:48:01      21 阅读
  2. MacOS命令行运行fortran程序|编程私教解答

    2024-07-21 01:48:01       22 阅读
  3. 类与对象-多态-案例3-电脑组装具体实现

    2024-07-21 01:48:01       23 阅读
  4. OpenPyXL 写入 Excel 文件

    2024-07-21 01:48:01       20 阅读
  5. 量化机器人如何实现无缝交易?

    2024-07-21 01:48:01       20 阅读
  6. Redis 深度历险:核心原理与应用实践 - 读书笔记

    2024-07-21 01:48:01       18 阅读
  7. Head size 160 is not supported by PagedAttention.

    2024-07-21 01:48:01       20 阅读
  8. 数据仓库中的数据治理

    2024-07-21 01:48:01       21 阅读