Python机器学习NLP自然语言处理Word2vec电影影评建模

目录

  • 概述
  • 词向量
  • 词向量维度
  • 代码实现
    • 预处理
    • 主程序

概述 【Python机器学习NLP自然语言处理Word2vec电影影评建模】从今天开始我们将开启一段自然语言处理 (NLP) 的旅程. 自然语言处理可以让来处理, 理解, 以及运用人类的语言, 实现机器语言和人类语言之间的沟通桥梁.


词向量 我们先来说说词向量究竟是什么. 当我们把文本交给算法来处理的时候, 计算机并不能理解我们输入的文本, 词向量就由此而生了. 简单的来说, 词向量就是将词语转换成数字组成的向量.
Python机器学习NLP自然语言处理Word2vec电影影评建模
文章图片

当我们描述一个人的时候, 我们会使用身高体重等种种指标, 这些指标就可以当做向量. 有了向量我们就可以使用不同方法来计算相似度.
Python机器学习NLP自然语言处理Word2vec电影影评建模
文章图片

那我们如何来描述语言的特征呢? 我们把语言分割成一个个词, 然后在词的层面上构建特征.
Python机器学习NLP自然语言处理Word2vec电影影评建模
文章图片


词向量维度 词向量的维度越高, 其所能提供的信息也就越多, 计算结果的可靠性就更值得信赖.
50 维的词向量:
Python机器学习NLP自然语言处理Word2vec电影影评建模
文章图片

用热度图表示一下:
Python机器学习NLP自然语言处理Word2vec电影影评建模
文章图片

Python机器学习NLP自然语言处理Word2vec电影影评建模
文章图片

从上图我们可以看出, 相似的词在特征表达中比较相似. 由此也可以证明词的特征是有意义的.

代码实现
预处理
import numpy as npimport pandas as pdimport itertoolsimport refrom bs4 import BeautifulSoupfrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.model_selection import train_test_splitfrom matplotlib import pyplot as pltimport nltk# 停用词stop_words = pd.read_csv("data/stopwords.txt", index_col=False, quoting=3, sep="\n", names=["stop_words"])stop_words = [word.strip() for word in stop_words["stop_words"].values]def load_train_data():"""读取训练数据"""# 语料data = https://www.it610.com/article/pd.read_csv("data/labeledTrainData.tsv", sep="\t", escapechar="\\")print(data[:5])print("训练评论数量:", len(data))# 25,000return datadef load_test_data():# 语料data = https://www.it610.com/article/pd.read_csv("data/unlabeledTrainData.tsv", sep="\t", escapechar="\\")print("测试评论数量:", len(data))# 50,000return datadef pre_process(text):# 去除网页链接text = BeautifulSoup(text, "html.parser").get_text()# 去除标点text = re.sub("[^a-zA-Z]", " ", text)# 分词words = text.lower().split()# 去除停用词words = [w for w in words if w not in stop_words]return " ".join(words)def split_train_data():# 读取文件data = https://www.it610.com/article/pd.read_csv("data/train.csv")print(data.head())# 抽取bag of words特征vec = CountVectorizer(max_features=5000)# 拟合vec.fit(data["review"])# 转换train_data_features = vec.transform(data["review"]).toarray()print(train_data_features.shape)# 词袋print(vec.get_feature_names())# 分割数据集X_train, X_test, y_train, y_test = train_test_split(train_data_features, data["sentiment"], test_size=0.2,random_state=0)return X_train, X_test, y_train, y_testdef test():# 读取测试数据data = https://www.it610.com/article/pd.read_csv("data/test.csv")print(data.head())tokenizer = nltk.data.load("tokenizers/punkt/english.pickle")# 分词def split_sentences(review):raw_sentences = tokenizer.tokenize(review.strip())return sentencessentences = sum(data["review"][:10].apply(split_sentences), [])def visualize(cm, classes, title="Confusion matrix", cmap=plt.cm.Blues):plt.imshow(cm, interpolation="nearest", cmap=cmap)plt.title(title)plt.colorbar()tick_marks = np.arange(len(classes))plt.xticks(tick_marks, classes, rotation=0)plt.yticks(tick_marks, classes)thresh = cm.max()for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):plt.text(j, i, cm[i, j], horizontalalignment="center", color="white" if cm[i, j] > thresh else "black")plt.tight_layout()plt.ylabel("True label")plt.xlabel("Predicted label")plt.show()if __name__ == '__main__':# # 处理训练数据# train_data = https://www.it610.com/article/load_train_data()# train_data["review"] = train_data["review"].apply(pre_process)# print(train_data.head())## # 保存# train_data.to_csv("data/train.csv")# # 处理训练数据# test_data = https://www.it610.com/article/load_test_data()# test_data["review"] =test_data["review"].apply(pre_process)# print( test_data.head())## # 保存# test_data.to_csv("data/test.csv")split_train_data()


主程序
import pandas as pdimport nltkfrom gensim.models.word2vec import Word2Vecdef pre_process():"""预处理"""# 读取测试数据data = https://www.it610.com/article/pd.read_csv("data/test.csv")print(data.head())# 存放结果result = []# 分词for line in data["review"]:result.append(nltk.word_tokenize(line))return resultdef main():# 获取分词语料word_list = pre_process()# 设定词向量训练的参数num_features = 300# Word vector dimensionalitymin_word_count = 40# Minimum word countnum_workers = 4# Number of threads to run in parallelcontext = 10# Context window sizemodel_name = '{}features_{}minwords_{}context.model'.format(num_features, min_word_count, context)# 创建w2c模型model = Word2Vec(sentences=word_list, workers=num_workers,vector_size=num_features, min_count=min_word_count,window=context)# 保存模型model.save(model_name)def test():# 加载模型model = Word2Vec.load("300features_40minwords_10context.model")# 不匹配match = model.wv.doesnt_match(['man','woman','child','kitchen'])print(match)# 最相似print(model.wv.most_similar("boy"))print(model.wv.most_similar("bad"))if __name__ == '__main__':test()

输出结果:
2021-09-16 20:36:40.791181: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0Unnamed: 0idsentimentreview005814_81stuff moment mj ve started listening music wat...112381_91classic war worlds timothy hines entertaining ...227759_30film starts manager nicholas bell investors ro...333630_40assumed praised film filmed opera didn read do...449495_81superbly trashy wondrously unpretentious explo...73423[[15958623 1236844596228353015220972408 35364 571438922997766 42223967266 25276157108696163119825769850374527523789950369652652354862474382101 110276966456 22390969587353764044623140120697186189296138134571496181231770518331435498318885208373983228 2863510442054401107185856589577226804462244472113269157421053217943504598037328873438389412319561229253 2717621491990 5714453487469665581362067 106824851814829366815873786211010506 25150 2074434033316174824389297814 101502596766 4222350824784700198627652547001982334696 208795863025832872 306013086288373329618222470830167737912164551322310513186045361925414132157874348516969975354 57145162302911839] [0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001357684283027 103715801 20987 21481 1980013027 10371 21481 198001719204491682507355154737440154152417192449168735515473610 21481 19800123204491681102154765621354325183614 6616620365675183202511650311145782] [00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000021891586218915185561540053943797 238662892481289281022020 17820174123120746202810406089816555541772176226811288879645] [00000000000000000000000000000000000000000000000000000000000000000000000000085310173478190678190614121985787644141224492877092637425846183379530801288221735346005485115437621797 261446992376745712881415900356232371669 179878744212341278347928716097100106575980033447650214738030151436665231396851 223303465 208617106637434060 1903530895081371695 1073535829263741768348601491 11540 28826184746440992235615122153810273892621951966308933 1989428714263741843740256732537421549 219762877442466 3153327283613396374 1480516704666603312] [0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001275246399577415458575855 104632688 2101915421701653976591897062212 18342566437263943114504 2611030749689331712752587]][[0. 1.] [0. 1.] [0. 1.] [1. 0.] [0. 1.]]2021-09-16 20:36:46.488438: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set2021-09-16 20:36:46.489070: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: /usr/lib/x86_64-linux-gnu/libcuda.so.1: file too short; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64/:/usr/lib/x86_64-linux-gnu2021-09-16 20:36:46.489097: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)2021-09-16 20:36:46.489128: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (313c6f2d15e2): /proc/driver/nvidia/version does not exist2021-09-16 20:36:46.489488: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:AVX512FTo enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.2021-09-16 20:36:46.493241: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not setModel: "sequential"_________________________________________________________________Layer (type)Output ShapeParam #=================================================================embedding (Embedding)(None, None, 200)14684800_________________________________________________________________lstm (LSTM)(None, 200)320800_________________________________________________________________dropout (Dropout)(None, 200)0_________________________________________________________________dense (Dense)(None, 64)12864_________________________________________________________________dense_1 (Dense)(None, 2)130=================================================================Total params: 15,018,594Trainable params: 15,018,594Non-trainable params: 0_________________________________________________________________None2021-09-16 20:36:46.792534: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)2021-09-16 20:36:46.830442: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2300000000 HzEpoch 1/2313/313 [==============================] - 101s 315ms/step - loss: 0.5581 - accuracy: 0.7229 - val_loss: 0.3703 - val_accuracy: 0.8486Epoch 2/2313/313 [==============================] - 98s 312ms/step - loss: 0.2174 - accuracy: 0.9195 - val_loss: 0.3016 - val_accuracy: 0.8822

以上就是Python机器学习NLP自然语言处理Word2vec电影影评建模的详细内容,更多关于NLP自然语言处理的资料请关注脚本之家其它相关文章!

    推荐阅读