分类算法-朴素贝叶斯分类器

前言 此程序基于新闻文本分类实验 使用朴素贝叶斯(Naive Bayes Classifier)模型实现分类任务。 本程序可以流畅运行于Python3.6环境,但是Python2.x版本需要修正的地方也已经在注释中说明。

requirements:pandas,numpy,scikit-learn
想查看其他经典算法实现可以关注查看本人其他文集。 实验结果分析 朴素贝叶斯模型被广泛应用于海量互联网文本分类任务。由于其较强的特征条件假设,使得模型预测所需要估计的参数规模从幂指数量级向线性量级减少,极大地节约了内存消耗和计算时间。但是,也正是受这种强假设的限制,模型训练无法将各个特征之间的联系考量在内,使得该模型在其他数据特征关联性较强的分类任务上的性能表现不佳。
程序源码
#import news data from sklearn.datasets
from sklearn.datasets import fetch_20newsgroups
#this instruction need internet downloading news data
news=fetch_20newsgroups(subset='all')
#check the details and scale of news data
# print(len(news.data))
# print(news.data[0])
#data preprocessing
#notes:you should use cross_valiation instead of model_valiation in python 2.7
#from sklearn.cross_validation import train_test_split #DeprecationWarning
from sklearn.model_selection import train_test_split #use train_test_split module of sklearn.model_valiation to split data
#take 25 percent of data randomly for testing,and others for training
X_train,X_test,y_train,y_test = train_test_split(news.data,news.target,test_size=0.25,random_state=33)
#import text features transforming module for extracting text important features
from sklearn.feature_extraction.text import CountVectorizer
vec=CountVectorizer()
X_train=vec.fit_transform(X_train)
X_test=vec.transform(X_test)
#import and initialize naive bayes model in default setting
from sklearn.naive_bayes import MultinomialNB
mnb=MultinomialNB()
#training model by trainning set
mnb.fit(X_train,y_train)
#predict the target names of tests set
y_predict=mnb.predict(X_test)
#import classification report to evaluate model performance
from sklearn.metrics import classification_report
#get accuracy by the score function in lsvc model
print('The accuracy of Naive Bayes Classifier is',mnb.score(X_test,y_test))
#getprecision ,recall and f1-score from classification_report module
print(classification_report(y_test,y_predict,target_names=news.target_names))
Ubuntu16.04 Python3.6 程序输出结果:
The accuracy of Naive Bayes Classifier is 0.8397707979626485
precisionrecallf1-scoresupport
alt.atheism0.860.860.86201
comp.graphics0.590.860.70250
comp.os.ms-windows.misc0.890.100.17248
comp.sys.ibm.pc.hardware0.600.880.72240
comp.sys.mac.hardware0.930.780.85242
comp.windows.x0.820.840.83263
misc.forsale0.910.700.79257
rec.autos0.890.890.89238
rec.motorcycles0.980.920.95276
rec.sport.baseball0.980.910.95251
rec.sport.hockey0.930.990.96233
sci.crypt0.860.980.91238
sci.electronics0.850.880.86249
sci.med0.920.940.93245
sci.space0.890.960.92221
soc.religion.christian0.780.960.86232
talk.politics.guns0.880.960.92251
talk.politics.mideast0.900.980.94231
talk.politics.misc0.790.890.84188
talk.religion.misc0.930.440.60158
avg / total0.860.840.824712
[Finished in 4.6s]
欢迎指正错误,包括英语和程序错误。有问题也欢迎提问,一起加油一起进步。 【分类算法-朴素贝叶斯分类器】本程序完全是本人逐字符输入的劳动结果,转载请注明出处。

    推荐阅读