如何用python as matrix画好confusion matrix

From Wikipedia, the free encyclopedia
Terminology and derivations
condition positive (P)
the number of real positive cases in the data
condition negatives (N)
the number of real negative cases in the data
true positive (TP)
eqv. with hit
true negative (TN)
eqv. with correct rejection
false positive (FP)
eqv. with ,
false negative (FN)
eqv. with miss,
{\displaystyle \mathrm {TPR} ={\frac {\mathrm {TP} }{P}}={\frac {\mathrm {TP} }{\mathrm {TP} +\mathrm {FN} }}}
{\displaystyle \mathrm {TNR} ={\frac {\mathrm {TN} }{N}}={\frac {\mathrm {TN} }{\mathrm {TN} +\mathrm {FP} }}}
{\displaystyle \mathrm {PPV} ={\frac {\mathrm {TP} }{\mathrm {TP} +\mathrm {FP} }}}
{\displaystyle \mathrm {NPV} ={\frac {\mathrm {TN} }{\mathrm {TN} +\mathrm {FN} }}}
miss rate or
{\displaystyle \mathrm {FNR} ={\frac {\mathrm {FN} }{P}}={\frac {\mathrm {FN} }{\mathrm {FN} +\mathrm {TP} }}=1-\mathrm {TPR} }
{\displaystyle \mathrm {FPR} ={\frac {\mathrm {FP} }{N}}={\frac {\mathrm {FP} }{\mathrm {FP} +\mathrm {TN} }}=1-\mathrm {TNR} }
{\displaystyle \mathrm {FDR} ={\frac {\mathrm {FP} }{\mathrm {FP} +\mathrm {TP} }}=1-\mathrm {PPV} }
{\displaystyle \mathrm {FOR} ={\frac {\mathrm {FN} }{\mathrm {FN} +\mathrm {TN} }}=1-\mathrm {NPV} }
{\displaystyle \mathrm {ACC} ={\frac {\mathrm {TP} +\mathrm {TN} }{P+N}}={\frac {\mathrm {TP} +\mathrm {TN} }{\mathrm {TP} +\mathrm {TN} +\mathrm {FP} +\mathrm {FN} }}}
{\displaystyle F_{1}=2\cdot {\frac {\mathrm {PPV} \cdot \mathrm {TPR} }{\mathrm {PPV} +\mathrm {TPR} }}={\frac {2\mathrm {TP} }{2\mathrm {TP} +\mathrm {FP} +\mathrm {FN} }}}
{\displaystyle \mathrm {MCC} ={\frac {\mathrm {TP} \times \mathrm {TN} -\mathrm {FP} \times \mathrm {FN} }{\sqrt {(\mathrm {TP} +\mathrm {FP} )(\mathrm {TP} +\mathrm {FN} )(\mathrm {TN} +\mathrm {FP} )(\mathrm {TN} +\mathrm {FN} )}}}}
or Bookmaker Informedness (BM)
{\displaystyle \mathrm {BM} =\mathrm {TPR} +\mathrm {TNR} -1}
Markedness (MK)
{\displaystyle \mathrm {MK} =\mathrm {PPV} +\mathrm {NPV} -1}
Sources: Fawcett (2006), Powers (2011), and Ting (2011)
In the field of
and specifically the problem of , a confusion matrix, also known as an error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a
it is usually called a matching matrix). Each row of the matrix represents the instances in a predicted class while each column represents the instances in an actual class (or vice versa). The name stems from the fact that it makes it easy to see if the system is confusing two classes (i.e. commonly mislabelling one as another).
It is a special kind of , with two dimensions ("actual" and "predicted"), and identical sets of "classes" in both dimensions (each combination of dimension and class is a variable in the contingency table).
If a classification system has been trained to distinguish between cats, dogs and rabbits, a confusion matrix will summarize the results of testing the algorithm for further inspection. Assuming a sample of 27 animals — 8 cats, 6 dogs, and 13 rabbits, the resulting confusion matrix could look like the table below:
Actual class
In this confusion matrix, of the 8 actual cats, the system predicted that three were dogs, and of the six dogs, it predicted that one was a rabbit and two were cats. We can see from the matrix that the system in question has trouble distinguishing between cats and dogs, but can make the distinction between rabbits and other types of animals pretty well. All correct predictions are located in the diagonal of the table, so it is easy to visually inspect the table for prediction errors, as they will be represented by values outside the diagonal.
In , a table of confusion (sometimes also called a confusion matrix), is a table with two rows and two columns that reports the number of false positives, false negatives, true positives, and true negatives. This allows more detailed analysis than mere proportion of correct classifications (accuracy). Accuracy is not a reliable metric for the real performance of a classifier, because it will yield misleading results if the data set is unbalanced (that is, when the numbers of observations in different classes vary greatly). For example, if there were 95 cats and only 5 dogs in the data set, a particular classifier might classify all the observations as cats. The overall accuracy would be 95%, but in more detail the classifier would have a 100% recognition rate for the cat class but a 0% recognition rate for the dog class.
Assuming the confusion matrix above, its corresponding table of confusion, for the cat class, would be:
Actual class
5 True Positives
2 False Positives
3 False Negatives
17 True Negatives
The final table of confusion would contain the average values for all classes combined.
Let us define an experiment from P positive instances and N negative instances for some condition. The four outcomes can be formulated in a 2×2 confusion matrix, as follows:
True condition
Condition positive
Condition negative
= Σ Condition positive/Σ Total population
(ACC) = Σ True positive + Σ True negative/Σ Total population
Predicted condition
= Σ True positive/Σ Predicted condition positive
(FDR) = Σ False positive/Σ Predicted condition positive
Predicted condition
(FOR) = Σ False negative/Σ Predicted condition negative
(NPV) = Σ True negative/Σ Predicted condition negative
Click thumbnail for interactive chart:
(TPR), , , probability of detection = Σ True positive/Σ Condition positive
(FPR), , probability of false alarm = Σ False positive/Σ Condition negative
(LR+) = TPR/FPR
(DOR) = LR+/LR-
= 2/1/Recall + 1/Precision
(FNR), Miss rate = Σ False negative/Σ Condition positive
(SPC) = Σ True negative/Σ Condition negative
(LR-) = FNR/TNR
Fawcett, Tom (2006).
(PDF). Pattern Recognition Letters. 27 (8): 861–874. :.
Powers, David M W (2011).
(PDF). Journal of Machine Learning Technologies. 2 (1): 37–63.
Ting, Kai Ming (2011). . Springer.  .
Stehman, Stephen V. (1997). "Selecting and interpreting measures of thematic classification accuracy". Remote Sensing of Environment. 62 (1): 77–89. :.Scikitlearn:machinelearninginPython之贝叶斯学习
Scikitlearn:machinelearningin之贝叶斯学习
chapter 2之朴素贝叶斯.
朴素贝叶斯是一个简单却很强大的分类器,基于贝叶斯定理的概率模型。本质来说,贝叶斯是基于每个特征值的概率去决定该实例属于一类的概率,前提条件,也就是假定每个特征之间是独立的。朴素贝叶斯的一个非常成功的应用就是自然语言处理(natural language processing , NLP),NLP问题有很重要的,大量的标记数据(一般为文本文件),该数据作为算法的训练集。
在这个章节,将介绍使用朴素贝叶斯进行文本分类。数据集为一组分出着相应类别的文本文档,然后训练朴素贝叶斯算法来预测一个新的未知的文档的类别。scikit-learn中给出的数据集包含19,000组来自从政治,宗教到体育和科学等20个不同主题的新闻组。
fromsklearn.datasetsimport fetch_20newsgroups
news=fetch_20newsgroups(subset='all') #导入数据和赋值
值得注意的是,数据是存着一系列的文本内容,而不是矩阵。另外,由于书本是Python2的,我使用的是Python3,故代码和书本有些微不同。
print(type(news.data),type(news.target),type(news.target_names))
Downloading dataset from http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz (14 MB)
print(news.target_names)
['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
print(len(news.data))
print(len(news.target))
print(news.data[0])
From: Mamatha Devineni Ratnam
Subject: Pens fans reactions
Organization: Post Office, Carnegie Mellon, Pittsburgh, PA
NNTP-Posting-Host: po4.andrew.cmu.edu
I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game. PENS RULE!!!
print(news.target[0],news.target_names[news.target[0]]) #target是用于下标定位
10 rec.sport.hockey #下标从0开始
预处理数据:
本书的机器学习算法只能适用于数值型数据,因此,需要将文本数据转化为数值数据。
目前,只有一个特征&&文本内容,因此,需要一些函数将文本内容转变为有意义的一组数值型特征。直观地看,每个文本类别中的文字(确切地说,就是符号,包括数字或标点符号)有哪些,然后尝试用这些文字的频繁分布描述每个类别。 sklearn.feature_extraction.text提供 一些实用程序,从文本文档中建立数字特征向量。
在转换数据之前,先划分好训练集和测试集。在随机顺序下,75%个实例为训练集,25%个实例为测试集。
SPLIT_PETC=0.75
split_size=int(len(news.data)*SPLIT_PETC)
x_train=news.data[:split_size]
x_test=news.data[split_size:]
y_train=news.target[:split_size]
y_test=news.target[split_size:]
这里有3中方式将文本转变为数字特征:CountVectorizer, HashingVectorizer,and TfidfVectorizer.(它们之间的不同在于获得数字特征的计算)
CountVectorizer 主要是从文本中建立一个字典,然后每个实例转变成一个数字特征向量,其中的每个元素是文本中一个独有单词出现的次数
HashingVectorizer 实现一个哈希函数(hashing function),映射特征的索引,然后如 CountVectorizer计算次数
TfidfVectorizer 和CountVectorizer 很像,但是计算方式更为先进,使用术语逆文档频率法(Term Frequency Inverse Document Frequency,TF-IDF)&&测量单词在文档或者文集中的重要性的统计学方法(寻找当前文档中比价频繁出现的单词,对比其在整个文档集中出现的次数;这样可以看到标准化的结果,避免了过度频繁)。
训练朴素贝叶斯分类器:
建立一个朴素贝叶斯分类器,由特征向量化程序和实际贝叶斯分类器:使用sklearn.naive_bayes模块中的方法MultinomialNB;sklearn.pipeline模块中的Pipeline能够将向量和分类器组合一起。这里结合MultinomialNB 建立3个不同的分类器,分别使用上面提及的3个不同的文本向量,然后对比在默认参数下,哪个更好。
fromsklearn.naive_bayesimportMultinomialNB
fromsklearn.pipelineimportPipeline
fromsklearn.feature_extraction.textimportTfidfVectorizer,HashingVectorizer,CountVectorizer
clf_1=Pipeline([('vect',CountVectorizer()),('clf',MultinomialNB()),])
clf_2=Pipeline([('vect',HashingVectorizer(non_negative=True)),('clf',MultinomialNB()),])
clf_3=Pipeline([('vect',TfidfVectorizer()),('clf',MultinomialNB()),])
定义一个函数,分类和对指定的x和y值进行交叉验证:
fromsklearn.cross_validationimportcross_val_score,KFold
importnumpyasnp
fromscipy.statsimportsem
defevaluate_cross_validation(clf,x,y,K):
#createak-foldcrossvalidationiteratorofk=5folds(建立一个k=5的交叉验证迭代器)
cv=KFold(len(y),K,shuffle=True,random_state=0)
#bydefaultthescoreusedistheonereturnedbyscoremethodoftheestimator(accuracy)(默认情况下,使用的得分是返回的一个估计分数)
scores=cross_val_score(clf,x,y,cv=cv)
print(scores)
print((&Meanscore:{0:.3f}(+/-{1:.3f})&).format(np.mean(scores),sem(scores)))
然后,每个分类器都进行5重交叉验证:
clfs=[clf_1,clf_2,clf_3]
forclfin clfs:
evaluate_cross_validation(clf,news.data,news.target,5)
结果如下:
[ 0.....8458477 ]
Mean score:0.853 (+/-0.003)
Mean score:0.770 (+/-0.005)
Mean score:0.850 (+/-0.004)
可以看出,CountVectorizer 和TfidfVectorizer 比HashingVectorizer 结果更好。使用TfidfVectorizer 继续,尝试通过将文档解析成不同的符号正则表达式来提高结果。
默认的正则表达式:ur&\b\w\w+\b& ,考虑了字母数字字符,下划线(也许也会考虑削减和点号以提高标记and begin considering tokens as Wi-Fi .)
新的正则表达式:ur&\b[a-z0- 9_\-\.]+[a-z][a-z0-9_\-\.]+\b&:
clf_4=Pipeline([('vect',TfidfVectorizer(token_pattern=r&\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b&,)),('clf',MultinomialNB()),]) #Python3不支持ur
evaluate_cross_validation(clf_4,news.data,news.target,5)
结果如下:
[ 0.....8588485 ]
Mean score:0.865 (+/-0.003)
说明结果从0.850提高到0.865。
此外,还有另一个参数:stop_words,允许我们忽略掉不想加入计算的一列单词,例如太频繁的单词,或者先验认为不该为特定主题提供信息的单词。
定义一个函数,获得stop words (禁用词):
defget_stop_words():
result=set()
forlineinopen('stopwords_en.txt','r').readlines():
result.add(line.strip())
return result
然后,建立一个新的分类器:
clf_5=Pipeline([('vect',TfidfVectorizer(stop_words=get_stop_words(),token_pattern=r&\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b&,)),('clf',MultinomialNB()),])
evaluate_cross_validation(clf_5,news.data,news.target,5)
结果如下:
Mean score:0.889 (+/-0.003)
结果由0.865提高到0.889。
再看MultinomialNB的参数,最重要的参数是alpha参数,也叫平滑参数,其默认值为1.0,假设令其为0.1:
clf_6=Pipeline([('vect',TfidfVectorizer(stop_words=get_stop_words(),token_pattern=r&\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b&,)),('clf',MultinomialNB(alpha=0.1)),])
结果如下:
Mean score:0.915 (+/-0.001)
结果由 0.889 提高到 0.915 。接下来,不同的alpha值对结果的影响。
模型评估:
定义一个函数,在整个训练集训练模型,和评估模型在训练集和测试集的准确性。
fromsklearnimport metrics
deftrain_and_evaluate(clf,x_train,x_test,y_train,y_test):
clf.fit(x_train,y_train)
print(&Accuracyontrainingset:&)
print(clf.score(x_train,y_train))
print(&Accuracyontestingset:&)
print(clf.score(x_test,y_test))
print(&ClassificationReport:&)
print(metrics.classification_report(y_test,y_pred=y_test))
print(&ConfusionMatrix:&)
print(metrics.confusion_matrix(y_test,y_pred=y_test))
train_and_evaluate(clf_6,x_train,x_test,y_train,y_test)
Accuracy on training set:
Accuracy on testing set:
由上可知,结果还可以。测试集结果也差不多达到0.91.如何用python画好confusion matrix_百度知道
色情、暴力
我们会通过消息、邮箱等方式尽快将举报结果通知您。
如何用python画好confusion matrix
我有更好的答案
#load&true&and&=&confusion_matrix(y_true,&y_pred)&nbsp:,&&'r')&nbsp,在label name比较长的时候;labels&Predicted&label')&labels)&&plt.ylabel('True&label')&nbsp,&nbsp,&+&0.5&copy'lines&=&nbsp.strip())&file.readlines()&nbsp.imshow()/cm.sum(axis=1)[;for&line&in&lines:&&&nbsp.yticks(xlocations.array(range(len(labels)))&&plt.xticks(xlocations,&&title='Confusion&lines&=&matrix&&nbsp.strip()))&&&quot.metrics&import&confusion_matrix&&import&&&=&[]&&file&=&nbsp:, 但是这样的图并不能满足我们的要求,首先是刻度的显示是在方格的中间;as&np.array(range(len(labels)))&nbsp,如何处理显示问题。直接贴上代码:[python]&''#load&labels,&'r')&=&plt.''&predict&def&plot_confusion_matrix(plt&&import&numpy&float&#39,&'&#39.split(&quot,&file&nbsp.set_printoptions(precision=2)&&cm_normalized&compute&confusion&nbsp,经常需要画混淆矩阵;&nbsp.astype('label&name.colorbar()&&nbsp.&cmap&&nbsp.&&nbsp,&file.close()&Matrix';&=&&)[1].strip()))&print&cm_normalized&nbsp:&&labels.txt'.title(title)&nbsp:&&interpolation='nearest';cmap=cmap)&nbsp,下面我们使用python的matplotlib包,scikit-learning机器学习库也同样提供了例子.close()&&tick_marks&=&nbsp.txt, 其次是如何在每个方格中显示accuracy数值, 最后是如何在横坐标和纵坐标显示label的名字;sklearn.pyplot&np&y_pred.append(int(line.split(&&&)[0];labels.txt:&contain&cm&&for&line&in&lines:&&y_&nbsp,这需要隐藏刻度.xlabel('&&#set&the&fontsize&of&label.&&#for&label&in&plt.gca().xaxis.get_ticklabels():&&#&&&&label.set_fontsize(8)&&#text&portion&&ind_array&=&np.arange(len(labels))&&x,&y&=&np.meshgrid(ind_array,&ind_array)&&for&x_val,&y_val&in&zip(x.flatten(),&y.flatten()):&&c&=&cm_normalized[y_val][x_val]&&if&(c&&&0.01):&&plt.text(x_val,&y_val,&&%0.2f&&%(c,),&color='red',&fontsize=7,&va='center',&ha='center')&&#offset&the&tick&&plt.gca().set_xticks(tick_marks,&minor=True)&&plt.gca().set_yticks(tick_marks,&minor=True)&&plt.gca().xaxis.set_ticks_position('none')&&plt.gca().yaxis.set_ticks_position('none')&&plt.grid(True,&which='minor',&linestyle='-')&&plt.gcf().subplots_adjust(bottom=0.15)&&plot_confusion_matrix(cm_normalized,&title='Normalized&confusion&matrix')&&#show&confusion&matrix&&plt.show()&&结果如下图所示:阅读全文版权声明:本文为博主原创文章,未经博主允许不得转载。目前您尚未登录,请&登录&或&注册&后进行评论linchunmian 2;rotation=90)&&nbsp.append(file.readlines()&xlocations&open('&nbsp,其次是如何把每个label显示在每个方块的中间;&nbsp.&=&[]&&y_pred&=&view plain&=&&predict_label&true_label&&#39.figure(figsize=(12,8);dpi=120)&&nbsp.append(int(line,&nbsp,&from&=&open('predict.txt';as&nbsp,&nbsp.newaxis]&&&y_true&[]&plt.binary);print&cm&&plt在做分类的时候
为您推荐:
其他类似问题
换一换
回答问题,赢新手礼包本文是 2014 年 12 月我在布拉格经济大学做的名为‘ Python 数据科学’讲座的笔记。欢迎通过&&进行提问和评论。
本次讲座的目的是展示一些关于机器学习的高级概念。该笔记中用具体的代码来做演示,大家可以在自己的电脑上运行(需要安装 IPython,如下所示)。
本次讲座的听众需要了解一些基础的编程(不一定是 Python),并拥有一点基本的数据挖掘背景。本次讲座不是机器学习专家的“高级演讲”。
这些代码实例创建了一个有效的、可执行的原型系统:一个使用“spam”(垃圾信息)或“ham”(非垃圾信息)对英文手机短信(”短信类型“的英文)进行分类的 app。
整套代码使用&&语言。 python 是一种在管线(pipeline)的所有环节(I/O、数据清洗重整和预处理、模型训练和评估)都好用的通用语言。尽管 python 不是唯一选择,但它灵活、易于开发,性能优越,这得益于它成熟的科学计算生态系统。Python 庞大的、开源生态系统同时避免了任何单一框架或库的限制(以及相关的信息丢失)。
IPython notebook,是 Python 的一个工具,它是一个以&HTML 形式呈现的交互环境,可以通过它立刻看到结果。我们也将重温其它广泛用于数据科学领域的实用工具。
想交互运行下面的例子(选读)?
1. 安装免费的 Anaconda Python 发行版,其中已经包含 Python 本身。
2. 安装“自然语言处理”库——TextBlob:。
3. 下载本文的源码(网址:&并运行:$ ipython notebook data_science_python.ipynb
4. 观看 IPython notebook 基本用法教程&。
5. 运行下面的第一个代码,如果执行过程没有报错,就可以了。
端到端的例子:自动过滤垃圾信息
%matplotlib inline
import matplotlib.pyplot as plt
import csv
from textblob import TextBlob
import pandas
import sklearn
import cPickle
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC, LinearSVC
from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import StratifiedKFold, cross_val_score, train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.learning_curve import learning_curve
第一步:加载数据,浏览一下
让我们跳过真正的第一步(完善资料,了解我们要做的是什么,这在实践过程中是非常重要的),直接到&&下载 demo 里需要用的 zip 文件,解压到 data 子目录下。你能看到一个大概 0.5MB 大小,名为 SMSSpamCollection 的文件:
$ &span class="kw"&ls&/span& -l data
&span class="kw"&total&/span& 1352
&span class="kw"&-rw-r--r--@&/span& 1 kofola&&staff&&477907 Mar 15&&2011 SMSSpamCollection
&span class="kw"&-rw-r--r--@&/span& 1 kofola&&staff&&&&5868 Apr 18&&2011 readme
&span class="kw"&-rw-r-----@&/span& 1 kofola&&staff&&203415 Dec&&1 15:30 smsspamcollection.zip
这份文件包含了 5000 多份 SMS 手机信息(查看 readme 文件以获得更多信息):
messages = [line.rstrip() for line in open('./data/SMSSpamCollection')]
print len(messages)
文本集有时候也称为“语料库”,我们来打印 SMS 语料库中的前 10 条信息:
for message_no, message in enumerate(messages[:10]):
&&&&print message_no, message
0 ham&&&&Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
1 ham&& Ok lar... Joking wif u oni...
2 spam&&Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply over18's
3 ham&& U dun say so early hor... U c already then say...
4 ham&& Nah I don't think he goes to usf, he lives around here though
5 spam&&FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, ?1.50 to rcv
6 ham&& Even my brother is not like to speak with me. They treat me like aids patent.
7 ham&& As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune
8 spam&&WINNER!! As a valued network customer you have been selected to receivea ?900 prize reward! To claim call . Claim code KL341. Valid 12 hours only.
9 spam&&Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on
我们看到一个 TSV 文件(用制表符 tab 分隔),它的第一列是标记正常信息(ham)或“垃圾文件”(spam)的标签,第二列是信息本身。
这个语料库将作为带标签的训练集。通过使用这些标记了 ham/spam 例子,我们将训练一个自动分辨 ham/spam 的机器学习模型。然后,我们可以用训练好的模型将任意未标记的信息标记为 ham 或 spam。
我们可以使用 Python 的 Pandas 库替我们处理 TSV 文件(或 CSV 文件,或 Excel 文件):
messages = pandas.read_csv('./data/SMSSpamCollection', sep='t', quoting=csv.QUOTE_NONE,
&&&&&&&&&&&&&&&&&&&&&&&&&& names=["label", "message"])
print messages
&&&& label&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&message
0&&&&&&ham&&Go until jurong point, crazy.. Available only ...
1&&&&&&ham&&&&&&&&&&&&&&&&&&&&&&Ok lar... Joking wif u oni...
2&&&& spam&&Free entry in 2 a wkly comp to win FA Cup fina...
3&&&&&&ham&&U dun say so early hor... U c already then say...
4&&&&&&ham&&Nah I don't think he goes to usf, he lives aro...
5&&&& spam&&FreeMsg Hey there darling it's been 3 week's n...
6&&&&&&ham&&Even my brother is not like to speak with me. ...
7&&&&&&ham&&As per your request 'Melle Melle (Oru Minnamin...
8&&&& spam&&WINNER!! As a valued network customer you have...
9&&&& spam&&Had your mobile 11 months or more? U R entitle...
10&&&& ham&&I'm gonna be home soon and i don't want to tal...
11&&&&spam&&SIX chances to win CASH! From 100 to 20,000 po...
12&&&&spam&&URGENT! You have won a 1 week FREE membership ...
13&&&& ham&&I've been searching for the right words to tha...
14&&&& ham&&&&&&&&&&&&&&&&I HAVE A DATE ON SUNDAY WITH WILL!!
15&&&&spam&&XXXMobileMovieClub: To use your credit, click ...
16&&&& ham&&&&&&&&&&&&&&&&&&&&&&&& Oh k...i'm watching here:)
17&&&& ham&&Eh u remember how 2 spell his name... Yes i di...
18&&&& ham&&Fine if that?s the way u feel. That?s the way ...
19&&&&spam&&England v Macedonia - dont miss the goals/team...
20&&&& ham&&&&&&&&&&Is that seriously how you spell his name?
21&&&& ham&&&&I‘m going to try for 2 months ha ha only joking
22&&&& ham&&So ü pay first lar... Then when is da stock co...
23&&&& ham&&Aft i finish my lunch then i go str down lor. ...
24&&&& ham&&Ffffffffff. Alright no way I can meet up with ...
25&&&& ham&&Just forced myself to eat a slice. I'm really ...
26&&&& ham&&&&&&&&&&&&&&&&&&&& Lol your always so convincing.
27&&&& ham&&Did you catch the bus ? Are you frying an egg ...
28&&&& ham&&I'm back & we're packing the car now, I'll...
29&&&& ham&&Ahhh. Work. I vaguely remember that! What does...
...&&&&...&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&...
5544&& ham&&&&&&&&&& Armand says get your ass over to epsilon
5545&& ham&&&&&&&&&&&& U still havent got urself a jacket ah?
5546&& ham&&I'm taking derek & taylor to walmart, if I...
5547&& ham&&&&&&Hi its in durban are you still on this number
5548&& ham&&&&&&&& Ic. There are a lotta childporn cars then.
5549&&spam&&Had your contract mobile 11 Mnths? Latest Moto...
5550&& ham&&&&&&&&&&&&&&&& No, I was trV
5551&& ham&&You know, wot people wear. T shirts, jumpers, ...
5552&& ham&&&&&&&&Cool, what time you think you can get here?
5553&& ham&&Wen did you get so spiritual and deep. That's ...
5554&& ham&&Have a safe trip to Nigeria. Wish you happines...
5555&& ham&&&&&&&&&&&&&&&&&&&&&&&&Hahaha..use your brain dear
5556&& ham&&Well keep in mind I've only got enough gas for...
5557&& ham&&Yeh. Indians was nice. Tho it did kane me off ...
5558&& ham&&Yes i have. So that's why u texted. Pshew...mi...
5559&& ham&&No. I meant the calculation is the same. That ...
5560&& ham&&&&&&&&&&&&&&&&&&&&&&&&&&&& Sorry, I'll call later
5561&& ham&&if you aren't here in the next&&<#&&&hou...
5562&& ham&&&&&&&&&&&&&&&&&&Anything lor. Juz both of us lor.
5563&& ham&&Get me out of this dump heap. My mom decided t...
5564&& ham&&Ok lor... Sony ericsson salesman... I ask shuh...
5565&& ham&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&Ard 6 like dat lor.
5566&& ham&&Why don't you wait 'til at least wednesday to ...
5567&& ham&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& Huh y lei...
5568&&spam&&REMINDER FROM O2: To get 2.50 pounds free call...
5569&&spam&&This is the 2nd time we have tried 2 contact u...
5570&& ham&&&&&&&&&&&&&& Will ü b going to esplanade fr home?
5571&& ham&&Pity, * was in mood for that. So...any other s...
5572&& ham&&The guy did some bitching but I acted like i'd...
5573&& ham&&&&&&&&&&&&&&&&&&&&&&&& Rofl. Its true to its name
[5574 rows x 2 columns]
我们也可以使用 pandas 轻松查看统计信息:
messages.groupby('label').describe()
Sorry, I’ll call later
Please call our customer service representativ…
这些信息的长度是多少:
messages['length'] = messages['message'].map(lambda text: len(text))
print messages.head()
&&label&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&message&&length
0&& ham&&Go until jurong point, crazy.. Available only ...&&&& 111
1&& ham&&&&&&&&&&&&&&&&&&&&&&Ok lar... Joking wif u oni...&&&&&&29
2&&spam&&Free entry in 2 a wkly comp to win FA Cup fina...&&&& 155
3&& ham&&U dun say so early hor... U c already then say...&&&&&&49
4&& ham&&Nah I don't think he goes to usf, he lives aro...&&&&&&61
messages.length.plot(bins=20, kind='hist')
<matplotlib.axes._subplots.AxesSubplot at 0x10dd7a990>
messages.length.describe()
mean&&&&&& 80.604593
std&&&&&&&&59.919970
min&&&&&&&& 2.000000
25%&&&&&&&&36.000000
50%&&&&&&&&62.000000
75%&&&&&& 122.000000
max&&&&&& 910.000000
Name: length, dtype: float64
哪些是超长信息?
print list(messages.message[messages.length & 900])
["For me the love should start with attraction.i should feel that I need her every time
around me.she should be the first thing which comes in my thoughts.I would start the day and
end it with her.she should be there every time I dream.love will be then when my every
breath has her name.my life should happen around her.my life will be named to her.I would
cry for her.will give all my happiness and take all her sorrows.I will be ready to fight
with anyone for her.I will be in love when I will be doing the craziest things for her.love
will be when I don't have to proove anyone that my girl is the most beautiful lady on the
whole planet.I will always be singing praises for her.love will be when I start up making
chicken curry and end up makiing sambar.life will be the most beautiful then.will get every
morning and thank god for the day because she is with me.I would like to say a lot..will
tell later.."]
spam 信息与 ham 信息在长度上有区别吗?
messages.hist(column='length', by='label', bins=50)
array([<matplotlib.axes._subplots.AxesSubplot object at 0x11270da50>,
&&&&&& <matplotlib.axes._subplots.AxesSubplot object at 0x>], dtype=object)
太棒了,但是我们怎么能让电脑自己识别文字信息?它可以理解这些胡言乱语吗?
第二步:数据预处理
这一节我们将原始信息(字符序列)转换为向量(数字序列);
这里的映射并非一对一的,我们要用词袋模型()把每个不重复的词用一个数字来表示。
与第一步的方法一样,让我们写一个将信息分割成单词的函数:
def split_into_tokens(message):
&&&&message = unicode(message, 'utf8')&&# convert bytes into proper unicode
&&&&return TextBlob(message).words
这还是原始文本的一部分:
messages.message.head()
0&&&&Go until jurong point, crazy.. Available only ...
1&&&&&&&&&&&&&&&&&&&&&&&&Ok lar... Joking wif u oni...
2&&&&Free entry in 2 a wkly comp to win FA Cup fina...
3&&&&U dun say so early hor... U c already then say...
4&&&&Nah I don't think he goes to usf, he lives aro...
Name: message, dtype: object
这是原始文本处理后的样子:
messages.message.head().apply(split_into_tokens)
0&&&&[Go, until, jurong, point, crazy, Available, o...
1&&&&&&&&&&&&&&&&&&&&&& [Ok, lar, Joking, wif, u, oni]
2&&&&[Free, entry, in, 2, a, wkly, comp, to, win, F...
3&&&&[U, dun, say, so, early, hor, U, c, already, t...
4&&&&[Nah, I, do, n't, think, he, goes, to, usf, he...
Name: message, dtype: object
自然语言处理(NLP)的问题:
大写字母是否携带信息?
单词的不同形式(“goes”和“go”)是否携带信息?
叹词和限定词是否携带信息?
换句话说,我们想对文本进行更好的标准化。
我们使用 textblob 获取&&标签:
TextBlob("Hello world, how is it going?").tags&&# list of (word, POS) pairs
[(u'Hello', u'UH'),
(u'world', u'NN'),
(u'how', u'WRB'),
(u'is', u'VBZ'),
(u'it', u'PRP'),
(u'going', u'VBG')]
并将单词标准化为基本形式 ():
def split_into_lemmas(message):
&&&&message = unicode(message, 'utf8').lower()
&&&&words = TextBlob(message).words
&&&&# for each word, take its "base form" = lemma
&&&&return [word.lemma for word in words]
messages.message.head().apply(split_into_lemmas)
0 [go, until, jurong, point, crazy, available, o...
1 [ok, lar, joking, wif, u, oni]
2 [free, entry, in, 2, a, wkly, comp, to, win, f...
3 [u, dun, say, so, early, hor, u, c, already, t...
4 [nah, i, do, n't, think, he, go, to, usf, he, ...
Name: message, dtype: object
这样就好多了。你也许还会想到更多的方法来改进预处理:解码 HTML 实体(我们上面看到的 &amp 和 &lt);过滤掉停用词 (代词等);添加更多特征,比如所有字母大写标识等等。
第三步:数据转换为向量
现在,我们将每条消息(词干列表)转换成机器学习模型可以理解的向量。
用词袋模型完成这项工作需要三个步骤:
1.& 对每个词在每条信息中出现的次数进行计数(词频);
2. 对计数进行加权,这样经常出现的单词将会获得较低的权重(逆向文件频率);
3. 将向量由原始文本长度归一化到单位长度(L2 范式)。
每个向量的维度等于 SMS 语料库中包含的独立词的数量。
bow_transformer = CountVectorizer(analyzer=split_into_lemmas).fit(messages['message'])
print len(bow_transformer.vocabulary_)
这里我们使用强大的 python 机器学习训练库 scikit-learn (sklearn),它包含大量的方法和选项。
我们取一个信息并使用新的 bow_tramsformer 获取向量形式的词袋模型计数:
message4 = messages['message'][3]
print message4
U dun say so early hor... U c already then say...
bow4 = bow_transformer.transform([message4])
print bow4
print bow4.shape
&&(0, 1158)&&&&&&1
&&(0, 1899)&&&& 1
&&(0, 2897)&&&& 1
&&(0, 2927)&&&& 1
&&(0, 4021)&&&& 1
&&(0, 6736)&&&& 2
&&(0, 7111)&&&& 1
&&(0, 7698)&&&& 1
&&(0, 8013)&&&& 2
&&(1, 8874)
message 4 中有 9 个独立词,它们中的两个出现了两次,其余的只出现了一次。可用性检测,哪些词出现了两次?
print bow_transformer.get_feature_names()[6736]
print bow_transformer.get_feature_names()[8013]
整个 SMS 语料库的词袋计数是一个庞大的稀疏矩阵:
messages_bow = bow_transformer.transform(messages['message'])
print 'sparse matrix shape:', messages_bow.shape
print 'number of non-zeros:', messages_bow.nnz
print 'sparsity: %.2f%%' % (100.0 * messages_bow.nnz / (messages_bow.shape[0] * messages_bow.shape[1]))
sparse matrix shape: (5574, 8874)
number of non-zeros: 80272
sparsity: 0.16%
最终,计数后,使用 scikit-learn 的 TFidfTransformer 实现的&&完成词语加权和归一化。
tfidf_transformer = TfidfTransformer().fit(messages_bow)
tfidf4 = tfidf_transformer.transform(bow4)
print tfidf4
&&(0, 8013)&&&&&&0.
&&(0, 7698)&&&& 0.
&&(0, 7111)&&&& 0.
&&(0, 6736)&&&& 0.
&&(0, 4021)&&&& 0.
&&(0, 2927)&&&& 0.
&&(0, 2897)&&&& 0.
&&(0, 1899)&&&& 0.
&&(0, 1158)&&&& 0.
单词 “u” 的 IDF(逆向文件频率)是什么?单词“university”的 IDF 又是什么?
print tfidf_transformer.idf_[bow_transformer.vocabulary_['u']]
print tfidf_transformer.idf_[bow_transformer.vocabulary_['university']]
将整个 bag-of-words 语料库转化为 TF-IDF 语料库。
messages_tfidf = tfidf_transformer.transform(messages_bow)
print messages_tfidf.shape
(5574, 8874)
有许多方法可以对数据进行预处理和向量化。这两个步骤也可以称为“特征工程”,它们通常是预测过程中最耗时间和最无趣的部分,但是它们非常重要并且需要经验。诀窍在于反复评估:分析模型误差,改进数据清洗和预处理方法,进行头脑风暴讨论新功能,评估等等。
第四步:训练模型,检测垃圾信息
我们使用向量形式的信息来训练 spam/ham 分类器。这部分很简单,有很多实现训练算法的库文件。
这里我们使用 scikit-learn,首先选择 Naive Bayes 分类器:
%time spam_detector = MultinomialNB().fit(messages_tfidf, messages['label'])
CPU times: user 4.51 ms, sys: 987 us, total: 5.49 ms
Wall time: 4.77 ms
我们来试着分类一个随机信息:
print 'predicted:', spam_detector.predict(tfidf4)[0]
print 'expected:', messages.label[3]
predicted: ham
expected: ham
太棒了!你也可以用自己的文本试试。
有一个很自然的问题是:我们可以正确分辨多少信息?
all_predictions = spam_detector.predict(messages_tfidf)
print all_predictions
['ham' 'ham' 'spam' ..., 'ham' 'ham' 'ham']
print 'accuracy', accuracy_score(messages['label'], all_predictions)
print 'confusion matrixn', confusion_matrix(messages['label'], all_predictions)
print '(row=expected, col=predicted)'
accuracy 0.
confusion matrix
[[4827&&&&0]
[ 170&&577]]
(row=expected, col=predicted)
plt.matshow(confusion_matrix(messages['label'], all_predictions), cmap=plt.cm.binary, interpolation='nearest')
plt.title('confusion matrix')
plt.colorbar()
plt.ylabel('expected label')
plt.xlabel('predicted label')
<matplotlib.text.Text at 0x>
我们可以通过这个混淆矩阵计算精度(precision)和召回率(recall),或者它们的组合(调和平均值)F1:
print classification_report(messages['label'], all_predictions)
&&&&&&&&&&&& precision&&&&recall&&f1-score&& support
&&&&&&&&ham&&&&&& 0.97&&&&&&1.00&&&&&&0.98&&&&&&4827
&&&&&& spam&&&&&& 1.00&&&&&&0.77&&&&&&0.87&&&&&& 747
avg / total&&&&&& 0.97&&&&&&0.97&&&&&&0.97&&&&&&5574
有相当多的指标都可以用来评估模型性能,至于哪个最合适是由任务决定的。比如,将“spam”错误预测为“ham”的成本远低于将“ham”错误预测为“spam”的成本。
第五步:如何进行实验?
在上述“评价”中,我们犯了个大忌。为了简单的演示,我们使用训练数据进行了准确性评估。永远不要评估你的训练数据。这是错误的。
这样的评估方法不能告诉我们模型的实际预测能力,如果我们记住训练期间的每个例子,训练的准确率将非常接近 100%,但是我们不能用它来分类任何新信息。
一个正确的做法是将数据分为训练集和测试集,在模型拟合和调参时只能使用训练数据,不能以任何方式使用测试数据,通过这个方法确保模型没有“作弊”,最终使用测试数据评价模型可以代表模型真正的预测性能。
msg_train, msg_test, label_train, label_test =
&&&&train_test_split(messages['message'], messages['label'], test_size=0.2)
print len(msg_train), len(msg_test), len(msg_train) + len(msg_test)
4459 1115 5574
按照要求,测试数据占整个数据集的 20%(总共 5574 条记录中的 1115 条),其余的是训练数据(5574 条中的 4459 条)。
让我们回顾整个流程,将所有步骤放入 scikit-learn 的 Pipeline 中:
def split_into_lemmas(message):
&&&&message = unicode(message, 'utf8').lower()
&&&&words = TextBlob(message).words
&&&&# for each word, take its "base form" = lemma
&&&&return [word.lemma for word in words]
pipeline = Pipeline([
&&&&('bow', CountVectorizer(analyzer=split_into_lemmas)),&&# strings to token integer counts
&&&&('tfidf', TfidfTransformer()),&&# integer counts to weighted TF-IDF scores
&&&&('classifier', MultinomialNB()),&&# train on TF-IDF vectors w/ Naive Bayes classifier
实际当中一个常见的做法是将训练集再次分割成更小的集合,例如,5 个大小相等的子集。然后我们用 4 个子集训练数据,用最后 1 个子集计算精度(称之为“验证集”)。重复5次(每次使用不同的子集进行验证),这样可以得到模型的“稳定性“。如果模型使用不同子集的得分差异非常大,那么很可能哪里出错了(坏数据或者不良的模型方差)。返回,分析错误,重新检查输入数据有效性,重新检查数据清洗。
在这个例子里,一切进展顺利:
scores = cross_val_score(pipeline,&&# steps to convert raw messages into models
&&&&&&&&&&&&&&&&&&&&&&&& msg_train,&&# training data
&&&&&&&&&&&&&&&&&&&&&&&& label_train,&&# training labels
&&&&&&&&&&&&&&&&&&&&&&&& cv=10,&&# split data randomly into 10 parts: 9 for training, 1 for scoring
&&&&&&&&&&&&&&&&&&&&&&&& scoring='accuracy',&&# which scoring metric?
&&&&&&&&&&&&&&&&&&&&&&&& n_jobs=-1,&&# -1 = use all cores = faster
&&&&&&&&&&&&&&&&&&&&&&&& )
print scores
[ 0.&&0.&&0.&&0.&&0.&&0.
&&0.&&0.&&0.&&0.]
得分确实比训练全部数据时差一点点( 5574 个训练例子中,准确性 0.97),但是它们相当稳定:
print scores.mean(), scores.std()
我们自然会问,如何改进这个模型?这个得分已经很高了,但是我们通常如何改进模型呢?
Naive Bayes 是一个高偏差-低方差的分类器(简单且稳定,不易过度拟合)。与其相反的例子是低偏差-高方差(容易过度拟合)的 k 最临近(kNN)分类器和决策树。Bagging(随机森林)是一种通过训练许多(高方差)模型和求均值来降低方差的方法。
换句话说:
高偏差 = 分类器比较固执。它有自己的想法,数据能够改变的空间有限。另一方面,也没有多少过度拟合的空间(左图)。
低偏差 = 分类器更听话,但也更神经质。大家都知道,让它做什么就做什么可能造成麻烦(右图)。
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
&&&&&&&&&&&&&&&&&&&&&&&&n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 5)):
&&&&Generate a simple plot of the test and traning learning curve.
&&&&Parameters
&&&&----------
&&&&estimator : object type that implements the "fit" and "predict" methods
&&&&&&&&An object of that type which is cloned for each validation.
&&&&title : string
&&&&&&&&Title for the chart.
&&&&X : array-like, shape (n_samples, n_features)
&&&&&&&&Training vector, where n_samples is the number of samples and
&&&&&&&&n_features is the number of features.
&&&&y : array-like, shape (n_samples) or (n_samples, n_features), optional
&&&&&&&&Target relative to X for classifi
&&&&&&&&None for unsupervised learning.
&&&&ylim : tuple, shape (ymin, ymax), optional
&&&&&&&&Defines minimum and maximum yvalues plotted.
&&&&cv : integer, cross-validation generator, optional
&&&&&&&&If an integer is passed, it is the number of folds (defaults to 3).
&&&&&&&&Specific cross-validation objects can be passed, see
&&&&&&&&sklearn.cross_validation module for the list of possible objects
&&&&n_jobs : integer, optional
&&&&&&&&Number of jobs to run in parallel (default 1).
&&&&plt.figure()
&&&&plt.title(title)
&&&&if ylim is not None:
&&&&&&&&plt.ylim(*ylim)
&&&&plt.xlabel("Training examples")
&&&&plt.ylabel("Score")
&&&&train_sizes, train_scores, test_scores = learning_curve(
&&&&&&&&estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
&&&&train_scores_mean = np.mean(train_scores, axis=1)
&&&&train_scores_std = np.std(train_scores, axis=1)
&&&&test_scores_mean = np.mean(test_scores, axis=1)
&&&&test_scores_std = np.std(test_scores, axis=1)
&&&&plt.grid()
&&&&plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
&&&&&&&&&&&&&&&&&&&& train_scores_mean + train_scores_std, alpha=0.1,
&&&&&&&&&&&&&&&&&&&& color="r")
&&&&plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
&&&&&&&&&&&&&&&&&&&& test_scores_mean + test_scores_std, alpha=0.1, color="g")
&&&&plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
&&&&&&&&&&&& label="Training score")
&&&&plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
&&&&&&&&&&&& label="Cross-validation score")
&&&&plt.legend(loc="best")
&&&&return plt
%time plot_learning_curve(pipeline, "accuracy vs. training set size", msg_train, label_train, cv=5)
CPU times: user 382 ms, sys: 83.1 ms, total: 465 ms
Wall time: 28.5 s
<module 'matplotlib.pyplot' from '/Volumes/work/workspace/vew/sklearn_intro/lib/python2.7/site-packages/matplotlib/pyplot.pyc'>
&(我们对数据的 64% 进行了有效训练:保留 20% 的数据作为测试集,保留剩余的 20% 做 5 折交叉验证 = & 0.8*0.8*5574 = 3567个训练数据。)
随着性能的提升,训练和交叉验证都表现良好,我们发现由于数据量较少,这个模型难以足够复杂/灵活地捕获所有的细微差别。在这种特殊案例中,不管怎样做精度都很高,这个问题看起来不是很明显。
关于这一点,我们有两个选择:
使用更多的训练数据,增加模型的复杂性;
使用更复杂(更低偏差)的模型,从现有数据中获取更多信息。
在过去的几年里,随着收集大规模训练数据越来越容易,机器越来越快。方法 1 变得越来越流行(更简单的算法,更多的数据)。简单的算法(如 Naive Bayes)也有更容易解释的额外优势(相对一些更复杂的黑箱模型,如神经网络)。
了解了如何正确地评估模型,我们现在可以开始研究参数对性能有哪些影响。
第六步:如何调整参数?
到目前为止,我们看到的只是冰山一角,还有许多其它参数需要调整。比如使用什么算法进行训练。
上面我们已经使用了 Navie Bayes,但是 scikit-learn 支持许多分类器:支持向量机、最邻近算法、决策树、Ensamble 方法等…
我们会问:IDF 加权对准确性有什么影响?消耗额外成本进行词形还原(与只用纯文字相比)真的会有效果吗?
让我们来看看:
params = {
&&&&'tfidf__use_idf': (True, False),
&&&&'bow__analyzer': (split_into_lemmas, split_into_tokens),
grid = GridSearchCV(
&&&&pipeline,&&# pipeline from above
&&&&params,&&# parameters to tune via cross validation
&&&&refit=True,&&# fit using all available data at the end, on the best found param combination
&&&&n_jobs=-1,&&# number of cores to use -1 for "all cores"
&&&&scoring='accuracy',&&# what score are we optimizing?
&&&&cv=StratifiedKFold(label_train, n_folds=5),&&# what type of cross validation to use
%time nb_detector = grid.fit(msg_train, label_train)
print nb_detector.grid_scores_
CPU times: user 4.09 s, sys: 291 ms, total: 4.38 s
Wall time: 20.2 s
[mean: 0.94752, std: 0.00357, params: {'tfidf__use_idf': True, 'bow__analyzer': <function split_into_lemmas at 0x>}, mean: 0.92958, std: 0.00390, params: {'tfidf__use_idf': False, 'bow__analyzer': <function split_into_lemmas at 0x>}, mean: 0.94528, std: 0.00259, params: {'tfidf__use_idf': True, 'bow__analyzer': <function split_into_tokens at 0x>}, mean: 0.92868, std: 0.00240, params: {'tfidf__use_idf': False, 'bow__analyzer': <function split_into_tokens at 0x>}]
(首先显示最佳参数组合:在这个案例中是使用 idf=True 和 analyzer=split_into_lemmas 的参数组合)
快速合理性检查
print nb_detector.predict_proba(["Hi mom, how are you?"])[0]
print nb_detector.predict_proba(["WINNER! Credit for free!"])[0]
predict_proba 返回每类(ham,spam)的预测概率。在第一个例子中,消息被预测为 ham 的概率 &99%,被预测为 spam 的概率 &1%。如果进行选择模型会认为信息是 ”ham“:
print nb_detector.predict(["Hi mom, how are you?"])[0]
print nb_detector.predict(["WINNER! Credit for free!"])[0]
在训练期间没有用到的测试集的整体得分:
predictions = nb_detector.predict(msg_test)
print confusion_matrix(label_test, predictions)
print classification_report(label_test, predictions)
[[973&& 0]
[ 46&&96]]
&&&&&&&&&&&& precision&&&&recall&&f1-score&& support
&&&&&&&&ham&&&&&& 0.95&&&&&&1.00&&&&&&0.98&&&&&& 973
&&&&&& spam&&&&&& 1.00&&&&&&0.68&&&&&&0.81&&&&&& 142
avg / total&&&&&& 0.96&&&&&&0.96&&&&&&0.96&&&&&&1115
这是我们使用词形还原、TF-IDF 和 Navie Bayes 分类器的 ham 检测 pipeline 获得的实际预测性能。
让我们尝试另一个分类器:。SVM 可以非常迅速的得到结果,它所需要的参数调整也很少(虽然比 Navie Bayes 稍多一点),在处理文本数据方面它是个好的起点。
pipeline_svm = Pipeline([
&&&&('bow', CountVectorizer(analyzer=split_into_lemmas)),
&&&&('tfidf', TfidfTransformer()),
&&&&('classifier', SVC()),&&# &== change here
# pipeline parameters to automatically explore and tune
param_svm = [
&&{'classifier__C': [1, 10, 100, 1000], 'classifier__kernel': ['linear']},
&&{'classifier__C': [1, 10, 100, 1000], 'classifier__gamma': [0.001, 0.0001], 'classifier__kernel': ['rbf']},
grid_svm = GridSearchCV(
&&&&pipeline_svm,&&# pipeline from above
&&&&param_grid=param_svm,&&# parameters to tune via cross validation
&&&&refit=True,&&# fit using all data, on the best detected classifier
&&&&n_jobs=-1,&&# number of cores to use -1 for "all cores"
&&&&scoring='accuracy',&&# what score are we optimizing?
&&&&cv=StratifiedKFold(label_train, n_folds=5),&&# what type of cross validation to use
%time svm_detector = grid_svm.fit(msg_train, label_train) # find the best combination from param_svm
print svm_detector.grid_scores_
CPU times: user 5.24 s, sys: 170 ms, total: 5.41 s
Wall time: 1min 8s
[mean: 0.98677, std: 0.00259, params: {'classifier__kernel': 'linear', 'classifier__C': 1}, mean: 0.98654, std: 0.00100, params: {'classifier__kernel': 'linear', 'classifier__C': 10}, mean: 0.98654, std: 0.00100, params: {'classifier__kernel': 'linear', 'classifier__C': 100}, mean: 0.98654, std: 0.00100, params: {'classifier__kernel': 'linear', 'classifier__C': 1000}, mean: 0.86432, std: 0.00006, params: {'classifier__gamma': 0.001, 'classifier__kernel': 'rbf', 'classifier__C': 1}, mean: 0.86432, std: 0.00006, params: {'classifier__gamma': 0.0001, 'classifier__kernel': 'rbf', 'classifier__C': 1}, mean: 0.86432, std: 0.00006, params: {'classifier__gamma': 0.001, 'classifier__kernel': 'rbf', 'classifier__C': 10}, mean: 0.86432, std: 0.00006, params: {'classifier__gamma': 0.0001, 'classifier__kernel': 'rbf', 'classifier__C': 10}, mean: 0.97040, std: 0.00587, params: {'classifier__gamma': 0.001, 'classifier__kernel': 'rbf', 'classifier__C': 100}, mean: 0.86432, std: 0.00006, params: {'classifier__gamma': 0.0001, 'classifier__kernel': 'rbf', 'classifier__C': 100}, mean: 0.98722, std: 0.00280, params: {'classifier__gamma': 0.001, 'classifier__kernel': 'rbf', 'classifier__C': 1000}, mean: 0.97040, std: 0.00587, params: {'classifier__gamma': 0.0001, 'classifier__kernel': 'rbf', 'classifier__C': 1000}]
因此,很明显的,具有 C=1 的线性核函数是最好的参数组合。
再一次合理性检查:
print svm_detector.predict(["Hi mom, how are you?"])[0]
print svm_detector.predict(["WINNER! Credit for free!"])[0]
print confusion_matrix(label_test, svm_detector.predict(msg_test))
print classification_report(label_test, svm_detector.predict(msg_test))
[[965&& 8]
[ 13 129]]
&&&&&&&&&&&& precision&&&&recall&&f1-score&& support
&&&&&&&&ham&&&&&& 0.99&&&&&&0.99&&&&&&0.99&&&&&& 973
&&&&&& spam&&&&&& 0.94&&&&&&0.91&&&&&&0.92&&&&&& 142
avg / total&&&&&& 0.98&&&&&&0.98&&&&&&0.98&&&&&&1115
这是我们使用 SVM 时可以从 spam 邮件检测流程中获得的实际预测性能。
第七步:生成预测器
经过基本分析和调优,真正的工作(工程)开始了。
生成预测器的最后一步是再次对整个数据集合进行训练,以充分利用所有可用数据。当然,我们将使用上面交叉验证找到的最好的参数。这与我们开始做的非常相似,但这次深入了解它的行为和稳定性。在不同的训练/测试子集进行评价。
最终的预测器可以序列化到磁盘,以便我们下次想使用它时,可以跳过所有训练直接使用训练好的模型:
# store the spam detector to disk after training
with open('sms_spam_detector.pkl', 'wb') as fout:
&&&&cPickle.dump(svm_detector, fout)
# ...and load it back, whenever needed, possibly on a different machine
svm_detector_reloaded = cPickle.load(open('sms_spam_detector.pkl'))
加载的结果是一个与原始对象表现相同的对象:
print 'before:', svm_detector.predict([message4])[0]
print 'after:', svm_detector_reloaded.predict([message4])[0]
before: ham
after: ham
生产执行的另一个重要部分是性能。经过快速、迭代模型调整和参数搜索之后,性能良好的模型可以被翻译成不同的语言并优化。可以牺牲几个点的准确性换取一个更小、更快的模型吗?是否值得优化内存使用情况,或者使用 mmap 跨进程共享内存?
请注意,优化并不总是必要的,要从实际情况出发。
还有一些需要考虑的问题,比如,生产流水线还需要考虑鲁棒性(服务故障转移、冗余、负载平衡)、监测(包括异常自动报警)、HR 可替代性(避免关于工作如何完成的“知识孤岛”、晦涩/锁定的技术、调整结果的黑艺术)。现在,开源世界都可以为所有这些领域提供可行的解决方法,由于 OSI 批准的开源许可证,今天展示的所有工具都可以免费用于商业用途。
其他实用概念
数据稀疏性
在线学习,数据流
用于内存共享的 mmap,系统“冷启动”负载时间
可扩展性、分布式(集群)处理
无监督学习
大多数数据没有结构化。了解这些数据,其中没有自带的标签(不然就成了监督学习!)。
我们如何训练没有标签的内容?这是什么魔法?
“在类似语境中出现的词倾向于具有相似的含义”。上下文=句子,文档,滑动窗口……
查看&。简单的模型、大量数据(Google 新闻,1000 亿词,没有标签)。
下一步做什么?
这个 notebook 的静态版本(非交互版本)的 HTML,地址:&(你可能已经在看了,但以防万一)
交互式 notebook 源文件在 GitHub 上,(见上面的安装说明)。
阅读(...) 评论()}

我要回帖

更多关于 python sparse matrix 的文章

更多推荐

版权声明:文章内容来源于网络,版权归原作者所有,如有侵权请点击这里与我们联系,我们将及时删除。

点击添加站长微信