斯坦福机器学习-第三周（分类，逻辑回归，过度拟合及解决方法）

男儿欲遂平生志，五经勤向窗前读。这篇文章主要讲述斯坦福机器学习-第三周（分类，逻辑回归，过度拟合及解决方法）相关的知识，希望能为你提供帮助。

逻辑回归(Logistic Regression)

1. 分类(Classification)

The classification problem is just like the regression problem, except that the values we now want to predict take on only a small number of discrete values. For now, we will focus on the binary classification problem in which y can take on only two values, 0 and 1. For instance, if we are trying to build a spam classifier for email, then x(i)x^(i)x(i)may be some features of a piece of email, and y may be 1 if it is a piece of spam mail, and 0 otherwise. Hence, y∈0,1. 0 is also called the negative class, and 1 the positive class.

简而言之，分类就是通过一系列的特征值，来将数据集分成不同的类别。也就是说其最终的输出yyy是离散的值。比如垃圾邮件的分类。

2. 假设函数(Hypothesis function)
逻辑回归中的假设函数在本质与意义上同线性回归中的假设函数，仅仅只是在形式上发生了变化。

We could approach the classification problem ignoring the fact that y is discrete-valued, and use our old linear regression algorithm to try to predict y given x. However, it is easy to construct examples below where this method performs very poorly.

【斯坦福机器学习-第三周（分类，逻辑回归，过度拟合及解决方法）】example 1:

在上面的图片中，我们可以用以下表达式来表示假设函数：
当hθ(x)≥0.5h_\\theta(x)\\geq0.5hθ?(x)≥0.5，y=1y=1y=1;
当hθ(x)< 0.5h_\\theta(x)< 0.5hθ?(x)< 0.5，y=0y=0y=0; （至于为什么是0.5，第六周课程会讲到。简单说为了提高准确度你可以设置得更大，比如0.9，但这并不代表此时的模型最优）
但是这样表示的问题就是，如果此时在添加一条数据（如下图），这个表达式就不适用了。

example 2:
在逻辑回归中，0≤hθ(x)≤10\\leq h_\\theta(x) \\leq10≤hθ(x)≤1（因为hθ(x)h_\\theta(x)hθ(x)表示的是y=1y=1y=1的概率）；而在线性回归中hθ(x)h_\\theta(x)hθ(x)的取值范围可能大于1或者小于0，并且其值也不表示某种情况的概率。
Logistic Regression Model：
hθ(x)=g(θTx),g(z)=11+e?zh_\\theta(x) = g(\\theta^Tx),g(z) = \\frac11+e^-zhθ?(x)=g(θTx),g(z)=1+e?z1?;
hθ(x)=11+e?θTxh_\\theta(x) = \\frac11+e^-\\theta^Txhθ?(x)=1+e?θTx1?;

hθ(x)h_\\theta(x)hθ(x) will give us the probability that our output is 1. For example, hθ(x)h_\\theta(x)hθ(x)=0.7 gives us a probability of 70% that our output is 1.

hθ(x)=h_\\theta(x) =hθ(x)=（表示Y=1时的概率）
P(y=0∣x; θ)+P(y=1∣x; θ)=1P(y=0|x; \\theta) + P(y=1|x; \\theta) = 1P(y=0∣x; θ)+P(y=1∣x; θ)=1

其中g(x)g(x)g(x)称为S型函数（sigmoid function)或者逻辑函数(logistic function)

3. 决策边界(Decision boundary)

In order to get our discrete 0 or 1 classification, we can translate the output of the hypothesis function as follows:

为了解决离散值0和1的分类问题，我们可以将假设函数转化为如下形式：

hθ(x)≥0.5→y=1h_\\theta(x) \\geq 0.5→y=1hθ?(x)≥0.5→y=1
hθ(x)< 0.5→y=0h_\\theta(x) < 0.5→y=0hθ?(x)< 0.5→y=0

也就是说当hθ(x)h_\\theta(x)hθ?(x)大于0.5时，我们就可以认为yyy的取值为1了，因为超过了一半的概率。

同时，根据hθ(x)=g(θTx)=g(z)（z=θTx）h_\\theta(x) = g(\\theta^Tx) = g(z) （z=\\theta^Tx ）hθ?(x)=g(θTx)=g(z)（z=θTx）的图像，我们可以得出以下结论：
当$ z\\geq0时，时，时，g(z)\\geq0.5; 即; 即; 即h_\\theta(x)\\geq0.5时，时，时，y=1$
当$ z< 0时，时，时，g(z)< 0.5; 即; 即; 即h_\\theta(x)< 0.5时，时，时，y=0$
立即推：
当θTx≥0\\theta^Tx \\geq0θTx≥0时，y=1y=1y=1; 当θTx< 0\\theta^Tx < 0θTx< 0时，y=0y=0y=0; 也就是说，此时用$\\theta^Tx 把数据集分成了两个部分。因此，我们把把数据集分成了两个部分。因此，我们把把数据集分成了两个部分。因此，我们把\\theta^Tx =0$这条直（或曲）线称之为决策边界。注意，决策边界仅仅只是假设函数的性质，与其他无关。

The Decision Boundary is a property of the hypothesis including the parameters θ0,θ1,θ2?\\theta_0,\\theta_1,\\theta_2\\cdotsθ0,θ1,θ2?, which is the line that separates the area where y = 0 and where y = 1. It is created by our hypothesis function. And the data set is only used to fit the parameters theta.

看一个例子：

已知$\\theta_0 = -3, \\theta_1 = 1, \\theta_2 = 1; \\to h_\\theta(x) = g(-3 + x_1 + x_2) $，由前面推导可知：
θTx=?3+x1+x2≥0→y=1\\theta^Tx = -3 + x_1 + x_2 \\geq0 \\to y=1θTx=?3+x1?+x2?≥0→y=1
θTx=?3+x1+x2< 0→y=0\\theta^Tx = -3 + x_1 + x_2 < 0 \\to y=0θTx=?3+x1?+x2?< 0→y=0
所以，原数据集被**决策边界θTx=?3+x1+x2=0\\theta^Tx =-3 + x_1 + x_2 = 0θTx=?3+x1+x2=0**分割成如下两个部分，右上方表示y=1y=1y=1的部分，左下方表示y=0y=0y=0的部分。

4. 代价函数(Cost function)

We cannot use the same cost function that we use for linear regression because the Logistic Function will cause the output to be wavy looks like the figure left above, causing many local optima. In other words, it will not be a convex function.
Instead, our cost function for logistic regression looks like:

J(θ)=1m∑i=1mCost(hθ(x(i),y(i))J(\\theta) = \\frac1m\\displaystyle\\sum_i=1^mCost(h_\\theta(x^(i),y^(i))J(θ)=m1?i=1∑m?Cost(hθ?(x(i),y(i));
$Cost(h_\\theta(x),y) =\\begincases -\\log (h_\\theta(x)),& \\text if y=1y = 1y=1\\ -\\log(1-h_\\theta(x)), & \\textif y=0y=0y=0\\endcases$
Note: 0≤hθ(x)≤1表示的是y=10\\leq h_\\theta(x) \\leq1表示的是y=10≤hθ(x)≤1表示的是y=1的概率

由于yyy的取值只有0和1，所以原式又可以写成如下形式：
J(θ)=1m∑i=1mCost(hθ(x(i),y(i))=?1m[∑i=1my(i)log?hθ(x(i))+(1?y(i))log?(1?hθ(x(i)))]J(\\theta) = \\frac1m\\displaystyle\\sum_i=1^mCost(h_\\theta(x^(i),y^(i)) = -\\frac1m[\\displaystyle\\sum_i=1^my^(i)\\log h_\\theta(x^(i)) + (1 - y^(i))\\log (1 - h_\\theta(x^(i)))]J(θ)=m1i=1∑mCost(hθ(x(i),y(i))=?m1[i=1∑my(i)loghθ(x(i))+(1?y(i))log(1?hθ(x(i)))]

A vectorized implementation is:
h=g(Xθ)h=g(X\\theta)h=g(Xθ)
J(θ)=1m(?yTlog?(h)?(1?y)Tlog?(1?h))J(\\theta) = \\frac1m(-y^T\\log(h) - (1-y)^T\\log (1-h))J(θ)=m1(?yTlog(h)?(1?y)Tlog(1?h))

If our correct answer ‘y’ is 0:
then the cost function will be 0 if our hypothesis function also outputs 0.
then the cost function will approach infinity,If our hypothesis approaches 1.
If our correct answer ‘y’ is 1:
then the cost function will be 0 if our hypothesis function outputs 1.
then the cost function will approach infinity, If our hypothesis approaches 0.
Note that writing the cost function in this way guarantees that J(θ) is convex for logistic regression.

5. 梯度下降(Gradient Descent)
有了代价函数，下一步就是用梯度下降算法进行最小化Minimize J(θ)J(\\theta)J(θ)了。不管是在Linear regression model 中还是Logistic regression model中，梯度下降算法的最基本形式都是一样的，只是J(θ)J(\\theta)J(θ)的形式发生了改变。

Gradient Descent
Remember that the general form of gradient descent is:

RepeatRepeat \\Repeat
θj:=θj?α??θjJ(θ)\\theta_j := \\theta_j - \\alpha\\frac\\partial\\partial\\theta_jJ(\\theta)θj?:=θj??α?θj???J(θ)
\\

在逻辑回归中:
J(θ)=?1m[∑i=1my(i)log?hθ(x(i))+(1?y(i))log?(1?hθ(x(i)))]J(\\theta) = -\\frac1m[\\displaystyle\\sum_i=1^my^(i)\\log h_\\theta(x^(i)) + (1 - y^(i))\\log (1 - h_\\theta(x^(i)))]J(θ)=?m1[i=1∑my(i)loghθ(x(i))+(1?y(i))log(1?hθ(x(i)))]
所以，求导后的表达式如下：

We can work out the derivative part using calculus to get:

RepeatRepeat \\Repeat
θj:=θj?αm∑i=1m(hθ(x(i))?y(i))xj(i)\\theta_j := \\theta_j - \\frac\\alpham\\displaystyle\\sum_i=1^m(h_\\theta(x^(i))-y^(i))x_j^(i)θj?:=θj??mα?i=1∑m?(hθ?(x(i))?y(i))xj(i)?
\\

Notice that this algorithm is identical to the one we used in linear regression. We still have to simultaneously update all values in theta.

其中， hθ(x)=11+e?θTXh_\\theta(x) = \\frac11+e^-\\theta^TXhθ(x)=1+e?θTX1; 而在线性回归中hθ(x)=θTXh_\\theta(x)=\\theta^TXhθ(x)=θTX

A vectorized implementation is:

θ:=θ?αm(hθ(x)?y?)xj\\theta := \\theta - \\frac\\alpham(h_\\theta(x)- \\vecy)x_jθ:=θ?mα(hθ(x)?y )xj
推导见??关于梯度下降算法的矢量化过程??

6. 进阶优化(Advanced Optimization)

"Conjugate gradient", “BFGS”, and “L-BFGS” are more sophisticated, faster ways to optimize θ that can be used instead of gradient descent. We suggest that you should not write these more sophisticated algorithms yourself (unless you are an expert in numerical computing) but use the libraries instead, as they’re already tested and highly optimized. Octave provides them.

有一个观点是这样来描述梯度下降算法的：梯度下降算法做了两件事，第一计算J(θ)J(\\theta)J(θ)；第二计算??θjJ(θ)\\frac\\partial\\partial\\theta_jJ(\\theta)?θj?J(θ)。当然除了Gradient descent 之外，还有其他三种算法也能做着两件事情（以此来最优化参数θ\\thetaθ），且比梯度下降算法更快，不用手动选择α\\alphaα，但却更复杂。因为复杂，所以就不用我们自己来编写这些算法，使用开源的库即可。此时我们只需要自己写好cost function以及告诉matlab我们需要用那种算法来优化参数。

如图，现在我们用Matlab中的函数fminunc来计算J(θ)J(\\theta)J(θ)和??θjJ(θ)\\frac\\partial\\partial\\theta_jJ(\\theta)?θj?J(θ)，并且最终得到参数θ\\thetaθ的优化值。

You set a few options. This is a options as a data structure that stores the options you want. So grant up on, this sets the gradient objective parameter to on. It just means you are indeed going to provide a gradient to this algorithm. I’m going to set the maximum number of iterations to, let’s say, one hundred. We’re going give it an initial guess for theta. There’s a 2 by 1 vector.

optTheta %用来保存最后计算得到的参数值
functionVal %用来保存代价函数的计算值
exitFlag %用来表示最终是否收敛（1表示收敛）
@costFunction %表示调用函数costFunctioin

function [ jVal,gradient ] = costFunction( theta )
%此函数有两个返回值
%jVal 表示 cost function
%gradient 表示分别对两个参数的求导公式
jVal = (theta(1) - 5)^2 + (theta(2) - 5)^2;
gradient = zeros(2,1);
gradient(1) = 2 * (theta(1) - 5);
gradient(2) = 2 * (theta(2) - 5);
end

> > options = optimset(GradObj,on,MaxIter,100);
> > initialTheta = zeros(2,1);
> > [optTheta,functionVal,exitFlag]=fminunc(@costFunction,initialTheta,options)

因此，不管是在逻辑回归中还是线性回归中，只需要完成下图红色矩形中的内容即可。

7. 多分类(Multi-class classification: One-vs-all)

Multi-class 简而言之就是yyy的输出值不再是仅仅只有0和1了。而解决这一问题的思想就是，每次都把把training set 分成两部分，即One-vs-all。

One-vs-all
Train a logistic regression classifier hθ(i)(x)h_\\theta^(i)(x)hθ(i)?(x) for each class iii to predict the probability that y=iy=iy=i.
On a new input xxx, to make a prediction, pick the class iii that maximizes maxhθ(i)(x)max \\h_\\theta^(i)(x)\\maxhθ(i)?(x).

处理方法：

We are basically choosing one class and then lumping all the others into a single second class. We do this repeatedly, applying binary logistic regression to each case, and then use the hypothesis that returned the highest value as our prediction.

在解决这个问题的时候，我们根据图一，图二，图三的处理来训练三个分类器(classifier)hθ(1)(x),hθ(2)(x),hθ(3)(x)h_\\theta^(1)(x),h_\\theta^(2)(x),h_\\theta^(3)(x)hθ(1)(x),hθ(2)(x),hθ(3)(x) ，分表来输出y=calss1,y=calss2,y=calss3y=calss 1,y=calss 2,y=calss 3y=calss1,y=calss2,y=calss3的概率。在输入一个新的xxx后，分别在三个分类器中计算yyy的值，然后选择其中最大的即可。

8. 过度拟合(Over fitting)
既然有过度拟合，那就可定有对应的欠拟合；简单的说过度拟合就是假设函数过于复杂，虽然他能完美地拟合training set 但却不能预测新的数据。这中现象