# Classification

To attempt classification, one method is to use linear regression and map all predictions greater than 0.5 as a 1 and all less than 0.5 as a 0. However, this method doesn’t work well because classification is not actually a linear function.

The classification problem is just like the regression problem, except that the values y we now want to predict take on only a small number of discrete values. For now, we will focus on the binary classification problem in which y can take on only two values, 0 and 1. (Most of what we say here will also generalize to the multiple-class case.) For instance, if we are trying to build a spam classifier for email, then x(i) may be some features of a piece of email, and y may be 1 if it is a piece of spam mail, and 0 otherwise. Hence, y∈{0,1}. 0 is also called the negative class, and 1 the positive class, and they are sometimes also denoted by the symbols “-” and “+.” Given x(i), the corresponding y(i) is also called the label for the training example.

# Decision Boundary

In order to get our discrete 0 or 1 classification, we can translate the output of the hypothesis function as follows:

The way our logistic function g behaves is that when its input is greater than or equal to zero, its output is greater than or equal to 0.5:

Remember.

From these statements we can now say:

The decision boundary is the line that separates the area where y = 0 and where y = 1. It is created by our hypothesis function.

# Multiple Features

Linear regression with multiple variables is also known as “multivariate linear regression”.

We now introduce notation for equations where we can have any number of input variables.

The multivariable form of the hypothesis function accommodating these multiple features is as follows:

hθ(x)=θ0+θ1x1+θ2x2+θ3x3++θnxn

In order to develop intuition about this function, we can think about θ0 as the basic price of a house, θ1 as the price per square meter, θ2 as the price per floor, etc. x1 will be the number of square meters in the house, x2 the number of floors, etc.

Using the definition of matrix multiplication, our multivariable hypothesis function can be concisely represented as:

This is a vectorization of our hypothesis function for one training example; see the lessons on vectorization to learn more.

Remark: Note that for convenience reasons in this course we assume x(i)0=1 for (i1,,m). This allows us to do matrix operations with theta and x. Hence making the two vectors ‘θ‘ and x(i) match each other element-wise (that is, have the same number of elements: n+1).]

# Gradient Descent For Multiple Variables

## Gradient Descent for Multiple Variables

The gradient descent equation itself is generally the same form; we just have to repeat it for our ‘n’ features:

The following image compares gradient descent with one variable to gradient descent with multiple variables:

# Gradient Descent in Practice I – Feature Scaling

We can speed up gradient descent by having each of our input values in roughly the same range. This is because θ will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven.

The way to prevent this is to modify the ranges of our input variables so that they are all roughly the same. Ideally:

−1 ≤ x(i) ≤ 1

or

−0.5 ≤ x(i) ≤ 0.5

These aren’t exact requirements; we are only trying to speed things up. The goal is to get all input variables into roughly one of these ranges, give or take a few.

Two techniques to help with this are feature scaling and mean normalization. Feature scaling involves dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable, resulting in a new range of just 1. Mean normalization involves subtracting the average value for an input variable from the values for that input variable resulting in a new average value for the input variable of just zero. To implement both of these techniques, adjust your input values as shown in this formula:

Where μi is the average of all the values for feature (i) and si is the range of values (max – min), or si is the standard deviation.

Note that dividing by the range, or dividing by the standard deviation, give different results. The quizzes in this course use range – the programming exercises use standard deviation.

For example, if xi represents housing prices with a range of 100 to 2000 and a mean value of 1000, then,

# Gradient Descent in Practice II – Learning Rate

Debugging gradient descent. Make a plot with number of iterations on the x-axis. Now plot the cost function, J(θ) over the number of iterations of gradient descent. If J(θ) ever increases, then you probably need to decrease α.

Automatic convergence test. Declare convergence if J(θ) decreases by less than E in one iteration, where E is some small value such as 10(3). However in practice it’s difficult to choose this threshold value.

It has been proven that if learning rate α is sufficiently small, then J(θ) will decrease on every iteration.

To summarize:

If α is too small: slow convergence.

If α is too large: ￼may not decrease on every iteration and thus may not converge.

# Features and Polynomial Regression

We can improve our features and the form of our hypothesis function in a couple different ways.

We can combine multiple features into one. For example, we can combine x1 and x2 into a new feature x3 by taking x1x2.

### Polynomial Regression

Our hypothesis function need not be linear (a straight line) if that does not fit the data well.

We can change the behavior or curve of our hypothesis function by making it a quadratic, cubic or square root function (or any other form).

## 机器学习 – 线性代数基础

1 基本概念和符号

Ax = b其中

• 符号A ∈ Rm×n表示一个m行n列的矩阵，并且矩阵A中的所有元素都是实数。
• 符号x ∈ Rn表示一个含有n个元素的向量。通常，我们把n维向量看成是一个n行1列矩阵，即列向量。如果我们想表示一个行向量（1行n列矩阵），我们通常写作x(xT表示x的转置，后面会解释它的定义)。
• 一个向量x的第i个元素表示为xi

• 我们用aij (或Aij，Ai，j，等) 表示第i行第j列的元素：

2 矩阵乘法

## 机器学习 – 梯度下降法

So we have our hypothesis function and we have a way of measuring how well it fits into the data. Now we need to estimate the parameters in the hypothesis function. That’s where gradient descent comes in.

Imagine that we graph our hypothesis function based on its fields θ0 and θ1 (actually we are graphing the cost function as a function of the parameter estimates). We are not graphing x and y itself, but the parameter range of our hypothesis function and the cost resulting from selecting a particular set of parameters.

We put θ0 on the x axis and θ1 on the y axis, with the cost function on the vertical z axis. The points on our graph will be the result of the cost function using our hypothesis with those specific theta parameters. The graph below depicts such a setup.

We will know that we have succeeded when our cost function is at the very bottom of the pits in our graph, i.e. when its value is the minimum. The red arrows show the minimum points in the graph.

The way we do this is by taking the derivative (the tangential line to a function) of our cost function. The slope of the tangent is the derivative at that point and it will give us a direction to move towards. We make steps down the cost function in the direction with the steepest descent. The size of each step is determined by the parameter α, which is called the learning rate.

For example, the distance between each ‘star’ in the graph above represents a step determined by our parameter α. A smaller α would result in a smaller step and a larger α results in a larger step. The direction in which the step is taken is determined by the partial derivative of J(θ0,θ1). Depending on where one starts on the graph, one could end up at different points. The image above shows us two different starting points that end up in two different places.

repeat until convergence:

where

j=0,1 represents the feature index number.

At each iteration j, one should simultaneously update the parameters θ1,θ2,...,θn. Updating a specific parameter prior to calculating another one on the j(th) iteration would yield to a wrong implementation.

In this video we explored the scenario where we used one parameter θ1 and plotted its cost function to implement a gradient descent. Our formula for a single parameter was :

Repeat until convergence:

Regardless of the slope’s sign for, θ1 eventually converges to its minimum value. The following graph shows that when the slope is negative, the value of θ1 increases and when it is positive, the value of θ1 decreases.

On a side note, we should adjust our parameter α to ensure that the gradient descent algorithm converges in a reasonable time. Failure to converge or too much time to obtain the minimum value imply that our step size is wrong.

# Gradient Descent For Linear Regression

When specifically applied to the case of linear regression, a new form of the gradient descent equation can be derived. We can substitute our actual cost function and our actual hypothesis function and modify the equation to :

The point of all this is that if we start with a guess for our hypothesis and then repeatedly apply these gradient descent equations, our hypothesis will become more and more accurate.

So, this is simply gradient descent on the original cost function J. This method looks at every example in the entire training set on every step, and is called batch gradient descent. Note that, while gradient descent can be susceptible to local minima in general, the optimization problem we have posed here for linear regression has only one global, and no other local, optima; thus gradient descent always converges (assuming the learning rate α is not too large) to the global minimum. Indeed, J is a convex quadratic function. Here is an example of gradient descent as it is run to minimize a quadratic function.

The ellipses shown above are the contours of a quadratic function. Also shown is the trajectory taken by gradient descent, which was initialized at (48,30). The x’s in the figure (joined by straight lines) mark the successive values of θ that gradient descent went through as it converged to its minimum.

# Cost Function – Intuition I

If we try to think of it in visual terms, our training data set is scattered on the x-y plane. We are trying to make a straight line (defined by hθ(x)) which passes through these scattered data points.

Our objective is to get the best possible line. The best possible line will be such so that the average squared vertical distances of the scattered points from the line will be the least. Ideally, the line should pass through all the points of our training data set. In such a case, the value of J(θ0,θ1) will be 0. The following example shows the ideal situation where we have a cost function of 0.

When θ1=1, we get a slope of 1 which goes through every single data point in our model. Conversely, when θ1=0.5, we see the vertical distance from our fit to the data points increase.

This increases our cost function to 0.58. Plotting several other points yields to the following graph:

# Cost Function – Intuition II

A contour plot is a graph that contains many contour lines. A contour line of a two variable function has a constant value at all points of the same line. An example of such a graph is the one to the right below.

Taking any color and going along the ‘circle’, one would expect to get the same value of the cost function. For example, the three green points found on the green line above have the same value for J(θ0,θ1) and as a result, they are found along the same line. The circled x displays the value of the cost function for the graph on the left when θ0 = 800 and θ1= -0.15. Taking another h(x) and plotting its contour plot, one gets the following graphs:

When θ0 = 360 and θ1 = 0, the value of J(θ0,θ1) in the contour plot gets closer to the center thus reducing the cost function error. Now giving our hypothesis function a slightly positive slope results in a better fit of the data.

The graph above minimizes the cost function as much as possible and consequently, the result of θ1 and θ0 tend to be around 0.12 and 250 respectively. Plotting those values on our graph to the right seems to put our point in the center of the inner most ‘circle’.

# Machine Learning Honor Code

We strongly encourage students to form study groups, and discuss the lecture videos (including in-video questions). We also encourage you to get together with friends to watch the videos together as a group. However, the answers that you submit for the review questions should be your own work. For the programming exercises, you are welcome to discuss them with other students, discuss specific algorithms, properties of algorithms, etc.; we ask only that you not look at any source code written by a different student, nor show your solution code to other students.

# Guidelines for Posting Code in Discussion Forums

### Scenario 1: Code to delete

Why Delete?: The reason is that if there is a simple fix provided by a student, a quick copy and paste with a small edit will provide credit without individual effort.

Learner Question: A student substitutes words for the math operators, but includes the variable names (or substitutes the equivalent greek letters (θ for ‘theta’, etc). This student also provides a sentence-by-sentence, line by line, description of exactly what their code implements. “The first line of my script has the equation “hypothesis equals theta times X”, but I get the following error message…”.

Why Delete?: This should be deleted. “Spelling out” the code in English is the same as using the regular code.

### Scenario 2: Code not to delete

Learner Question: How do I subset a matrix to eliminate the intercept?

Mentor Response: This probably would be okay, especially if the person posting makes an effort to not use familiar variable names, or to use a context which has nothing to do with the contexts in the assignments.

It is clearly ok to show examples of Octave code to demonstrate a technique. Even if the technique itself is directly applicable to a programming problem at hand. As long as what is typed cannot be “cut and pasted” into the program at hand.

E.g. how do I set column 1 of a matrix to zero? Try this in your Octave work area:

>> A = magic(3)

>> A(:,1) = 0

The above is always acceptable (in my understanding). Demonstrating techniques and learning the language/syntax are important Forum activities.

## What is Machine Learning?

Two definitions of Machine Learning are offered. Arthur Samuel described it as: “the field of study that gives computers the ability to learn without being explicitly programmed.” This is an older, informal definition.

Tom Mitchell provides a more modern definition: “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”

Example: playing checkers.

E = the experience of playing many games of checkers

T = the task of playing checkers.

P = the probability that the program will win the next game.

In general, any machine learning problem can be assigned to one of two broad classifications:

Supervised learning and Unsupervised learning.

## Supervised Learning

In supervised learning, we are given a data set and already know what our correct output should look like, having the idea that there is a relationship between the input and the output.

Supervised learning problems are categorized into “regression” and “classification” problems. In a regression problem, we are trying to predict results within a continuous output, meaning that we are trying to map input variables to some continuous function. In a classification problem, we are instead trying to predict results in a discrete output. In other words, we are trying to map input variables into discrete categories.

Example 1:

Given data about the size of houses on the real estate market, try to predict their price. Price as a function of size is a continuous output, so this is a regression problem.

We could turn this example into a classification problem by instead making our output about whether the house “sells for more or less than the asking price.” Here we are classifying the houses based on price into two discrete categories.

Example 2:

(a) Regression – Given a picture of a person, we have to predict their age on the basis of the given picture

(b) Classification – Given a patient with a tumor, we have to predict whether the tumor is malignant or benign.

## Unsupervised Learning

Unsupervised learning allows us to approach problems with little or no idea what our results should look like. We can derive structure from data where we don’t necessarily know the effect of the variables.

We can derive this structure by clustering the data based on relationships among the variables in the data.

With unsupervised learning there is no feedback based on the prediction results.

Example:

Clustering: Take a collection of 1,000,000 different genes, and find a way to automatically group these genes into groups that are somehow similar or related by different variables, such as lifespan, location, roles, and so on.

Non-clustering: The “Cocktail Party Algorithm”, allows you to find structure in a chaotic environment. (i.e. identifying individual voices and music from a mesh of sounds at a cocktail party).

# Model Representation

To establish notation for future use, we’ll use x(i) to denote the “input” variables (living area in this example), also called input features, and y(i) to denote the “output” or target variable that we are trying to predict (price). A pair (x(i),y(i)) is called a training example, and the dataset that we’ll be using to learn—a list of m training examples (x(i),y(i));i=1,...,m—is called a training set. Note that the superscript “(i)” in the notation is simply an index into the training set, and has nothing to do with exponentiation. We will also use X to denote the space of input values, and Y to denote the space of output values. In this example, X = Y = ℝ.

To describe the supervised learning problem slightly more formally, our goal is, given a training set, to learn a function h : X → Y so that h(x) is a “good” predictor for the corresponding value of y. For historical reasons, this function h is called a hypothesis. Seen pictorially, the process is therefore like this:

When the target variable that we’re trying to predict is continuous, such as in our housing example, we call the learning problem a regression problem. When y can take on only a small number of discrete values (such as if, given the living area, we wanted to predict if a dwelling is a house or an apartment, say), we call it a classification problem.

## 机器学习 – octave

octave机器学习利器:

>> 5+6
ans =  11
>> 3-2
ans =  1
>> 5*8
ans =  40
>> 1/2
ans =  0.50000
>> 2^6
ans =  64

>> 1 == 2  %false
ans = 0
>> 1 ~= 2  %true
ans =  1
>> 8>1 && 0  %AND
ans = 0
>> 9>1 || 1  %OR
ans =  1
>> xor(1,0)
ans =  1

>> PS1('>>>');
>>>

>>>a = 3
a =  3
>>>a = 3;  #分号抑制打印
>>>

>>>a = 3.14;
>>>a
a =  3.1400
>>>disp(a);
3.1400
>>>disp(sprintf('2 decimals: %0.2f', a));
decimals: 3.14

>>>a=pi
a =  3.1416
>>>format long
>>>a
a =  3.14159265358979
>>>format short
>>>a
a =  3.1416

>>>A = [1 2; 3 4; 5 6]
A =
2
4
6

>>>a = [1 2;
4;
6]
a =
2
4
6

>>>v = [1 2 3]
v =
2   3

>>>v = [1; 2; 3]
v =
2

>>>v = 1:0.1:2
v =

Columns 1 through 4:

1.0000    1.1000    1.2000    1.3000

Columns 5 through 8:

1.4000    1.5000    1.6000    1.7000

Columns 9 through 11:

1.8000    1.9000    2.0000

>>>v = 1:6
v =

1   2   3   4   5   6

>>>ones(2,3)
ans =
1   1
1   1

>>>w = ones(1,3)
w =
1   1

>>>w = rand(3,3)
w =

0.91025   0.82671   0.14067
0.90400   0.34350   0.51289
0.25501   0.24975   0.80750

// 高斯分布 (正态分布)
>>>w = randn(1,3)
w =

-0.052546  -1.786869   0.754202

w = -6 + sqrt(10)*(randn(1,10000));
hist(w)
hist(w, 50)

>> eye(4)
ans =

Diagonal Matrix
0   0   0
1   0   0
0   1   0
0   0   1

>> help

For help with individual commands and functions type

help NAME

(replace NAME with the name of the command or function you would

For a more detailed introduction to GNU Octave, please consult the
manual.  To read the manual from the prompt type

doc

GNU Octave is supported and developed by its user community.

>> A = [1 2; 3 4; 5 6]
A =
2
4
6

>> size(A)
ans =
2

>> sz = size(A)
sz =
2

>> size(sz)
ans =
2

>> size(A,1)
ans =  3
>> size(A,2)
ans =  2

>> V = [1 2 3 4]
V =
2   3   4

>> length(V)
ans =  4
>> length(A)
ans =  3

>> pwd
ans = C:\Users\xin
>> cd 'E:\TEMPsrc\octave'
>> pwd
ans = E:\TEMPsrc\octave
>> ls
>> who
Variables in the current scope:

a    ans  b    c
>> whos
Variables in the current scope:

Attr Name        Size                     Bytes  Class

==== ====        ====                     =====  =====

a           1x1                          8  doubl
e
ans         1x17                        17  char
b           1x1                          8  doubl
e
c           1x1                          8  doubl
e
d           3x2                         48  doubl
e

Total is 26 elements using 89 bytes
>> who
Variables in the current scope:

a    ans  b    c    d

>> clear a
>> who
Variables in the current scope:

ans  b    c    d
>> save hello.mat d
>> clear
>> who
>> v = [1 2; 3 4; 5 6; 7 8; 9 0]
v =
2
4
6
8
0

< -ascii  %save as text(ASCII)
>> A = [1 2; 3 4; 5 6]
A =
2
4
6

>> A(3,2)
ans =  6
>> A(2,:)
ans =
4

>> A(:,2)
ans =
4

>> A([1 3], 🙂
ans =

1   2
5   6

>> A(:,2) = [10;11;12]
A =
10
11
12

>> A = [A, [100;101;102]]
A =
10   100
11   101
12   102

>> A(:)
ans =
3
10
12
101

>> A = [1 2; 3 4; 5 6];
>> B = [11 12; 13 14; 15 16];
>> C = [A B]
C =
2   11   12
4   13   14
6   15   16

>> C = [A; B]
C =
2
4
6
12
14
16

>> A = [1 2; 3 4; 5 6];
>> B = [11 12; 13 14; 15 16];
>> C = [1 1; 2 2];
>> A*C
ans =
5
11
17

>> A .* B
ans =
24
56
96

>> A .^ 2
ans =
4
16
36

>> V = [1; 2; 3];
>> 1 ./ V
ans =

1.00000
0.50000
0.33333

>> 1 ./ A
ans =

1.00000   0.50000
0.33333   0.25000
0.20000   0.16667

>> log(V)
ans =

0.00000
0.69315
1.09861

>> exp(V)
ans =

2.7183
7.3891
20.0855

>> abs(V)
ans =
2

>> v = [1;2;3]
v =
2

>> v + ones(length(v), 1)
ans =
3

>> v + ones(3,1)
ans =
3

>> v + 1
ans =
3

>> A
A =
2
4
6

>> A'
ans =
3   5
4   6

>> a = [1 15 2 0.5]
a =

1.00000   15.00000    2.00000    0.50000

>> val = max(a)
val =  15
>> [val, ind] = max(a)
val =  15
ind =  2

a =

1.00000   15.00000    2.00000    0.50000

>> a < 3
ans =
0   1   1

>> find(a < 3)
ans =
3   4

>> A = magic(3)
A =
1   6
5   7
9   2

>> [r, c] = find(A >= 7)
r =
3

c =
2

>> a
a =

1.00000   15.00000    2.00000    0.50000

>> sum(a)
ans =  18.500
>> prod(a)
ans =  15
>> floor(a)
ans =
15    2    0

>> ceil(a)
ans =
15    2    1

>> max(rand(3), rand(3))
ans =

0.957477   0.083887   0.459507
0.799441   0.975439   0.927632
0.888604   0.942436   0.612661

>> A
A =
1   6
5   7
9   2

>> max(A, [], 1)
ans =
9   7
>> max(max(A))
ans =  9
>> max(A(:))
ans =  9

>> A = magic(5)
A =
24    1    8   15
5    7   14   16
6   13   20   22
12   19   21    3
18   25    2    9

>> sum(A,1)
ans =
65   65   65   65

>> sum(A,2)
ans =
65
65
>> sum(sum(A.*eye(5)))
ans =  65

>> eye(9)
ans =

Diagonal Matrix
0   0   0   0   0   0   0   0
1   0   0   0   0   0   0   0
0   1   0   0   0   0   0   0
0   0   1   0   0   0   0   0
0   0   0   1   0   0   0   0
0   0   0   0   1   0   0   0
0   0   0   0   0   1   0   0
0   0   0   0   0   0   1   0
0   0   0   0   0   0   0   1

>> flipud(eye(9))
ans =

Permutation Matrix
0   0   0   0   0   0   0   1
0   0   0   0   0   0   1   0
0   0   0   0   0   1   0   0
0   0   0   0   1   0   0   0
0   0   0   1   0   0   0   0
0   0   1   0   0   0   0   0
0   1   0   0   0   0   0   0
1   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0

>> A = magic(3)
A =
1   6
5   7
9   2

>> pinv(A)
ans =

0.147222  -0.144444   0.063889
-0.061111   0.022222   0.105556
-0.019444   0.188889  -0.102778

>> temp = pinv(A)
temp =

0.147222  -0.144444   0.063889
-0.061111   0.022222   0.105556
-0.019444   0.188889  -0.102778

>> temp * A
ans =

1.00000   0.00000  -0.00000
-0.00000   1.00000   0.00000
0.00000   0.00000   1.00000

>> t=[0:0.01:0.98];
>> y1 = sin(2*pi*4*t);
>> plot(t,y1);
>> t=[0:0.01:0.98];
>> y2 = cos(2*pi*4*t);
>> plot(t,y2);
>> plot(t, y1);
>> hold on;
>> plot(t, y2, 'r');
>> xlabel('time')
>> ylabel('value')
>> legend('sin', 'cos')
>> title('my plot')
>> print -dpng 'myplot.png'
>> close
>> figure(1); plot(t, y1);
>> figure(2); plot(t, y2);
>> subplot(1,2,1);
>> plot(t, y1);
>> subplot(1,2,2);
>> plot(t, y2);
>> axis([0.5 1 -1 1])
>> clf;
>> A = magic(5);
>> imagesc(A)
>> imagesc(A), colorbar, colormap gray;


## Python – 简易识别

# -*- TomYuan
import numpy as np
import os
import matplotlib.pyplot as plt
'''
0000000000000000000000000
0000000000000000000000000
0000000001111111000000000
0000000010000000100000000
0000000100000000010000000
0000001000000000001000000
0000001000000000001000000
0000001000000000001000000
0000001000000000001000000
0000001000000000001000000
0000001000000000001000000
0000001000000000001000000
0000001000000000001000000
0000001000000000001000000
0000001000000000001000000
0000001000000000001000000
0000001000000000001000000
0000000100000000010000000
0000000010000000100000000
0000000001111111000000000
0000000000000000000000000
0000000000000000000000000
'''
#define array (25 * 22) {m * n}
m = 25
n = 22
matrix = [0] * n
fr = open(fileName)
l = 0;
#put all number into array
matrix[l] = [0] * (len(line) - 1)
for i in range(1, len(line)):
# print (line[i - 1])
matrix[l][i - 1] = line[i - 1]
l = l + 1
return matrix

def calMatrix(matrix):
ret = 0;
for i in range(1, 22):
for j in range(1, 25):
if matrix[i][j] == '1':
ret = ret + 1
#
if matrix[i - 1][j] == '1':
ret = ret + 2
else:
ret = ret - 2
#
if matrix[i - 1][j - 1] == '1':
ret = ret + 2
else:
ret = ret - 2
#
if matrix[i][j - 1] == '1':
ret = ret + 2
else:
ret = ret - 2
##
#
if matrix[i + 1][j] == '1':
ret = ret + 2
else:
ret = ret - 2
#
if matrix[i + 1][j + 1] == '1':
ret = ret + 2
else:
ret = ret - 2
#
if matrix[i][j + 1] == '1':
ret = ret + 2
else:
ret = ret - 2
##
#
if matrix[i - 1][j + 1] == '1':
ret = ret + 2
else:
ret = ret - 2
#
if matrix[i + 1][j - 1] == '1':
ret = ret + 2
else:
ret = ret - 2
return ret


## Python最小二乘法

# -*- TomYuan
import numpy as np
import os
import matplotlib.pyplot as plt

xcord = [];
ycord = [];
def drawScatterDiagram(fileName):
fr = open(fileName)
line_arr = line.split(",")
xcord.append(float(line_arr[0]));
ycord.append(float(line_arr[1]));

drawScatterDiagram("ml.txt")

def leastSquare(xcord, ycord):
t1 = 0; t2 = 0; t3 = 0; t4 = 0
for num in range(0, len(xcord)):
t1 += xcord[num] * xcord[num];
t2 += xcord[num];
t3 += xcord[num] * ycord[num];
t4 += ycord[num];
a = (t3 * len(xcord) - t2 * t4) / (t1 * len(xcord) - t2 * t2);
b = (t1 * t4 - t2 * t3) / (t1 * len(xcord) - t2 * t2)
return a, b

def lastDraw():
plt.scatter(xcord, ycord, s=30, c='red')
print(ycord[len(ycord) - 1])
a, b = leastSquare(xcord, ycord)

x1 = float(xcord[0])
x2 = 1.0

y1 = a * x1 + b;
y2 = a * x2 + b;

plt.plot([x1, x2],
[y1, y2])
plt.show()

lastDraw()


## 神经网络(2)

### 分析一个Unity3D的神经网络

• 神经元细胞(Neuron)
• 输入值:InValue as float;
期望值:DesiredValue as float;
错误:Error as float;
值:Value

NeuronConnection {
Neuron neuron;   // 神经元细胞
float weight;    // 权重
};


## 卷积神经网络

### 卷积神经网络(CNN)

首先我们需要明确下Deep Learning是全部深度学习算法的总成，而CNN则是深度学习算法在图像处理领域的一个应用。
一般的CNN基本结构包括两层：

• 特征提取层
• 每个神经元的输入与前一层的局部接受域相连，并提取该局部的特征。一旦该局部特征被提取后，它与其它特征间的位置关系也随之确定下来

• 特征映射层
• 网络的每个计算层由多个特征映射组成，每个特征映射是一个平面，平面上所有神经元的权值相等。
在想了解卷积神经网络之前，我们需要对神经网络有一个清晰的认识，请参考博客中有关神经网络的内容，在了解神经网络之后，我们继续下面的学习和讨论。

## 遗传算法(2)

### 遗传算法核心表示

/**
* 交叉遗传的概率Pc
* 遗传变异的概率Pm
* 种群规模M
* 终止进化代数G
* 当产生出一个个体的适应度超过给定Tf,则终止进化
*/
步骤1
初始化产生 Pc Pm M G Tf参数并随机生成第一代种群population,简称P1
初始化P = P1
do {
计算P中每一个个体的适应度F(i)
初始化空种群newP
do {
根据适应度比例选择算法选择出2个个体
if (rnd1 < Pc) {
进行交叉操作
}
if (rnd2 < Pm) { 进行变异操作 } 将两个操作后的个体放进newP中,即产生的新个体进入新的种群 } until (M个个体被创建) P = newP } until(任何染色体满足适应度或者繁殖代数>G)


在这里我们看到了，这个随机选择以及产生后代的方式需要斟酌，如果设定的不好，那么很有可能这个种族最后就灭绝了，算个说话也就是我们没有得到我们的解。
大自然这里还有一个规律叫做：物竞天择 适者生存
在我们这里也需要对算法进行优化：
择优 为了防止进化过程中产生的最优解被交叉和变异所破坏，可以将每一代中的最优解原封不动的复制到下一代中。

### 具体实例

• 理解实例
• 求 f(x1, x2) = x1^3 + x2^2 + (x1 * (x1 – x2))的最大值，其中x1属于{-5, -3, 1, 3, 5}, x2属于{0, 2, 4}
当然这个题目解法很多，穷举方法也非常的迅速。这个例子为了更加透彻的理解遗传算法。
步骤1 编码
我们此处定义隐射关系为
[
[0] = -5,
[1] = -3,
[2] = 0,
[3] = 1,
[4] = 2,
[5] = 3,
[6] = 4,
[7] = 5
]
8可以用4位二进制表示,而x1和x2则用8位二进制即可完成验证比如{0110|0110}则表示[x1 = 3, x2 = 4]
步骤2 生成种群,注意生成种群的数量以及作用域关系，写一段js代码来进行测试

步骤3 随机选择父代进行通过交叉和变异生成子代（选出适应度较高的进行）

代码示意，因为没有变异以及编码是否可以有更好的办法，所以只为显示整体过程

console.log("遗传算法");
var everyone = [];
var number = 200;
function in_array(search, array){
for(var i in array){
if(array[i]==search){
return true;
}
}
return false;
}
var genChromosome = function(scope) {
var timestamp = new Date().getTime();
var index = Math.ceil(Math.random() * timestamp) % scope.length;
var chromosome = scope[index].toString(2);
while (chromosome.length < 4) {
chromosome = "0" + chromosome;
}
return chromosome;
}

// 计算每个的适应度
var calFitness = function(omo) {
var codes = [-5, -3, 0, 1, 2, 3, 4, 5];
var arr1 = [-5, -3, 1, 3, 5];
var arr2 = [0, 2, 4];
var x1 = codes[parseInt(omo.substr(0, 4), 2)];
var x2 = codes[parseInt(omo.substr(4, 4), 2)];
if (x1 != undefined && x2 != undefined && in_array(x1, arr1) && in_array(x2, arr2)) {
return x1 * x1 * x1 + x2 * x2 + (x1 * (x1 - x2));
}
return -9999;
}

function sortNumber(a,b)
{
return a - b
}

$('#genUnit').click(function() {$('#geti').html('');
var scope1 = [0, 1, 3, 5, 7];
var scope2 = [2, 4, 6];
// 生成50组个体
everyone = [];
for (var i = 0; i < number; i++) {
var new_omo = genChromosome(scope1) + genChromosome(scope2);
everyone.push (new_omo);
}
for (var i = 0; i < everyone.length; i++) {
$('#geti').append(everyone[i] + " "); if ((i + 1) % 10 == 0) {$('#geti').append("");
}
}
});

// 交叉函数
var cross = function(omo1, omo2) {
// 针对这个，四位是一个染色体特征控制
var ret = "";
var timestamp = new Date().getTime();
var rnd = Math.ceil(Math.random() * timestamp) % 4;
if (rnd <= 1) {
// 互换前四位
for (var i = 0; i < 4; i++) {
var tmp = omo1[i];
omo1[i] = omo2[i];
omo2[i] = tmp;
}
} else if (rnd <= 3) {
// 互换后四位
for (var i = 4; i < 8; i++) {
var tmp = omo1[i];
omo1[i] = omo2[i];
omo2[i] = tmp;
}
}
var rnd_next = Math.ceil(Math.random() * timestamp) % 2;
if (rnd_next == 0) {
ret = omo1;
} else {
ret = omo2;
}
return ret;
}
// 变异函数
var variation = function(omo1, omo2) {
// 变异某一位，然后做交叉运算
// 这里暂时不需要，所以直接进行选择
var timestamp = new Date().getTime();
var rnd_next = Math.ceil(Math.random() * timestamp) % 2;
if (rnd_next == 0) {
ret = omo1;
} else {
ret = omo2;
}
return ret;
}
// 判断结束
var finish = function() {
// 这里直接看第五十代
}

$('#genNextUnit').click(function() { if (everyone.length == 0) { return } // 至少5代且满足best适应值占75%或最多50代 var g_num = 0; while (g_num < 50) { // 假设淘汰20%，最优的保留，剩下随机 var fitness_score = []; for (var i = 0; i < everyone.length; i++) { fitness_score.push(parseInt(calFitness(everyone[i]))); } fitness_score.sort(sortNumber); var over = Math.ceil(fitness_score.length * 0.2) for (var i = 0; i < over; i++) { fitness_score.shift(); } var best = fitness_score[fitness_score.length - 1]; // 生成子代 var new_generation = []; while (new_generation.length < number) { var curr_unit; // 选择 var timestamp = new Date().getTime(); var choose_fitness1 = everyone[Math.ceil(Math.random() * timestamp) % everyone.length]; var choose_fitness2 = everyone[Math.ceil(Math.random() * timestamp) % everyone.length]; if (calFitness(choose_fitness1) == best && calFitness(choose_fitness2) == best) { // 进行交叉 curr_unit = cross(choose_fitness1, choose_fitness2) if (Math.ceil(Math.random() * timestamp) % 100 < 2) { // 进行变异 curr_unit = variation(choose_fitness1, choose_fitness2) } } else if (Math.ceil(Math.random() * timestamp) % 100 > 50) { // 进行交叉 curr_unit = cross(choose_fitness1, choose_fitness2) // 进行变异 if (Math.ceil(Math.random() * timestamp) % 100 < 2) { // 进行变异 curr_unit = variation(choose_fitness1, choose_fitness2) } } if (curr_unit != undefined) { if (calFitness(curr_unit) > -9999) { new_generation.push(curr_unit); } } } everyone = new_generation; g_num = g_num + 1; } var fitness_score = []; for (var i = 0; i < everyone.length; i++) { fitness_score.push(parseInt(calFitness(everyone[i]))); } console.log(everyone[0]); fitness_score.sort(sortNumber); var best_number = fitness_score[fitness_score.length - 1];$('#zidai').html(best_number);
// 01110010
});


## 遗传算法(1)

遗传算法也成进化算法，该算法受到达尔文进化论的启发提出的一种启发式搜索算法。

### 进化论

种群

生物的进化以群体的形式进行，这样的一个群体称为种群。

个体

组成种群的单个生物。

基因

一个遗传因子。

染色体

包含一组的基因。

生存竞争，适者生存

对环境适应度高的个体参与繁殖的机会比较多，后代就会越来越多。适应度低的个体参与繁殖的机会比较少，后代就会越来越少。

遗传与变异

新个体会遗传父母双方各一部分的基因，同时有一定的概率发生基因变异。

综述

繁殖过程，会发生基因交叉，基因突变 ，适应度低的个体会被逐步淘汰，而适应度高的个体会越来越多。那么经过N代的自然选择后，保存下来的个体都是适应度很高的，其中很可能包含史上产生的适应度最高的那个个体。

### 遗传算法

遗传算法将要解决的问题模拟成一个生物进化的过程，通过复制、交叉、突变等操作产生下一代的解，并逐步淘汰掉适应度函数值低的解，增加适应度函数值高的解。这样进化N代后就很有可能会进化出适应度函数值很高的个体。
下面以0-1背包问题来讲解下遗传算法的步骤

• 编码
• 需要将问题的解编码成字符串的形式才能使用遗传算法。最简单的一种编码方式是二进制编码，即将问题的解编码成二进制位数组的形式。例如，问题的解是整数，那么可以将其编码成二进制位数组的形式。将0-1字符串作为0-1背包问题的解就属于二进制编码。

• 选择
• 选择一些染色体来产生下一代。一种常用的选择策略是比例选择。也就是轮盘赌算法，如下图所示

群体中每一染色体指定饼图中一个小块。块的大小与染色体的适应性分数成比例，适应性分数愈高，它在饼图中对应的小块所占面积也愈大。为了选取一个染色体，要做的就是旋转这个轮子，直到轮盘停止时，看指针停止在哪一块上，就选中与它对应的那个染色体。

若产生随机数为0.81，则6号个体被选中。

// 轮盘赌代码示意
/*
* 按设定的概率，随机选中一个个体
* P[i]表示第i个个体被选中的概率
*/
int RWS()
{
m =0;
r =Random(0,1); //r为0至1的随机数
for(i=1;i<=N; i++)
{
/* 产生的随机数在m~m+P[i]间则认为选中了i
* 因此i被选中的概率是P[i]
*/
m = m + P[i];
if(r<=m) return i;
}
}

• 交叉
• 2条染色体交换部分基因，来构造下一代的2条新的染色体。
父辈染色体
00000|011100000000|10000
11100|000001111110|00101
随机交叉遗传
00000|000001111110|10000
11100|011100000000|00101

• 变异
• 新产生的染色体中的基因会以一定的概率出错，称为变异。
变异前: 000001110000000010000
变异后: 000001110000100010000
我们认为染色体交叉的概率为Pc，染色体变异的概率为Pm。
适应度函数:用于评价某个染色体的适应度，用f(x)表示。有时需要区分染色体的适应度函数与问题的目标函数。例如：0-1背包问题的目标函数是所取得物品价值，但将物品价值作为染色体的适应度函数可能并不一定适合。适应度函数与目标函数是正相关的，可对目标函数作一些变形来得到适应度函数。

## 神经网络(1)

• 生物学中的神经网络
• 人类的大脑有白质和灰质，简单的说，灰质负责接收指令，发出指令，而白质负责传递指令！而最小的处理单元我们认为是神经细胞，如下图所示：

神经细胞利用电-化学过程交换信号。输入信号来自另一些神经细胞。这些神经细胞的轴突末梢（也就是终端）和本神经细胞的树突相遇形成突触，信号就从树突上的突触进入本细胞。我们可以把这个过程看成现代计算机信号处理过程，利用一系列的0和1来进行操作，大脑的神经细胞也只有两种状态：兴奋和不兴奋（即抑制）。发射信号的强度不变，变化的仅仅是频率。神经细胞利用一种我们还不知道的方法,把所有从树突突触上进来的信号进行相加，如果全部信号的总和超过某个阀值，就会激发神经细胞进入兴奋状态，这时就会有一个电信号通过轴突发送出去给其他神经细胞。如果信号总和没有达到阀值，神经细胞就不会兴奋起来。
由于人体的神经细胞数量巨大，而且每个细胞都以独立处理单元的形式并行工作着！使人类大脑具备了如下的特点：
1 能实现无监督的学习
2 对损伤有冗余性
3 处理信息的效率极高
4 善于归纳推广
5 具有意识

• 数字神经网络
• 上一节我们了解了生物学的神经网络是由神经细胞进行信息传导的，那么我们的人工神经网络也需要一个信息传导介质，我们称之为人工神经细胞，其示意图如下:

如图所示进入人工神经细胞的每一个input(输入)都与一个权重w相联系，正是这些权重将决定神经网络的整体活跃性。你现在暂时可以设想所有这些权重都被设置到了-1和1之间的一个随机小数。因为权重可正可负，故能对与它关联的输入施加不同的影响，如果权重为正，就会有激发作用，权重为负，则会有抑制作用。
当输入信号进入神经细胞时，它们的值将与它们对应的权重相乘，作为图中大圆的输入。而大圆的本质是一个函数，我们成为激励函数，它把所有这些新的、经过权重调整后的输入全部加起来，形成单个的激励值。激励值也是一浮点数，且同样可正可负。然后，再根据激励值来产生函数的输出也即神经细胞的输出：如果激励值超过某个阀值，就会产生一个值为1的信号输出；如果激励值小于阀值，则输出一个0。


有了上面的了解，下面我们将人工神经细胞统一称之为神经细胞。神经细胞可以有一个或者多个输入，我们通过数学表达式x1, x2, x3, x4, …, xn(n表示输入总数),每个输入对应的权值为w1, w2, w3, w4, …, wn。而此时我们的激励值即为A = x1*w1 + x2*w2 + x3*w3 + … + xn*wn和图中所表示意义一致。我们此时可以将(x1, w1)、(x2, w2)…看成一个n维向量。下一步我们知道，如果激励值超过阀值，神经细胞的output则为1，否则为0。

• 神经细胞可以用来干什么？
• 我们有了上面对神经细胞的了解，下面就让我们来看看它到底用来干什么？大脑里的生物神经细胞和其他的神经细胞是相互连接在一起的。为了创建一个人工神经网络，人工神经细胞也要以同样方式相互连接在一起。虽然连接方式有很多种，我们一般通过下图所示的方式，一层一层的将神经细胞连接起来，这一种类型的神经网络就叫前馈网络，因为网络的每一层神经细胞的输出都向前馈送到了它们的下一层，直到获得整个网络的输出为止。

下面我们通过一个实例来具体说明，我们知道神经网络经常用作模式识别。说的具体一点就是神经网络善于把一种输入状态（它所企图识别的模式）映射到一种输出状态（它曾被训练用来识别的模式）。
我们以字符识别作为例子。设想有一个由8×8个格子组成的一块面板。每一个格子里放了一个小灯，每个小灯都可独立地被打开（格子变亮）或关闭（格子变黑），这样面板就可以用来显示十个数字符号。如下图所示表示数字4

现在为了识别显示内容，我们设计如下神经网络：这个面板作为输入，1或0作为输出，1代表神经网络确认已显示了数字“4”，而输出0表示没有显示数字“4”。神经网络需要有64个输入，即识别图中的每个方格。以及由许多神经细胞组成的一个隐藏层，还有仅有一个神经细胞的输出层，隐藏层的所有输出都馈送到它。
一旦神经网络体系创建成功后，它必须接受训练来认出数字“4”。为此可用这样一种方法来完成：先把神经网的所有权重初始化为任意值。然后给它一系列的输入，在本例中，就是代表面板不同配置的输入。对每一种输入配置，我们检查它的输出是什么，并调整相应的权重。如果我们送给网络的输入模式不是“4”， 则我们知道网络应该输出一个0。因此每个非“4”字符时的网络权重应进行调节，使得它的输出趋向于0。当代表“4”的模式输送给网络时，则应把权重调整到使输出趋向于1。这种类型的训练称作有监督的学习，用来训练的数据称为训练集。调整权重可以采用许多不同的方法。对本类问题最常用的方法就是反向传播方法，后面我们会阐述到。除了这种需要监督的学习训练还有一种不需要任何导师来监督的训练，或称无监督学习。