 Activation function
 Logistic neurons
 $y=\frac{1}{1+e^{z}}$
 $\frac{dy}{dz}=y(1y)$
 The softmax function
 $y_i=\frac{e^{z_i}}{\sum\limits_{j:j \in group}{e^{z_j}}}$
 $\frac{\partial y_i}{\partial z_i} = y_i(1y_i)$
 Use Cross entropy as the cost function. (See more at http://peterroelants.github.io/posts/neural_network_implementation_intermezzo02/)
 $E = \sum\limits_{j:j \in group}{\hat{y}_j \log y_j}$
 $\frac{\partial E}{\partial z_i}=y_i\hat{y}_i$
 Logistic neurons
 General tricks
 Dropout (See more at https://arxiv.org/abs/1207.0580)
 A useful way to regularize the net.
 Details: randomly remove some hidden units in a layer for each training example
 Advantage: It stops hidden units from relying too much on other hidden units
 Tricks to speed up the learning procedure (See more at http://ruder.io/optimizinggradientdescent/)
 Use momentum
 $v(t) = \alpha v(t1)\varepsilon \frac{\partial E}{\partial w}(t)$
 $\begin{aligned}\Delta w(t) & = v(t)\\
& = \alpha v(t – 1) – \varepsilon \frac{\partial E}{\partial w}(t)\\
& = \alpha \Delta w(t – 1) – \varepsilon \frac{\partial E}{\partial w}(t) \end{aligned}$
 Use separate adaptive learning rates for each parameter
 $\Delta w_{i,j} = \varepsilon g_{i,j} \frac{\partial E}{\partial w_{i,j}}$
 $g_{i, j} = \begin{cases}g_{i,j}(t – 1) + \delta & \frac{\partial E}{\partial w_{i,j}}(t)\frac{\partial E}{\partial w_{i,j}}(t – 1) > 0 \\ g_{i,j}(t1) \times \delta & otherwise\end{cases}$
 $\delta$ is a variable setted by hand ($0.05$ is usually a good choice)
 This method ensures that local learning rates increase when the gradients don’t change sign and decay rapidly when oscillations start
 Use rmsprop
 $F(w, t) = \delta F(w, t – 1) + (1 – \delta)(\frac{\partial E}{\partial w})^2$
 $\Delta w = \frac{\varepsilon}{\sqrt{F(w, t)}} \cdot \frac{\partial E}{\partial w}$
 $\delta$ is a variable setted by hand ($0.9$ is usually a good choice)
 Use momentum
 Preventing overfitting
 Get more data
 Average many different models
 Use Bayesian Approach
 Bayes Theorem: $P(AB)=\frac{P(A)P(BA)}{P(B)}$
 Deriving Bayes’ Theorem: $P(AB)=\frac{P(AB)}{P(B)}=\frac{P(A)P(BA)}{P(B)}$
 Understanding
 Let $P(A)$ be the prior probability, assuming that we already know the prior distribution.
 Then we get a new data $B$, so we can compute $P(BA)$ as we already have prior distribution.
 After that, we use $P(B)$ to renormalize as $P(B) = \int\limits_A{P(BA)P(A)}$.
 Finally, our posterior distribution $P(AB)$ is $P(AB)=\frac{P(A)P(BA)}{P(B)}$.
 The Bayesian interpretation of weight decay
 Let $W$ be the set of weights in a network. We add Gaussian noise $\epsilon \sim N(0, \sigma_B^2)$ to the the output of the network, let $P(AW)$ donate the probability we get the correct anwser $A$ after we add the noise to the output $B$ with weights $W$.
 Considering the Maximum A Posteriori(MAP), we are trying to maximize $P(WA) = \frac{P(W)P(AW)}{P(A)}$, which is the same as minimize $\log P(WA) = \log P(W) – \log P(AW) + \log P(A)$. As $\log P(A)$ has nothing to do with $W$, we can just ignore that term. So we can define our cost as $C = \log P(W) – \log P(AW)$, and we are trying to minimize the cost $C$.
 Let us regularise weights $W$ by imposing the Gaussian distribution $N(0,\sigma_w^2)$, then $C = \frac{1}{2\sigma_w^2}\sum\limits_{i}{w_i^2}+\frac{1}{2\sigma_B^2}\sum\limits_c{(y_c\hat{y})^2} + k$ where $k$ is the sum of some constant terms.
 If we take that equation and multiply through by $2\sigma_{B}^2$ and ignore the constant $k$. Then we got a new cost function $C’ = \sum\limits_c{(y_c\hat{y})^2} + \frac{2\sigma_B^2}{2\sigma_w^2}\sum\limits_{i}{w_i^2}$.
 As we can see here, the weight decay parameter in this case should not be an arbitrary hack.
 See more at $L^2$ regularization is equivalent to Gaussian Prior and Bayesian Methods for Neural Networks
 Empirical Bayes
 As we mentioned above, if we want the $L^2$ weight decay to work excellent, we need to know the variance of the weight prior $\sigma_w$ and the noise variance $\sigma_B$
 Steps
 Start with guesses for both the noise variance and the weight prior variance.
 While not yet bored
 Do some learning using the ratio of the variances as the weight penalty coefficient.
 Reset the noise variance to be the variance of the residual errors.
 Reset the weight prior variance to be the variance of the distribution of the actual learned weights.
 Go back to the start of this loop.
 We don’t need a validation set to choose the weight penalty coefficient.
 Full Bayesian Learning
 Instead of trying to find the best single setting of parameters we compute the full posterior distribution over all possible parameter settings.
 This approach let overfitting disappear, cuz there’s no reason why the amount of data should influence our prior beliefs about the complexity. So this approach allows us to use complicated models even when we do not have much data.
 This approach does not involve any gradient descent so there are no local optimum issues.
 Practical Full Bayesian Learning
 Use Markov Chain Monte Carlo to approximate the posterior probability
 We can use the sampling noise from minibatch as the noise which the MCMC method needs.
 Bayes Theorem: $P(AB)=\frac{P(A)P(BA)}{P(B)}$
 Use the model has the right capacity
 Limit the number of hidden layers and the number of units per layer
 Use early stopping
 Use weightdecay
 Introduce noise
 $y = \sum\limits_i{x_iw_i}$
 Consider $L^2$ weightdecay and Gaussian noise
 Let $x_i^{noisy}$ equals to $x_i+\epsilon_i$ where $\epsilon_i$ is sampled from $N(0,\sigma^{2}_{i})$
 Then $y^{noisy} = y + \sum\limits_{i}{w_i\epsilon_i}$
$$\begin{aligned}E\bigl((y^{noisy}\hat{y})^2\bigr) & = E\Bigl(\bigl((y – \hat{y}) + \sum\limits_{i}{w_i\epsilon_i}\bigr)^2\Bigr)\\
& = (y – \hat{y})^2 + E\bigl(2(y\hat{y})\sum\limits_{i}{w_i\epsilon_i}\bigr) + E\bigl((\sum\limits_{i}{w_i\epsilon_i})^2\bigr)\\
& = (y – \hat{y})^2 + E(\sum\limits_{i}{w_i^2\epsilon_i^2})\\
& = (y – \hat{y})^2 + \sum\limits_{i}{w_i^2\sigma_i^2}\end{aligned}$$ So $\sigma_i^2$ is equivalent to an $L^2$ penalty in this simple network.
 If the network is more complicated, Gaussian noise usually works better than $L^2$ weight penalty.
 Combine models (cooperation)
 Mathematics analysis
 Combine models
 $\bar{y} = \frac{1}{n} \sum\limits_{i = 1}^{n}{y_i}$
 If we use $L^2$ penalty, then the cost is $(\bar{y}\hat{y})^2$, so is its expectation.
 Pick a predictor at random
 If we use $L^2$ penalty, then the expectation of the cost should be
$$\begin{aligned}E\big((\hat{y}y_i)^2\big) & = E\Big(\big((\hat{y} – \bar{y}) – (y_i – \bar{y})\big)^2\Big)\\
& = E\big((\hat{y} – \bar{y})^2 + (y_i – \bar{y})^2 – 2(\hat{y} – \bar{y})(y_i – \bar{y})\big)\\
& = (\bar{y} – \hat{y})^2 + E\big((y_i\bar{y})^2\big) – 2(\hat{y} – \bar{y})E(y_i\bar{y})\\
& = (\bar{y} – \hat{y})^2 + E\big((y_i\bar{y})^2\big) – \frac{2}{n}(\hat{y} – \bar{y})\sum\limits_{i = 1}^{n}{y_i – \bar{y}}\\
& = (\bar{y} – \hat{y})^2 + E\big((y_i\bar{y})^2\big)\end{aligned}$$ 
Comparison
 Apparently the expectation cost of picking a random predictor is no less than combining models’ as $E\big((y_i\bar{y})^2\big) \ge 0$.
 The more the results of those models differ, the better is combining models than picking a random model.
 Combine models
 Ways to make predictors differ
 Use lots of different kinds of models
 Bagging: Train different models on different subsets of the data.
 Boosting: Train a sequence of low capacity models. Weight the training cases differently for each model in the sequence.
 Boosting upweights cases that previous models got wrong.
 It focused the computational resources on modelling the tricky cases.
 Mathematics analysis
 Mixtures of Experts (specialization)
 Cluster the training cases into subsets based on the relationship between input and output so that each cluster can be well modelled by an expert.
 Averaging models like what we mentioned above causes cooperation, not specialization.
 Cost function that encourages cooperation: $C = (\hat{y}\bar{y})^2$
 Cost function that encourages specialization: $C = \sum\limits_{i}{p_i(\hat{y}y_i)^2}$ where $p_i$ is the probability of picking each expert assigned by the manager where manager is a seperately trained model.
 Principal Compopnents Analysis
 This projects ndimensional data into an mdimensional subspace in which the data have the most variance.
 We reconstruct by using the mean value on the $nm$ directions that are not represented. The reconstruction error is the sum over all these unrepresented directions of the squared differences of the datapoint from the mean.
 Using backpropagation to implement PCA inefficiently
 Try to make the output be the same as the input in a network with a central bottleneck. The activities of the hidden units in the bottleneck form an efficient code.
 If the hidden and output layers are linear, it will learn hidden units that are a linear function of the data and minimize the squared reconstructing error which is exactly what PCA dose.
 Dropout (See more at https://arxiv.org/abs/1207.0580)
 Neural networks
 Convolutional Neural Networks
 Weight sharing
 If a feature is useful in some locations during training, detectors for that feature will be available in all location during testing
 Pooling
 Get a small amount of translational invariance at each level by averaging some neighbouring replicated detectors to give a single output to the next level.
 Advantage: It reduces the inputs to the next layer, allowing us to have many more different feature maps.
 Disadvantage: We lost information about the precise position of things
 Weight sharing
 Recurrent neural networks
 Comparing with other Hidden Sate Models
 Linear Dynamical Systems
 Let $x$ donate the state vector and $A$ donate the transition matirx
 The variation can take two forms:
 Flow: $x$ varies coontinuously with time
 $\frac{d x(t)}{d t} = A \cdot x(t)$
 Mapping: $x$ varies in discrete steps
 $x_{m+1} = A \cdot x_m$
 Flow: $x$ varies coontinuously with time
 The distribution over the hidden state given the data so far can be computed effeciently
 The hidden states follow Gaussian distribution
 Hidden Markov Models
 There are a number of things called states. And the system is always in exactly one of those states.
 Transitions between states are stochastic and controlled by a transition matrix
 HMMs have efficient algorithm for inference and learning
 The hidden states follow discrete distribution
 A HMM with $n$ state can only store $\log (n)$ bits of information because a HMM has to stay at one exactly state of its $n$ states
 RNNs
 RNNs can remember much more information
 RNNs can potentially learn to implement lots of small programs that each capture a nugget of knowledge and run in parallel, interacting to produce very complicated effects
 RNNs are much harder to train
 Linear Dynamical Systems
 Training the RNNs with backpropagation
 Expand RNN in time sequence
 Constraint the weights expanded from the same original weights
 Train as a normal network
 Reasons why RNNs are hard to train:
 In the forward pass, We use squashing functions (like the logistic) to prevent the activity vectors from exploding
 In the backward pass, the pass is completely linear, so it’s very likely to vanish or explode.
 Four effective ways to learn an RNN
 Long Short Term Memory
 Hessian Free Optimization
 Echo State Networks
 Fix the inputtohidden connections and the hiddentohidden connections at random values and only learn the hiddentoouput connections.
 It’s very important to set these connections very carefully.
 Key ideal: a big random expansion of the input vector can help.
 ESNs demonstrate that it’s very important to initialize weights sensibly.
 Good initialization with momentum
 RNNs with multiplicative connections
 Reasons why we need multiplicative connections
 If we consider an RNN as an automaton, then the hidden state of time $t$, should depend on the conjunction of input of time $t$ and hidden state of time $t – 1$.
 Instead of using the inputs to the RNN to provide additive extra input to the hidden units, we can use the current input to choose the whole hiddentohidden weight matrix. But this approach requires so many parameters.
 Details
 Let’s consider this model
(picture from https://www.coursera.org/learn/neuralnetworks/home/welcome)
 Then, multiplicative connections can be written as $c_f = (b^{\mathsf{T}}w_f)(a^{\mathsf{T}}u_f)v_f$ or $c_f=(b^{\mathsf{T}}w_f)(u_fv_f^{\mathsf{T}})a$ where $(b^{\mathsf{T}}w_f)$ can be considered as scalar coefficients and $(u_fv_f^{\mathsf{T}})$ can be considered to be the transition matrix with rank $1$.
 In fact, if here’re $n$ hidden units, here’re $n$ factors, and the description above shows one factor. And all this factors together actually let the input to determine the transition matrix.
 Reasons why we need multiplicative connections
 Comparing with other Hidden Sate Models
 Hopfield Nets
 Energy function: $E = \sum\limits_i{s_ib_i}\sum\limits_{i<j}{s_is_jw_{ij}}$.
 Hopfield proposed that memories could be energy minima of a neural net.
 Storing a memory using Hopfield’s storage rule
 If the units are binary threshold units: $\Delta w = 4(s_i\frac{1}{2})(s_j\frac{1}{2})$.
 If the units are of activities of $1$ and $1$: $\Delta w_{ij}=s_is_j$
 Dealing with spurious minima in Hopfield Nets
 Let the net settle from a random initial state and then do unlearning.
 Crick and Mitchison proposed unlearning as a model of what dreams are for.
 Storing a memory using pseudolikelihood method
 Use the perceptron convergence procedure to train each unit to have the correct state given the states of all the other units in that vector.
 Boltzmann Machines
 Can be regarded as Hopfield nets with hidden units.
 Instead of using the net to store memories, use it to construct interpretations of sensory input.
 The input is represented by the visible units.
 The interpretation is represented by the states of the hidden units.
 The badness of the interpretation is represented by the energy.
 Use Simulated Annealing to make Boltzmann Machines escape from local minima.
 $P(s_i=1)=\frac{1}{1+e^{\frac{\Delta E_i}{T}}}$ where $T$ is the temperature.
 Rasing the noise level is equivalent to decreasing all the energy gaps between configurations
 Thermal equilibrium
 The thing that settles down is the probability distribution over configurations
 Using energies to define probabilities
 Let $v$ donates the configuration on the visible units and $h$ on the hidden units.
 $P(v, h) = \frac{e^{E(v, h)}}{\sum\limits_{g}{\sum\limits_{u}{e^{E(u, g)}}}}$
 $P(v) = \frac{\sum\limits_h{e^{E(v, h)}}}{\sum\limits_{g}{\sum\limits_{u}{e^{E(u, g)}}}}$
 We can use MCMC to sample the probabilities.
 The boltzmann Machine learning algorithm
 The goal of learning: We are trying to maximize the product of the probabilities that the Boltzmann machine assigns to the binary vectors in the training set. This is equivalent to maximize the sum of the log probabilities that Boltzmann machine assigns to the training vectors.
 Everything that one weight needs to know about the other weights and the data is contained in the difference of two correlations.
$$\frac{\partial \log P(v)}{\partial w_{ij}}=E_v(s_is_j)E_{m}(s_is_j)$$
 where $E_v$ donates the expected value of the product of states at thermal equilibrium when $v$ is clamped on the visible units and $E_w$ donates the expected value of the product of states at thermal equilibrium with no clamping.

Positive phase: finding hidden configurations that work well with $v$ and lower their energies.
 Negative phase: finding the joint configurations that are the best competitors and raises their energies.
 Use MCMC to collect the statistics.

Restricted Boltzmann Machines
 Definition: Only one hidden layer and no connections between hidden units.
 In an RBM it only takes one step to reach thermal equilibrium when the visible units are clamped
$$P(h_j=1)=\frac{1}{1+e^{(b_j+\sum\limits_{i \in vis}{v_iw_{ij}})}}$$
 So we can quickly get the exact value of $E_v(v_ih_j)$
 Contrastive divergence (CD)
(picture from https://www.coursera.org/learn/neuralnetworks/home/welcome)
 Learning layers of features by stacking RBMs
 Training steps
 First train a layer of features that receive input directly from the data.
 Then treat the activations of the trained features as if they were the original data and learn features of features in a second hidden layer.
 Then do it again.
 Generating steps
 Get an equilibrium sample from the top level RBM by performing alternating Gibbs sampling for a long time.
 Perform a topdown pass to get states for all the other layers.
 Training steps
 Modeling realvalued data with an RBM
 Use stepped sigmoid units
 Belief Networks
 Disadvantage of backpropagation
 Requires labeled data.
 The learning time doesn’t scale well.
 It can get stuck in poor local optima.
 Overcoming the limitations of backpropagation by using unsupervised learning
 Keep the efficiency and simplicity of using a gradient method for adjusting the weights, but use it for modelling the structure of the sensory input.
 Adjust the weights to maximize the probability that a generative model would have generated the sensory input.
 Definition: a directed acyclic graph composed of stochastic variables.
 Disadvantage of backpropagation
 Sigmoid Belief Networks (SBN)
 Definition: Belief Networks with logistic neurons.
 Naive learning algorithm
 Get the posterior distribution over hidden states given the observed data. We can use MCMC to do this.
 Get an unbiased sample from the posterior distribution.
 For each unit, maximize the log probability that its binary state in the sample from the posterior would be generated by the sampled binary states of its parents.
 $\Delta w_{ji}=\epsilon s_j\big(s_iP(s_i=1)\big)$
 The wakesleep algorithm
 We assume wrongly that the posterior over hidden configurations factorizes into a product of distributions for each separate hidden unit. Though this assuming is wrong, the learning of an SBN will still work.
 Wake phase: Use recognition weights to perform a bottomup pass which trans the generative weights to reconstruct activities in each layer from the layer above.
 Sleep phase: Use generative weights to generate samples from the model, and this action trains the recognition weights to reconstruct activities in each layer from the layer below.
 Convolutional Neural Networks
 Strong AI
 2 theories on concept
 The feature theory: A concept is a set of semantic features
 The structuralist theory: The meaning of a concept lies in its relationships to other concepts
 2 theories on concept
God’s Swing
Here’s how some tensors from an Auto Encoder oscillate:
It seems that they are trying to capture something.
Well, this is my first time to visualize a neural network. And it feels like reaching a completely new world!
Titanic: Machine Learning from Disaster
Well, this is my first try on Kaggle.
And the result is not so good:
The first time I saw the picture below, I thought it’s just a joke. However, after my first try, it’s true, though I tried my best to avoid it.
(picture from: xkcd, CC BYNC 2.5)
PAC vs. Auto Encoder
This is definitely the most impressive picture I’ve seen this month!
(from NN4ML – Geoffrey Hinton)
Practical Tricks
I found two useful websites recently.
1. Acronymify!
This website can find abbreviation from the full name of your project.
A cool abbreviation makes your project much posher!
2. Linggle
This website can provide you with semantic suggestions.
It’s quite useful for students when writing English articles.
Notes on “A Neural Probabilistic Language Model”
Ⅰ. Distributed Representation
1. Fight the curse of dimensionality
As sentences with similar meanings can use quite different words, it’s very hard for the ngram model to generalize. In the paper, they propose to use Distributed Representation to fight the curse of dimensionality. The Distributed Representation give the model the ability to make use of semantic information. This feature itself does improve the model. And this proposal let each training sentence inform the model an exponential number of semantically neighbouring sentences which makes the model generalize much better.
2. Deal with outofvocabulary words
It’s clear to see that the only stuff we need to deal with is to assign this outofvocabulary word a distributed representation. So to do that, we first consider this unknown word $j$ as a blank which need to be filled. And we use the network to estimate the probability $p_i$ of each word $i$ in the vocabulary. Donating $C_i$ as the distributed representation for word $i$ in the vocabulary, we assign $\sum\limits_{i \ in \ vocabulary}{C_i \cdot p_i}$ to $C_j$ as the distributed representation for word $j$. After that, we can incorporate word $j$ in the vocabulary and use this slightly larger vocabulary for further computation.
This approach is quite elegant because this is exactly how human brain works.
3. Deal with polysemous words
We can just simply assign multiple distributed representations for a single polysemous word.
4. comparison with ngram models
Use distributed representation to make use of semantic information and to turn discrete random variables to continuous variables. And these two features let the network generalize much better than ngram models.
Ⅱ. Improve computation efficiency
1. Represent the conditional probability with a tree structure
For each classification, it will go through $O(\log n)$ nodes in the tree structure. This makes the network use much fewer parameters to process.
2. Parallel implementation
In this paper, databased parallel implementation and parameterbased implementation are mentioned.
Ⅲ. Relating to Strong AI
1. Taking advantage of prior knowledge
In this paper, they take advantage of the semantic information.
2. Decompose the whole network into smaller parts
This can make computation faster and easier to adapt it to other works. In Opening the black box of Deep Neural Networks via Information, it’s said that a large amount of computation is used to compression of input to effective representation. So if we can modularize the network and set up a set of general APIs, it can make a huge difference in practical implementation.
Ⅳ. Paper involves
Perceptrons
Definition
A twoclass classifier.
A Feedforward Neural Network without hidden layers.
More specifically: It maps its input $x$ to output $f(x)$ with parameter $\theta$:
$$
\begin{equation}
f(x) = \begin{cases}
0 & \text{if }x \cdot \theta > 0 \\
1 & \text{otherwise}
\end{cases}
\end{equation}
$$
Learning Algorithm
 Random initialize $\theta$
 For each example $i$ in the training set, update parameter $\theta$ as follow:
$$\theta = \theta + (y_i – f(\theta \cdot x_i)) \cdot x_i$$  Repeat
step 2
until the classifier classifies most examples correctly.
Properties
 Let’s change a view. Let each example donates a constraint. And the expected $\theta$ is at the intersection of those constraints. If you know Linear Programming, then you will see that this is actually a HalfPlane Intersection Problem.
 It’s clear to see that if $\theta_1, \theta_2$ is legal, then $a\theta_1 + b\theta_2$ is legal when $a + b = 1, a, b \ge 0$. It easy to proof: HalfPlane is a convex set. The intersection of some convex sets is convex. And this property can help to prove the convergence of this algorithm.
 If you know how to solve Linear Regression Problems with Gradient Descent, then you will know that sometimes we may pass the best solution if we choose a learning rate which is too large. In this algorithm, the same problem exists. So this algorithm will converge. But it may not converge to a solution which fit the dataset perfectly. So consider “generously feasible solution” that lies within the feasible region by a margin at least as great as the length of the input vector that defines each constraint plane. The algorithm can only be proved that it will converge to the “generously feasible solution”. Thus the solution in the proof of convergence below means “generously feasible solution”.
Convergence
As you can see from the definition, Perceptron is a linear classifier. Thus, if the dataset is not linearly separable, this algorithm will not converge.
If you can imagine plotting the dataset to a plane(space), with knowledge of linear algebra you can easily have an intuition that we are actually adjusting the decision boundary(separating hyperplane) according to every single example. And each iteration makes the parameter $\theta$ better. Actually, this algorithm will converge. And here is the proof:
 Because the feasible solution is a convex set, the modification made from other examples won’t make the decision boundary worse.
 As for a single misclassified example $i$, each modification will change the value of $\theta \cdot x$ by $x_i^2$. And the total value need to be correct is $\theta \cdot x_i$. Thus, the maximum number of iteration before the classifier makes the right classification for each example is $O(\max\limits_i{(\frac{\theta \cdot x_i}{x_i^2})})$
And here is a more rigorous proof, read it if you like: http://leijun00.github.io/2014/08/perceptron/
Disadvantages
The most wellknown disadvantage of this algorithm is that it can’t simulate the XOR Function. But actually, there are more general theorems, like “Group Invariance Theorem”. So I decide to read Perceptrons: an introduction to computational geometry(Minsky & Papert，1969) first. Then I will come back to finish this part.
—————————— UPD 2017.8.15 ——————————
I thought Perceptrons: an introduction to computational geometry is a paper. But it’s actually a book with 292 pages! So I give it up to read it for now. Maybe I will read it in university?
Summary of “Machine Learning by Andrew NG”
Preface
I found this course from zhihu. Lots of people recommend this course as “the best way to start with Machine Learning”. So I spent two weeks to finish this course. After finishing it, I found it a great course as well! So here’s the link: https://www.coursera.org/learn/machinelearning/home/welcome. It won’t cost you so much time(about 50 hours are enough), but will lead you to a new world.
Problems in this course
According to wikipedia, here are 5 subfields:
1. Classification: To divide inputs to known classes.
2. Regression: To estimate the relationships among variables.
3. Clustering: To divide inputs to classes. Unlike in classification, the groups are not known beforehand.
4. Density estimation: To find the distribution of inputs in some space.
5. Dimensionality reduction: To simplify inputs by mapping them into a lowerdimensional space.
In this course, all these 5 topics are involved.
Algorithms in this course
 Gradient Descent: A powerful algorithm to solve Classification Problems and (Linear) Regression Problems. This algorithm use derivative of the cost function to minimize the cost function.
 Stochastic Gradient Descent: A variant of Gradient Descent. When dealing with a large amount of data, it’s much faster than Gradient Descent. But it’s a little bit harder to converge.
 MiniBatch Gradient Descent: A variant of Gradient Descent. It cost less time to complete a single iteration than Gradient Descent, but slower than Stochastic Gradient Descent. But it can fit data better than Stochastic Gradient Descent. Actually you can regard this algorithm as a compromise between the original Gradient Descent and Stochastic Gradient Descent.
 Collaborative Filtering: A variant of Gradient Descent. It’s often used in Recommender system.
 Normal Equation: A great way to solve Linear Regression Problems. It use numerical tricks to fit the data perfectly. In this algorithm we have to compute the inverse of a matrix, which can be solved in $O(n^3)$. So this algorithm can’t deal with datasets with so much features.
 Support vector machine(SVM): A powerful tool to solve Classification Problems and (Linear) Regression Problems. In this course, Andrew explains the application of this algorithm in classification problems, and it can be described as a Large Margin Classifier. Furthermore the cost function of SVMs is convex, so it won’t be trapped in the local optimum. Moreover with the “Kernel trick”, it can fit nonlinear hypothesis well.
 Neural Network(Backpropagation): The most popular algorithm in Machine Learning. Neural networks try to simulate our brain, so it’s believed as the most possible way to build strong AI. And Backpropagation use derivative of the cost function to minimize the cost function. It’s easy to learn, and perform well on many problems.
 KMeans Algorithm: This algorithm try to find patterns in data by itself. It divides data to different unknown classes. It’s useful in analysis.
 (Multivariate)Gaussian Distribution Algorithm: An algorithm based Gaussian Distribution to solve Density Estimation Problems. Widely used in Anomaly Detection.
Useful Tricks
 Feature Scaling: Scale data to make algorithms work better. Widely used in Gradient Descent and other algorithms.
 OnevsAll: This trick allows you to do very little modification on your twoclass classifier to make it a multiclass classifier.
 Regularization: It’s the most useful way to solve overfitting problems.
 Gradient Check: An easy numerical way to determine whether your implement of cost function is bugfree.
 Random Initialization: A necessary part of Neural Network. And Random Initialize for several times is also a good way to increase the possibility to find global optimum rather than local optimum.
 Train/Validation/Test set: A way to assign your dataset. It’s widely used in almost every single algorithm.
 Learning Curve: A good way to evaluate your algorithm. And it can help you to decide how to improve your algorithm.
 Precision/ Recall/ $F_1$ Score: A good way to evaluate your algorithm, especially when your dataset it skewed.
 Principal Component Analysis(PCA): A good way to compress your data. It can reduce the number of principal components. This can speed up your algorithm. Also it can help you to visualize your data.
 Ceiling Analysis: A way to the pipeline of your Machine Learning system. It can help you to decide which component to optimize worth the most.
Important Ideas
 Build a naive system as fast as possible. Optimize your system later.
 Do analyze your system. Let the result of analysi tell you what to do next instead of intuition.
Further Plan
After finishing the course “Machine Learning” by Andrew NG, here are two more courses to take:
1. CS231n: Convolutional Neural Networks for Visual Recognition
2. Neural Networks for Machine Learning
And this is the site of Hungyi Lee: http://speech.ee.ntu.edu.tw/~tlkagk/index.html
There are some good lectures can be read.
—————————— UPD 2017.8.13 ——————————
slides of CS231n can be found here: http://cs231n.stanford.edu/slides/2017/
vedios of CS231n can be found on youtube or bilibili
—————————— UPD 2018.3.4 ——————————
I found this post very useful: https://zhuanlan.zhihu.com/p/25005808