xzyao

Sample space: the possible outcomes of the experiment. Conceptually, it is a set. It can be discrete, continuous, finite, infinite and so on.

The elements of the sample space follow certain properties: The elements should be mutually exclusive, collectively exhaustive and at the right granularity.

Probability Law: assigns probabilities to outcomes or to collections of outcomes. It tells us whether one outcome is much more likely than some other outcome.

Probability Axioms and Properties

Event: a subset of the sample space. Probability is assigned to events. Axioms: * Nonnegativity: $(P(A)\geq 0)$. (Any event cannot have negative probability) * Normalization: $(P(\Omega)=1)$. (The sum of the probability of all events equals to 1) * Finite Additivity: If $A\cap B=\emptyset$ (These two events disjoint to each other), then $P(A\cup B)=P(A)+P(B)$. * Countable Additivity Axiom: If $A1, A2, \cdots $ is an infinite sequence of disjoint events, then $P(A1\cup A2\cup\cdots)=P(A1)+P(A2)+\cdots$. (Probability of disjoint events equals to the sum of their probabilities)

With these axioms, we can infer the following properties:

Properties: * $P(A)\leq 1$. * $P(\emptyset)=0$. * $P(A)+P(A^c)=1$. * For $k$ disjoint events, $P({s1, s2,\cdots,sk})=P(s1)+\cdots+P(s_k)$. * If $A\subset B$, then $P(A)\leq P(B)$. * $P(A\cup B)=P(A)+P(B)-P(A\cap B)$. * $P(A\cup B)\leq P(A)+P(B)$. This property is called the Union Bound. * $P(A\cup B\cup C)=P(A)+P(A^c\cap B)+P(A^c\cap B^c\cap C)$.

Uniform Law

Discrete: Assume $\Omega$ consists of $n$ equally likely elements, and assume $A$ consists of $k$ elements. Then $P(A)=\frac{k}{n}$. Continuous: Probability = Area

Probability Calculation Steps * Specify the sample space. * Specify a probability law. * Identify an event of interest. * Calculate the probability of the event of interest.

Bonferroni's Inequality * For any two events $A1$ and $A2$, we have $P(A1\cap A2)\geq P(A1)+P(A2)-1$. * Generally, we have $P(A1\cap A2\cap\cdots An)\geq P(A1)+P(A2)+\cdots+P(An)–(n-1)$.

[Questions about Machine Learning] Uncategorized

These questions are those cannot be categorized at the time. They are collected here and waiting to be collected.

Q. What is overfitting?

Q. How to overcome overfitting?

Q. What is underfitting?

Q. How to overcome underfitting?

[Questions about Machine Learning] Boosting

Q. What is AdaBoost?

A. AdaBoost is the abbreviation of Adaptive Boosting. It is proposed by Yoav Freund and Robert Schapire in 1995. As it focuses on the classification problem, so we use this as an example. The AdaBoost method is a classifier consists of many weak classifier, and it aims to convert and combine these weak classifiers into a strong classifier and use this strong classifier to solve the classification problem. This process can be denoted as:

$$ F(x)=\sum_{i=1}^n w _i f _i (x) $$

where $f _i$ stands for the $ith$ weak classifier and the $w _i$ represents the corresponding weight.

It is clear that the adaptive boosting method is exactly the weighted combination of $n$ weak classifiers.

Now let's see its procedure.

Given a dataset containing n points, some of them are -1 (negative), while others are 1 (positive). In the beginning, we assume all the weights equal to $\frac{1}{n}$. It can be denoted as $w _i=\frac{1}{n}, i=1,...,n$.

Step 1. Find several weak classifiers to the dataset (simply use linear regression or etc.), and select one with the lowest weighted classification error.

Step 2. Calculate the weight for the $ith$ weak classifier:

Step 3.

It is adaptive because it can adapt to wrongly classified objects, and the adapted parameters could be used to train the next classifier.

Q. What is GDBT (a.k.a Gradient Boosting Decision Tree)?

A. GDBT is a Decision Tree trained by Gradient Boosting.

Q. What is Gradient Boosting?

A. Gradient Boosting is a machine learning algorithm that can be used to handle multiple tasks, including regression, classification, ranking and etc.

As in its name, gradient boosting is the combination of gradient descent and boosting.

[Questions about Machine Learning] Preface

This book series is inspired by a Chinese version, named “机器学习500问” (500 Questions about Machine Learning). It is hosted on GitHub.

I also refer to the Introduction to Applied Linear Algebra, written by Stephen Boyd at Stanford University for some definition and explanation.

[Questions about Machine Learning] Chapter VI Recurrent Neural Network

Q. What is Recurrent Neural Network?

[Questions about Machine Learning] Chapter V Convolutional Neural Network

Q. What is Convolutional Neural Network? a.k.a CNN?

Q. What are the main components of CNN?

A. Input, Conv, Activation, Pooling and Fully Connected Layer.

Q. What is convolution?

Q. What are the main parameters in the convolution layer?

Q. What are the different types of convolution layer?

Q. What are the differences between 2D and 3D convolution?

Q. What is pooling?

Q. What are the methods for pooling?

A. There are a few methods.

  • General Pooling

  • Spatial Pyramid Pooling

  • Mean Pooling

  • Max Pooling

Q.

[Questions about Machine Learning]Chapter IV Fundamentals of Deep Learning

Q. What is neural network?

Q. What is perception machine? What is multi-layer perception machine, a.k.a MLP?

A.

Q. What are the common used neural networks?

A. commonly-used-neural-networks

For more information, please visit asimov institute

Q. There are so many deep learning frameworks, which one should I choose?

Q. Why we need deep neural networks? What is it?

Q. Why it is so hard to train a deep neural network?

Q. What are the differences between machine learning and deep learning?

Q. What is forward propagation and backward propagation? a.k.a FP and BP?

Q. Still unclear, more examples?

Q. How to calculate the output of a neural network?

Q. What is hyper parameters?

Q. How to find the best value for hyper parameters?

Q. Generally, what are the steps to find a hyper parameters?

Q. What is activation function? Why we need it?

Q. What are the commonly used activation functions?

A. sigmoid, tanh, Relu, Leaky Relu, softplus and softmax are some commonly used activation functions.

Q. What is the derivatives of those activation functions?

Q. What properties do these activation functions have?

Q. How to choose an proper activation function?

Q. What are the advantages of Relu? Why it is so popular?

Q. Why softmax can be used to do multi class classification?

Q. Why tanh has a higher convergence rate?

Q. What is batch size? Why we need it?

Q. What is normalization? Why we need it?

Q. What is batch normalization? Why we need it?

Q. What is fine tuning?

Q.

[Questions about Machine Learning] Chapter II Machine Learning Fundamentals

In this chapter, we will discuss some basic knowledge about machine learning

Q. What is regression? What is classification?

Q. What is supervised learning? What is semi-supervised learning? What is weakly-supervised learning? What is unsupervised learning?

Q. What are the steps for a supervised learning?

Q. What is multi instance learning?

Q. What is K-Nearest-Neighbor or KNN? What is SVM?

Q. What is neural network? What are the types of difference neural networks?

Q. What is local optimum? What is global optimum?

One day, Plato asked Socrates: what is love? Socrates said: I ask you to cross this piece of rice fields, to pick one of the largest and most golden wheat back, but there is a rule: you can not go back, and you can only pick once. So Plato went on. After a long time, he was back with empty hand. Socrates ask him why come back with empty? Plato said: When I walked in the field, I had seen a few particularly special bright wheat, but I always think there may be bigger and better in front, so there is no pick; but when I continue walking, see the wheat, always feel it is not better than the previous which i had seen, so I did not pick anything finally. Socrates said meaningfully: this is love.

Q. What are the advantages and disadvantages of common classification algorithms, such as Bayes, Decision Tree or SVM?

Q. Can the accuracy be a great and comprehensive measurement for classification?

Q. If not, what are the measurements for classification algorithm? and what are for regression algorithm?

Q. What is a good enough classification algorithm?

Q. What is logistic regression? What is Poisson regression?

Q. What are the differences between logistic regression and naïve bayes?

Q. What are the differences between linear regression and logistic regression?

A: Linear Regression: $f(x)=\theta ^{T}x=\theta _{1}x _{1}+\theta _{2}x _{2}+...+\theta _{n}x _{n}$

Logistic Regression: $f(x)=P(y=1|x;\theta )=g(\theta ^{T}x)$,where,$g(z)=\frac{1}{1+e^{-z}}$

Q. What is cost function? Why we need it?

Q. Why cost function can work?

Q. Why cost function have to have a lower bound? or why most cost functions cannot be minus?

Q. What are some common cost function?

Q. Why we use cross entropy to replace quadratic cost?

Q. What is loss function? Why we need it?

Q. What are some common loss function?

Q. Why we use log loss function in logistic regression?

Q. How log loss function measures the loss?

Q. What is gradient decent? why we need it?

Q. What are the advantages and drawbacks of gradient decent?

Q. Still unclear, is there any graph or description?

Q. What are the steps of gradient decent?

Q. How to optimize gradient decent?

Q. What is random gradient decent and batch gradient decent? What are the differences between them?

Q. What is computation graph? How to calculate its derivatives?

Q. What is Linear Discriminant Analysis or LDA?

Q. What are the steps of LDA?

Q. What is PCA (Principal Component Analysis)? and its steps?

Q. What are the differences between LDA and PCA?

Q. What are the advantages and drawbacks of LDA and PCA?

Q. Why we need to reduce the dimension?

Q. What is Kernelized Principal Component Analysis or KPCA?

Q. For machine learning models, what are the usually used measurements?

Q. What are the relations between bias, error, variance and covariance?

Q. What is empirical and generalization error?

Q. What is overfitting? What is underfitting? How to solve them respectively?

Q. What is the purpose of cross validation?

Q. What is k-fold cross validation?

Q.

[Questions about Machine Learning] Chapter I Mathematics Fundamentals

In this chapter, we will discuss some basic mathematics knowledge that you need to know for further study.

Q. What are the relations between scalar, vector, matrix, and tensor?

A. A vector is an ordered finite list of numbers. Vectors are usually written as vertical arrays, surrounded by square or curved brackets, as seen below.

$$\begin{pmatrix}-1.1 \\ 0.0 \\ 3.6 \\ -7.2 \end{pmatrix} or \begin{bmatrix}-1.1 \\ 0.0 \\ 3.6 \\ -7.2 \end{bmatrix}$$

Sometimes, they are written as numbers separated by commas and surrounded by parentheses. As seen below.

$$(-1.1, 0.0, 3.6, -7.2)$$

Vector is often denoted by a lowercase symbol $a$. We can get the element (also known as entries, coefficients or components) of a vector by the index, and the $i$th element of the vector $a$ is therefore denoted as $a_i$ where the subscript $i$ is an integer index of the vector. (Obviously, $0<i<n$).

If two vectors have the same size, and more importantly, each of the corresponding entries is the same, then the two vectors are equal, which is denoted as $a=b$.

A scalar is a number or a value. In most applications, scalars are real numbers. We usually use an italic lowercase symbol to denote a scalar. For example, $\textit{a}$ is a scalar.

A matrix is a rectangular array, which means it is a 2-dimensional data table at the same time. Matrix is a collection of items that have the same feature and character. In a matrix, a column indicates a feature, and a row indicates an item. Matrix is usually denoted as a capital letter, $A$ for example.

A tensor is an array with more than 2 dimensions. Generally, if the elements of an array are distributed in a regular grid with several dimensions, we would call it a tensor. We use a capital letter to denote a tensor, same with the matrix. $A$ for example. An element in a tensor is denoted as $A_(i,j,k)$.

Relations between them

Scalar is a 0-dimensional tensor. Vector is a 1-dimensional tensor. For example, with a scalar, we could get the length of a rod, but we cannot know the direction of this rod.

With a vector, we could know both the length and direction of a rod.

With a tensor, we may be able to know both the length and direction of a rod, and we could even know more about the rod. (for example, the degree of deflection)

Q. What are the differences between tensor and matrix?

From the aspect of algebra, the matrix is a generation of the vector, the matrix is a 2-dimensional table. $n$-dimensional is a so-called $n$-dimensional table. Noted that this is not a strict definition of the tensor.

For the aspect of geometry, a matrix is a geometric sense value. It does not change with the coordinate transformation of the frame of reference. the vector has this feature too.

The tensor can be represented by a $3$x$3$ matrix or an $n$x$n$ matrix.

A scalar can be regarded as a $1$x$1$ matrix while a vector with $n$ items can be regarded as $1$x$n$ matrix.

Q. What will happen if I multiply a matrix and a tensor?

A. You can only multiply an $m$x$n$ matrix and a $n$ items vector. Then you will get a $m$ items vector. The key to this is regarded each row of the matrix as a vector, and multiply the given vector.

For example, If you are going to multiply the following:

$$\begin{bmatrix}1, 2 \\ 0.0, 1 \\ 3.6, 3 \\ -7.2,2 \end{bmatrix}$$ and $$\begin{bmatrix}-1.1 \\ 0.0 \\ 3.6 \\ -7.2 \end{bmatrix}$$

Q. What is the norm?

In mathematics, a norm is a function that assigns a strictly positive length or size to each vector in a vector space. There are many different types of norms for a vector or a matrix. For example,

1-norm: $ ||x|| 1 = \sum{i=1}^N |x _i| $

2-norm or Euclid norm: $ ||x|| _2 = $

3-norm:

Q. What are the norms of matrix and vector?

We define a vector as $\vec{x}=(x1,x2,...,x_N)$. Its norm will be:

Q. What is the positive definite matrix?

Q. How to judge if a matrix is the positive definite matrix?

Q. What is a derivative?

Q. How to calculate the derivatives?

Q. What are the differences between derivatives and partial derivatives?

Q. What is eigenvalue? What is eigenvector? What is eigenvalue decomposition?

Q. What is the singular value? What is singular value decomposition?

Q. What are the differences between singular value and eigenvalue? and what about their decomposition?

Q. What is the probability?

Q. What are the differences between variable and random variable?

Q. What are the common probability distribution?

Q. What is the conditional probability?

Q. What is joint distribution? What is marginal distribution? What are their relations?

Q. What is the chain rule for conditional probability?

Q. What is independence and conditional independence?

Q. What is the expectation? What is variance? What is covariance? What is the correlation coefficient?