Life and computing: ML

Logistic Regression

Like linear regression, logistic regression does work better when you remove attributes that are unrelated to the output variable as well as attributes that are very similar (correlated) to each other. [--> Feature Selection required.]
It's a fast model to learn and effective on binary classification problems.

LDA (Linear Discriminant Analysis)

If you have more than two classes then the LDA algorithm is the preferred linear classification technique. The technique assumes that the data has a Gaussian distribution (bell curve), so it is a good idea to remove outliers from your data before hand.
클래스를 구별할 수 있는 정보를 보존하면서도 dimension reduction을 추구;

PCA (as unsupervised)는 클래스의 종류에 관계없이 기저벡터를 찾아서 모든 원소를 투영시켜서 전체 원소들의 분포를 보았을 때 가장 넓게 퍼질 수 있는 축을 찾아가는 방식.
LDA (as supervised)는 동일 클래스의 원소들은 뭉치고, 이종 클래스 간의 거리는 멀어지는 투영 축을 찾는 방식. (참고 설명1, 참고 설명2)

class들의 mean 값들의 차이는 최대화하고, class내의 variance는 최소화하는 벡터 $w$를 찾는 것
LDA will seek to maximize the separation between the different classes by computing the component axes (linear discriminants).

한계점:

LDA는 parametric 기법(unimodal Gaussian likelihood를 전제)이기 때문에, 가우시안 분포가 아닌 데이터에 대해서는 classification 성능이 낮음.

Naive Bayes (= Simple Bayes)

Naive Bayes is called naive because it assumes that each input variable is independent(확률적으로 서로 독립). This is a strong assumption and unrealistic for real data, nevertheless, the technique is very effective on a large range of complex problems.
속성(feature)가 너무 많은 경우, 모든 연관 관계를 고려한다면 너무 복잡한 상황에서 '단순화'를 통해 쉽고 빠른 판단을 내리기 위해 적용됨.

multi-class 분류 문제를 쉽고 빠르게 예측 가능.

spam-filtering, sentiment analysis, 질병 예측.

범주형 데이터 분석에 효과적.
하지만, training data에 없던 category의 데이터에 대해서는 정상적인 예측 불가능.

이와 같은 zero frequency 상황을 피하기 위해, smoothing 기법이 필요함(예: Laplace 추정)

Bayes theorem

Posterior: P(c|x) = 특정 개체 x가 특정 그룹 c에 포함될 사후 확률

Posterior = Likelihood와 Prior의 곱

Likelihood: P(x|c) = 특정 그룹 c에 특정 개체 x가 포함될 조건부 확률
(Class) Prior: P(c) = 특정 그룹 c가 발생할 빈도.
Predictor: P(x) = 특정 개체가 발생할 확률 = 모든 그룹에 대해 동일한 상수값이므로 보통 계산식에서 무시됨.

KNN (K-Nearest Neighbor)

A downside of K-Nearest Neighbors is that you need to hang on to your entire training dataset.
KNN and LVQ need normalization.
튜토리얼

LVQ (Learning Vector Quantization)

입력 벡터를 가장 유사한 참조 벡터로 군집화하는 인공 신경망.

참조 블로그: http://untitledtblog.tistory.com/50

The Learning Vector Quantization algorithm (or LVQ for short) is an artificial neural network algorithm that allows you to choose how many training instances to hang onto and learns exactly what those instances should look like.
If you discover that KNN gives good results on your dataset try using LVQ to reduce the memory requirements of storing the entire training dataset.
다음과 같은 주장이 있음: kNN is usually superior than LVQ. (개인적 경험상, LVQ 학습하는 데 RF 만큼 상당한 시간 소요되는 것으로 판단됨.)

Decision Trees

On the opposite end of the spectrum, you have decision trees. Unfortunately, decision trees are often implemented as ensembles in practice to improve accuracy and reduce variance. Random forests and gradient boosted trees are simple enough models conceptually, but once you add tens or hundreds of trees, it becomes impractical to try to interpret why the model made a certain prediction.
Decision trees have a high variance and can yield more accurate predictions when used in an ensemble.
Tree가 커지면 커질 수록 세밀한 분류가 가능해지는 대신, overfitting 가능성이 높아짐. 따라서 가지치기(pruning)을 통해 일반화하는 기법이 적용되고 있음
장점:

빠른 구현 가능; 속성 유형에 상관없이 잘 동작; outlier에 (상대적으로) 덜 민감; 튜닝 파라미터 개수가 적음; missing value 있어도 효율적 처리 가능; 해석이 용이.

Random Forest

If you get good results with an algorithm with high variance (like decision trees), you can often get better results by bagging (Bootstrap Aggregating) that algorithm.

Bagging

Bootstrap

통계에서 사용되던 용어(Bootstrapping, 1979, by Bradley Efron)에서 유래.

전체 모집단의 분포를 확실하게 알 수 없는 경우, 표본(샘플)을 취하여, (해당 표본이 전체 모집단을 대표한다는 가정하에) 전체 분포를 예측하기 위해 사용. (즉, 전체 표본의 분포와 표본에 대한 샘플들 간의 분포 간의 관계를 통해, 모집단의 분포를 유추하는 방식)

model averaging을 통해 성능을 높이는 방식.

regression 문제의 경우, variance를 줄이는 효과.
classification 문제의 경우, voting을 통해 가장 많은 결과가 나오는 것을 선택.

적용하면 안되는 경우

모집단에 비해, 표본 데이터가 매우 작은 경우.
데이터에 잡음이 많은 경우. (outlier로 인해 왜곡이 커질 수 있음)
데이터간 독립성이 부족한 경우.

Boosting과 다른 점

bootstrap 에 사용되는 각 model은 서로 독립적인 반면, boosting은 model이 순차적으로 학습됨.
boosting은 최종적으로 weighted vote를 하지만, bagging은 단순 vote를 함.
bagging은 variance를 줄이는 것이 주목적, boosting은 bias를 줄이는 것이 주목적.
잡음이 없는 경우, boosting이 bagging 보다 우수함.
boosting은 overfiting 발생할 수 있지만, bagging은 overfitting 문제를 해결 가능.

Boosting

AdaBoost (= adaptive boosting)

(무작위로 선택하는 대신) 약간 가능성이 높은 규칙들(weak learner/classifier)를 결합시켜서 보다 정확한 예측모델을 만들어 내는 기법.

Boosting is an ensemble technique that attempts to create a strong classifier from a number of weak classifiers.

간단한 결합 방식 1) 평균/가중 평균을 사용; 2) 가장 많은 의견(vote)을 선택.

(Schapire 1989), (Freuen) 등에 의해 효율적인 boosting 알고리즘이 개발됨.

This is done by building a model from the training data, then creating a second model that attempts to correct the errors from the first model. Models are added until the training set is predicted perfectly or a maximum number of models are added.

즉, weak learner를 이용한 학습에서 에러가 발생하면, 해당 에러에 집중하기 위한 weighting을 올려서 새로운 weak learner를 학습하여, 모든 결과를 결합한 최종 결과를 이용한다.

새로운 learner를 학습할 때마다, 이전 결과를 참조하는 방식.

Modern boosting methods build on AdaBoost, most notably stochastic gradient boosting machines.

Because so much attention is put on correcting mistakes by the algorithm it is important that you have clean data with outliers removed.

XGBoost

XGBoost is a tree ensemble model, which means the sum of predictions from a set of classification and regression trees (CART). In that, XGBoost is similar to Random Forests but it uses a different approach to model training.
Out of the different implementations and variations of gradient boosting algorithms, caret performed best on PCA-preprocessed data in the validation set.
Both xgboost and gbm follows the principle of gradient boosting. There are however, the difference in modeling details. Specifically, xgboost used a more regularized model formalization to control over-fitting, which gives it better performance.
참조 블로그:

t-SNE

t-Distributed Stochastic Neighbour Embedding

introduced by van der Maaten and Hinton in 2008

~~instead of looking at directions/axes which maximise information or class separation,~~ T-SNE aims to convert the Euclidean distances between points into conditional probabilities. A Student-t distribution is then applied on these probabilities which serve as metrics to calculate the similarity between one datapoint to another.

Multiple local minima may occur as the algorithm is identifying clusters/sub-clusters.

Model Ensembling

generalization error를 reduce하기 위한 일련의 기법 중 하나.

70% 정확도를 가진 모델 3개를 emsembling 할 경우, majority voting을 할 경우 78% 효과를 얻을 수 있음.
그러나, 모델간의 correlation이 매우 높은 경우, ensemble로 인한 효과가 높지 않음.

Rank ensembling: 주어진 test set에 대한 다양한 모델의 예측값을 combining; original model을 다시 retrain할 필요가 없음

Voting; Averaging; Rank averaging; Historical averaging

Stacking (= stack generalization): base predictors를 combine하는 또다른 predictor를 이용하기.

Feature weighted linear stacking; Quadratic weighted stacking; StackNet

Blending: 학습 데이터 중 일부(10%)를 holdout set으로 놓고, stacker model을 holdout set으로 학습.

Life and computing

2017년 2월 21일 화요일

Learning Algorithms

Logistic Regression

LDA (Linear Discriminant Analysis)

Naive Bayes (= Simple Bayes)

KNN (K-Nearest Neighbor)

LVQ (Learning Vector Quantization)

Decision Trees

Random Forest

Bagging

Boosting

AdaBoost (= adaptive boosting)

XGBoost

t-SNE

Model Ensembling

*References

2016년 2월 4일 목요일

Deep Learning