machine-learning

machine-learning

license Python 3.10 Python 3.9 Python 3.8

This is a continuously updated repository that documents personal journey on learning data science, machine learning related topics.

Goal: Introduce machine learning contents in Jupyter Notebook format. The content aims to strike a good balance between mathematical notations, educational implementation from scratch using Python’s scientific stack including numpy, numba, scipy, pandas, matplotlib, pyspark etc. and open-source library usage such as scikit-learn, fasttext, huggingface, onnx, xgboost, lightgbm, pytorch, keras, tensorflow, gensim, h2o, ortools, ray tune etc.

Documentation Listings

deep learning

Curated notes on deep learning.

model deployment

operation research

reinforcement learning

Notes related to advertising domain.

Information Retrieval, some examples are demonstrated using ElasticSearch.

time series

Forecasting methods for timeseries-based data.

projects

End to end project including data preprocessing, model building.

ab tests

A/B testing, a.k.a experimental design. Includes: Quick review of necessary statistic concepts. Methods and workflow/thought-process for conducting the test and caveats to look out for.

model selection

Methods for selecting, improving, evaluating models/algorithms.

dim reduct

Dimensionality reduction methods.

recsys

Recommendation system with a focus on matrix factorization methods. Starters into the field should go through the first notebook to understand the basics of matrix factorization methods.

trees

Tree-based models for both regression and classification tasks.

clustering

TF-IDF and Topic Modeling are techniques specifically used for text analytics.

keras

For those interested there’s also a keras cheatsheet that may come in handy.

text classification

Deep learning techniques for text classification are categorized in its own section.

regularization

Building intuition on Ridge and Lasso regularization using scikit-learn.

networkx

Graph library other than networkx are also discussed.

association rule

Also known as market-basket analysis.

big data

Exploring big data tools, such as Spark and H2O.ai. For those interested there’s also a pyspark rdd cheatsheet and pyspark dataframe cheatsheet that may come in handy.

data science is software

Best practices for doing data science in Python.

ga

Genetic Algorithm. Math-free explanation and code from scratch.

unbalanced

Choosing the optimal cutoff value for logistic regression using cost-sensitive mistakes (meaning when the cost of misclassification might differ between the two classes) when your dataset consists of unbalanced binary classes. e.g. Majority of the data points in the dataset have a positive outcome, while few have negative, or vice versa. The notion can be extended to any other classification algorithm that can predict class’s probability, this documentation just uses logistic regression for illustration purpose.

clustering old

A collection of scattered old clustering documents in R.

linear regression

Python Programming