Machine learning stuff

Some old random notes I took many years ago.

overfitting: too perfect for the data we have, but not generic enough for new and unseen data
- usually due to data being too much, or have too many parameters
- usual ways to deal with it are cross validation or regularization
always define business metric before designing models
silent failures are a common thing in ML pipeline
more engineering than research
- debug-ability requires expert-level system design
- cost, cost, cost: how to bring it down

Tokenization: break a doc into list of tokens, can be words or grams
- Normalization: some cleanup at token level (could combine back to doc afterwards)
TFIDF: (term frequency) x (inverse document frequency)
- TF: how frequent does a token occur at token level
- IDF: how rare does a token occur at doc level
- is a weight metric: how important a token is to a doc in a corpus
- a type of word embeddings: way to extract quantifiable features from words/sentences/docs
corpus: all the unique tokens across all docs
- way of prepare corpus affects the final weight, e.g. one can use n-gram
  - end up generating many grams?
- need a stop word list