Some old random notes I took many years ago.

  • overfitting: too perfect for the data we have, but not generic enough for new and unseen data
    • usually due to data being too much, or have too many parameters
    • usual ways to deal with it are cross validation or regularization
  • always define business metric before designing models
  • silent failures are a common thing in ML pipeline
  • more engineering than research
    • debug-ability requires expert-level system design
    • cost, cost, cost: how to bring it down

NLP stuff

  • Tokenization: break a doc into list of tokens, can be words or grams
    • Normalization: some cleanup at token level (could combine back to doc afterwards)
  • TFIDF: (term frequency) x (inverse document frequency)
    • TF: how frequent does a token occur at token level
    • IDF: how rare does a token occur at doc level
    • is a weight metric: how important a token is to a doc in a corpus
    • a type of word embeddings: way to extract quantifiable features from words/sentences/docs
  • corpus: all the unique tokens across all docs
    • way of prepare corpus affects the final weight, e.g. one can use n-gram
      • end up generating many grams?
    • need a stop word list