This Crash Course introduces some basic machine learning topics and provide hands-on exercises using TensorFlow. This course does not provide very detailed concepts. Instead, it throws out common topics and few examples, and can be a good complement to the coursera machine learning course by Andrew Ng. You can find a detailed learning topics I summarized on github. Here, I will just list some of the bullet points I learned from this Crash Course.
-
L2 Loss (Least Squared Error)
-
Mean Square Error (MSE) is the average squared loss per example.
-
Although MSE is commonly used in machine learning, it's neither the only practical loss function nor the best loss function for all circumstances
-
-
Stochastic Gradient Descent and Mini-Batch Gradient Descent
-
A large dataset with randomly sampled examples probably contains redundant data.
-
Redundancy becomes more likely as the batch size grows
-
-
TensorFlow API Hierarchy
Contains differnt levels of API
* Highest level of abstrction to solve problems, but may be less flexible * If need additional flexibility, move one level lower -
Workflow To Use TensorFlow
- Select features and targets from data
- Use tensorflow to build feature columns and optimizer with hyperparameters such as learning rate
- Build an estimator based on step 2
- Train the data with steps specified
- Calculate the training and validation loss
- Tune learning rate, steps and batch size to reduce loss
- Try different features
- Test new features from synthesis of old features
- Cap outliers
-
Batch Size
- Steps are the total number of training iterations. One step calculate the loss from one batch and use the value to modify the model's weights once
- Batch size is the number of examples (randomly) selected for each single step
- Total number of trained examples = batch size × steps
- Periods control the granularity of reporting
- Number of training examples in each period = (batch size × steps)/periods
-
Overfitting
- Split the data into training set, validation set and test set
-
Representations
- String can be represented as a string vector using one-hot encoding
- A binary vector that only has one element of 1 and all others 0
- 1 means the feature belongs to some category
- String can be represented as a string vector using one-hot encoding
-
Properties of Good Features
- Appear with non-zero value more than a small handful of times in the dataset: avoid rarely used data
- Clear and obvious meaning: sanity check
- Shouldn't take on magic values: split into 2 separate features, one representing whether the feature exists and second indicate the values
- The definition of a feature should not change over time
- Should not have crazy outlier values
-
Good Habits: know your data
- Visualize
- Debug
- Monitor
-
Feature Scaling
-
Scale the features to [-1, 1] or scaling with Z score to (mostly) [-3, 3]
-
Logarithmic scaling
-
Clip feature
-
The binning trick: create several boolean bins, each mapping to a new feature and allow model to fit a different value for each bin
E.g. there's no linear relationship between latitude and the house price, but the individual latitudes may be a good indicator of house values. Binning by location or by quantile (ensure each bin has same examples).
-
-
Scrubbing Data
- Many exmaples are not reliable because:
- omitted values
- Duplicate values
- Bad label
- Bad feature values
- Detect bad data by
- Histogram
- Min and max
- Mean and median
- Standard deviation
- Many exmaples are not reliable because:
-
Pearson Correlation Coefficient
- Learn the linear correlation between targets and feature and between feature and feature
-
Feature crosses
- Combine several features together (polynomial terms), and incorporate nonlinear learning into linear learner
- For boolean features, feature crosses may be very sparse
- Combine feature crosses and massive data is one efficient strategy for learning highly complex systems
- Bucketized column
-
Regularization: Avoid Model Complexity When Possible
- Minimize lost(data|model) + complexity(model)
- Smaller weights: complexity as the sum of the squares of the weights
- Smaller number of features with nonzero weights
- Performing L2 regularization has following effect:
- Encourage weight values toward 0
- Encourage the mean of the weights toward 0, with a normal(Gaussian) distribution
- Increasing the lambda value strenghthens the regularization effect
- Minimize lost(data|model) + complexity(model)
-
Logistic Regression
- Used as:
- Probabilities(expectation)
- Classifications
- Regularization is very important for logistic regression:
- L2 regularization
- Early stopping
- Used as:
-
Classification
- Evaluation metrics:
- Accuracy, but breaks down when only contains extremely low positives or negatives
- True positives, false positive, true negatives, false negatives
- ROC curve:
- Each point is the TP and FP rate at one decision threshold
- AUC: area under the ROC curve
- Gives an aggregate measures of performance aggregate across all possible classification thresholds
- Prediction bias:
- Should have average of prediction == average of observatioins, otherwise biased
- Bias is a canary:
- Zero bias alone does not mean everything in your system is perfect
- But it's a good sanity check
- If having bias:
- Imcomplete feature set
- Buggy pipeline
- Biased training set
- Not suggested to fix bias with a calibration layer, fix the model instead
- Evaluation metrics:
-
L1 Regularization
- Feature crosses: sparser feature crosses may significantly increase feature space
- Model size (RAM) become huge
- Noise coefficients (overfitting)
- Penalize the sum of abs(weights)
- Convex problem (L0 regularization is non-convex, thus hard to optimize)
- Encourage sparsity unlike L2
- Feature crosses: sparser feature crosses may significantly increase feature space
-
Neural Networks:
- Nonlinearity::
- Relu law: rectify linear unit (max(0, x)), usually a little better than sigmoid
- Sigmoid
- Backpropagation
- Gradients can vanish
- The gradients for lower layer can become very small and thus the layers may be trained very slow
- Relu activation function may help prevent vanishing gradients
- Gradients can explode
- The weights in the network are very huge, they may get too large to converge
- Large batch size and slower learning rate may help
- ReLu layers can die
- If the weighted sum is smaller than 0, the ReLu unit can get stuck, and the gradient cannot flow through during the backpropagation
- Lowing learning rate may keep ReLu unit from dying
- Gradients can vanish
- Normalizing feature values
- Have reasonable scales
- Roughly 0-centered, [-1, 1] range often helps
- Help gradient descent work, avoid NaN trap
- Avoid outlier values also help
- Methods
- Linear scaling
- Hard cap to max, min
- Log scaling
- Have reasonable scales
- Dropout regularization
- Work by randomly drop out units in the network for a single gradient descent
- The more you drop out, the stronger the regularization
- 0.0 -> no dropout regularization
- 1.0 -> drop everything out and learn nothing
- Intermidiate values are helpful
- Nonlinearity::
-
Multi-Class Neural Networks
- One-vs-all Multi-Class
- Create one unique output for each possible class
- Train that on a signal of "my class" vs "all other classes"
- Can do in a deep network, or with separate model
- Reasonable when the total number of classes is small, but becomes increasingly inefficient as the number of classes rises
- SoftMax Multi-Class
- Add additional contraints: require output of one-vs-all nodes to sum up to 1
- Helps the training converge quickly
- Outputs can be interpret as probabilities
- What to use and when
- Multi-class, single-label classification
- An example be only a member of one class
- Constraint that classes are mutually exclusive is helpful structure
- Useful to encode this in the loss
- Use one SoftMax loss for all possible classes
- If an example may be a member of more than one classes, use logistic regression instead
- Multi-class, single-label classification
- SoftMax Options
- Full SoftMax
- Brute force, calculate for all classes
- Fairly cheap when the number of classes is small, but prohibitively expensive when the number fo classes climbs
- Candidate sampling
- Calculate for all positive labels, but only for a random sample of negatives
- Improve the efficiency in the problems that have a large number fo classes
- Full SoftMax
- One-vs-all Multi-Class
-
Confusion matrix
-
Embedding
- E.g. Use 0 and 1 to represent if a movie is watched is not very efficient. Instead build a dictionary mapping each feature to an integer (movie no.)
- Higher dimensional embedding can more accurately represent the relationship between input values
- More dimensions increases the chance of overfitting and lowering training efficiency
- Empirical rule: Dimensions = (possible values)^(1/4)
- Embedding as a tool:
- Embedding maps items to low-dimensional real vectors in a way that similar items are close to each other
- Apply to dense data to create a meaningful similarity metric
- Joint embedding diverse data types definea similarity between them
-
Categorical data
- Most efficiently represented via sparse tensors, which are tensors with very few non-zero elements
- Bag-of-word representation
- A useful embedding may be on the order of hundreds of dimensions
- Word2vec developed by google
-
System level component: don't have to build everything
- Reuse generic ML code
- Google CloudML solutions include Dataflow and TF serving
- Components can be found on platforms like Spark, Hadoop, etc
-
Staticd vs. Dynamic Training
- Static Model - trained offline
- Easy to build and test
- Still requires monitoring the inputs
- Easy to go stale
- Dynamic model - trained online
- Continue to feed training data over time, regularly sync out updated version
- Use progressive validatioin instead of batch training and test
- Need monitoring, model rollback & data quarantine capabilities
- Adapt to changes, staleness issue avoided
- Static Model - trained offline
-
Offline inference
- Make all possible predictions in batch, using a mapreduce or similar
- Write to table, then feed these to cache/lookup table
- Upside:
- Don't need to worry about the cost of inference
- Can use batch quota
- Post-verification on the training data before pushing
- Downside:
- Only predict things we know about - bad for long tail
- Latency measured in hours or days
-
Online inference
- Predict on demand, using a server
- Upside:
- Predict new items as it comes in - great for long tail
- Downside:
- Compute intensive, latency sensitive - may limit model complexity
- Monitoring needs more intensity
-
Feature management
- Input data determines ML system behavior
- We write unit tests for software libraries, what about data?
- Care is required when choosing input signal
- Maybe even more care than when deciding upon which software libraries to depend
- Reliability
- Versioning
- Necessity
- Correlations
- Feedback loops
- Input data determines ML system behavior
-
Machine Learning Guidelinee: rules of machine learning
- Keep first model simple
- Focus on ensuring the pipeline is correct
- Use simple and observable metrics for training and validation
- Monitor input features
- Treat model coonfiguration as code: review it, check it
- Documenting, especially failures