Consider TPOT your Data Science Assistant. TPOT is a Python tool that automatically creates and optimizes Machine Learning pipelines using genetic programming.
TPOT will automate the most tedious part of Machine Learning by intelligently exploring thousands of possible pipelines to find the best one for your data.
An example Machine Learning pipeline
Once TPOT is finished searching (or you get tired of waiting), it provides you with the Python code for the best pipeline it found so you can tinker with the pipeline from there.
TPOT is built on top of scikit-learn, so all of the code it generates should look familiar... if you're familiar with scikit-learn, anyway.
TPOT is still under active development and we encourage you to check back on this repository regularly for updates.
Please see the repository license for the licensing and usage information for TPOT.
TPOT is built on top of several existing Python libraries, including:
-
NumPy
-
SciPy
-
pandas
-
scikit-learn
-
DEAP
Except for DEAP, all of the necessary Python packages can be installed via the Anaconda Python distribution, which we strongly recommend that you use. We also strongly recommend that you use of Python 3 over Python 2 if you're given the choice.
NumPy, SciPy, pandas, and scikit-learn can be installed in Anaconda via the command:
conda install numpy scipy pandas scikit-learn
DEAP can be installed with pip
via the command:
pip install deap
Finally to install TPOT, run the following command:
pip install tpot
Please file a new issue if you run into installation problems.
TPOT can be used in two ways: via code and via the command line. We will eventually develop a GUI for TPOT.
We've taken care to design the TPOT interface to be as similar as possible to scikit-learn.
TPOT can be imported just like any regular Python module. To import TPOT, type:
from tpot import TPOT
then create an instance of TPOT as follows:
from tpot import TPOT
pipeline_optimizer = TPOT()
Note that you can pass several parameters to the TPOT instantiation call:
generations
: The number of generations to run pipeline optimization for. Must be > 0. The more generations you give TPOT to run, the longer it takes, but it's also more likely to find better pipelines.population_size
: The number of pipelines in the genetic algorithm population. Must be > 0. The more pipelines in the population, the slower TPOT will run, but it's also more likely to find better pipelines.mutation_rate
: The mutation rate for the genetic programming algorithm in the range [0.0, 1.0]. This tells the genetic programming algorithm how many pipelines to apply random changes to every generation. We don't recommend that you tweak this parameter unless you know what you're doing.crossover_rate
: The crossover rate for the genetic programming algorithm in the range [0.0, 1.0]. This tells the genetic programming algorithm how many pipelines to "breed" every generation. We don't recommend that you tweak this parameter unless you know what you're doing.random_state
: The random number generator seed for TPOT. Use this to make sure that TPOT will give you the same results each time you run it against the same data set with that seed.verbosity
: How much information TPOT communicates while it's running. 0 = none, 1 = minimal, 2 = all
Some example code with custom TPOT parameters might look like:
from tpot import TPOT
pipeline_optimizer = TPOT(generations=100, rng_seed=42, verbosity=2)
Now TPOT is ready to work! You can pass TPOT some data with a scikit-learn-like interface:
from tpot import TPOT
pipeline_optimizer = TPOT(generations=100, rng_seed=42, verbosity=2)
pipeline_optimizer.fit(training_features, training_classes)
then evaluate the final pipeline as such:
from tpot import TPOT
pipeline_optimizer = TPOT(generations=100, rng_seed=42, verbosity=2)
pipeline_optimizer.fit(training_features, training_classes)
pipeline_optimizer.score(training_features, training_classes, testing_features, testing_classes)
Note that you need to pass the training data to the score()
function so the pipeline re-trains the scikit-learn models on the training data.
To use TPOT via the command line, enter the following command to see the parameters that TPOT can receive:
tpot --help
The following parameters will display along with their descriptions:
-i
/INPUT_FILE
: The path to the data file to optimize the pipeline on. Make sure that the class column in the file is labeled as "class".-is
/INPUT_SEPARATOR
: The character used to separate columns in the input file. Commas (,) and tabs (\t) are the most common separators.-g
/GENERATIONS
: The number of generations to run pipeline optimization for. Must be > 0. The more generations you give TPOT to run, the longer it takes, but it's also more likely to find better pipelines.-p
/POPULATION
: The number of pipelines in the genetic algorithm population. Must be > 0. The more pipelines in the population, the slower TPOT will run, but it's also more likely to find better pipelines.-mr
/MUTATION_RATE
: The mutation rate for the genetic programming algorithm in the range [0.0, 1.0]. This tells the genetic programming algorithm how many pipelines to apply random changes to every generation. We don't recommend that you tweak this parameter unless you know what you're doing.-xr
/CROSSOVER_RATE
: The crossover rate for the genetic programming algorithm in the range [0.0, 1.0]. This tells the genetic programming algorithm how many pipelines to "breed" every generation. We don't recommend that you tweak this parameter unless you know what you're doing.-s
/RANDOM_STATE
: The random number generator seed for TPOT. Use this to make sure that TPOT will give you the same results each time you run it against the same data set with that seed.-v
/VERBOSITY
: How much information TPOT communicates while it's running. 0 = none, 1 = minimal, 2 = all
An example command-line call to TPOT may look like:
tpot -i data/mnist.csv -is , -g 100 -s 42 -v 2
Below is a minimal working example with the practice MNIST data set.
from tpot import TPOT
from sklearn.datasets import load_digits
from sklearn.cross_validation import train_test_split
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
train_size=0.75)
tpot = TPOT(generations=5)
tpot.fit(X_train, y_train)
tpot.score(X_train, y_train, X_test, y_test)
Running this code should discover a pipeline that achieves ~98% testing accuracy.
We welcome you to check the existing issues for bugs or enhancements to work on. If you have an idea for an extension to TPOT, please file a new issue so we can discuss it.
Please check the existing open and closed issues to see if your issue has already been attended to. If it hasn't, please file a new issue on this repository so we can review your issue.
TPOT was developed in the Computational Genetics Lab with funding from the NIH. We're incredibly grateful for their support during the development of this project!