-
Notifications
You must be signed in to change notification settings - Fork 270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kipoi: accelerating the community exchange and reuse of predictive models for genomics #886
Comments
Goals
Main points
Pros
Cons
|
Hi Evan, thanks for reviewing! Let me comment on the cons points.
The existing dependency management systems (e.g. pip and conda) need to be properly used and their installations continuously tested to guarantee easy usage. This is addressed in two steps: (i) Kipoi requires the user to specify the required dependencies in a consistent manner (e.g. not as part of a free-text README file). (ii) Once per day and upon a pull-request, Kipoi automatically installs the dependencies and runs model predictions. Hence, it makes sure that the specified dependencies indeed work as expected. Without these tests, it can happen that an update of a particular package version (if not fixed to a particular version) might break the code. Why don't we freeze all package versions for each model/dataloader? Doing so would prevent using multiple models in the same conda environment and would hence restrict the user.
Hyperparameter optimization packages including Hyperopt require the user to specify an objective function returning the loss. In this objective function, arbitrary python code can be executed hence also a model can be loaded from Kipoi and fine-tuned on a new dataset. Here is a sketch of how I would implement transfer-learning using hyperopt via kopt (https://github.com/Avsecz/kopt, hyperopt wrapper for training Keras models): import kipoi
from kopt import CompileFN, KMongoTrials, test_fn
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
def data():
"""Returns the whole training set. Generator/iterator support on my TODO
"""
Dl = kipoi.get_dataloader_factory("my_model")
train = Dl(fasta_file="path.fa", intervals_file="path.tsv").load_all()
test = Dl(fasta_file="path.fa", intervals_file="test_path.tsv").load_all()
return (train['inputs'], train['targets']), (test['inputs'], test['targets'])
def model(train, transfer_to, lr=0.001, base_model='Divergent421'):
model = kipoi.get_model(base_model)
# Transferred part
tmodel = Model(model.model.inputs,
model.model.get_layer(transfer_to).output)
# New part
top_model = Sequential([kl.Dense(args.tasks,
activation="sigmoid",
input_shape=tmodel.output_shape[1:]))
# Stack
final_model = Sequential([tmodel, top_model])
final_model.compile(Adam(lr), "binary_crossentropy", ['acc'])
return final_model
# Specify the optimization metrics
db_name="kipoi"
exp_name="model1"
objective = CompileFN(db_name, exp_name,
data_fn=data,
model_fn=model,
loss_metric="acc", # which metric to optimize for
loss_metric_mode="max", # try to maximize the metric
valid_split=.2, # use 20% of the training data for the validation set
save_model='best', # checkpoint the best model
save_results=True, # save the results as .json (in addition to mongoDB)
save_dir="./saved_models/") # place to store the models
# define the hyper-parameter ranges
# see https://github.com/hyperopt/hyperopt/wiki/FMin for more info
hyper_params = {
"data": {},
"model": {
"lr": hp.loguniform("m_lr", np.log(1e-4), np.log(1e-2)), # 0.0001 - 0.01
"transfer_to": hp.choice("m_tt", ("dense_1", "dense_2")), # Transfer different number of layers
"base_model": "Divergent421",
},
"fit": {
"epochs": 20
}
}
# test model training, on a small subset for one epoch
test_fn(objective, hyper_params)
# run hyper-parameter optimization
trials = KMongoTrials(db_name, exp_name,
ip="localhost",
port=22334)
best = fmin(objective, hyper_params, trials=trials, algo=tpe.suggest, max_evals=100)
We disabled the option to force push hence commits can't be purged. We are not using GitHub for backups in the canonical sense (e.g backing up a PC disk) for which they list specifically designed solutions like CrashPlan in the article. However, I agree that Git LFS has storage limits and we should consider other alternatives in the future. Regarding archiving, one idea would be to deposit a model to Zendoo and then make a pull-request with a link to the kipoi/models repository (started an issue here). This would make the model directly citable with a doi link.
I totally agree with you. We should be consistent and include all authors from the manuscript as the model authors (DeepCpG_DNA). DeepSEA correctly lists the first author first in the YAML file, however I just noticed a bug which may swap the author list when summarizing the authors from multiple models ( |
Thank you for the response. I agree that Zenodo is probably a good long-term solution for archival. I think that crossref can be used for retrieving manuscript information, and I assume that there are a number of publisher-specific APIs to lookup manuscript information for a given doi. I believe that doi2bib does this pretty well, and have their web app code in a public github repo. |
@Avsecz I recommend checking out @dhimmel's Manubot Python package for this. It was originally developed as part of the collaborative writing platform that we used to write this review manuscript (deep review). Now it is a standalone package that can take a DOI, arXiv id, PubMed id, PubMed Central id, or URL and return structured citation information, including authors. |
https://doi.org/10.1101/375345
Kipoi (not the paper) was previously mentioned in #837
The text was updated successfully, but these errors were encountered: