Biomedical ontologies are large, graphical data structures that describe concepts in biology and medicine. An ongoing area of research is determining how to integrate these from different sub-domains, since they are too large to integrate by hand. This describes an elaborate approach using machine learning and distributed computing, implemented in python.
How do you combine huge graphs of data based on natural language attributes over a hadoop cluster using python? This poster answers that question. It also covers practical experience with dexy, py.test, scikit-learn, nltk, rdflib and mrjob.