Discover the most effective run-time strategies on the OpenAI o1-preview model, improving accuracy in medical language tasks.
The post Advances in run-time strategies for next-generation foundation models appeared first on Microsoft Research.
]]>Groundbreaking advancements in frontier language models are progressing rapidly, paving the way for boosts in accuracy and reliability of generalist models, making them highly effective in specialized domains. As part of our ongoing exploration of foundation model capabilities, we developed Medprompt last year—a novel approach to maximize model performance on specialized domain and tasks without fine-tuning. By leveraging multiphase prompting, Medprompt optimizes inference by identifying the most effective chain-of-thought (CoT) examples at run time and drawing on multiple calls to refine output. When deployed with GPT-4, Medprompt achieved an impressive 90.2% accuracy on the MedQA benchmark (USMLE-style), outperforming all other methods.
Less than a year later, our tests show the OpenAI o1-preview demonstrated superior performance over Medprompt, reaching 96% on the same benchmark (Figure 1)—without using sophisticated prompt guidance and control. This advancement is driven by the model’s integration of run-time strategies at its core, enabling state-of-the-art results on medical licensing exams in the United States and Japan, medical subsets of the Massive Multitask Language Understanding (MMLU) benchmark, and nursing exams (NCLEX) as shown in Figure 2.
These results are notable, prompting us to publish our recent study, findings, and analyses, From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond (opens in new tab). But the numbers are only part of the story. In this blog, we discuss prompting strategies to make the most of o1-preview models and other factors to consider as well as directions forward for run-time strategies.
The introduction of the OpenAI o1 model series marks a significant shift from prior GPT models. Unlike GPT, o1 models are trained using reinforcement learning (RL) techniques that enable them to “think” before generating outputs. While Medprompt relies on a cascade of operations with GPT-4 at run time guided by a multistage prompt, the o1 series incorporates this run-time reasoning directly into its RL-based design. The built-in functionality enables the o1 models to significantly outperform even the best results using GPT-4 and Medprompt. The performance gains come with a notable tradeoff: its per-token cost was approximately six times that of GPT-4o at the time of our evaluation. While the results for GPT-4o with Medprompt fall short of o1-preview model performance, the combination offers a more cost-effective alternative. The cost-benefit tradeoffs are highlighted in the following figure, with the x-axis presented on a logarithmic scale.
The o1-preview model exhibits distinct run-time behaviors compared to the GPT series. While some of our more dynamic prompting strategies performed better than expected with o1-preview models, our most tried-and-true strategy was anything but consistent throughout our evaluation. Figure 4 captures specific performance results for Tailored Prompt, Ensembling, and Few-Shot Prompting on o1-preview. Here’s a summary of our findings:
Spotlight: Blog post
We expanded our research to include a new multilingual benchmark based on the Japanese national medical licensing exam. The JMLE (Japanese Medical Licensing Examination) is written in Japanese and administered in February 2024, after the o1-preview model’s knowledge cutoff. Even without translation to English, the o1-preview model achieved a remarkable score of 98.2% accuracy (Figure 5), well above the exam’s minimum passing score of approximately 80%.
For fun, we conducted tests to determine whether increasing the number of reasoning tokens could improve performance. Our findings showed that by adjusting the prompt, we were able to consistently increase the number of reasoning tokens used by o1-preview, and the increase was directly correlated with improved performance as demonstrated in Figure 6.
Bottom line: There’s a little something for everyone when it comes to run-time strategies. We’re excited by the performance gains from GPT models to o1-preview models. While these improvements are significant, so is the cost. For those needing proven accuracy on a budget, Medprompt leveraging calls to GPT-4 is a viable option for medicine and beyond. We summarize the relative performance of prompting strategies in Figure 7 to determine the best option, or check out the paper for a detailed breakdown of every dataset, experimental configuration, and prompt template (opens in new tab).
We highlighted several considerations in the paper that are worth checking out. Here are three opportunities that are top of mind:
The post Advances in run-time strategies for next-generation foundation models appeared first on Microsoft Research.
]]>TamGen uses generative AI to design new drug candidate compounds to treat TB, going beyond traditional methods by generating novel chemical structures. Learn how a collaboration with the Global Health Drug Discovery Institute is making this possible.
The post Accelerating drug discovery with TamGen: A generative AI approach to target-aware molecule generation appeared first on Microsoft Research.
]]>The Global Health Drug Discovery Institute (opens in new tab) (GHDDI) and Microsoft Research have reached a milestone in tuberculosis (TB) drug research with TamGen (opens in new tab), an open-source (opens in new tab), transformer-based chemical language model for developing target-specific drug compounds. Working in collaboration, the joint team successfully identified several promising inhibitors for a TB protease, with the most effective compound showing significant bioactivity. Research shows that TamGen can also optimize existing molecules by designing target-aware molecule fragments, potentially enabling the discovery of novel compounds that build on a known molecular core structure.
Generative AI is opening new avenues for scientific exploration by allowing computers to autonomously learn and produce original content. TamGen offers a new approach to drug discovery by applying the principles of generative AI to molecular design. Unlike traditional methods, which depend on systematically screening known compounds—a process that is long, complex, and costly due to its reliance on empirical knowledge and the time-consuming task of exploring a vast chemical library—generative AI provides opportunities for designing entirely new chemical structures.
TamGen goes beyond analyzing existing data by generating chemically diverse compounds that conventional approaches might miss. Figure 1 shows that generative AI expands chemical exploration, allowing for a deeper and more comprehensive search for therapeutic solutions compared to traditional methods.
TamGen’s workflow uses generative AI to design target-specific chemical compounds. Building on the success of large language models (LLMs), we adapted a similar approach for molecular generation, using a training method like that of GPT models, which involves next-token prediction. Molecules were first converted into a simplified molecular input line entry system (SMILES)—a notation representing molecular structures as symbol sequences, similar to text. We then developed a protein encoder to process information about proteins, including their 3D structure.
A contextual encoder combines insights from medical professionals with data on the protein target and existing compounds that have proven to be effective or promising. Using expert knowledge and computational analysis, this encoder guides the compound generator to produce new molecules that are more likely to bind to a given protein. This workflow is illustrated in Figure 2.
To evaluate TamGen’s performance, we compared it to five other common methods used to create 3D shapes of molecules intended to bind to certain proteins. We evaluated these methods using the CrossDocked benchmark, a dataset used in AI research to assess the quality of molecule generation conditioned on a target protein.
Evaluation metrics:
The findings, illustrated in Figure 3, show TamGen’s overall performance. While other methods may produce compounds that bind more strongly, they often include multiple interconnected ring structures. Research indicates that more of these structures can lower synthesis accessibility (SAS) and increase cellular toxicity, making these compounds harder to develop. We believe that molecular pretraining of the model contributed to the overall effectiveness of the compounds TamGen generated.
To ensure real-world applicability, we also validated our findings in a hands-on lab environment. Here, we focused on the ClpP protease in Mycobacterium tuberculosis as the target because it plays a significant role in the bacterium’s survival under stress conditions. We proposed the Design-Refine-Test pipeline to effectively identify molecular compounds for TB drug discovery.
Design stage: We began by using TamGen to analyze the binding pocket of the protease, where molecules can attach and influence its function. TamGen generated about 2,600 potential compounds that could fit into this pocket. We assessed these compounds based on how well they could attach to the protease and their predicted biological effects, narrowing it down to four promising candidates.
Refine stage: Next, we entered the four compounds into TamGen, along with three molecular fragments that had been validated in previous lab experiments. This generated a total of 8,600 new compounds, which we screened again using the same criteria, eventually narrowing the selection to 296 compounds.
Test stage: Because synthesizing all 296 compounds wasn’t feasible, we identified similar compounds available in commercial libraries and tested their initial activity against TB. Five compounds showed promising results. We then synthesized one of the originals and two variants of another. Additionally, we categorized the generated compounds into clusters, selected the top 10% from each cluster based on docking scores, and after manual review, synthesized eight more compounds.
The team from Microsoft Research generated the compounds by TamGen, and the GHDDI team conducted binding analysis, structure–activity relationship studies, and lab experiments to verify the compounds’ inhibitory effect on the ClpP protease, measuring their capacity to interfere with or reduce its activity. Lower IC50 values signify greater potency. Out of the 16 compounds tested, 14 showed strong inhibitory activity measuring under 40 µM, indicating high potential. The most effective compound had a measured IC50 value of 1.88 µM.
In addition to generating new molecules, TamGen can optimize existing ones by designing smaller molecular fragments. In this fragment generation process, TamGen builds on a given protein target and a molecular core structure to design new compounds around that core. By incorporating information about the target protein, it generates fragments that are highly specific to the target. This approach moves beyond traditional methods that rely on pre-existing databases, which often limit both novelty and effectiveness of molecular fragments.
For fragment generation, we adjusted the input to TamGen’s compound generator. We modified the SMILES string to ensure it ended at the desired growth site. This was done by specifying the fragment we wanted to retain and its connection point for further growth. The tailored SMILES string was then fed into the compound generator to extend the molecule.
We evaluated this method by targeting the ClpP protease for TB, achieving a more than tenfold improvement in the binding affinity of the generated compound compared to the original. Some compounds also demonstrated slow binding, indicating potential for prolonged action and improved selectivity for the target protein.
TamGen showcases the transformative potential of generative AI in drug design, combining advanced molecular modeling with researcher-AI collaboration. Tasks that once took years can now be accomplished in a fraction of the time. This research underscores AI’s expanding role in drug discovery and its promise for developing effective treatments against persistent infectious diseases like TB.
Looking ahead, we plan to integrate advanced techniques into TamGen, including diffusion models for generating 3D structures, reinforcement learning to apply physical constraints, and molecular dynamics simulations to capture proteins’ shifting shapes. These enhancements aim to improve how well generated compounds bind to target proteins, increase their ability to be synthesized, and strengthen other critical drug properties.
The post Accelerating drug discovery with TamGen: A generative AI approach to target-aware molecule generation appeared first on Microsoft Research.
]]>Introducing a new approach to graph-enabled RAG. LazyGraphRAG needs no prior summarization of source data, avoiding prohibitive up-front indexing costs. It’s inherently scalable in cost and quality across multiple methods and search mechanisms.
The post LazyGraphRAG: Setting a new standard for quality and cost appeared first on Microsoft Research.
]]>The GraphRAG project (opens in new tab) aims to expand the class of questions that AI systems can answer over private datasets by leveraging the implicit relationships within unstructured text.
A key advantage of GraphRAG over conventional vector RAG (or “semantic search”) is its ability to answer global queries that address the entire dataset, such as “what are the main themes in the data?”, or “what are the most important implications for X?”. Conversely, vector RAG excels for local queries where the answer resembles the query and can be found within specific text regions, as is typically the case for “who”, “what”, “when”, and “where” questions.
In recent blog posts, we have shared two new query mechanisms that exploit the rich, summary-based data index created by GraphRAG to improve local search performance and global search costs, respectively.
In this blog post, we introduce a radically different approach to graph-enabled RAG that requires no prior summarization of the source data, avoiding the up-front indexing costs that may be prohibitive for some users and use cases. We call this approach “LazyGraphRAG”.
A key advantage of LazyGraphRAG is its inherent scalability in terms of both cost and quality. Across a range of competing methods (standard vector RAG, RAPTOR (opens in new tab), and GraphRAG local (opens in new tab), global (opens in new tab), and DRIFT (opens in new tab) search mechanisms), LazyGraphRAG shows strong performance across the cost-quality spectrum as follows:
LazyGraphRAG is coming soon to our open-source GraphRAG library (opens in new tab), providing a unified query interface for both local and global queries over a lightweight data index that is comparable in cost to standard vector RAG.
LazyGraphRAG aims to blend the advantages of vector RAG and Graph RAG while overcoming their respective limitations:
LazyGraphRAG combines best-first and breadth-first search dynamics in an iterative deepening manner (Table 1). Compared to the global search mechanism of full GraphRAG, this approach is “lazy” in ways that defer LLM use and dramatically increase the efficiency of answer generation. Overall performance can be scaled via a single main parameter – the relevance test budget – that controls the cost-quality trade-off in a consistent manner.
GraphRAG | LazyGraphRAG | |
---|---|---|
Build index | Uses an LLM to extract and describe entities and their relationships, b) uses an LLM to summarize all observations of each entity and relationship, c) uses graph statistics to optimize the entity graph and extract hierarchical community structure | Uses NLP noun phrase extraction to extract concepts and their co-occurrences, b) uses graph statistics to optimize the concept graph and extract hierarchical community structure |
Summarize index | Uses an LLM to summarize entities and relationships in each community | None – the “lazy” approach defers all LLM use until query time |
Refine query | None – the original query is used throughout | Uses an LLM to a) identify relevant subqueries and recombine them into a single expanded query, b) refine subqueries with matching concepts from the concept graph |
Match query | None – all queries are answered using all community summaries (breadth first) | For each of q subqueries [3-5]: – Uses text chunk embeddings and chunk-community relationships to first rank text chunks by similarity to the query, then rank communities by the rank of their top-k text chunks (best first) – Uses an LLM-based sentence-level relevance assessor to rate the relevance of the top-k untested text chunks from communities in rank order (breadth first) – Recurses into relevant sub-communities after z successive communities yield zero relevant text chunks (iterative deepening) – Terminates when no relevant communities remain or relevance test budget / q is reached |
Map answers | Uses an LLM to answer the original query over random batches of community summaries in parallel | For each of q subqueries [3-5]: – Builds a subgraph of concepts from the relevant text chunks – Uses the community assignments of concepts to group related chunks together – Uses an LLM to extract subquery-relevant claims from groups of related chunks as a way of focusing on relevant content only – Ranks and filters extracted claims to fit a pre-defined context window size |
Reduce answers | Uses an LLM to answer the original query using the mapped answers | Uses an LLM to answer the expanded query using the extracted map claims |
We compared LazyGraphRAG at varying levels of relevance test budget to a range of competing methods, as follows:
Condition | Description |
---|---|
Z100_Lite | LazyGraphRAG with a relevance test budget of 100 and using a low-cost LLM model at all steps |
Z500 | LazyGraphRAG with a relevance test budget of 500, using a low-cost LLM for relevance tests and a more advanced (higher cost) LLM for query refinement and map/reduce answer generation |
Z1500 | LazyGraphRAG with a relevance test budget of 1,500, using a low-cost LLM for relevance tests and a more advanced (higher cost) LLM for query refinement and map/reduce answer generation |
C1 | GraphRAG Global Search at community level 1 |
C2 | GraphRAG Global Search at community level 2 |
C3_Dynamic | GraphRAG Global Search at community level 3 using dynamic community selection (a cost saving solution introduced in our previous blog post) |
LS | GraphRAG Local Search |
DRIFT | GraphRAG DRIFT Search |
SS_8K | Semantic Search with vector RAG retrieving 8k tokens of input |
SS_64K | Semantic Search with vector RAG retrieving 64k tokens of input (included to examine the effect of long-context window on vector RAG) |
RAPTOR | A hierarchical clustering-based approach to RAG that shares some similar features to LazyGraphRAG |
Figure 1 shows LazyGraphRAG win rates against each of the eight competing conditions. At the lowest budget level of 100 relevance tests using a low-cost LLM model (same cost as SS_8K), LazyGraphRAG significantly outperforms all conditions on local and global queries, except for GraphRAG global search conditions for global queries. At an increased budget of 500 relevance tests using a more advanced LLM model (4% of the query cost of C2), LazyGraphRAG significantly outperforms all conditions on both local and global queries. LazyGraphRAG’s win rates continue to increase as the relevance test budget increases to 1,500, demonstrating its scalability in terms of cost vs. quality.
An example comparison is shown in Table 3.
Query: Describe what factors individuals in the United States should consider when selecting a health insurance plan during the open enrollment period for 2024 health insurance plans.
Vector RAG (64K context) | LazyGraphRAG |
---|---|
# Factors to Consider When Selecting a Health Insurance Plan During Open Enrollment for 2024 Selecting a health insurance plan during the open enrollment period can be a complex process, but understanding the key factors to consider can help individuals make informed decisions. Here are some important aspects to keep in mind: ## Coverage and Networks One of the primary considerations is the coverage network of the plan. Insurers form coverage networks of hospitals and doctors, and it is crucial to ensure that your preferred healthcare providers are included in the network of the plan you are considering. Many marketplace plans, such as Health Maintenance Organizations (HMOs) and Exclusive Provider Organizations (EPOs), do not cover claims for most non-emergency care outside their networks [Data: Sources (47419, 47421, 47420, 47410, 47408)]. ## Costs: Premiums, Deductibles, and Out-of-Pocket Maximums Understanding the costs associated with a health insurance plan is essential. This includes the monthly premiums, deductibles, copays, and coinsurance. Plans are typically categorized into different levels such as platinum, gold, silver, and bronze, with bronze plans generally having lower premiums but higher out-of-pocket costs [Data: Sources (47411, 47412, 47416, 47414, 47418)]. Additionally, it is important to consider the out-of-pocket maximum, which is the most you will have to pay in a year for covered services before the insurance starts covering all costs [Data: Sources (47416, 47415)]. ## Tax Credits and Cost-Sharing Reductions Individuals with lower incomes may qualify for income-based tax credits to help reduce the cost of premiums. It is important to accurately estimate your income for the coming year when applying for these credits to avoid having to repay any excess credits received [Data: Sources (47412, 47413)]. Additionally, those with incomes below 250% of the federal poverty level may qualify for cost-sharing reductions if they select a silver plan, which can lower deductibles and out-of-pocket maximums [Data: Sources (47417, 47418)]. ## Special Considerations for Medicaid and Medicare For those transitioning from Medicaid, it is important to understand that marketplace plans may come with higher premiums and copays compared to Medicaid. Additionally, many people who have been removed from Medicaid may be eligible to enroll in marketplace plans through an extended enrollment period [Data: Sources (6284, 6285, 6286, 6287)]. For seniors, the Medicare open enrollment period allows for switching between traditional Medicare and Medicare Advantage plans, with considerations for prescription drug coverage and additional benefits like dental and vision [Data: Sources (72060, 72061, 72062, 72063, 72064)]. ## Assistance and Resources Navigating the complexities of health insurance can be challenging, and seeking assistance from insurance agents or government-funded navigators can be beneficial. These professionals can help individuals understand their options and make informed decisions without recommending specific plans [Data: Sources (47424, 47425, 47426)]. ## Conclusion Selecting the right health insurance plan requires careful consideration of various factors including coverage networks, costs, eligibility for tax credits and cost-sharing reductions, and available assistance. By thoroughly evaluating these aspects, individuals can choose a plan that best meets their healthcare needs and financial situation for the upcoming year. | ### Types of Health Insurance Plans When selecting a health insurance plan during the open enrollment period for 2024, individuals in the United States have several options to consider: 1. **Marketplace Plans**: These are available under the Affordable Care Act (ACA) and are categorized into platinum, gold, silver, and bronze levels. Bronze plans generally have lower premiums but higher out-of-pocket costs, while platinum plans have higher premiums but lower out-of-pocket costs [Data: Sources (47411, 47420, 47421)]. 2. **Low-Cost Marketplace Plans**: These often come with high deductibles, requiring patients to pay thousands of dollars before most coverage kicks in. They also have annual out-of-pocket maximums that can exceed $9,000 for individuals and $18,000 for families [Data: Sources (47415, 47416, 47414)]. 3. **Exclusive Provider Organizations (EPOs) and Health Maintenance Organizations (HMOs)**: EPOs and HMOs generally restrict patients to a network of doctors and require a primary care doctor to direct care. They tend to be cheaper but lack out-of-network flexibility [Data: Sources (47420, 47421, 43218, 43217)]. 4. **Preferred Provider Organizations (PPOs)**: These plans allow for out-of-network services but at a higher cost. They offer more flexibility compared to HMOs and EPOs [Data: Sources (43217)]. 5. **High-Deductible Health Plans (HDHPs)**: Defined as plans with a deductible of at least $1,600 for individual coverage or $3,200 for family coverage, with out-of-pocket maximums of no more than $8,050 or $16,100, respectively. HDHPs usually have lower premiums, and sometimes companies contribute to a health savings account (HSA) to help cover the deductible [Data: Sources (43227, 43226)]. 6. **Medicare Advantage**: These are privately run versions of the federal government’s Medicare program, mostly for people aged 65 and over. They often include prescription drug coverage and may offer additional benefits like dental or vision coverage not provided by traditional Medicare [Data: Sources (72063, 72061, 72060, 72062)]. 7. **Short-Term Health Insurance Plans**: These plans are limited to three months and can only be renewed for a maximum of four months under new rules. They are intended for temporary coverage but often lack comprehensive benefits [Data: Sources (97999, 97995, 97996, 97997)]. ### Cost Factors: Premiums, Deductibles, Co-pays, and Out-of-Pocket Maximums The overall cost of health insurance plans in 2024 is influenced by several factors: – **Premiums**: This is the set monthly cost you pay for your health insurance plan. Premiums have been rising, with a notable increase of 7% for both family and single plans in 2023, partly due to inflation [Data: Sources (83383, 83382, 83384, 83385, 83381, +more)]. – **Deductibles**: The amount you pay out-of-pocket for health care services before your insurance starts to pay. For HDHPs, the deductible is at least $1,600 for individual coverage or $3,200 for family coverage [Data: Sources (43226, 43225)]. – **Co-pays and Co-insurance**: These are the costs you pay each time you receive a medical service. Co-pays are fixed amounts, while co-insurance is a percentage of the service cost. – **Out-of-Pocket Maximums**: This is the maximum amount you will pay for covered services in a year. For example, HDHPs have out-of-pocket maximums of no more than $8,050 for individual coverage or $16,100 for family coverage [Data: Sources (43227, 43226)]. ### Provider Networks: In-Network vs. Out-of-Network The network of healthcare providers is a crucial factor in selecting a health insurance plan: – **In-Network Providers**: These are doctors and hospitals that have agreements with your insurance plan to provide services at lower rates. Ensuring your preferred doctors and specialists are in-network can save you significant costs [Data: Sources (43216, 47419)]. – **Out-of-Network Providers**: Services from these providers are usually more expensive and may not be covered at all, except in emergencies. PPO plans offer some out-of-network coverage but at a higher cost, while HMOs and EPOs generally do not cover non-emergency out-of-network care [Data: Sources (43217, 47421)]. ### Specific Medical Needs and Services When selecting a health insurance plan, individuals should consider their specific medical needs: – **Prescription Drugs**: Ensure that your medications are covered by the plan’s formulary, as drug coverage can change annually [Data: Sources (43220, 43218, 43219)]. – **Mental Health Services**: Coverage for mental health treatments is essential, especially with new rules pushing insurers to increase their coverage of these services [Data: Sources (97031, 97028, 97027, 97030, 97033, +more)]. – **Chronic Conditions**: Plans should cover ongoing treatments and medications for chronic conditions. Medicare Supplement Insurance (Medigap) can help cover gaps in Medicare for chronic disease management [Data: Sources (93367, 93368)]. – **Preventive Care**: Coverage for preventive services like cancer screenings and HIV prevention is mandated under the ACA, though its future is uncertain due to ongoing legal battles [Data: Sources (71106, 71109, 71098, 71099, 71100, +more)]. ### Key Dates and Steps for Open Enrollment The open enrollment period for 2024 health insurance plans involves several key dates and steps: – **Marketplace Plans**: Open enrollment starts on November 1, 2023, and runs through mid-December in most states, ending on January 16, 2024 [Data: Sources (47419, 47411, 47416, 47421, 47409, +more)]. – **Medicare**: Open enrollment for Medicare runs from October 15, 2023, to December 7, 2023. During this period, individuals can choose between traditional Medicare, Medicare Advantage plans, and prescription drug plans [Data: Sources (72061, 72063, 72060, 72062)]. – **Special Enrollment Periods**: Individuals who lose coverage due to life events like job loss or moving may qualify for special enrollment periods. For example, those removed from Medicaid may enroll in marketplace plans through July 2024 [Data: Sources (6288, 6289)]. By considering these factors, individuals can make informed decisions about their health insurance coverage for 2024, ensuring they select plans that best meet their medical needs and financial situations. |
LazyGraphRAG shows that it is possible for a single, flexible query mechanism to substantially outperform a diverse range of specialized query mechanisms across the local-global query spectrum, and to do so without the up-front costs of LLM data summarization. Its very fast and almost-free indexing make LazyGraphRAG ideal for one-off queries, exploratory analysis, and streaming data use cases, while its ability to smoothly increase answer quality with increasing relevance test budget makes it a valuable tool for benchmarking RAG approaches in general (e.g., “RAG approach X beats LazyGraphRAG with budget Y for task Z”).
Does this mean that all graph-enabled RAG should be lazy? We believe the answer is no, for three reasons:
We will be exploring these directions in the coming period, with all advances (including LazyGraphRAG itself) released via the GraphRAG GitHub repository (opens in new tab). Stay tuned!
The post LazyGraphRAG: Setting a new standard for quality and cost appeared first on Microsoft Research.
]]>Research manager Karin Strauss and members of the DNA Data Storage Project reflect on the path to developing a synthetic DNA–based system for archival data storage, including the recent open-source release of its most powerful algorithm for DNA error correction.
The post Ideas: The journey to DNA data storage appeared first on Microsoft Research.
]]>Behind every emerging technology is a great idea propelling it forward. In the Microsoft Research Podcast series Ideas, members of the research community at Microsoft discuss the beliefs that animate their research, the experiences and thinkers that inform it, and the positive human impact it targets.
Accommodating the increasing amounts of digital data the world is producing requires out-of-the-box thinking. In this episode, guest host Karin Strauss, an innovation strategist and senior principal research manager at Microsoft, brings together members of her team to explore a more sustainable, more cost-effective alternative for archival data storage: synthetic DNA. Strauss, Principal Researcher Bichlien Nguyen, Senior Researcher Jake Smith, and Partner Research Manager Sergey Yekhanin discuss how Microsoft Research’s contributions have helped bring “science fiction,” as Strauss describes it, closer to reality, including its role in establishing the DNA Data Storage Alliance to foster collaboration in developing the technology and to establish specifications for interoperability. They also talk about the scope of collaboration with other fields, such as the life sciences and electrical and mechanical engineering, and the coding theory behind the project, including the group’s most powerful algorithm for DNA error correction, Trellis BMA, which is now open source.
[TEASER]
[MUSIC PLAYS UNDER DIALOGUE]
JAKE SMITH: This really starts from the fundamental data production–data storage gap, where we produce way more data nowadays than we could ever have imagined years ago. And it’s more than we can practically store in magnetic media. And so we really need a denser medium on the other side to contain that. DNA is extremely dense. It holds far, far more information per unit volume, per unit mass than any storage media that we have available today. This, along with the fact that DNA is itself a relatively rugged molecule—it lives in our body; it lives outside our body for thousands and thousands of years if we, you know, leave it alone to do its thing—makes it a very attractive media.
BICHLIEN NGUYEN: It’s such a futuristic technology, right? When you begin to work on the tech, you realize how many disciplines and domains you actually have to reach in and leverage. It’s really interesting, this multidisciplinarity, because we’re, in a way, bridging software with wetware with hardware. And so you, kind of, need all the different disciplines to actually get you to where you need to go.
SERGEY YEKHANIN: We all work for Microsoft; we are all Microsoft researchers. Microsoft isn’t a startup. But that team, the team that drove the DNA Data Storage Project, it did feel like a startup, and it was something unusual and exciting for me.
SERIES INTRO: You’re listening to Ideas, a Microsoft Research Podcast that dives deep into the world of technology research and the profound questions behind the code. In this series, we’ll explore the technologies that are shaping our future and the big ideas that propel them forward.
[MUSIC FADES]
GUEST HOST KARIN STRAUSS: I’m your guest host Karin Strauss, a senior principal research manager at Microsoft. For nearly a decade, my colleagues and I—along with a fantastic and talented group of collaborators from academia and industry—have been working together to help close the data creation–data storage gap. We’re producing far more digital information than we can possibly store. One solution we’ve explored uses synthetic DNA as a medium, and over the years, we’ve contributed to steady and promising progress in the area. We’ve helped push the boundaries of how much DNA writer can simultaneously store, shown that full automation is possible, and helped create an ecosystem for the commercial success of DNA data storage. And just this week, we’ve made one of our most advanced tools for encoding and decoding data in DNA open source. Joining me today to discuss the state of DNA data storage and some of our contributions are several members of the DNA Data Storage Project at Microsoft Research: Principal Researcher Bichlien Nguyen, Senior Researcher Jake Smith, and Partner Research Manager Sergey Yekhanin. Bichlien, Jake, and Sergey, welcome to the podcast.
BICHLIEN NGUYEN: Thanks for having us, Karin.
SERGEY YEKHANIN: Thank you so much.
JAKE SMITH: Yes, thank you.
STRAUSS: So before getting into the details of DNA data storage and our work, I’d like to talk about the big idea behind the work and how we all got here. I’ve often described the DNA Data Storage Project as turning science fiction into reality. When we started the project in 2015, though, the idea of using DNA for archival storage was already out there and had been for over five decades. Still, when I talked about the work in the area, people were pretty skeptical in the beginning, and I heard things like, “Wow, why are you thinking about that? It’s so far off.” So, first, please share a bit of your research backgrounds and then how you came to work on this project. Where did you first encounter this idea, what do you remember about your initial impressions—or the impressions of others—and what made you want to get involved? Sergey, why don’t you start.
YEKHANIN: Thanks so much. So I’m a coding theorist by training, so, like, my core areas of research have been error-correcting codes and also computational complexity theory. So I joined the project probably, like, within half a year of the time that it was born, and thanks, Karin, for inviting me to join. So, like, that was roughly the time when I moved from a different lab, from the Silicon Valley lab in California to the Redmond lab, and actually, it just so happened that at that moment, I was thinking about what to do next. Like, in California, I was mostly working on coding for distributed storage, and when I joined here, that effort kept going. But I had some free cycles, and that was the moment when Karin came just to my office and told me about the project. So, indeed, initially, it did feel a lot like science fiction. Because, I mean, we are used to coding for digital storage media, like for magnetic storage media, and here, like, this is biology, and, like, why exactly these kind of molecules? There are so many different molecules. Like, why that? But honestly, like, I didn’t try to pretend to be a biologist and make conclusions about whether this is the right medium or the wrong medium. So I tried to look into these kinds of questions from a technical standpoint, and there was a lot of, kind of, deep, interesting coding questions, and that was the main attraction for me. At the same time, I wasn’t convinced that we will get as far as we actually got, and I wasn’t immediately convinced about the future of the field, but, kind of, just the depth and the richness of the, what I’ll call, technical problems, that’s what made it appealing for me, and I, kind of, enthusiastically joined. And also, I guess, the culture of the team. So, like, it did feel like a startup. Like, we all work for Microsoft; we’re all Microsoft researchers. Microsoft isn’t a startup. But that team, the team that drove the DNA Data Storage Project, it did feel like a startup, and it was something unusual and exciting for me.
NGUYEN: Oh, I love that, Sergey. So my background is in organic chemistry, and Karin had reached out to me, and I interviewed not knowing what Karin wanted. Actually … so I took the job kind of blind because I was like, “Hmm, Microsoft Research? … DNA biotech? …” I was very, very curious, and then when she told me that this project was about DNA data storage, I was like, this is a crazy, crazy idea. I definitely was not sold on it, but I was like, well, look, I get to meet and work with so many interesting people from different backgrounds that, one, even if it doesn’t work out, I’m going to learn something, and, two, I think it could work, like it could work. And so I think that’s really what motivated me to join.
SMITH: The first thing that you think when you hear about we’re going to take what is our hard drive and we’re going to turn that into DNA is that this is nuts. But, you know, it didn’t take very long after that. I come from a chemistry, biotech-type background where I’ve been working on designing drugs, and there, DNA is this thing off in the nethers, you know. You look at it every now and then to see what information it can tell you about, you know, what maybe your drug might be hitting on the target side, and it’s, you know, that connection—that the DNA contains the information in the living systems, the DNA contains the information in our assays, and why could the DNA not contain the information that we, you know, think more about every day, that information that lives in our computers—as an extremely cool idea.
STRAUSS: Through our work, we’ve had years to wrap our heads around DNA data storage. But, Jake, could you tell us a little bit about how DNA data storage works and why we’re interested in looking into the technology?
SMITH: So you mentioned it earlier, Karin, that this really starts from the fundamental data production–data storage gap, where we produce way more data nowadays than we could ever have imagined years ago. And it’s more than we can practically store in magnetic media. This is a problem because, you know, we have data; we have recognized the value of data with the rise of large language models and these other big generative models. The data that we do produce, our video has gone from, you know, substantially small, down at 480 resolution, all the way up to things at 8K resolution that now take orders of magnitude more storage. And so we really need a denser medium on the other side to contain that. DNA is extremely dense. It holds far, far more information per unit volume, per unit mass than any storage media that we have available today. This, along with the fact that DNA is itself a relatively rugged molecule—it lives in our body; it lives outside our body for thousands and thousands of years if we, you know, leave it alone to do its thing—makes it a very attractive media, particularly compared to the traditional magnetic media, which has lower density and a much shorter lifetime on the, you know, scale of decades at most.
So how does DNA data storage actually work? Well, at a very high level, we start out in the digital domain, where we have our information represented as ones and zeros, and we need to convert that into a series of A’s, C’s, T’s, and G’s that we could then actually produce, and this is really the domain of Sergey. He’ll tell us much more about how this works later on. For now, let’s just assume we’ve done this. And now our information, you know, lives in the DNA base domain. It’s still in the digital world. It’s just represented as A’s, C’s, T’s, and G’s, and we now need to make this physical so that we can store it. This is accomplished through large-scale DNA synthesis. Once the DNA has been synthesized with the sequences that we specified, we need to store it. There’s a lot of ways we can think about storing it. Bichlien’s done great work looking at DNA encapsulation, as well as, you know, other more raw just DNA-on-glass-type techniques. And we’ve done some work looking at the susceptibility of DNA stored in this unencapsulated form to things like atmospheric humidity, to temperature changes and, most excitingly, to things like neutron radiation. So we’ve stored our data in this physical form, we’ve archived it, and coming back to it, likely many years in the future because the properties of DNA match up very well with archival storage, we need to convert it back into the digital domain. And this is done through a technique called DNA sequencing. What this does is it puts the molecules through some sort of machine, and on the other side of the machine, we get out, you know, a noisy representation of what the actual sequence of bases in the molecules were. We have one final step. We need to take this series of noisy sequences and convert it back into ones and zeros. Once we do this, we return to our original data and we’ve completed, let’s call it, one DNA data storage cycle.
STRAUSS: We’ll get into this in more detail later, but maybe, Sergey, we dig a little bit on encoding-decoding end of things and how DNA is different as a medium from other types of media.
YEKHANIN: Sure. So, like, I mean, coding is an important aspect of this whole idea of DNA data storage because we have to deal with errors—it’s a new medium—but talking about error-correcting codes in the context of DNA data storage, so, I mean, usually, like … what are error-correcting codes about? Like, on the very high level, right, I mean, you have some data—think of it as a binary string—you want to store it, but there are errors. So usually, like, in most, kind of, forms of media, the errors are bit flips. Like, you store a 0; you get a 1. Or you store a 1; you get a 0. So these are called substitution errors. The field of error-correcting codes, it started, like, in the 1950s, so, like, it’s 70 years old at least. So we, kind of, we understand how to deal with this kind of error reasonably well, so with substitution errors. In DNA data storage, the way you store your data is that given, like, some large amount of digital data, you have the freedom of choosing which short DNA molecules to generate. So in a DNA molecule, it’s a sequence of the bases A, G, C, and T, and you have the freedom to decide, like, which of the short molecules you need to generate, and then those molecules get stored, and then during the storage, some of them are lost; some of them can be damaged. There can be insertions and deletions of bases on every molecule. Like, we call them strands. So you need redundancy, and there are two forms of redundancy. There’s redundancy that goes across strands, and there is redundancy on the strand. And so, yeah, so, kind of, from the error-correcting side of things, like, we get to decide what kind of redundancy we want to introduce—across strands, on the strand—and then, like, we want to make sure that our encoding and decoding algorithms are efficient. So that’s the coding theory angle on the field.
NGUYEN: Yeah, and then, you know, from there, once you have that data encoded into DNA, the question is how do you make that data on a scale that’s compatible with digital data storage? And so that’s where a lot of the work came in for really automating the synthesis process and also the reading process, as well. So synthesis is what we consider the writing process of DNA data storage. And so, you know, we came up with some unique ideas there. We made a chip that enabled us to get to the densities that we needed. And then on the reading side, we used different sequencing technologies. And it was great to see that we could actually just, kind of, pull sequencing technologies off the shelf because people are so interested in reading biological DNA. So we explored the Illumina technologies and also Oxford Nanopore, which is a new technology coming in the horizon. And then preservation, too, because we have to make sure that the data that’s stored in the DNA doesn’t get damaged and that we can recover it using the error-correcting codes.
STRAUSS: Yeah, absolutely. And it’s clear that—and it’s also been our experience that—DNA data storage and projects like this require more than just a team of computer scientists. Bichlien, you’ve had the opportunity to collaborate with many people in all different disciplines. So do you want to talk a little bit about that? What kind of expertise, you know, other disciplines that are relevant to bringing DNA data storage to reality?
NGUYEN: Yeah, well, it’s such a futuristic technology, right? When you begin to work on the tech, you realize how many disciplines and domains you actually have to reach in and leverage. One concrete example is that in order to fabricate an electronic chip to synthesize DNA, we really had to pull in a lot of material science research because there’s different capabilities that are needed when trying to use liquid on a chip. We, you know, have to think about DNA data storage itself. And that’s a very different beast than, you know, the traditional storage mediums. And so we worked with teams who literally create, you know, these little tiny micro- or nanocapsules in glass and being able to store that there. It’s really interesting, this multidisciplinarity, because we’re, in a way, bridging software with wetware with hardware. And so you, kind of, need all the different disciplines to actually get you to where you need to go.
STRAUSS: Yeah, absolutely. And, you know, building on, you know, collaborators, I think one area that was super interesting, as well, and was pretty early on in the project was building that first end-to-end system that we collaborated with University of Washington, the Molecular Information Systems Lab there, to build. And really, at that point, you know, there had been work suggesting that DNA data storage was viable, but nobody had really shown an end-to-end system, from beginning to end, and in fact, my manager at the time, Doug Carmean, used to call it the “bubble gum and shoestring” system. But it was a crucial first step because it shows it was possible to really fully automate the process. And there have been several interesting challenges there in the system, but we noticed that one particularly challenging one was synthesis. That first system that we built was capable of storing the word “hello,” and that was all we could store. So it wasn’t a very high-capacity system. But in order to be able to store a lot more volumes of data instead of a simple word, we really needed much more advanced synthesis systems. And this is what both Bichlien and Jake ended up working on, so do you want to talk a little bit about that and the importance of that particular work?
SMITH: Yeah, absolutely. As you said, Karin, the amount of DNA that is required to store the massive amount of data we spoke about earlier is far beyond the amount of DNA that’s needed for any, air quotes, traditional applications of synthetic DNA, whether it’s your gene construction or it’s your primer synthesis or such. And so we really had to rethink how you make DNA at scale and think about how could this actually scale to meet the demand. And so Bichlien started out looking at a thing called a microelectrode array, where you have this big checkerboard of small individual reaction sites, and in each reaction site, we used electrochemistry in order to control base by base—A, C, T, or G by A, C, T, or G—the sequence that was growing at that particular reaction site. We got this down to the nanoscale. And so what this means practically is that on one of these chips, we could synthesize at any given time on the order of hundreds of millions of individual strands. So once we had the synthesis working with the traditional chemistry where you’re doing chemical synthesis—each base is added in using a mixture of chemicals that are added to the individual spots—they’re activated. But each coupling happens due to some energy you prestored in the synthesis of your reagents. And this makes the synthesis of those reagents costly and themselves a bottleneck. And so taking, you know, a look forward at what else was happening in the synthetic biology world, the, you know, next big word in DNA synthesis was and still is enzymatic synthesis, where rather than having to, you know, spend a lot of energy to chemically pre-activate reagents that will go in to make your actual DNA strands, we capitalize on nature’s synthetic robots—enzymes—to start with less-activated, less-expensive-to-get-to, cheaply-produced-through-natural-processes substrates, and we use the enzymes themselves, toggling their activity over each of the individual chips, or each of the individual spots on our checkerboard, to construct DNA strands. And so we got a little bit into this project. You know, we successfully showed that we could put down selectively one base at a given time. We hope that others will, kind of, take up the work that we’ve put out there, you know, particularly our wonderful collaborators at Ansa who helped us design the enzymatic system. And one day we will see, you know, a truly parallelized, in this fashion, enzymatic DNA system that can achieve the scales necessary.
NGUYEN: It’s interesting to note that even though it’s DNA and we’re still storing data in these DNA strands, chemical synthesis and enzymatic synthesis provide different errors that you see in the actual files, right, in the DNA files. And so I know that we talked to Sergey about how do we deal with these new types of errors and also the new capabilities that you can have, for example, if you don’t control base by base the DNA synthesis.
YEKHANIN: This whole field of DNA data storage, like, the technologies on the biology side are advancing rapidly, right. And there are different approaches to synthesis. There are different approaches to sequencing. And, presumably, the way the storage is actually done, like, is also progressing, right, and we had works on that. So there is, kind of, this very general, kind of, high-level error profile that you can say that these are the type of errors that you encounter in DNA data storage. Like, in DNA molecules—just the sequence of these bases, A, G, C, T, in maybe a length of, like, 200 or so and you store a very, very large number of them—the errors that you see is that some of these strands, kind of, will disappear. Some of these strings can be torn apart like, let’s say, in two pieces, maybe even more. And then on every strand, you also encounter these errors—insertions, deletions, substitutions—with different rates. Like, the likelihood of all kinds of these errors may differ very significantly across different technologies that you use on the biology side. And also there can be error bursts somehow. Maybe you can get an insertion of, I don’t know, 10 A’s, like, in a row, or you can lose, like, you know, 10 bases in a row. So if you don’t, kind of, quantify, like, what are the likelihoods of all these bad events happening, then I think this still, kind of, fits at least the majority of approaches to DNA data storage, maybe not exactly all of them, but it fits the majority. So when we design coding schemes, we are trying also, kind of, to look ahead in the sense that, like, we don’t know, like, in five years, like, how will these error profiles, how will it look like. So the technologies that we develop on the error-correction side, we try to keep them very flexible, so whether it’s enzymatic synthesis, whether it’s Nanopore technology, whether it’s Illumina technology that is being used, the error-correction algorithms would be able to adapt and would still be useful. But, I mean, this makes also coding aspect harder because, [LAUGHTER] kind of, you want to keep all this flexibility in mind.
STRAUSS: So, Sergey, we are at an interesting moment now because you’re open sourcing the Trellis BMA piece of code, right, that you published a few years ago. Can you talk a little bit about that specific problem of trace reconstruction and then the paper specifically and how it solves it?
YEKHANIN: Absolutely, yeah, so this Trellis BMA paper for that we are releasing the source code right now, this is, kind of, this is the latest in our sequence of publications on error-correction for DNA data storage. And I should say that, like, we already discussed that the project is, kind of, very interdisciplinary. So, like, we have experts from all kinds of fields. But really even within, like, within this coding theory, like, within computer science/information theory, coding theory, in our algorithms, we use ideas from very different branches. I mean, there are some core ideas from, like, core algorithm space, and I won’t go into these, but let me just focus, kind of, on two aspects. So when we just faced this problem of coding for DNA data storage and we were thinking about, OK, so how to exactly design the coding scheme and what are the algorithms that we’ll be using for error correction, so, I mean, we’re always studying the literature, and we came up on this problem called trace reconstruction that was pretty popular—I mean, somewhat popular, I would say—in computer science and in statistics. It didn’t have much motivation, but very strong mathematicians had been looking at it. And the problem is as follows. So, like, there is a long binary string picked at random, and then it’s transmitted over a deletion channel, so some bits—some zeros and some ones—at certain coordinates get deleted and you get to see, kind of, the shortened version of the string. But you get to see it multiple times. And the question is, like, how many times do you need to see it so that you can get a reasonably accurate estimate of the original string that was transmitted? So that was called trace reconstruction, and we took a lot of motivation—we took a lot of inspiration—from the problem, I would say, because really, in DNA data storage, if we think about a single strand, like, a single strand which is being stored, after we read it, we usually get multiple reads of this string. And, well, the errors there are not just deletions. There are insertions, substitutions, and, like, inversive errors, but still we could rely on this literature in computer science that already had some ideas. So there was an algorithm called BMA, Bitwise Majority Alignment. We extended it—we adopted it, kind of, for the needs of DNA data storage—and it became, kind of, one of the tools in our toolbox for error correction.
So we also started to use ideas from literature on electrical engineering, what are called convolutional error-correcting codes and a certain, kind of, class of algorithms for decoding errors in these convolutional error-correcting codes called, like, I mean, Trellis is the main data structure, like, Trellis-based algorithms for decoding convolutional codes, like, Viterbi algorithm or BCJR algorithm. Convolutional codes allow you to introduce redundancy on the string. So, like, with algorithms kind of similar to BMA, like, they were good for doing error correction when there was no redundancy on the strand itself. Like, when there is redundancy on the strand, kind of, we could do some things, but really it was very limited. With Trellis-based approaches, like, again inspired by the literature in electrical engineering, we had an approach to introduce redundancy on the strand, so that allowed us to have more powerful error-correction algorithms. And then in the end, we have this algorithm, which we call Trellis BMA, which, kind of, combines ideas from both fields. So it’s based on Trellis, but it’s also more efficient than standard Trellis-based algorithms because it uses ideas from BMA from computer science literature. So this is, kind of, this is a mix of these two approaches. And, yeah, that’s the paper that we wrote about three years ago. And now we’re open sourcing it. So it is the most powerful algorithm for DNA error correction that we developed in the group. We’re really happy that now we are making it publicly available so that anybody can experiment with the source code. Because, again, the field has expanded a lot, and now there are multiple groups around the globe that work just specifically on error correction apart from all other aspects, so, yeah, so we are really happy that it’s become publicly available to hopefully further advance the field.
STRAUSS: Yeah, absolutely, and I’m always amazed by, you know, how, it is really about building on other people’s work. Jake and Bichlien, you recently published a paper in Nature Communications. Can you tell us a little bit about what it was, what you exposed the DNA to, and what it was specifically about?
NGUYEN: Yeah. So that paper was on the effects of neutron radiation on DNA data storage. So, you know, when we started the DNA Data Storage Project, it was really a comparison, right, between the different storage medias that exist today. And one of the issues that have come up through the years of development of those technologies was, you know, hard errors and soft errors that were induced by radiation. So we wanted to know, does that maybe happen in DNA? We know that DNA, in humans at least, is affected by radiation from cosmic rays. And so that was really the motivation for this type of experiment. So what we did was we essentially took our DNA files and dried them and threw them in a neutron accelerator, which was fantastic. It was so exciting. That’s, kind of, the merge of, you know, sci fi with sci fi at the same time. [LAUGHS] It was fantastic. And we irradiated for over 80 million years—
STRAUSS: The equivalent of …
NGUYEN: The equivalent of 80 million years.
STRAUSS: Yes, because it’s a lot of radiation all at the same time, …
NGUYEN: It’s a lot of radiation …
STRAUSS: … and it’s accelerated radiation exposure?
NGUYEN: Yeah, I would say it’s accelerated aging with radiation. It’s an insane amount of radiation. And it was surprising that even though we irradiated our DNA files with that much radiation, there wasn’t that much damage. And that’s surprising because, you know, we know that humans, if we were to be irradiated like that, it would be disastrous. But in, you know, DNA, our files were able to be recovered with zero bit errors.
STRAUSS: And why that difference?
NGUYEN: Well, we think there’s a few reasons. One is that when you look at the interaction between a neutron and the actual elemental composition of DNA—which is basically carbons, oxygens, and hydrogens, maybe a phosphorus—the neutrons don’t interact with the DNA much. And if it did interact, we would have, for example, a strand break, which based on the error-correcting codes, we can recover from. So essentially, there’s not much … one, there’s not much interaction between neutrons and DNA, and second, we have error-correcting codes that would prevent any data loss.
STRAUSS: Awesome, so yeah, this is another milestone that contributes towards the technology becoming a reality. There are also other conditions that are needed for technology to be brought to the market. And one thing I’ve worked on is to, you know, create the DNA Data Storage Alliance; this is something Microsoft co-founded with Illumina, Twist Bioscience, and Western Digital. And the goal there was to essentially provide the right conditions for the technology to thrive commercially. We did bring together multiple universities and companies that were interested in the technology. And one thing that we’ve seen with storage technologies that’s been pretty important is standardization and making sure that the technology’s interoperable. And, you know, we’ve seen stalemate situations like Blu-ray and high-definition DVD, where, you know, really we couldn’t decide on a standard, and the technology, it took a while for the technology to be picked up, and the intent of the DNA Data Storage [Alliance] is to provide an ecosystem of companies, universities, groups interested in making sure that this time, it’s an interoperable technology from the get-go, and that increases the chances of commercial adoption. As a group, we often talk about how amazing it is to work for a company that empowers us to do this kind of research. And for me, one of Microsoft Research’s unique strengths, particularly in this project, is the opportunity to work with such a diverse set of collaborators on such a multidisciplinary project like we have. How do you all think where you’ve done this work has impacted how you’ve gone about it and the contributions you’ve been able to make?
NGUYEN: I’m going to start with if we look around this table and we see who’s sitting at it, which is two chemists, a computer architect, and a coding theorist, and we come together and we’re like, what can we make that would be super, super impactful? I think that’s the answer right there, is that being at Microsoft and being in a culture that really fosters this type of interdisciplinary collaboration is the key to getting a project like this off the ground.
SMITH: Yeah, absolutely. And we should acknowledge the gigantic contributions made by our collaborators at the University of Washington. Many of them would fall in not any of these three categories. They’re electrical engineers, they’re mechanical engineers, they’re pure biologists that we worked with. And each of them brought their own perspective, and particularly when you talk about going to a true end-to-end system, those perspectives were invaluable as we were trying to fit all the puzzle pieces together.
STRAUSS: Yeah, absolutely. We’ve had great collaborations over time—University of Washington, ETH Zürich, Los Alamos National Lab, ChipIr, Twist Bioscience, Ansa Biotechnologies. Yeah, it’s been really great and a great set of different disciplines, all the way from coding theorists to the molecular biology and chemistry, electrical and mechanical engineering. One of the great things about research is there’s never a shortage of interesting questions to pursue, and for us, this particular work has opened the door to research in adjacent domains, including sustainability fields. DNA data storage requires small amounts of materials to accommodate the large amounts of data, and early on, we wanted to understand if DNA data storage was, as it seemed, a more sustainable way to store information. And we learned a lot. Bichlien and Jake, you had experience in green chemistry when you came to Microsoft. What new findings did we make, and what sustainability benefits do we get with DNA data storage? And, finally, what new sustainability work has the project led to?
NGUYEN: As a part of this project, if we’re going to bring new technologies to the forefront, you know, to the world, we should make sure that they have a lower carbon footprint, for example, than previous technologies. And so we ran a life cycle assessment—which is a way to systematically evaluate the environmental impacts of anything of interest—and we did this on DNA data storage and compared it to electronic storage medium[1], and we noticed that if we were able to store all of our digital information in DNA, that we would have benefits associated with carbon emissions. We would be able to reduce that because we don’t need as much infrastructure compared to the traditional storage methods. And there would be an energy reduction, as well, because this is a passive way of archival data storage. So that was, you know, the main takeaways that we had. But that also, kind of, led us to think about other technologies that would be beneficial beyond data storage and how we could use the same kind of life cycle thinking towards that.
SMITH: This design approach that you’ve, you know, talked about us stumbling on, not inventing but seeing other people doing in the literature and trying to implement ourselves on the DNA Data Storage Project, you know, is something that can be much bigger than any single material. And where we think there’s a, you know, chance for folks like ourselves at Microsoft Research to make a real impact on this sustainability-focused design is through the application of machine learning, artificial intelligence—the new tools that will allow us to look at much bigger design spaces than we could previously to evaluate sustainability metrics that were not possible when everything was done manually and to ultimately, you know, at the end of the day, take a sustainability-first look at what a material should be composed of. And so we’ve tried to prototype this with a few projects. We had another wonderful collaboration with the University of Washington where we looked at recyclable circuit boards and a novel material called a vitrimer that it could possibly be made out of[2]. We’ve had another great collaboration with the University of Michigan, where we’ve looked at the design of charge-carrying molecules in these things called flow batteries that have good potential for energy smoothing in, you know, renewables production, trying to get us out of that day-night, boom-bust cycle[3]. And we had one more project, you know, this time with collaborators at the University of Berkeley, where we looked at, you know, design of a class of materials called a metal organic framework, which have great promise in low-energy-cost gas separation, such as pulling CO2 out of the, you know, plume of a smokestack or, you know, ideally out of the air itself[4].
STRAUSS: For me, the DNA work has made me much more open to projects outside my own research area—as Bichlien mentioned, my core research area is computer architecture, but we’ve ventured in quite a bit of other areas here—and going way beyond my own comfort zone and really made me love interdisciplinary projects like this and try, really try, to do the most important work I can. And this is what attracted me to these other areas of environmental sustainability that Bichlien and Jake covered, where there’s absolutely no lack of problems. Like them, I’m super interested in using AI to solve many of them. So how do each of you think working on the DNA Data Storage Project has influenced your research approach more generally and how you think about research questions to pursue next?
YEKHANIN: It definitely expanded the horizons a lot, like, just, kind of, just having this interactions with people, kind of, whose core areas of research are so different from my own and also a lot of learning even within my own field that we had to do to, kind of, carry this project out. So, I mean, it was a great and rewarding experience.
NGUYEN: Yeah, for me, it’s kind of the opposite of Karin, right. I started as an organic chemist and then now really, one, appreciate the breadth and depth of going from a concept to a real end-to-end prototype and all the requirements that you need to get there. And then also, really the importance of having, you know, a background in computer science and really being able to understand the lingo that is used in multidisciplinary projects because you might say something and someone else interprets it very differently, and it’s because you’re not speaking the same language. And so that understanding that you have to really be … you have to learn a little bit of vocabulary from each person and understand how they contribute and then how your ideas can contribute to their ideas has been really impactful in my career here.
SMITH: Yeah, I think the key change in approach that I took away—and I think many of us took away from the DNA Data Storage Project—was rather than starting with an academic question, we started with a vision of what we wanted to happen, and then we derived the research questions from analyzing what would need to happen in the world—what are the bottlenecks that need to be solved in order for us to achieve, you know, that goal? And this is something that we’ve taken with us into the sustainability-focused research and, you know, something that I think will affect all the research I do going forward.
STRAUSS: Awesome. As we close, let’s reflect a bit on what a world in which DNA data storage is widely used might look like. If everything goes as planned, what do you hope the lasting impact of this work will be? Sergey, why don’t you lead us off.
YEKHANIN: Sure, I remember that, like, when … in the early days when I started working on this project actually, you, Karin, told me that you were taking an Uber ride somewhere and you were talking to the taxi driver, and the taxi driver—I don’t know if you remember that—but the taxi driver mentioned that he has a camera which is recording everything that’s happening in the car. And then you had a discussion with him about, like, how long does he keep the data, how long does he keep the videos. And he told you that he keeps it for about a couple of days because it’s too expensive. But otherwise, like, if it weren’t that expensive, he would keep it for much, much longer because, like, he wants to have these recordings if later somebody is upset about the ride and, I don’t know, he is getting sued or something. So this is, like, this is one small narrow application area where DNA data storage would clearly, kind of, if it happens, then it will solve it. Because then, kind of, this long-term archival storage will become very cheap, available to everybody; it would become a commodity basically. There are many things that will be enabled, like this helping the Uber drivers, for instance. But also one has to think of, of course, like, about, kind of, the broader implications so that we don’t get into something negative because again this power of recording everything and storing everything, it can also lead to some use cases that might be, kind of, morally wrong. So, again, hopefully by the time that we get to, like, really wide deployments of this technology, the regulation will also be catching up and the, like, we will have great use cases and we won’t have bad ones. I mean, that’s how I think of it. But definitely there are lots of, kind of, great scenarios that this can enable.
SMITH: Yeah. I’ll grab onto the word you use there, which is making DNA a commodity. And one of the things that I hope comes out of this project, you know, besides all the great benefits of DNA data storage itself is spillover benefits into the field of health—where if we make DNA synthesis at large scale truly a commodity thing, which I hope some of the work that we’ve done to really accelerate the throughput of synthesis will do—then this will open new doors in what we can do in terms of gene synthesis, in terms of, like, fundamental biotech research that will lead to that next set of drugs and, you know, give us medications or treatments that we could not have thought possible if we were not able to synthesize DNA and related molecules at that scale.
NGUYEN: So much information gets lost because of just time. And so I think being able to recover really ancient history that humans wrote in the future, I think, is something that I really hope could be achieved because we’re so information rich, but in the course of time, we become information poor, and so I would like for our future generations to be able to understand the life of, you know, an everyday 21st-century person.
STRAUSS: Well, Bichlien, Jake, Sergey, it’s been fun having this conversation with you today and collaborating with you in all of this amazing project [MUSIC] and all the research we’ve done together. Thank you so much.
YEKHANIN: Thank you, Karin.
SMITH: Thank you.
NGUYEN: Thanks.
[MUSIC FADES]
[1] The team presented the findings from their life cycle assessment of DNA data storage in the paper Architecting Datacenters for Sustainability: Greener Data Storage using Synthetic DNA.
[2] For more information, check out the podcast episode Collaborators: Sustainable electronics with Jake Smith and Aniruddh Vashisth and the paper Recyclable vitrimer-based printed circuit boards for sustainable electronics.
[3] For more information, check out the podcast episode Collaborators: Renewable energy storage with Bichlien Nguyen and David Kwabi.
[4] For more information, check out the paper MOFDiff: Coarse-grained Diffusion for Metal-Organic Framework Design.
The post Ideas: The journey to DNA data storage appeared first on Microsoft Research.
]]>Yasuyuki Matsushita rejoins Microsoft to lead the new Microsoft Research Asia - Tokyo lab. Learn more about his journey and his perspective on the Tokyo lab's role in evolution of AI.
The post Introducing Yasuyuki Matsushita: Tackling societal challenges with AI at Microsoft Research Asia – Tokyo appeared first on Microsoft Research.
]]>Earlier this year, Microsoft Research announced (opens in new tab) its newest lab in Tokyo, Japan. Today, we are celebrating its grand opening, reinforcing Microsoft Research’s commitment to AI research across the Asia-Pacific region. This new lab will focus on embodied AI, well-being and neuroscience, societal AI, and industry innovation—all areas that align with Japan’s socioeconomic priorities. This initiative will enhance collaboration with local academic and industrial partners, contributing to global innovation and talent development.
We recently spoke with Yasuyuki Matsushita, head of the newly established Tokyo lab. Matsushita, who worked at Microsoft Research Asia from 2003 to 2015, served as a professor at Osaka University for the past decade, before returning in October. He reflects on his journey, the evolution of technology, and the opportunities ahead for Microsoft Research Asia – Tokyo.
Question: We are excited to have you leading the new lab in Tokyo. You worked at Microsoft Research Asia in Beijing from 2003 to 2015 before transitioning to academia. What motivated you to return after nearly a decade?
Yasuyuki Matsushita: Microsoft Research Asia has always been an exceptional place for conducting cutting-edge research, especially in the AI era. Earlier this year, I learned about Microsoft Research Asia’s expansion, including the establishment of a new lab in Tokyo. This presented an exciting opportunity to make a meaningful impact both locally and globally, sparking my motivation to return. Additionally, Microsoft is at the forefront of AI advancements, making this an ideal moment to re-engage. I’m confident that my work can contribute meaningfully to this dynamic field. The pace of AI development today is unmatched, making this an exhilarating time to be involved.
Question: Now that you’ve been back for a few weeks, from your perspective, what has changed at Microsoft Research Asia, and what has remained the same since you were last here?
Yasuyuki Matsushita: The most immediate change I’ve noticed is the array of employee tools and resources, which have evolved significantly over the past decade. I’m still familiarizing myself with these new systems, designed to optimize efficiency and collaboration. Over the past ten years, Microsoft has played a key role in driving digital transformation for other companies, and it has also transformed internally.
Beyond these changes, much of what made Microsoft Research Asia unique remains the same. The culture and people continue to foster an environment of innovation and collaboration. The organization still attracts exceptional talent, and the spirit of research is as vibrant as ever. One of its greatest strengths is its open, collaborative approach. It has maintained long-standing partnerships with universities and research institutions, which encourage cross-regional, cross-cultural, and interdisciplinary exchanges. This synergy stimulates innovation and supports industry development. The commitment to excellence remains at the heart of Microsoft Research Asia’s identity.
Question: With Microsoft Research Asia expanding regionally to places like Tokyo, Vancouver, Singapore, and Hong Kong, what are your plans as the head of the Tokyo lab, and how do you see it contributing to the region’s innovation ecosystem?
Yasuyuki Matsushita: My primary goal is to align the Tokyo lab’s growth with Microsoft Research Asia’s mission to advance science and technology for the benefit of humanity. The research efforts we’re focusing on in this lab aim to address pressing societal issues while advancing AI technologies to benefit society as a whole.
For instance, Japan’s aging population presents unique challenges that require efficient societal solutions—an issue faced by many nations today. Through our research, we aim to generate insights that can be applied globally to proactively address and mitigate such challenges.
Japan also has a strong legacy of scientific research in fields like electronics, materials science, and robotics. Its advanced industrial base, featuring renowned companies across the automotive, electronics, and machinery sectors, provides rich application scenarios for our research outcomes. Additionally, Japan’s robust education system supplies an intellectual foundation crucial for our in-depth research.
We’re dedicated to maintaining open research practices. By publishing our findings and open-sourcing our tools, we ensure our work benefits the broader industry and enriches the global knowledge pool. Our goal is to share insights that drive societal progress and innovation worldwide.
Question: Talent is at the heart of Microsoft Research’s mission and culture. What kind of talent is Microsoft Research Asia – Tokyo looking for? In what ways can the Tokyo lab enhance its efforts to cultivate the next generation of tech innovators for the region?
Yasuyuki Matsushita: One of the key advantages of being part of Microsoft is the close connection we have to real-world applications. This bridge between research and practice allows our work to have a direct societal impact, ensuring that innovative technology results in meaningful and beneficial outcomes.
When recruiting new talent, we seek bright, self-driven individuals with an innate curiosity and a passion for solving societal challenges. The most vital trait we look for is a deep desire to understand the “why” behind complex problems. While technical expertise is essential, a commitment to addressing social issues fuels creativity and drives meaningful progress. This blend of curiosity and purpose sparks innovation and propels us forward at Microsoft Research Asia.
At the Tokyo lab, a core part of our vision is cultivating the next wave of tech innovators. We plan to build on the legacy of successful talent programs that Microsoft Research Asia has championed throughout the region, like joint research initiatives, visiting scholar programs, and internship opportunities. These provide early career professionals and students with invaluable hands-on experiences, equipping them with essential research skills and deepening their understanding of complex technological challenges.
We’re committed to creating a nurturing environment where talent can thrive, collaborate, and contribute to the global tech landscape. By combining innovation with real-world impact, we aim to inspire the next generation to push boundaries and advance society.
Question: In today’s world, everything is moving toward digitization and intelligence. Ten years ago, your research focused on photometry and video analysis. Can you share some key outcomes from that period and explain how you see emerging technologies like AI influencing the field of computer vision?
Yasuyuki Matsushita: Back then, my research centered on computer vision, specifically on photometry for 3D reconstruction and video analysis aimed at enhancing video quality. One of the standout projects during that period was the development of a gigapixel camera capable of capturing high-resolution 3D information. This camera played a crucial role in the Dunhuang Mogao Grottoes project, which sought to digitally preserve the cultural heritage of Dunhuang’s murals and Buddha shrines with unprecedented accuracy.
Another notable project was the development of video stabilization technology, which was integrated into Windows 7 as part of Media Foundation. This technology improved video quality by compensating for unwanted camera movements, delivering smoother and more professional-looking output. The creation of real-time algorithms capable of processing and stabilizing video was groundbreaking at that time.
Since then, the introduction of deep learning, large datasets, and sophisticated neural network architectures has propelled computer vision to new heights. Tasks that were once considered difficult, such as object detection, recognition, and segmentation, are now standard with modern AI techniques. Current research continues to push the boundaries by exploring innovative network architectures, new learning strategies, and enhanced datasets. A particularly exciting trend is the use of AI in real-world interactive scenarios, leading to the emergence of embodied AI, which is a major focus of my current work.
Question: Your current research interests include embodied AI, which is also one of the key areas at Microsoft Research Asia – Tokyo. What exactly is embodied AI, and how does it differ from robotics?
Yasuyuki Matsushita: Embodied AI goes beyond traditional robotics. While robots are typically machines equipped with actuators designed to execute specific tasks, embodied AI focuses on developing intelligent systems that can perform complex tasks while understanding and interacting within physical and virtual environments. Robotics and AI have developed independently, but embodied AI is the convergence of these two fields, integrating AI with physical agents that can perceive, act, and learn in dynamic real-world environments.
This field is inherently interdisciplinary, involving aspects such as robotic control, reinforcement learning, spatial awareness, human-robot interaction, reasoning, and more. For instance, embodied AI includes the ability to infer cause and effect, such as understanding that an unsupported laptop will fall due to gravity. These types of interactions and interpretations stem from engaging with and understanding the physical world, making embodied AI an exciting and multifaceted area of research.
Given the complexity of embodied AI, no single organization can cover all aspects of its development alone. We look forward to collaborating with local industry and academic institutions in Japan, leveraging their expertise alongside our strengths in AI to advance the field.
Question: You’ve had an extensive career spanning academia and industry. From your experience as both an educator and a researcher, what advice would you give to young people interested in pursuing research in computer vision and AI?
Yasuyuki Matsushita: For students interested in computer vision and AI, a strong foundation in mathematics and computer science is essential, even as specific research topics and technologies evolve. A deep understanding of fundamental mathematical concepts, such as gradients, Jacobians, and vector spaces, is indispensable. Mastery of these principles will be beneficial regardless of changes in programming languages or development platforms.
Maintaining a mindset of continuous learning is equally important, as the field is constantly evolving. For example, deep learning was not as prominent a decade ago but is now central to the field. At Microsoft, we emphasize the importance of a growth mindset—being adaptable, open to new technologies, and willing to pivot with industry advancements. Early career professionals should cultivate the ability to quickly acquire new skills while building on their foundational knowledge. This adaptability is key to long-term success in research and development.
The post Introducing Yasuyuki Matsushita: Tackling societal challenges with AI at Microsoft Research Asia – Tokyo appeared first on Microsoft Research.
]]>BiomedParse reimagines medical image analysis, integrating advanced AI to capture complex insights across imaging types—a step forward for diagnostics and precision medicine.
The post BiomedParse: A foundation model for smarter, all-in-one biomedical image analysis appeared first on Microsoft Research.
]]>In cancer diagnosis or advanced treatments like immunotherapy, every detail in a medical image counts. Radiologists and pathologists rely on these images to track tumors, understand their boundaries, and analyze how they interact with surrounding cells. This work demands pinpoint accuracy across several tasks—identifying whether a tumor is present, locating it precisely, and mapping its contours on complex CT scans or pathology slides.
Yet, these crucial steps—object recognition, detection, and segmentation—are often tackled separately, which can limit the depth of analysis. Current tools like MedSAM (opens in new tab) and SAM (opens in new tab) focus on segmentation only, thus missing out on the opportunity to blend these insights holistically and relegating object as an afterthought.
In this blog, we introduce BiomedParse (opens in new tab), a new approach for holistic image analysis by treating object as the first-class citizen. By unifying object recognition, detection, and segmentation into a single framework, BiomedParse allows users to specify what they’re looking for through a simple, natural-language prompt. The result is a more cohesive, intelligent way of analyzing medical images that supports faster, more integrated clinical insights.
While biomedical segmentation datasets abound, there are relatively few prior works on object detection and recognition in biomedicine, let alone datasets covering all three tasks. To pretrain BiomedParse, we created the first such dataset by harnessing OpenAI’s GPT-4 for data synthesis from standard segmentation datasets (opens in new tab).
BiomedParse is a single foundation model that can accurately segment biomedical objects across nine modalities, as seen in Figure 1, outperforming prior best methods while requiring orders of magnitude fewer user operations, as it doesn’t require an object-specific bounding box. By learning semantic representation for individual object types, BiomedParse’s superiority is particularly pronounced in the most challenging cases with irregularly shaped objects. Through joint pretraining of object recognition, detection, and segmentation, BiomedParse opens new possibilities for holistic image analysis and image-based discovery in biomedicine.
Back in 2005, researchers first introduced the concept of “image parsing”—a unified approach to image analysis that jointly conducts object recognition, detection, and segmentation. Built on Bayesian networks, this early model offered a glimpse into a future of joint learning and reasoning in image analysis, though it was limited in scope and application. Fast forward to today, cutting-edge advances in generative AI have breathed new life into this vision. With our model, BiomedParse, we have created a foundation for biomedical image parsing that leverages interdependencies across the three subtasks, thus addressing key limitations in traditional methods. BiomedParse enables users to simply input a natural-language description of an object, which the model uses to predict both the object label and its segmentation mask, thus eliminating the need for a bounding box (Figure 1c). In other words, this joint learning approach lets users segment objects based on text alone.
on-demand event
We created the first dataset for biomedical imaging parsing by harnessing GPT-4 for large-scale data synthesis from 45 existing biomedical segmentation datasets (Figure 1a and 1b). The key insight is to leverage readily available natural-language descriptions already in these datasets and use GPT-4 to organize this often messy, unstructured text with established biomedical object taxonomies.
Specifically, we use GPT-4 to help create a unifying biomedical object taxonomy for image analysis and harmonize natural language descriptions from existing datasets with this taxonomy. We further leverage GPT-4 to synthesize additional variations of object descriptions to facilitate more robust text prompting.
This enables us to construct BiomedParseData, a biomedical image analysis dataset comprising over 6 million sets of images, segmentation masks, and text descriptions drawn from more than 1 million images. This dataset includes 64 major biomedical object types, 82 fine-grained subtypes, and spans nine imaging modalities.
We evaluated BiomedParse on a large held-out test set with 102,855 image-mask-label sets across 64 major object types in nine modalities. BiomedParse outperformed prior best methods such as MedSAM and SAM, even when oracle per-object bounding boxes were provided. In the more realistic setting when MedSAM and SAM used a state-of-the-art object detector (Grounding DINO) to propose bounding boxes, BiomedParse outperformed them by a wide margin, between 75 and 85 absolute points in dice score (Figure 2a). BiomedParse also outperforms a variety of other prominent methods such as SegVol, Swin UNETR, nnU-Net, DeepLab V3+, and UniverSeg.
Biomedical objects often have complex and irregular shapes, which present significant challenges for segmentation, even with oracle bounding box. By joint learning with object recognition and detection, BiomedParse learns to model object-specific shapes, and its superiority is particularly pronounced for the most challenging cases (Figure 3). Encompassing a large collection of diverse object types in nine modalities, BiomedParseData also provides a much more realistic representation of object complexity in biomedicine.
By operating through a simple text prompt, BiomedParse requires substantially less user effort than prior best methods that typically require object-specific bounding boxes, especially when an image contains a large number of objects (Figure 2c). By modeling object recognition threshold, BiomedParse can detect invalid prompt and reject segmentation requests when an object is absent from the image. BiomedParse can be used to recognize and segment all known objects in an image in one fell swoop (Figure 4). By scaling holistic image analysis, BiomedParse can potentially be applied to key precision health applications such as early detection, prognosis, treatment decision support, and progression monitoring.
Going forward, there are numerous growth opportunities. BiomedParse can be extended to handle more modalities and object types. It can be integrated into advanced multimodal frameworks such as LLaVA-Med (opens in new tab) to facilitate conversational image analysis by “talking to the data.” To facilitate research in biomedical image analysis, we have made BiomedParse open-source (opens in new tab) with Apache 2.0 license. We’ve also made it available on Azure AI (opens in new tab) for direct deployment and real-time inference. For more information, check out our demo. (opens in new tab)
BiomedParse is a joint work with Providence and the University of Washington’s Paul G. Allen School of Computer Science & Engineering, and brings collaboration from multiple teams within Microsoft*. It reflects Microsoft’s larger commitment to advancing multimodal generative AI for precision health, with other exciting progress such as GigaPath (opens in new tab), BiomedCLIP (opens in new tab), LLaVA-Rad (opens in new tab), BiomedJourney (opens in new tab), MAIRA (opens in new tab), Rad-DINO (opens in new tab), Virchow (opens in new tab).
(Acknowledgment footnote) *: Within Microsoft, it is a wonderful collaboration among Health Futures, MSR Deep Learning, and Nuance.
Paper co-authors: Theodore Zhao, Yu Gu, Jianwei Yang (opens in new tab), Naoto Usuyama (opens in new tab), Ho Hin Lee, Sid Kiblawi, Tristan Naumann (opens in new tab), Jianfeng Gao (opens in new tab), Angela Crabtree, Jacob Abel, Christine Moung-Wen, Brian Piening, Carlo Bifulco, Mu Wei, Hoifung Poon (opens in new tab), Sheng Wang (opens in new tab)
The post BiomedParse: A foundation model for smarter, all-in-one biomedical image analysis appeared first on Microsoft Research.
]]>Retrieval-augmented generation (RAG) helps AI systems provide more information to a large language model (LLM) when generating responses to user queries. A new method for conducting “global” queries can optimize the performance of global search in GraphRAG.
The post GraphRAG: Improving global search via dynamic community selection appeared first on Microsoft Research.
]]>Retrieval-augmented generation (RAG) allows AI systems to provide additional information and context to a large language model (LLM) when generating a response to a user query. However, traditional RAG-based methods can struggle to retrieve information that requires high-level knowledge of the entire dataset, especially with abstract and global questions such as the keywordless query: “Catch me up on the last two weeks of updates.” These types of queries are known as “global” queries, as they require holistic understanding of the dataset to answer the question. GraphRAG aims to tackle these questions in two main steps: indexing and query. The indexing engine first breaks down a collection of text documents into segments which are then clustered into hierarchical communities with entities and relationships connecting each segment up through higher levels of abstraction. We then use an LLM to generate a summary of each community, known as a community report. The indexing engine thus creates a hierarchical knowledge graph of the dataset, with each level in the hierarchy representing a different level of abstraction and summarization of the original material. In the query step, GraphRAG uses this structured knowledge to provide additional context to the LLM to help answer the question. In this blog post, we show a new method for conducting “global” queries that efficiently utilizes the knowledge graph representation and optimizes the performance of global search in GraphRAG.
The global search (opens in new tab) algorithm in GraphRAG aims to answer abstract questions that require knowledge of the entire dataset. It generates answers by searching over communities at a predetermined level in the knowledge graph. Then the LLM combines and summarizes all the community reports at this level of abstraction. Finally, the summary is used as additional context for the LLM to generate the response to the user question. This map-reduce process allows the LLM to select relevant text from all the community reports to generate its final answer. This static approach is expensive and inefficient because it includes many lower-level reports that are not informative to the user query. Since it is unlikely that all community reports, especially at a high level, are relevant in answering the query, an approach that first considers the relevancy of the report prior to the resource-intensive map-reduce operation is highly desirable.
Here, we introduce dynamic community selection to the global search algorithm, which leverages the knowledge graph structure of the indexed dataset. Starting from the root of the knowledge graph, we use an LLM to rate how relevant a community report is in answering the user question. If the report is deemed irrelevant, we simply remove it and its nodes (or sub-communities) from the search process. On the other hand, if the report is deemed relevant, we then traverse down its child nodes and repeat the operation. Finally, only relevant reports are passed to the map-reduce operation to generate the response to the user. Figure 1 illustrates the dynamic community selection process in action.
The dynamic global search approach has two main benefits. First, it prunes irrelevant reports early on, reducing the total number of community reports to be considered in the map-reduce operation. Second, it enables users to search the entire knowledge graph, instead of predefining a static community level, and can lead to more detailed answers. This allows it to collect information at various levels of abstraction. Moreover, the rating operation is a classification problem, which is considerably easier to perform than summarization and text generation, therefore, a less complex model can be used. In our experiments leveraging OpenAI’s models, a GPT-4o-mini rater achieved a very similar retrieval rate as a GPT-4o rater, while operating at a fraction of both cost and time. Overall, we use the smaller and more cost-effective model, GPT-4o-mini, in the rate operation to prune any irrelevant community reports, then we use GPT-4o to perform the map-reduce operation to generate the final response.
To demonstrate the cost saving that dynamic global search brings while maintaining a similar response quality, we evaluated the two methods side by side on a dataset from AP News. We tested static and dynamic search on 50 global questions and assessed the final response quality using an LLM evaluator. Moreover, we compared the total token cost of the two methods. To compare the two methods directly, we constrained the maximum search depth on dynamic global search so that both methods used the same underlying information.
We use an LLM evaluator to select the best response (i.e. win rate) on 3 key metrics:
Spotlight: Event Series
The quality of responses generated from dynamic community selection are comparable to its static counterpart while reducing the total token cost. Our LLM evaluation shows that the output quality of the two methods is similar in the three key metrics across the 50 global questions on the AP News dataset, with no statistical significance between them. More importantly, we observed a significant reduction of total token costs when using the new method, with an average cost reduction of 77% over the existing static global search at community level 1. This is due to the large number of community reports being eliminated via the rating process, thus requiring fewer prompt and output tokens needed in the map-reduce operation. For instance, the existing static global search method processes about 1,500 level 1 community reports in the map-reduce operation, while only 470 community reports on average are selected in dynamic search to generate the final answer.
Moreover, if we allow dynamic search to continue the rating process further to deeper level community reports, we observe an improvement in its final responses. Here, we conducted the same experiment but allowed dynamic search to continue until community level 3. Out of the 50 global questions, 29 included more community reports than our static search baseline, suggesting that some community reports at deeper levels are relevant to the user question. Indeed, we observed a moderate and statistically significant improvement in both comprehensiveness and empowerment. Using an LLM evaluator to score pairs of responses, we observe that dynamic global search scores a win rate of 58% and 60%, respectively, against static search at level 1. Nevertheless, while the rating operation is performed by a smaller model and hence induces negligible cost, it can still lead to a higher overall cost due to the increased number of community reports that the map-reduce operation processes. In this experiment, the total cost with dynamic search at level 3 is 34% higher on average. Table 1 summarizes the results of static search at level 1 against dynamic search at level 1 and 3.
Dynamic search | Num. queries | Win rate against static search (level 1) | Relative cost | ||
Comprehensiveness | Diversity | Empowerment | |||
Level 1 | 50 | 49.5% | 46.0% | 48.0% | -77% |
Level 3 | 29* | 58.8% | 48.0% | 60.0% | +34% |
Table 1. LLM evaluation on dynamic search vs static search at different community levels on global questions. We show the win rates of dynamic search at level 1 and level 3 against static search at level 1, that is the number of times the LLM evaluator prefers the generated responses from dynamic search over static search. Green background indicates there is statistical significance in the win rates, and grey cells if it is not significant. The right-most column shows the average relative token cost against static search at level 1. * Note that we only evaluate answers from dynamic search at community level 3, which contains more community reports than static search at level 1.
Overall, dynamic community selection achieved a cost reduction of 77% while achieving output quality similar to the baseline static approach. These appear to be promising numbers, but what of the result quality? Let’s examine some direct side-by-side comparisons with an example:
Query: Across the dataset, describe the common trends in vaccination rates for major diseases. | |
Static global search at level 1 | Global search with dynamic community selection at level 3 |
### Common Trends in Vaccination Rates for Major Diseases #### Decline in Vaccination Rates A significant trend observed across various regions is the decline in vaccination rates for major diseases. For instance, Minnesota has experienced a decrease in immunization rates, leading to recent measles outbreaks [Data: Reports (335)]. Similarly, the COVID-19 pandemic has exacerbated this issue, causing a drop in childhood immunization rates in Michigan and other areas [Data: Reports (1231, 915)]. This decline is often attributed to misinformation about vaccines, particularly the false claim linking the MMR vaccine to autism [Data: Reports (912, 1335)]. #### Impact of COVID-19 The COVID-19 pandemic has had a profound impact on vaccination rates. Efforts to increase COVID-19 vaccination rates have been a major focus of public health policies, but the pandemic has also led to a decline in routine vaccinations among children [Data: Reports (920, 194)]. The pandemic has influenced public attitudes towards health measures and vaccinations, contributing to broader public health challenges [Data: Reports (920)].#### Regional Variations Vaccination rates vary significantly by region. West Virginia, for example, has one of the strictest school vaccination policies in the U.S., resulting in high child immunization rates [Data: Reports (1199)]. In contrast, Idaho has the highest overall childhood vaccination exemption rate in the U.S., posing significant public health risks [Data: Reports (998)]. Illinois faces challenges with low vaccination rates in schools, leading to measles outbreaks [Data: Reports (108, 82)]. #### Public Health Initiatives Various public health initiatives have been implemented to address vaccination rates. The CDC plays a crucial role in monitoring and reporting on vaccination rates for diseases such as measles, mumps, and polio, and provides vital information on preventable diseases [Data: Reports (914)]. Programs like the ‘Do It For Babydog’ sweepstakes in West Virginia encourage vaccination among residents [Data: Reports (1199)]. Additionally, the Bill & Melinda Gates Foundation is heavily involved in funding and supporting vaccination programs for major diseases such as polio, malaria, HIV, and cholera [Data: Reports (1091)]. ### Conclusion The dataset reveals a complex landscape of vaccination rates influenced by regional policies, public health initiatives, and the impact of the COVID-19 pandemic. While some regions have managed to maintain high vaccination rates through strict policies and public health campaigns, others face significant challenges due to misinformation and exemption rates. Global efforts and collaborations remain crucial in addressing these challenges and improving vaccination coverage worldwide. |
### Common Trends in Vaccination Rates for Major Diseases #### Decline in Vaccination Rates A significant trend observed across the dataset is the decline in vaccination rates for various diseases, including measles, mumps, rubella (MMR), and polio. This decline is particularly evident among U.S. kindergartners, where vaccination coverage has dropped from 95.2% during the 2019-2020 school year to 93.1% in the 2022-2023 school year. This reduction falls below the 95% threshold needed to achieve herd immunity, putting approximately 250,000 kindergartners at risk each year [Data: Reports (5668, 15469, 15703, 15481, 15471, +more)]. #### Impact of the COVID-19 Pandemic The COVID-19 pandemic has had a profound impact on vaccination rates. Efforts to promote COVID-19 vaccinations have been a major focus, with initiatives in regions like Michigan and Illinois offering free meals to students and promoting updated COVID-19 shots [Data: Reports (19125, 15466)]. However, the pandemic has also disrupted routine vaccination programs, leading to lower coverage for diseases such as measles and contributing to recent outbreaks [Data: Reports (15464, 15674, 15482, 15481, +more)]. #### Regional Variations and Legislative Measures There are notable regional variations in vaccination rates and exemption rates across the United States. For example, West Virginia has implemented strict immunization requirements for children entering school, resulting in high child immunization rates [Data: Reports (5674, 18874, 18899)]. In contrast, states like Minnesota and Illinois have seen increases in non-medical exemptions, contributing to lower vaccination rates and recent outbreaks [Data: Reports (15483, 15481, 108, 2705, +more)]. #### Efforts to Improve Vaccination Rates Various initiatives and legislative measures have been introduced to address declining vaccination rates. For instance, the Government of Sindh introduced a polio vaccination bill that includes provisions for imprisonment for parents who do not vaccinate their children [Data: Reports (15398)]. In the U.S., the CDC has recommended new COVID-19 shots for everyone aged 6 months and older and has launched initiatives to ensure equitable access to vaccines, especially in developing countries [Data: Reports (15847, 15571, 15691, 15694, +more)]. ### Conclusion The dataset reveals a complex landscape of vaccination rates influenced by the COVID-19 pandemic, vaccine hesitancy, misinformation, and regional variations. While efforts to improve vaccination rates are ongoing, the decline in immunization coverage poses significant public health risks, highlighting the need for continued vigilance and proactive measures to ensure high vaccination rates and prevent outbreaks of vaccine-preventable diseases. |
Table 2. Generated response from static search (level 1) and dynamic search (level 3) to the same global question on the AP News dataset.
Table 2 shows an example output from static search at level 1 and dynamic search at level 3 to the same question. While the two outputs contain similar high-level topics, the response from dynamic search provided specific data such as the reduction of vaccination rates in certain demographics. We also notice that the response from dynamic search made significantly more references to the source material, indicated by “[Data Reports]” in the text. By selectively providing information that is relevant to the question, this alleviates the map-reduce operation from having to filter and process all the community reports all at once, and therefore it can generate a response that is more comprehensive and specific to the user question.
Overall, dynamic community selection proposes an alternative method to perform global search in GraphRAG by leveraging the indexed knowledge graph and the usage of cheaper LLM models in the rate relevancy operation. These changes led to lower total token cost and potential improvements to response detail and quality.
You can experiment with dynamic global search on the GraphRAG GitHub repository (opens in new tab).
Dynamic global search is the second of several major optimizations to GraphRAG that are being explored. If you are interested in optimizations for local questions, please check out our recent blog post on DRIFT search. Stay tuned for our upcoming work, where we explore a radically different approach to graph-enabled RAG that is significantly more cost-efficient while improving answer quality for both local and global questions.
The post GraphRAG: Improving global search via dynamic community selection appeared first on Microsoft Research.
]]>Orca-AgentInstruct, from Microsoft Research, can generate diverse, high-quality synthetic data at scale to post-train and fine-tune base LLMs for expanded capabilities, continual learning, and increased performance.
The post Orca-AgentInstruct: Agentic flows can be effective synthetic-data generators appeared first on Microsoft Research.
]]>Our work on Orca and Orca 2 demonstrated the power of using synthetic data for the post-training of small language models and getting them to levels of performance previously found only in much larger language models. Orca-AgentInstruct is another step in this direction, where we explore using agentic flows to generate diverse and high-quality data at scale. Orca-AgentInstruct is an agentic solution for synthetic-data generation. By leveraging an agentic framework, AgentInstruct can generate tailored datasets, comprising both prompts and responses, from raw data sources, paving the way to building a synthetic data factory for model fine-tuning.
The efficacy of this approach is exemplified by the substantial improvement observed by fine-tuning a base Mistral 7-billion-parameter model and using AgentInstruct to generate a 25-million-pair dataset. The fine-tuned model (which we refer to as Orca-3-Mistral) showcases a notable performance gain across multiple benchmarks. For example, it shows 40% improvement on AGIEval, 19% improvement on MMLU, 54% improvement on GSM8K, 38% improvement on BBH, 45% improvement on AlpacaEval, and a 31.34% reduction of inaccurate or unreliable results across multiple summarization benchmarks.
We are making a 1-million-pair subset (orca-agentinstruct-1M) of this dataset publicly available, along with a report describing the data generation procedure, to encourage research on synthetic data generation and finetuning of language models.
Synthetic Data Accelerated LLM Development: Over the past year, using synthetic data has greatly advanced the training of large language models (LLMs). It sped up model training at all stages, from pre-training (e.g., Phi-3) to instruction-tuning (e.g., Orca and WizardLM) and reinforcement learning from human feedback (e.g., Direct Nash Optimization).
Generating high-quality synthetic data is hard: On the other hand, research indicates that pre-training models on synthetic data produced by other models can result in model collapse, causing models to progressively degrade. Similar concerns have been raised regarding the use of synthetic data for post-training, suggesting that it might lead to an imitation process where the trained model learns only stylistic features rather than actual capabilities.
This discrepancy may be attributed to the challenge of generating high-quality and diverse synthetic data. Successful use of synthetic data involves significant human effort in curating and filtering the data to ensure high quality.
Synthetic data meets agents: Another major development we witnessed during the past year is the rise of agentic (especially multi-agent) workflows, such as with AutoGen. Agentic workflows can generate high-quality data, which surpasses the capabilities of the underlying LLMs, by using flows with reflection and iteration that enable agents to look back at solutions, generate critiques, and improve solutions. They can also use tools like search APIs, calculators, and code interpreters to address LLM limitations.
Multi-agent workflows bring in additional benefits as well, such as simulating scenarios where we can generate both new prompts and the corresponding responses. They also enable automation of data-generation workflows, reducing or eliminating the need for unnecessary human intervention on some tasks.
AgentInstruct: Generating synthetic data for post-training or finetuning often relies on an existing prompt set that is either used as is or as seeds for generating more instructions. In this work, we generalize the problem settings to a broader objective of generating an abundant amount of diverse, challenging, and high-quality data to teach a particular skill to an AI model. We refer to this setting as generative teaching.
AgentInstruct is an agentic solution for generative teaching. AgentInstruct uses raw documents as input to create demonstration and feedback data. When generic data is used as seeds, AgentInstruct can be used to teach an LLM a general capability, such as writing, reasoning, or retrieval-augmented generation (RAG). Domain specific data, like retail or finance, can also be used as seeds to improve the model in a certain specialization. AgentInstruct can create:
Using raw data as seeds offers two advantages: it is plentiful, allowing AgentInstruct to generate large-scale and diverse datasets, and it encourages learning general skills instead of benchmark-specific ones by avoiding using existing prompts.
Spotlight: Blog post
We anticipate agentic flows becoming increasingly important throughout the model-training lifecycle, including pre-training, post-training, and specialization, and ultimately enabling the creation of a synthetic data factory for model customization and continuous improvement. This has the potential to drive AI advances across multiple industries by making high-quality model training more efficient and accessible.
Arindam Mitra, Luciano Del Corro, Guoqing Zheng, Shweti Mahajan, Dany Rouhana, Andres Codas, Yadong Lu, Wei-ge Chen, Olga Vrousgou, Corby Rosset, Fillipe Silva, Hamed Khanpour, Yash Lara, and Ahmed Awadallah
The post Orca-AgentInstruct: Agentic flows can be effective synthetic-data generators appeared first on Microsoft Research.
]]>The efficient simulation of molecules has the potential to change how the world understands biological systems and designs new drugs and biomaterials. Tong Wang discusses AI2BMD, an AI-based system designed to simulate large biomolecules with speed and accuracy.
The post Abstracts: November 14, 2024 appeared first on Microsoft Research.
]]>Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.
In this episode, Microsoft Senior Researcher Tong Wang joins guest host Bonnie Kruft, partner and deputy director of Microsoft Research AI for Science, to discuss “Ab initio characterization of protein molecular dynamics with AI2BMD.” In the paper, which was published by the scientific journal Nature, Wang and his coauthors detail a system that leverages AI to advance the state of the art in simulating the behavior of large biomolecules. AI2BMD, which is generalizable across a wide range of proteins, has the potential to advance solutions to scientific problems and enhance biomedical research in drug discovery, protein design, and enzyme engineering.
[MUSIC]
BONNIE KRUFT: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.
[MUSIC FADES]
I’m Bonnie Kruft, partner and deputy director of Microsoft Research AI for Science and your host for today. Joining me is Tong Wang, a senior researcher at Microsoft. Tong is the lead author of a paper called “Ab initio characterization of protein molecular dynamics with AI2BMD,” which has just been published by the top scientific journal Nature. Tong, thanks so much for joining us today on Abstracts!
TONG WANG: Thank you, Bonnie.
KRUFT: Microsoft Research is one of the earliest institutions to apply AI in biomolecular simulation research. Why did the AI for Science team choose this direction, and—with this work specifically, AI2BMD—what problem are you and your coauthors addressing, and why should people know about it?
WANG: So as Richard Feynman famously said, “Everything that living things do can be understood in terms of the jigglings and the wigglings of atoms.” To study the mechanisms behind the biological processes and to develop biomaterials and drugs requires a computational approach that can accurately characterize the dynamic motions of biomolecules. When we review the computational research for biomolecular structure, we can get two key messages. First, in recent years, predicting the crystal, or static, protein structures with methods powered by AI has achieved great success and just won the Nobel Prize in Chemistry in the last month. However, characterizing the dynamic structures of proteins is more meaningful for biology, drug, and medicine fields but is much more challenging. Second, molecular dynamics simulation, or MD, is one of the most widely used approaches to study protein dynamics, which can be roughly divided into classical molecular dynamics simulation and quantum molecular dynamics simulation. Both approaches have been developed for more than a half century and won Nobel Prize. Classical MD is fast but less accurate, while quantum MD is very accurate but computationally prohibitive for the protein study. However, we need both the accuracy and the efficiency to detect the biomechanisms. Thus, applying AI in biomolecular simulation can become the third way to achieve both ab initio—or first principles—accuracy and high efficiency. In the winter of 2020, we have foreseen the trend that AI can make a difference in biomolecular simulations. Thus, we chose this direction.
KRUFT: It took four years from the idea to the launch of AI2BMD, and there were many important milestones along the way. First, talk about how your work builds on and/or differs from what’s been done previously in this field, and then give our audience a sense of the key moments and challenges along the AI2BMD research journey.
WANG: First, I’d like to say applying AI in biomolecular simulation is a novel research field. For AI-powered MD simulation for large biomolecules, there is no existing dataset, no well-designed machine learning model for the interactions between the atoms and the molecules, no clear technical roadmap, no mature AI-based simulation system. So we face various new challenges every day. Second, there are some other works exploring this area at the same time. I think a significant difference between AI2BMD and other works is that other works require to generate new data and train the deep learning models for any new proteins. So it takes a protein-specific solution. As a contrast, AI2BMD proposes a generalizable solution for a wide range of proteins. To achieve it, as you mentioned, there are some key milestones during the four-year journey. The first one is we proposed the generalizable protein fragmentation approach that divides proteins into the commonly used 20 kinds of dipeptides. Thus, we don’t need to generate data for various proteins. Instead, we only need to sample the conformational space of such dipeptides. So we built the protein unit dataset that contains about 20 million samples with ab initio accuracy. Then we proposed ViSNet, the graph neural network for molecular geometry modeling as the machine learning potential for AI2BMD. Furthermore, we designed AI2BMD simulation system by efficiently leveraging CPUs and GPUs at the same time, achieving hundreds of times simulation speed acceleration than one year before and accelerating the AI-driven simulation with only ten to a hundred millisecond per simulation step. Finally, we examined AI2BMD on energy, force, free energy, J coupling, and many kinds of property calculations for tens of proteins and also applied AI2BMD in the drug development competition. All things are done by the great team with science and engineering expertise and the great leadership and support from AI for Science lab.
KRUFT: Tell us about how you conducted this research. What was your methodology?
WANG: As exploring an interdisciplinary research topic, our team consists of experts and students with biology, chemistry, physics, math, computer science, and engineering backgrounds. The teamwork with different expertise is key to AI2BMD research. Furthermore, we collaborated and consulted with many senior experts in the molecular dynamics simulation field, and they provided very insightful and constructive suggestions to our research. Another aspect of the methodology I’d like to emphasize is learning from negative results. Negative results happened most of the time during the study. What we do is to constantly analyze the negative results and adjust our algorithm and model accordingly. There’s no perfect solution for a research topic, and we are always on the way.
KRUFT: AI2BMD got some upgrades this year, and as we mentioned at the top of the episode, the work around the latest system was published in the scientific journal Nature. So tell us, Tong—what is new about the latest AI2BMD system?
WANG: Good question. We posted a preliminary version of AI2BMD manuscript on bioRxiv last summer. I’d like to share three important upgrades through the past one and a half year. The first is hundreds of times of simulation speed acceleration for AI2BMD, which becomes one of the fastest AI-driven MD simulation system and leads to perform much longer simulations than before. The second aspect is AI2BMD was applied for many protein property calculations, such as enthalpy, heat capacity, folding free energy, pKa, and so on. Furthermore, we have been closely collaborating with the Global Health Drug Discovery Institute, GHDDI, a nonprofit research institute founded and supported by the Gates Foundation, to leverage AI2BMD and other AI capabilities to accelerate the drug discovery processes.
KRUFT: What significance does AI2BMD hold for research in both biology and AI? And also, what impact does it have outside of the lab, in terms of societal and individual benefits?
WANG: Good question. For biology, AI2BMD provides a much more accurate approach than those used in the past several decades to simulate the protein dynamic motions and study the bioactivity. For AI, AI2BMD proves AI can make a big difference to the dynamic protein structure study beyond AI for the protein static structure prediction. Raised by AI2BMD and other works, I can foresee there is a coming age of AI-driven biomolecular simulation, providing binding free-energy calculation with quantum simulation accuracy for the complex of drug and the target protein for drug discovery, detecting more flexible biomolecular conformational changes that molecular mechanics cannot do, and opening more opportunities for enzyme engineering and vaccine and antibody design.
KRUFT: AI is having a profound influence on the speed and breadth of scientific discovery, and we’re excited to see more and more talented people joining us in this space. What do you want our audience to take away from this work, particularly those already working in the AI for Science space or looking to enter it?
WANG: Good question. I’d like to share three points from my research experience. First is aim high. Exploring a disruptive research topic is better than doing 10 incremental works. In the years of research, our organization always encourages us to do the big things. Second is persistence. I remembered a computer scientist previously said about 90% of the time during research is failure and frustration. The rate is even higher when exploring a new research direction. In AI2BMD study, when we suffered from research bottlenecks that cannot be tackled for several months, when we received critical comments from reviewers, when some team members wanted to give up and leave, I always encourage everyone to persist, and we will make it. More importantly, the foundation of persistence is to ensure your research direction is meaningful and constantly adjust your methodology from failures and critical feedback. The third one is real-world applications. Our aim is to leverage AI for advancing science. Proposing scientific problems is a first step, then developing AI tools and evaluating on benchmarks and, more importantly, examining its usefulness in the real-world applications and further developing your AI algorithms. In this way, you can close the loop of AI for Science research.
KRUFT: And, finally, Tong, what unanswered questions or unsolved problems remain in this area, and what’s next on the agenda for the AI2BMD team?
WANG: Well, I think AI2BMD is a starting point for the coming age of AI-driven MD for biomolecules. There are lots of new scientific questions and challenges coming out in this new field. For example, how to expand the simulated molecules from proteins to other kinds of biomolecules; how to describe the biochemical reactions during the simulations; how to further improve the simulation efficiency and robustness; and how to apply it for more real-world scenarios. We warmly welcome any people from both academic and industrial fields to work together with us to make the joint efforts to push the frontier of this new field moving forward.
[MUSIC]
KRUFT: Well, Tong, thank you for joining us today, and to our listeners, thanks for tuning in. If you want to read the full paper on AI2BMD, you can find a link at aka.ms/abstracts, or you can read it on the Nature website. See you next time on Abstracts!
[MUSIC FADES]
The post Abstracts: November 14, 2024 appeared first on Microsoft Research.
]]>Modular models can democratize AI development while unlocking new benefits and use cases. Modularized AI can be more flexible, more compliant, and cheaper to develop—requiring less data and fewer compute resources to train expert models.
The post Toward modular models: Collaborative AI development enables model accountability and continuous learning appeared first on Microsoft Research.
]]>Today, development of generalizable AI models requires access to sufficient data and compute resources, which may create challenges for some researchers. Democratizing access to technology across the research community can advance the development of generalizable AI models. By applying the core software development concept of modularity to AI, we can build models that are powerful, efficient, adaptable, and transparent.
Until recently, AI models were primarily built using monolithic architecture. Though powerful, these models can be challenging to customize and edit compared to modular models with easily interpretable functional components. Today, developers employ modularity to make services more reliable, faster to refine, and easier for multiple users to contribute to simultaneously. One promising research direction that supports this involves shifting AI development towards a modular approach (opens in new tab), which could enhance flexibility and improve scalability.
One such approach is to use numerous fine-tuned models designed for specific tasks, known as expert models, and coordinate them to solve broader tasks (see Towards Modular LLMs by Building and Reusing a Library of LoRAs – Microsoft Research (opens in new tab), Learning to Route Among Specialized Experts for Zero-Shot Generalization (opens in new tab)). These expert models can be developed in a decentralized way. Similar to the benefits of using a microservice architecture, this modular AI approach can be more flexible, cheaper to develop, and more compliant with relevant privacy and legal policies. However, while substantial research has been done on training optimization, coordination methods remain largely unexplored.
Our team is exploring the potential of modular models by focusing on two themes: i) optimizing the training of expert models and ii) refining how expert models coordinate to form a collaborative model. One method for coordinating expert models is to adaptively select the most relevant independently developed expert models for specific tasks or queries. This approach, called MoErging, is similar to Mixture-of-Experts (MoE) approaches but differs in that the routing mechanism is learned after the individual experts are trained. As an initial step, we contributed to creating a taxonomy for organizing recent MoErging methods with the goal of helping establish a shared language for the research community and facilitating easier and fairer comparisons between different methods.
Most MoErging methods were developed within the past year, so they don’t reference each and are difficult to compare. To enable comparison of MoErging methods, we recently collaborated on a survey that establishes a taxonomy for comparing methods and organizes MoErging design choices into three steps:
Each step is broken down into more detailed choices. For example, in expert design, expert training can be custom or standard, and training data can be private or shared. Custom training requires MoErging to have specific training procedures, while the standard training does not. Similarly, shared data means that the training data must be accessible for routing. Otherwise, the training data is considered private.
The benefits of modular models discussed below assume that training data doesn’t need to be shared. However, a review of current MoErging methods finds that some approaches do require sharing training data, making certain benefits no longer applicable.
Spotlight: AI-POWERED EXPERIENCE
The survey evaluates 29 different MoErging methods using its taxonomy, which categorizes the design choices into two expert design choices, five routing design choices, and two application design options, shown in Figure 1.
One takeaway from the survey is that most MoErging methods can be grouped into four categories based on their routing design choices:
While the differences within each category are minor, the differences across categories are significant because they determine the level of data access required for implementation. As a result, data access is a primary factor in determining which methods are applicable and feasible in various settings.
Our taxonomy also covers recent approaches to building agentic systems, which could be viewed as specific types of MoErging methods where experts are full language models and routing decisions are made on a step-by-step or example-by-example basis. The optimal level for MoErging may vary depending on the task and the computational resources available to each stakeholder.
Modular models can unlock new benefits and use cases for AI, offering a promising approach to addressing challenges in current AI development. Moving forward, further substantial research is needed to validate this potential and assess feasibility.
Modular AI may:
In this blog, we focused on a type of modular approach that involves training foundation models, which requires substantial compute power and large amounts of data. Despite the advantages of modularity, such as increased flexibility, efficiency, and adaptability, the development of foundation models remains resource-intensive, necessitating high-performance computing and robust datasets to support fine-tuning.
Recent work has begun to address these challenges by distributing the pretraining process of foundation models (opens in new tab). Looking ahead, a promising research direction focuses on exploring how to create a minimal dataset for training “empty foundation models” while shifting most of their capabilities to external pluggable modules.
Modular methods are evolving rapidly, and we’re excited by their potential. Modularity has the capacity to democratize AI development, improve model accountability, and support efficient continuous learning. With the MoErging taxonomy, we aim to establish a shared language that fosters engagement within the research community. This research is in the early stages, and we welcome community collaboration. If you’re interested in working with us, please reach out to [email protected].
We would like to thank paper collaborators: Prateek Yadav, Colin Raffel, Mohammed Muqeeth, Haokun Liu, Tianlong Chen, Mohit Bansal, Leshem Choshen, Edoardo Ponti, Zhan Su, Matheus Pereira, Nicolas Le Roux, Nabil Omi, Siddhartha Sen, Anurag Sarkar, Jordan T. Ash, Oleksiy Ostapenko, and Laurent Charlin.
The post Toward modular models: Collaborative AI development enables model accountability and continuous learning appeared first on Microsoft Research.
]]>