Most Retrieval-Augmented Generation (RAG) pipelines today stop at a single task â retrieve, generate, and respond. That model works, but itâs ð»ð¼ð ð¶ð»ðð²ð¹ð¹ð¶ð´ð²ð»ð. It doesnât adapt, retain memory, or coordinate reasoning across multiple tools. Thatâs where ðð´ð²ð»ðð¶ð° ðð ð¥ðð changes the game. ð ð¦ðºð®ð¿ðð²ð¿ ðð¿ð°ðµð¶ðð²ð°ððð¿ð² ð³ð¼ð¿ ðð±ð®ð½ðð¶ðð² ð¥ð²ð®ðð¼ð»ð¶ð»ð´ In a traditional RAG setup, the LLM acts as a passive generator. In an ðð´ð²ð»ðð¶ð° ð¥ðð system, it becomes an ð®ð°ðð¶ðð² ð½ð¿ð¼ð¯ð¹ð²ðº-ðð¼ð¹ðð²ð¿ â supported by a network of specialized components that collaborate like an intelligent team. Hereâs how it works: ðð´ð²ð»ð ð¢ð¿ð°ðµð²ððð¿ð®ðð¼ð¿ â The decision-maker that interprets user intent and routes requests to the right tools or agents. Itâs the core logic layer that turns a static flow into an adaptive system. ðð¼ð»ðð²ð ð ð ð®ð»ð®ð´ð²ð¿ â Maintains awareness across turns, retaining relevant context and passing it to the LLM. This eliminates âcontext resetsâ and improves answer consistency over time. ð ð²ðºð¼ð¿ð ðð®ðð²ð¿ â Divided into Short-Term (session-based) and Long-Term (persistent or vector-based) memory, it allows the system to ð¹ð²ð®ð¿ð» ð³ð¿ð¼ðº ð²ð ð½ð²ð¿ð¶ð²ð»ð°ð². Every interaction strengthens the modelâs knowledge base. ðð»ð¼ðð¹ð²ð±ð´ð² ðð®ðð²ð¿ â The foundation. It combines similarity search, embeddings, and multi-granular document segmentation (sentence, paragraph, recursive) for precision retrieval. ð§ð¼ð¼ð¹ ðð®ðð²ð¿ â Includes the Search Tool, Vector Store Tool, and Code Interpreter Tool â each acting as a functional agent that executes specialized tasks and returns structured outputs. ðð²ð²ð±ð¯ð®ð°ð¸ ðð¼ð¼ð½ â Every user response feeds insights back into the vector store, creating a continuous learning and improvement cycle. ðªðµð ðð ð ð®ððð²ð¿ð Agentic RAG transforms an LLM from a passive responder into a ð°ð¼ð´ð»ð¶ðð¶ðð² ð²ð»ð´ð¶ð»ð² capable of reasoning, memory, and self-optimization. This shift isnât just technical â itâs strategic It defines how AI systems will evolve inside organizations: from one-off assistants to adaptive agents that understand context, learn continuously, and execute with autonomy.
AI Language Processing
Explore top LinkedIn content from expert professionals.
-
-
I shared a new tutorial + experiments on finetuning LLMs for classification efficiently. In this video, I explain how to convert a decoder-style LLM into a classifier. Many business problems are text classification problems, and if classification is all we need for a given task, using "smaller" and cheaper LLMs makes a lot of sense! (But, of course, also always run a simple logistic regression or naive Bayes baseline to determine if you even need a small LLM.) 𧪠In addition, I also ran a series of 19 experiments to answer some "what if" questions around finetuning pretrained LLMs for classification. Here, I kept things simple and small (e.g., GPT-2 on a toy binary classification task): Here's a snapshot summary of some of the interesting ones: 1) As would be expected, training on the last token yields much better performance than the first 2) Training the last transformer block is way better than just the last layer 3) LoRA performs on par or better than full finetuningâwhile being faster and more memory-efficient 4) Padding to full context length hurts performance 5) No padding or smart position selection leads to consistently higher accuracy 6) Surprisingly, training from random weights isn't much worse than using pretrained 7) Averaging embeddings over all tokens can improve performance slightly with little cost The full video is available here: https://lnkd.in/gcfqR2mH PS: If you are wondering why GPT instead of BERT? Well, you can of course also use BERT. Based on experiments on the 50k Movie Review dataset It's interesting though that this 3x smaller LLM performs on par (actually slightly better) than BERT. (ModernBERT then again is 2% better.)
-
Fine-tuned larger language models and longer context lengths eliminate the need for retrieval from external knowledge/vector databases, right? ... Not quite!! NVIDIA asked the same question last month! They published a new paper(https://lnkd.in/gfn3Jubc) examining how well very large finetuned LLMs with longer context lengths compare to shorter context length RAG supported LLMs. They explore two main questions: 1. Retrieval-augmentation versus long context window, which one is better for downstream tasks? 2. Can both methods be combined to get the best of both worlds? In short, they found: 1. RAG outperforms long context alone 2. Yes they perform better together. RAG works better with longer context than with shorter context. The main finding presented in the paper was that "retrieval can significantly improve the performance of LLMs regardless of their extended context window sizes". Some more details: 1. RAG is more important than context windows: a LLM with 4K context window using simple retrieval-augmentation at generation can achieve comparable performance to finetuned LLM with 16K context window 2. RAG is also faster: Augmenting generation with retrieval not only performs better by requiring significantly less computation and is much faster at generation 3. RAG works even better as parameter count increases because smaller 6-7B LLMs have relatively worse zero-shot capability to incorporate the retrieved chunked context: Perhaps counter intuitively the benefits of RAG on performance are more pronounced the larger the language model gets, experiments were done for LLMs with 43B and 70B params 4. RAG works even better as context length increases: Retrieval-augmented long context LLM (e.g., 16K and 32K) can obtain better results than retrieval-augmented 4K context LLM, even when fed with the same top 5 chunks of evidence 5. Retrieval-augmented LLaMA2-70B with 32K context window, outperforms GPT-3.5-turbo-16k and Davinci003 and non-retrieval LLaMA2-70B-32k baseline for question answering and query-based summarization.
-
Few-shot Text Classification predicts the label of a given text after training with just a handful of labeled data. It's a powerful technique for overcoming real-world situations with scarce labeled data. SetFit is a fast, accurate few-shot NLP classification model perfect for intent detection in GenAI chatbots. In the pre-ChatGPT era, Intent Detection was an essential aspect of chatbots like Dialogflow. Chatbots would only respond to intents or topics that the developers explicitly programmed, ensuring they would stick closely to their intended use and prevent prompt injections. OpenAI's ChatGPT changed that with its incredible reasoning abilities, which allowed an LLM to decide how to answer users' questions on various topics without explicitly programming a flow for handling each topic. You just "prompt" the LLM on which topics to respond to and which to decline and let the LLM decide. However, numerous examples in the post-ChatGPT era have repeatedly shown how finicky a pure "prompt" based approach is. In my journey working with LLMs over the past year+, one of the most reliable methods I've found to restrict LLMs to a desired domain is to follow a 2-step approach that I've spoken about in the past: https://lnkd.in/g6cvAW-T 1. Preprocessing guardrail: An LLM call and heuristical rules to decide if the user's input is from an allowed topic. 2. LLM call: The chatbot logic, such as Retrieval Augmented Generation. The downside of this approach is the significant latency added by the additional LLM call in step 1. The solution is simple: replace the LLM call with a lightweight model that detects if the user's input is from an allowed topic. In other words, good old Intent Detection! With SetFit, you can build a highly accurate multi-label text classifier with as few as 10-15 examples per topic, making it an excellent choice for label-scarce intent detection problems. Following the documentation from the links below, I could train a SetFit model in seconds and have an inference time of <50ms on the CPU! If you're using an LLM as a few- or zero-shot classifier, I recommend checking out SetFit instead! ð SetFit Paper: https://lnkd.in/gy88XD3b ð SetFit Github: https://lnkd.in/gC8br-EJ ð¤ SetFit Few Shot Learning Blog on Huggingface: https://lnkd.in/gaab_tvJ ð¤ SetFit Multi-Label Classification: https://lnkd.in/gz9mw4ey ð£ï¸ Intents in DialogFlow: https://lnkd.in/ggNbzxH6 Follow me for more tips on building successful ML and LLM products! Medium: https://lnkd.in/g2jAJn5 X: https://lnkd.in/g_JbKEkM #generativeai #llm #nlp #artificialintelligence #mlops #llmops
-
Fine-tuning a model with just a prompt sounds like a joke until you try it. Prompt engineering with a general-purpose model can only get you so far. Prompt engineering influences how a model uses its knowledge, but it does not introduce new knowledge into the mix. If you want complete control over the results of your model, you need fine-tuning. But fine-tuning is hard: ⢠You need a curated dataset (hard) ⢠You need distributed training pipelines (hard + expensive) ⢠You need a lot of compute (hard) Fine-tuning takes time, money, and skill. Most companies have neither of these. Here is where the idea of vibe-tuning comes in. Vibe-tuning is a method for fine-tuning a small language model using only a natural language prompt. You describe what you want, and the tuner generates synthetic data, sets up distillation, fine-tunes the model, and evaluates the results. The first time I heard about this was from DistilLabs. They are currently automating the entire fine-tuning process: 1. You provide a prompt describing the task 2. The platform generates and labels synthetic training data 3. You pick a Teacher model (say gpt-oss-120b) and a Student model (say llama-3.2-3B) 4. The platform distills, fine-tunes, benchmarks, and delivers a downloadable small language model 5. You can deploy this model and start using it right away. The technique builds on model distillation: transferring knowledge from a large "teacher" model to a compact "student" model that's cheaper and faster. Honestly, this is huge. You can literally teach a model your company's tone, classification rules, or tool-calling logic by writing a few sentences in English. Here is an article explaining how this works: https://lnkd.in/eDNTBg2F
-
Exciting New Research: Injecting Domain-Specific Knowledge into Large Language Models I just came across a fascinating comprehensive survey on enhancing Large Language Models (LLMs) with domain-specific knowledge. While LLMs like GPT-4 have shown remarkable general capabilities, they often struggle with specialized domains such as healthcare, chemistry, and legal analysis that require deep expertise. The researchers (Song, Yan, Liu, and colleagues) have systematically categorized knowledge injection methods into four key paradigms: 1. Dynamic Knowledge Injection - This approach retrieves information from external knowledge bases in real-time during inference, combining it with the input for enhanced reasoning. It offers flexibility and easy updates without retraining, though it depends heavily on retrieval quality and can slow inference. 2. Static Knowledge Embedding - This method embeds domain knowledge directly into model parameters through fine-tuning. PMC-LLaMA, for instance, extends LLaMA 7B by pretraining on 4.9 million PubMed Central articles. While offering faster inference without retrieval steps, it requires costly updates when knowledge changes. 3. Modular Knowledge Adapters - These introduce small, trainable modules that plug into the base model while keeping original parameters frozen. This parameter-efficient approach preserves general capabilities while adding domain expertise, striking a balance between flexibility and computational efficiency. 4. Prompt Optimization - Rather than retrieving external knowledge, this technique focuses on crafting prompts that guide LLMs to leverage their internal knowledge more effectively. It requires no training but depends on careful prompt engineering. The survey also highlights impressive domain-specific applications across biomedicine, finance, materials science, and human-centered domains. For example, in biomedicine, domain-specific models like PMC-LLaMA-13B significantly outperform general models like LLaMA2-70B by over 10 points on the MedQA dataset, despite having far fewer parameters. Looking ahead, the researchers identify key challenges including maintaining knowledge consistency when integrating multiple sources and enabling cross-domain knowledge transfer between distinct fields with different terminologies and reasoning patterns. This research provides a valuable roadmap for developing more specialized AI systems that combine the broad capabilities of LLMs with the precision and depth required for expert domains. As we continue to advance AI systems, this balance between generality and specialization will be crucial.
-
AIâs Umwelt and the Conditions of Meaning: Interpretation, Cognition, and Alien Epistemology The âstochastic parrotâ critique, introduced by Emily Bender, Timnit Gebru, and colleagues in 2021, has become a dominant framework for denying that large language models possess cognitive capacities. On this view, systems such as GPT generate plausible language by reproducing statistical patterns in training data, without understanding or meaning. Meaning is said to arise only when a human interprets the output. The system itself remains inert with respect to sense. N. Katherine Hayles challenges this conclusion by revising the conditions under which meaning is said to occur. She defines cognition as a process that interprets information in contexts connected to meaning, a formulation that explicitly decouples cognition from consciousness, intentionality, and symbolic reference. Once interpretation, not subjectivity, becomes the criterion, the question shifts from whether AI understands like humans to how interpretation operates within nonhuman systems. Hayles draws on the concept of umwelt to describe these bounded horizons of interpretation. An umwelt designates the domain within which information can register as relevant and actionable for a given system. It does not describe an experiential interior or semantic autonomy. Rather, meaning arises within an umwelt as a function of interpretive responsiveness under material limits. LLMs instantiate such limits materially. Textual input is segmented into tokens and transformed into vectors positioned within a high-dimensional space learned through training. This space does not encode meanings symbolically. It encodes relations of proximity, difference, and contextual salience shaped by patterned co-occurrence across vast corpora. Attention mechanisms modulate these relations dynamically, weighting which vectors matter in a given context. Text generation proceeds through the selection of subsequent tokens based on these weighted relations rather than through retrieval of stored meanings or execution of rules. Within Haylesâs framework, these operations count as interpretation. The system continuously selects among alternatives relative to contextual conditions internal to its architecture and training history. These selections are neither random nor imposed externally at the moment of reading. They emerge within the systemâs operational horizon. Meaning, in this sense, is generated within the AIâs umwelt rather than conferred retroactively by human interpretation. This does not imply that AI meanings resemble human meanings or that they are accessible in the same way. They are umwelt-relative, shaped by material architecture and inferential constraint. Haylesâs point is not to elevate machine language to human status, but to reject the assumption that meaning must mirror human semantics to count at all. This marks an alien epistemology grounded in constraint rather than human semantics.
-
Knowledge Graphs (KGs) have long been the unsung heroes behind technologies like search engines and recommendation systems. They store structured relationships between entities, helping us connect the dots in vast amounts of data. But with the rise of LLMs, KGs are evolving from static repositories into dynamic engines that enhance reasoning and contextual understanding. This transformation is gaining significant traction in the research community. Many studies are exploring how integrating KGs with LLMs can unlock new possibilities that neither could achieve alone. Here are a couple of notable examples: ⢠ððð«ð¬ð¨ð§ðð¥ð¢ð³ðð ðððð¨ð¦ð¦ðð§ðððð¢ð¨ð§ð¬ ð°ð¢ðð¡ ðððð©ðð« ðð§ð¬ð¢ð ð¡ðð¬: Researchers introduced a framework called ðð§ð¨ð°ð¥ððð ð ðð«ðð©ð¡ ðð§ð¡ðð§ððð ððð§ð ð®ðð ð ðð ðð§ð (ðððð). By integrating knowledge graphs into language agents, KGLA significantly improved the relevance of recommendations. It does this by understanding the relationships between different entities in the knowledge graph, which allows it to capture subtle user preferences that traditional models might miss. For example, if a user has shown interest in Italian cooking recipes, the KGLA can navigate the knowledge graph to find connections between Italian cuisine, regional ingredients, famous chefs, and cooking techniques. It then uses this information to recommend content that aligns closely with the userâs deeper interests, such as recipes from a specific region in Italy or cooking classes by renowned Italian chefs. This leads to more personalized and meaningful suggestions, enhancing user engagement and satisfaction. (See here: https://lnkd.in/e96EtwKA) ⢠ðððð¥-ðð¢ð¦ð ðð¨ð§ððð±ð ðð§ððð«ð¬ððð§ðð¢ð§ð : Another study introduced the ðð-ððð ð¦ð¨ððð¥, which enhances real-time reasoning in language models by leveraging knowledge graphs. The model creates âprompt graphsâ centered around user queries, providing context by mapping relationships between entities related to the query. Imagine a customer support scenario where a user asks about âtroubleshooting connectivity issues on my device.â The KG-ICL model uses the knowledge graph to understand that âconnectivity issuesâ could involve Wi-Fi, Bluetooth, or cellular data, and âdeviceâ could refer to various models of phones or tablets. By accessing related information in the knowledge graph, the model can ask clarifying questions or provide precise solutions tailored to the specific device and issue. This results in more accurate and relevant responses in real time, improving the customer experience. (See here: https://lnkd.in/ethKNm92) By combining structured knowledge with advanced language understanding, weâre moving toward AI systems that can reason in a more sophesticated way and handle complex, dynamic tasks across various domains. How do you think the combination of KGs and LLMs is going to influence your business?
-
ðð¡ðð¬ð ð ððð¬ððð«ðð¡ ððð©ðð«ð¬ ðð«ð ðð¡ð ðð¢ð¥ð¥ðð«ð¬ ðð ðð¡ð ðð ðð²ð¬ððð¦ð¬ ð²ð¨ð® ð®ð¬ð ðð¨ððð². They are the reason why AI systems today understand language, solve problems, reason step by step, and scale so effectively. Every AI Engineer should read. ð. ððððð§ðð¢ð¨ð§ ðð¬ ðð¥ð¥ ðð¨ð® ðððð (ðððð) * Introduced the Transformer architecture, replacing older RNN/CNN models. * Allowed models to focus on the most relevant parts of data through the âattentionâ mechanism. * Became the backbone of almost every modern LLM, including GPT, Gemini, and Claude. * Link: https://lnkd.in/ejMS4ne6 ð. ðððð: ðð«ð-ðð«ðð¢ð§ð¢ð§ð ð¨ð ðððð© ðð¢ðð¢ð«ðððð¢ð¨ð§ðð¥ ðð«ðð§ð¬ðð¨ð«ð¦ðð«ð¬ (ðððð) * Introduced masked language modeling predicting missing words during pretraining. * Enabled deeper contextual understanding of language. * Significantly improved performance on tasks like search, classification, and question answering. * Link: https://lnkd.in/eWKCcPJH ð. ððð§ð ð®ðð ð ðð¨ððð¥ð¬ ðð«ð ð ðð°-ðð¡ð¨ð ðððð«ð§ðð«ð¬ (ððð-ð, ðððð) * Proved that scaling up model size unlocks emergent abilities. * Showed that models can perform new tasks with just a few examples, without retraining. * Shifted AI from narrow, task-specific tools to powerful general-purpose systems. * Link: https://lnkd.in/eW2NsDdh ð. ðððð¥ð¢ð§ð ððð°ð¬ ðð¨ð« ððð®ð«ðð¥ ððð§ð ð®ðð ð ðð¨ððð¥ð¬ (ðððð) * Demonstrated how performance scales predictably with model size, data, and compute. * Provided a roadmap for building and scaling frontier models. * Influenced how todayâs largest LLMs are planned and developed. * Link: https://lnkd.in/ee-KkEjN ð. ðð¡ðð¢ð§-ð¨ð-ðð¡ð¨ð®ð ð¡ð ðð«ð¨ð¦ð©ðð¢ð§ð ðð¥ð¢ðð¢ðð¬ ðððð¬ð¨ð§ð¢ð§ð ð¢ð§ ððð«ð ð ððð§ð ð®ðð ð ðð¨ððð¥ð¬ (ðððð) * Showed that prompting models to âthink step by stepâ greatly enhances reasoning. * Enabled better performance on complex tasks requiring logical steps. * Became a core technique in prompting, reasoning pipelines, and agentic AI systems. * Link: https://lnkd.in/ejsu_mqZ ð. ððððð: ðð©ðð§ ðð§ð ðððð¢ðð¢ðð§ð ð ð¨ð®ð§ðððð¢ð¨ð§ ððð§ð ð®ðð ð ðð¨ððð¥ð¬ (ðððð) * Proved that strong LLMs donât require massive compute resources. * Delivered efficient and open-source models that perform exceptionally well. * Sparked the open-source LLM revolution and democratized access to advanced AI. * Link: https://lnkd.in/eppy7hFu â»ï¸ Repost this to help your network get started â Follow Sivasankar Natarajan for more #GenAI #LLM #AIAgents #AgenticAI
-
Brilliant in some cases and dumb in others! Iâm a heavy user of LLM for many tasks that I do, but⦠Large Language Models (LLMs) can appear brilliant in some areas and surprisingly bad in others because of the way they are designed and trained. 1. Training Data Bias and Coverage LLMs are trained on vast amounts of text data from the internet, research papers, books, and code repositories. They perform well in areas where they have seen a lot of high-quality data (e.g., general knowledge, programming, mathematics). However, they struggle in areas where data is sparse, biased, or highly nuanced, leading to gaps in reasoning. 2. Pattern Recognition vs. True Understanding LLMs are pattern recognition engines, not true reasoning machines. They generate responses based on statistical likelihood rather than deep conceptual understanding. This means they can sound intelligent without actually âthinking,â leading to confident but incorrect answers in complex situations. 3. Lack of Real-World Experience LLMs do not have real-world experienceâthey cannot observe, experiment, or interact with the physical world. This makes them excellent at answering structured, well-documented questions but bad at reasoning about real-world uncertainties. 4. Difficulty with Logic and Consistency While LLMs can follow logical rules, they often struggle with multi-step reasoning, consistency across responses, and self-correction. A simple fact recall might be perfect, but when asked to extend logic to a new situation, the model can make obvious mistakes. 5. Overfitting to User Inputs LLMs tend to mirror the structure and assumptions of the input they receive. If a user provides leading or biased questions, the model may generate an answer that aligns with those biases rather than critically analyzing the question. 6. Struggles with Small Data Scenarios LLMs are designed for big-picture knowledge but struggle with specific, small-sample reasoning (e.g., experimental setups, statistical overfitting). They can generalize well over large datasets but may fail in cases that require deep domain expertise. 7. Computational Constraints LLMs operate under finite compute budgetsâthey truncate memory, which makes long-term dependencies difficult to track. This can make them great at short, factual questions but weak at complex, multi-step problems requiring extended context. As for agentic to do data science â¦draw your own conclusion ð