Top LinkedIn Content on Software Performance Optimization

Solopreneurship for Experts | Helping experts turn their expertise into income

5,175 followers 9mo Edited

I've noticed a pattern: Most developers focus on writing clean code. And yes, readable, well-structured code matters. But here's the truth: Clean code means nothing if it's solving the wrong problem. Iâ€™ve seen engineers spend days polishing code for a feature that users didnâ€™t needâ€¦ Optimizing functions that didnâ€™t move the business forwardâ€¦ Refactoring components that werenâ€™t even in use. Because we confuse good looking code with good thinking. But great developers do this instead: - They pause before writing. - They ask better questions. - They focus on the real problem, not the prettiest solution. If you want to level up as an engineer, donâ€™t just aim for elegant syntax. Aim for meaningful outcomes. Your code shouldnâ€™t just be clean, it should be right. Because great code doesnâ€™t just look good. It drives results. How do you make sure you're solving the right problem before you start coding? P.S. If youâ€™re a software engineer tired of getting ghosted after applying, check out the Software Engineer Resume System. Itâ€™s the exact framework I used to turn 150+ applications into 40+ offers, no fluff, just the system that gets results: https://lnkd.in/dqCp4EHw

187 Comments

Sebastian Raschka, PhD

ML/AI research engineer. Author of Build a Large Language Model From Scratch (amzn.to/4fqvn0D) and Ahead of AI (magazine.sebastianraschka.com), on how LLMs work and the latest developments in the field.

214,354 followers 9mo

My next tutorial on pretraining an LLM from scratch is now out. It starts with a step-by-step walkthrough of understanding, calculating, and optimizing the loss. After training, we update the text generation function with temperature scaling and top-k sampling. And finally, we also load openly available pretrained weights into our scratch-built model architecture. Along with this pretraining tutorial, I also have bonus material on speeding up the LLM training. These apply not just to LLMs but also to other transformer-based models like vision transformers: 1. Instead of saving the causal mask, this creates the causal mask on the fly to reduce memory usage (here it has minimal effect, but it can add up in long-context size models like Llama 3.2 with 131k-input-tokens support) 2. Use tensor cores (only works for Ampere GPUs like A100 and newer) 3. Use the fused CUDA kernels forÂ `AdamW`Â by setting 4. Pre-allocate and re-use GPU memory via the pinned memory setting in the data loader 5. Switch from 32-bit float to 16-bit brain float (bfloat16) precision 6. Replace from-scratch implementations of attention mechanisms, layer normalizations, and activation functions with PyTorch counterparts that have optimized CUDA kernels 7. Use FlashAttention for more efficient memory read and write operations 8. Compile the model 9. Optimize the vocabulary size 10. After saving memory with the steps above, increase the batch size Video tutorial: https://lnkd.in/gDRycWea PyTorch speed-ups: https://lnkd.in/gChvGCJH

37 Comments

Aishwarya Srinivasan

605,256 followers 7mo

Most people still think of LLMs as â€œjust a model.â€ But if youâ€™ve ever shipped one in production, you know itâ€™s not that simple. Behind every performant LLM system, thereâ€™s a stack of decisions, about pretraining, fine-tuning, inference, evaluation, and application-specific tradeoffs. This diagram captures it well: LLMs arenâ€™t one-dimensional. Theyâ€™re systems. And each dimension introduces new failure points or optimization levers. Letâ€™s break it down: ðŸ§ Pre-Training Start with modality. â†’ Text-only models like LLaMA, UL2, PaLM have predictable inductive biases. â†’ Multimodal ones like GPT-4, Gemini, and LaVIN introduce more complex token fusion, grounding challenges, and cross-modal alignment issues. Understanding the data diet matters just as much as parameter count. ðŸ› Fine-Tuning This is where most teams underestimate complexity: â†’ PEFT strategies like LoRA and Prefix Tuning help with parameter efficiency, but can behave differently under distribution shift. â†’ Alignment techniques- RLHF, DPO, RAFT, arenâ€™t interchangeable. They encode different human preference priors. â†’ Quantization and pruning decisions will directly impact latency, memory usage, and downstream behavior. âš¡ï¸ Efficiency Inference optimization is still underexplored. Techniques like dynamic prompt caching, paged attention, speculative decoding, and batch streaming make the difference between real-time and unusable. The infra layer is where GenAI products often break. ðŸ“ Evaluation One benchmark doesnâ€™t cut it. You need a full matrix: â†’ NLG (summarization, completion), NLU (classification, reasoning), â†’ alignment tests (honesty, helpfulness, safety), â†’ dataset quality, and â†’ cost breakdowns across training + inference + memory. Evaluation isnâ€™t just a model task, itâ€™s a systems-level concern. ðŸ§¾ Inference & Prompting Multi-turn prompts, CoT, ToT, ICL, all behave differently under different sampling strategies and context lengths. Prompting isnâ€™t trivial anymore. Itâ€™s an orchestration layer in itself. Whether youâ€™re building for legal, education, robotics, or finance, the â€œgeneral-purposeâ€ tag doesnâ€™t hold. Every domain has its own retrieval, grounding, and reasoning constraints. ------- Follow me (Aishwarya Srinivasan) for more AI insight and subscribe to my Substack to find more in-depth blogs and weekly updates in AI: https://lnkd.in/dpBNr6Jg

73 Comments

Pavan Belagatti

99,099 followers 10mo

Don't use LLMs blindlyâ€”check if they align with your needs. Not all LLMs are created equal. Hereâ€™s how to measure whether theyâ€™re right for your use caseðŸ‘‡ Evaluating LLMs is critical to assess their performance, reliability, and suitability for specific tasks. Without evaluation, it would be impossible to determine whether a model generates coherent, relevant, or factually correct outputs, particularly in applications like translation, summarization, or question-answering. Evaluation ensures models align with human expectations, avoid biases, and improve iteratively. Different metrics cater to distinct aspects of model performance:Â PerplexityÂ quantifies how well a model predicts a sequence (lower scores indicate better familiarity with the data), making it useful for gauging fluency. ROUGE-1Â measures unigram (single-word) overlap between model outputs and references, ideal for tasks like summarization where content overlap matters. BLEUÂ focuses on n-gram precision (e.g., exact phrase matches), commonly used in machine translation to assess accuracy. METEORÂ extends this by incorporating synonyms, paraphrases, and stemming, offering a more flexible semantic evaluation. Exact Match (EM)Â is the strictest metric, requiring verbatim alignment with the reference, often used in closed-domain tasks like factual QA where precision is paramount. Each metric reflects a trade-off: EM prioritizes literal correctness, while ROUGE and BLEU balance precision with recall. METEOR and Perplexity accommodate linguistic diversity, rewarding semantic coherence over exact replication. Choosing the right metric depends on the taskâ€”e.g., EM for factual accuracy in trivia, ROUGE for summarization breadth, and Perplexity for generative fluency. Collectively, these metrics provide a multifaceted view of LLM capabilities, enabling developers to refine models, mitigate errors, and align outputs with user needs. The tableâ€™s examples, such as EM scoring 0 for paraphrased answers, highlight how minor phrasing changes impact scores, underscoring the importance of context-aware metric selection. Know more about how to evaluate LLMs: https://lnkd.in/gfPBxrWc Here is my complete in-depth guide on evaluating LLMs: https://lnkd.in/gjWt9jRu Follow me on my YouTube channel so you don't miss any AI topic: https://lnkd.in/gMCpfMKh

5 Comments

Armand Ruiz

building AI systems

203,126 followers 1y

Evaluations â€”or â€œEvalsâ€â€” are the backbone for creating production-ready GenAI applications. Over the past year, weâ€™ve built LLM-powered solutions for our customers and connected with AI leaders, uncovering a common struggle: the lack of clear, pluggable evaluation frameworks. If youâ€™ve ever been stuck wondering how to evaluate your LLM effectively, today's post is for you. Hereâ€™s what Iâ€™ve learned about creating impactful Evals: ð—ªð—µð—®ð˜ ð— ð—®ð—¸ð—²ð˜€ ð—® ð—šð—¿ð—²ð—®ð˜ ð—˜ð˜ƒð—®ð—¹? - Clarity and Focus: Prioritize a few interpretable metrics that align closely with your applicationâ€™s most important outcomes. - Efficiency: Opt for automated, fast-to-compute metrics to streamline iterative testing. - Representation Matters: Use datasets that reflect real-world diversity to ensure reliability and scalability. ð—§ð—µð—² ð—˜ð˜ƒð—¼ð—¹ð˜‚ð˜ð—¶ð—¼ð—» ð—¼ð—³ ð— ð—²ð˜ð—¿ð—¶ð—°ð˜€: ð—™ð—¿ð—¼ð—º ð—•ð—Ÿð—˜ð—¨ ð˜ð—¼ ð—Ÿð—Ÿð— -ð—”ð˜€ð˜€ð—¶ð˜€ð˜ð—²ð—± ð—˜ð˜ƒð—®ð—¹ð˜€ Traditional metrics like BLEU and ROUGE paved the way but often miss nuances like tone or semantics. LLM-assisted Evals (e.g., GPTScore, LLM-Eval) now leverage AI to evaluate itself, achieving up to 80% agreement with human judgments. Combining machine feedback with human evaluators provides a balanced and effective assessment framework. ð—™ð—¿ð—¼ð—º ð—§ð—µð—²ð—¼ð—¿ð˜† ð˜ð—¼ ð—£ð—¿ð—®ð—°ð˜ð—¶ð—°ð—²: ð—•ð˜‚ð—¶ð—¹ð—±ð—¶ð—»ð—´ ð—¬ð—¼ð˜‚ð—¿ ð—˜ð˜ƒð—®ð—¹ ð—£ð—¶ð—½ð—²ð—¹ð—¶ð—»ð—² - Create a Golden Test Set: Use tools like Langchain or RAGAS to simulate real-world conditions. - Grade Effectively: Leverage libraries like TruLens or Llama-Index for hybrid LLM+human feedback. - Iterate and Optimize: Continuously refine metrics and evaluation flows to align with customer needs. If youâ€™re working on LLM-powered applications, building high-quality Evals is one of the most impactful investments you can make. Itâ€™s not just about metrics â€” itâ€™s about ensuring your app resonates with real-world users and delivers measurable value.

36 Comments

Arjun Jain

Co-Creating Tomorrowâ€™s AI | Research-as-a-Service | Founder, Fast Code AI | Dad to 8-year-old twins

34,098 followers 4mo

ð—˜ð˜…ð—½ð—¹ð—®ð—¶ð—» ð—§ð—µð—¶ð˜€: ð—Ÿð—¹ð—®ð—ºð—® ðŸ¯ ð—¡ð—²ð—²ð—±ð˜€ ðŸ®.ðŸ°ð—§ð—•. ð—¬ð—¼ð˜‚ð—¿ ð—šð—£ð—¨ ð—›ð—®ð˜€ ðŸ´ðŸ¬ð—šð—•. ð—œð˜ ð—¦ð˜ð—¶ð—¹ð—¹ ð—§ð—¿ð—®ð—¶ð—»ð˜€. Training Llama-3 405B needs ~2.4TB with BF16 + 8-bit Adam: â€¢ Weights: 810GB â€¢ Gradients: 810GB â€¢ Optimizer: 810GB (vs 3.24TB with standard Adam!) â€¢ Total: ~2.4TB (Illustrative budgetâ€”config-dependent; FP32 masters, ZeRO stage, and offload change totals) Your H100? 80GB. You'd need 30+ GPUs just to hold everything. ð—§ð—µð—¿ð—²ð—² ð—§ð—¿ð—¶ð—°ð—¸ð˜€ ð—§ð—µð—®ð˜ ð— ð—®ð—¸ð—² ð—œð˜ ð—ªð—¼ð—¿ð—¸ ðŸ. ð——ð—®ð˜ð—® ð—£ð—®ð—¿ð—®ð—¹ð—¹ð—²ð—¹: Split batch. Problem: Each GPU needs 2.4TB. Fix: ZeRO splits it across N GPUs. ðŸ®. ð— ð—¼ð—±ð—²ð—¹ ð—£ð—®ð—¿ð—®ð—¹ð—¹ð—²ð—¹: Split layers. Problem: Sequential bottleneck. Fix: Pipeline batches. ðŸ¯. ð—¦ð—²ð—¾ð˜‚ð—²ð—»ð—°ð—² ð—£ð—®ð—¿ð—®ð—¹ð—¹ð—²ð—¹: Split tokens. This is the game changer. 8K tokens â†’ 8 GPUs â†’ 1K each. But attention needs every token to see all others. ð—§ð—µð—² ð— ð—®ð—´ð—¶ð—° ð— ð—¼ð—ºð—²ð—»ð˜: Instead of moving the 2.4TB model, GPUs only exchange attention keys/values (K,V). Each GPU: â€¢ Computes K,V for its 1K tokens (32MB) â€¢ Sends to others via all-to-all â€¢ Receives 7Ã—32MB = 224MB total â€¢ Computes attention, deletes copies ðŸ®ðŸ®ðŸ°ð— ð—• ð—ºð—¼ð˜ƒð—²ð—± ð—¶ð—»ð˜€ð˜ð—²ð—®ð—± ð—¼ð—³ ðŸ®.ðŸ°ð—§ð—•. That's 10,000x less. ð—§ð—µð—² ð—¥ð—²ð˜€ð˜‚ð—¹ð˜: Combine all three (ZeRO + tensor + pipeline + sequence parallel). Each GPU holds ~75GB instead of 2.4TB. This exact choreography powers ChatGPT, Claude, and every frontier model. Without it? 10K token limits. With it? Entire books in one context. Not magic. Just brilliant engineering making the impossible routine.

22 Comments

Greg Coquillo

Product Leader @AWS | Startup Investor | 2X Linkedin Top Voice for AI, Data Science, Tech, and Innovation | Quantum Computing & Web 3.0 | I build software that scales AI/ML Network infrastructure

219,627 followers 10mo

You need to check out the Agent Leaderboard on Hugging Face! One question that emerges in the midst of AI agents proliferation is â€œwhich LLMs actually delivers the most?â€ Youâ€™ve probably asked yourself this as well. Thatâ€™s because LLMs are not one-size-fits-all. While models thrive in structured environments, others donâ€™t handle the unpredictable real world of tool calling well. The team at GalileoðŸ” evaluated 17 leading models in their ability to select, execute, and manage external tools, using 14 highly-curated datasets. Today, AI researchers, ML engineers, and technology leaders can leverage insights from Agent Leaderboard to build the best agentic workflows. Some key insights that you can already benefit from: - A model can rank well but still be inefficient at error handling, adaptability, or cost-effectiveness. Benchmarks matter, but qualitative performance gaps are real. - Some LLMs excel in multi-step workflows, while others dominate single-call efficiency. Picking the right model depends on whether you need precision, speed, or robustness. - While Mistral-Small-2501 leads OSS, closed-source models still dominate tool execution reliability. The gap is closing, but consistency remains a challenge. - Some of the most expensive models barely outperform their cheaper competitors. Model pricing is still opaque, and performance per dollar varies significantly. - Many models fail not in accuracy, but in how they handle missing parameters, ambiguous inputs, or tool misfires. These edge cases separate top-tier AI agents from unreliable ones. Consider the below guidance to get going quickly: 1- For high-stakes automation, choose models with robust error recovery over just high accuracy. 2- For long-context applications, look for LLMs with stable multi-turn consistency, not just a good first response. 3- For cost-sensitive deployments, benchmark price-to-performance ratios carefully. Some â€œpremiumâ€ models may not be worth the cost. I expect this to evolve over time to highlight how models improve tool calling effectiveness for real world use case. Explore the Agent Leaderboard here: https://lnkd.in/dzxPMKrv #genai #agents #technology #artificialintelligence

13 Comments

Brij kishore Pandey

AI Architect | AI Engineer | Generative AI | Agentic AI

701,609 followers 12mo

A sluggish API isn't just a technical hiccup â€“ it's the difference between retaining and losing users to competitors. Let me share some battle-tested strategies that have helped manyÂ Â achieve 10x performance improvements: 1. ð—œð—»ð˜ð—²ð—¹ð—¹ð—¶ð—´ð—²ð—»ð˜ ð—–ð—®ð—°ð—µð—¶ð—»ð—´ ð—¦ð˜ð—¿ð—®ð˜ð—²ð—´ð˜† Not just any caching â€“ but strategic implementation. Think Redis or Memcached for frequently accessed data. The key is identifying what to cache and for how long. We've seen response times drop from seconds to milliseconds by implementing smart cache invalidation patterns and cache-aside strategies. 2. ð—¦ð—ºð—®ð—¿ð˜ ð—£ð—®ð—´ð—¶ð—»ð—®ð˜ð—¶ð—¼ð—» ð—œð—ºð—½ð—¹ð—²ð—ºð—²ð—»ð˜ð—®ð˜ð—¶ð—¼ð—» Large datasets need careful handling. Whether you're using cursor-based or offset pagination, the secret lies in optimizing page sizes and implementing infinite scroll efficiently. Pro tip: Always include total count and metadata in your pagination response for better frontend handling. 3. ð—ð—¦ð—¢ð—¡ ð—¦ð—²ð—¿ð—¶ð—®ð—¹ð—¶ð˜‡ð—®ð˜ð—¶ð—¼ð—» ð—¢ð—½ð˜ð—¶ð—ºð—¶ð˜‡ð—®ð˜ð—¶ð—¼ð—» This is often overlooked, but crucial. Using efficient serializers (like MessagePack or Protocol Buffers as alternatives), removing unnecessary fields, and implementing partial response patterns can significantly reduce payload size. I've seen API response sizes shrink by 60% through careful serialization optimization. 4. ð—§ð—µð—² ð—¡+ðŸ ð—¤ð˜‚ð—²ð—¿ð˜† ð—žð—¶ð—¹ð—¹ð—²ð—¿ This is the silent performance killer in many APIs. Using eager loading, implementing GraphQL for flexible data fetching, or utilizing batch loading techniques (like DataLoader pattern) can transform your API's database interaction patterns. 5. ð—–ð—¼ð—ºð—½ð—¿ð—²ð˜€ð˜€ð—¶ð—¼ð—» ð—§ð—²ð—°ð—µð—»ð—¶ð—¾ð˜‚ð—²ð˜€ GZIP or Brotli compression isn't just about smaller payloads â€“ it's about finding the right balance between CPU usage and transfer size. Modern compression algorithms can reduce payload size by up to 70% with minimal CPU overhead. 6. ð—–ð—¼ð—»ð—»ð—²ð—°ð˜ð—¶ð—¼ð—» ð—£ð—¼ð—¼ð—¹ A well-configured connection pool is your API's best friend. Whether it's database connections or HTTP clients, maintaining an optimal pool size based on your infrastructure capabilities can prevent connection bottlenecks and reduce latency spikes. 7. ð—œð—»ð˜ð—²ð—¹ð—¹ð—¶ð—´ð—²ð—»ð˜ ð—Ÿð—¼ð—®ð—± ð——ð—¶ð˜€ð˜ð—¿ð—¶ð—¯ð˜‚ð˜ð—¶ð—¼ð—» Beyond simple round-robin â€“ implement adaptive load balancing that considers server health, current load, and geographical proximity. Tools like Kubernetes horizontal pod autoscaling can help automatically adjust resources based on real-time demand. In my experience, implementing these techniques reduces average response times from 800ms to under 100ms and helps handle 10x more traffic with the same infrastructure. Which of these techniques made the most significant impact on your API optimization journey?

52 Comments

Denys Linkov

Head of ML @ Wisedocs | ML Lecturer @ U of T

22,006 followers 2y

How well do LLMs solve real coding tasks? Cool new paper from Princeton + UChicago. From the paper's intro: "In the real world, software engineering is not as simple. Fixing a bug might involve navigating a large repository, understanding the interplay between functions in different files, or spotting a small error in convoluted code. Inspired by this, we introduce SWE-bench, a benchmark that evaluates LMs in a realistic software engineering setting" Some interesting parts of the paper: 1) The researchers created a dataset based on Github PRs for open issues trying to represent real coding problems. 2) They use a 3 step pipeline to choose issues. i) Popular python library PRs ii) Is an issue resolution PR with tests incluced iii) PRs where solutions pass the tests and dependencies work 3) Success is measured by having LLMs generate patches and apply them to the code and see if the tests pass 4) LLMs do poorly right off the bat, achieving sub 5% accuracy for a working solution. "Claude 2 and GPT-4 solve a mere 3.6% and 1.3% of instances respectively, even when provided with an oracle retriever" 5) To make it easier they give LLMs access to an oracle, which provides more information on how to solve the issue, such as which file they should look into by doing a retrieval based BM25 . 6) They finetune versions of LLaMa on a training set to see how well it does. The models show improvement but are still less than 5% resolved. 7) There is even a challenge of even correctly of apply patches as noted in the graphic 8) "Generating patches is easier than generating whole files" 9) "Language models tend to generate shorter, simpler edits.Â " 10) "Difficulty does not correlate with issue resolution date." - Other than GPT-4, which might suggest it was trained on some of these PRs Very interesting paper and benchmark, will likely inspire lots of subsequent research.

1 Comment

Pooya Eimandar

Stay Hungry Stay Foolish

5,725 followers 1y

A paper released last year by Bilokon and one of his PhD students, Burak Gunduz, looks at 12 techniques for reducing latency in C++ code, as follows: ðŸš€ Lock-free programming: A concurrent programming paradigm involving multi-threaded algorithms which, unlike their traditional counterparts, do not employ the usage of mutual exclusion mechanisms, such as locks, to arbitrate access to shared resources. ðŸš€ SIMD instructions: Instructions that take advantage of the parallel processing power of contemporary CPUs, allowing the simultaneous execution of multiple operations. ðŸš€ Mixing data types: When a computation involves both float and double types, implicit conversions are required. If only float computations are used, performance improves. ðŸš€ Signed vs unsigned: Ensuring consistent signedness in comparisons to avoid conversions. ðŸš€ Prefetching: Explicitly loading data into cache before it is needed to reduce data fetch delays, particularly in memory-bound applications. ðŸš€ Branch reduction: Predicting conditional branch outcomes to allow speculative code execution. ðŸš€ Slowpath removal: Minimizing the execution of rarely executed code paths. ðŸš€ Short-circuiting: Logical expressions cease evaluation when the final result is determined. ðŸš€ Inlining: Incorporating the body of a function at each point the function is called, reducing function call overhead and enabling further optimization by the compiler. ðŸš€ Constexpr: Computations marked as constexpr are evaluated at compile time, enabling constant folding and efficient code execution by eliminating runtime calculations. ðŸš€ Compile-time dispatch: Techniques like template specialization or function overloading that ensure optimized code paths are chosen at compile time based on type or value, avoiding runtime dispatch and enabling early optimization decisions. ðŸš€ Cache warming: To minimize memory access time and boost program responsiveness, data is preloaded into the CPU cache before itâ€™s needed. Reference: https://lnkd.in/dDfYJyw6 #technology #tech #cpp #programming

Software Performance Optimization

More in Software Performance Optimization

More Technology topics

Explore categories