Improving LLM’s Reasoning In Production - The Structured Approach

This guide on achieving better reasoning performance of LLMs complements our guide on prompt engineering which explains how to improve LLM setups by systematic and programmatic means to enable flexibility while keeping the amount of operational variability to a minimum. Of course, the language component of a prompt can also be inspected and improved which is called In-context learning (ICL). There are a lot of papers on how to achieve better LLM reasoning performance through this technique that is all about the choice of words and their sequence in a prompt. For anyone interested, the known NLP-technique Retrieval-Augmented Generation (RAG) is a sub technique within the concept of ICL. Let’s dive into an overview of some interesting word games that have been unearthed so far.

Chain of Thought (CoT) Prompting (Zero shot and Few shot Examples)

Chain of Thought Zero Shot Example Prompt from the paper:

https://cdn.prod.website-files.com/63811f0b654325b008a7c1dc/650ac2efa338e7d6eb96ab22_Untitled (6).png

Chain of Thought Few Shot Example Prompt from the paper:

Zero-shot CoT uses special instructions to trigger CoT prompting without any examples. For example, the instruction “Let’s think step by step” works well for many tasks. The final answer is extracted from the last step of the thinking process.

Few-shot CoT uses some examples of questions and answers with reasoning chains to guide LLMs to use CoT prompting. The examples are given before the actual question. The final answer is also extracted from the last step of the thinking process.

In a nutshell, CoT prompting improves the performance of LLMs on many tasks, especially math and logic problems. However, it does not work for all tasks and sometimes gives wrong or multiple answers.

Reflexion Prompting (Reflect on x first and then reply)

Reflexion Prompt Example from the paper:

https://cdn.prod.website-files.com/63811f0b654325b008a7c1dc/650ac29da4b01c81f001a769_Untitled (1).png

Reflexion works by making the agents verbally reflect on the feedback they receive from the tasks, and use their own reflections to improve their performance. Reflexion does not require changing the model weights, and can handle different kinds of feedback, such as numbers or words. It outperforms baseline agents and is especially effective for tasks such as making decisions, writing code. Because the LLM will flag the output as “reflected”, it is not very useful in vanilla settings for use cases where being perceived as human is important for the success of the product.

Directional Stimulus Prompt (Do x while taking y related to x into account)‍

Directional Stimulus Prompt Example from the paper:

https://cdn.prod.website-files.com/63811f0b654325b008a7c1dc/650ac2aa9cd8d1e54d9fc174_Untitled (2).png

Directional stimulus prompts act as instance-specific additional inputs to guide LLMs in generating desired outcomes, such as including specific keywords in a generated summary. It’s very simple and straightforward. Like Prompt Alternating, you are most likely already using this technique in your setups already!

Tree of Thoughts (ToT) Prompting (Multiple experts vote on the right answer)

Tree of Thoughts Prompt Example from the paper:

https://cdn.prod.website-files.com/63811f0b654325b008a7c1dc/650ac3adf60de0f185288b7b_Untitled (8).png

Tree of Thoughts technique is effective for solving logic problems, such as finding the most likely location of a lost watch.It asks the LLM to imagine three different experts who are trying to answer a question based on a short story. The LLM has to generate the steps of thinking for each expert, as well as their critiques and likelihoods of their assertions and it also has to follow the rules of science and physics, and backtrack or correct itself if it finds a flaw in its logic. ToT is fun for riddles and very specific problems. It’s quite an academic technique and only useful for very specific and conversational AI use cases.

Reasoning And Acting (ReAct) Prompting (Combining reasoning step with possibility to act)

ReAct Prompt Example from the paper:

https://cdn.prod.website-files.com/63811f0b654325b008a7c1dc/650ac35a5bf4d19bcbe68207_Untitled (7).png

ReAct prompts consist of four components: a primary prompt instruction, ReAct steps, reasoning thoughts, and action commands. Setting up relevant sources of knowledge and APIs that can help the LLM perform the actions is vital. We do not recommend using ReAct in high-stakes environments.

Reasoning WithOut Observation (ReWOO) Prompting (Experts with clearly separated roles, eg. Planner and Solver)

Reasoning WithOut Observation Prompt Example from the paper:

https://cdn.prod.website-files.com/63811f0b654325b008a7c1dc/650ac32d5d6af89f75fe18ad_Untitled (4).png

ReWOO performs better than ReAct despite not relying on current and previous observations. ReAct suffers from tool failures, action loops, and lengthy prompts, while ReWOO can generate reasonable plans but sometimes has incorrect expectations or wrong conclusions. Improving the tool responses and the Solver prompt to enhance the reasoning performance is vital to have good ReWOO performance.

CoH is easy to optimize and does not rely on reinforcement learning or reward functions, unlike previous methods.

Chain of Density (CoD) Prompting (Summarizing recursively and keeping equal word length)

Chain of Density Prompt Example from the paper:

https://cdn.prod.website-files.com/63811f0b654325b008a7c1dc/650ac3413c2b2cc68da6baf5_Untitled (5).png

CoD is a big deal. We think that semantic compression is the golden goose of AI, a technique that will keep on giving.

Here’s what the prompt instructs the LLM to do:

Identify informative entities from the article that are missing from the previous summary.
Write a new summary of the same length that covers every entity and detail from the previous summary plus the missing entities.
Repeat these two steps five times, making each summary more concise and entity-dense than the previous one.

The purpose of this CoD is to write effective summaries that capture the main points and details of a text recursively, making output more dense every iteration. It uses fusion, compression, and removal of uninformative phrases to make space for additional entities. Of course, the output can be saved in convenient JSON-format for further processing and function calling.

Sparring Time With Opsie!

Opsie is our (imaginary) external audit & consulting sparring partner who answers all the naïve and uncomfortable questions. Let’s spar!

Q: Reflexion prompts are not useful in settings where the perception of human likeness is important. Given the increasing need for human-like interactions in customer service, does this limit Reflexion's applicability?

Opsie: Reflexion prompting may not be ideal for applications requiring human-like empathy and conversational nuance. However, it can be invaluable in scenarios requiring rigorous self-analysis and correction, such as data entry validation, educational tools, and automated debugging or technical support systems.

Q: Adding instance-specific inputs can guide LLMs effectively but increase the complexity of prompt management. How would you handle the additional overhead and ensure consistency in the prompts provided?

Opsie: Managing prompt complexity involves using centralized systems to track and standardize prompts. Leveraging modular prompt design patterns and developing robust testing frameworks to validate prompt effectiveness consistently can mitigate the risks of increased complexity.

Q: How well do Directional Stimulus Prompts perform in edge cases where specific keywords or hints may not align with the context of the output? How do you prevent the model from overly focusing on the hint and missing the broader context?

Opsie: Handling edge cases requires incorporating fallback mechanisms that trigger alternative prompts or additional processing layers. Continuous monitoring and retraining with edge case data can help models learn to balance hints with broader context effectively.

Q: How do you envision deploying ToT in practical applications outside riddles and logic problems? Are there specific domain areas where ToT could be particularly beneficial?

Opsie: Beyond riddles and logic problems, ToT can be beneficial in domains requiring structured decision-making or scenario analysis, such as strategic planning, legal reasoning, medical diagnosis support, and financial forecasting, enabling exploration of complex problem spaces comprehensively.

Q: Generating and voting on multiple expert opinions can be resource-heavy. How do you ensure this is computationally viable for real-time applications?

Opsie: Ensuring computational viability involves batch processing, parallel computing techniques, and leveraging hardware accelerators. Hybrid approaches where ToT is applied sparingly or for pre-processed batches can maintain a balance. Micro-optimization to prune less likely successful paths early can also reduce computational load.

Q: Both ReAct and ReWOO can suffer from tool failures, action loops, and lengthy prompts. What mechanisms would you put in place to detect and address these issues proactively?

Opsie: Proactive detection and addressing involve implementing monitoring and exception-handling mechanisms. Using watchdog timers, anomaly detection algorithms, and fallback strategies to simpler prompt methods can help. Continuous logging and analysis of tool performance data enable proactive adjustments.

Q: With ReAct not recommended for high-stakes environments, what criteria would you use to evaluate the risk and determine whether these techniques are suitable for a given use case?

Opsie: Evaluating risk for high-stakes environments involves comprehensive assessment mechanisms, analyzing factors like error tolerance, decision impact, and reliability requirements. Simulation and testing in controlled environments, combined with a phased rollout strategy, can mitigate risk. A human-in-the-loop approach adds an additional safety layer for critical applications.

Q: How do you ensure that critical nuances and contextual richness are not lost while creating entity-dense summaries through semantic compression?

Opsie: Maintaining nuances and context involves using attention mechanisms to weigh key elements more heavily and incorporating redundancy checks where the original context is referenced. This ensures critical information is preserved during compression phases.

Q: CoD involves multiple iterations of summarization, which could be time-consuming. How do you measure the trade-off between precision and performance in time-sensitive applications?

Opsie: Measuring the trade-off between precision and performance involves defining metrics and thresholds. Adaptive iteration approaches, where the depth of summarization is dynamically adjusted based on the context’s complexity and required precision, help maintain efficiency.

Q: How do you seamlessly integrate these prompting techniques with Retrieval-Augmented Generation (RAG) to leverage their combined strengths?

Opsie: Seamless integration with RAG involves optimizing the retrieval phase to align with the specific needs of the prompt method being applied. Unified embeddings, context-aware retrieval systems, query expansion, relevance feedback, and dynamic context adjustment enhance retrieval accuracy and relevance.

How To Improve Reasoning For LLMs?

Each approach has different advantages and disadvantages. You need to tinker and work your way up to achieve alignment of your use case and the input/output of your setup. So which technique is best? In our view, CoT, CoD, Directional Stimuli and good old RAG are the way to build chatbots. ReAct and ReWOO are more experimental and could be used for content generation.

Each of these techniques offers unique ways to enhance the performance of LLMs, making them more flexible, reliable, and context-aware. They are all based on the principle of prompt engineering, which involves the strategic use of prompts to guide the model's responses pragmatically and with minimal variability.

Start Your Project Today

If this work is of interest to you, then we’d love to talk to you. Please get in touch with our experts and we can chat about how we can help you.

Send us a message and we’ll get right back to you. ->

Read On

E-Learning Platforms

What Is Khanmigo? - The Chatbot Tutor Based On GPT-4

The integration of ChatGPT and GPT-4 in edtech has revolutionized the way we approach education. These powerful AI technologies have been seamlessly integrated into major edtech platforms, such as Khan Academy, Quizlet, and Duolingo, to enhance the learning experience. With the ability to act as personal tutors, teacher assistants, and feedback systems, ChatGPT and GPT-4 have opened up a world of possibilities for personalized and interactive learning.

Legal Assistants

AI For Lawyers? - Considerations For Using Legal AI In Case Law Research & More

The legal industry is experiencing a significant transformation with the adoption of artificial intelligence (AI) tools. These tools, powered by machine learning algorithms, are revolutionizing the way lawyers work and facilitating case law research internally. While some lawyers may be hesitant about relying on AI due to the importance of accuracy and critical thinking in their work, research shows that up to 22% of a lawyer's tasks can be automated with AI. AI tools can analyze large volumes of legal documents, identify relevant case law, and provide valuable insights, saving time and improving efficiency in case law research.

Customer Relations

What Is Intercom? - How Intercom's AI Transforms Customer Support

Intercom's chatbot called Fin, powered by machine learning and natural language processing, provides personalized and efficient customer support by understanding and responding to user queries, integrating with LLMs, and continuously learning and improving over time.

Closed & Open Source LLMs

Prompt Engineering In Production - The Structured Approach

Large Language Models (LLMs) like GPT-4 have transformed the way we interact with AI, but to achieve reliable and consistent results in real-world applications, we need a more structured approach to prompting to keep the amount of operational variability to a minimum.

Kubernetes & OpenShift

IaaS vs. IaC Explained - A Kubernetes Approach

In today's rapidly evolving technology landscape, businesses are constantly seeking ways to stay ahead of the curve. The world of cloud computing has revolutionized the way businesses operate, offering unprecedented flexibility, scalability, efficiency and security. Two key concepts that have emerged in this landscape are Infrastructure as a Service (IaaS) and Infrastructure as Code (IaC). We will explore how they can work together in a containerized environment, leveraging Kubernetes for better management of your applications.

Closed & Open Source LLMs

LLM Frontiers & Use Case Topology - Risk vs. Potential

In recent years, the rapid advancements in Large Language Models (LLMs) have transformed the landscape of natural language processing and artificial intelligence. As investment in LLMs surges, we witness an astounding growth in their capabilities, ranging from enhanced language understanding and generation to domain-specific expertise.

Closed & Open Source LLMs

LLMs vs. Traditional ML Algorithms - A Pragmatic Comparison

LLMs (like GPT-4) excel in natural language understanding and generation tasks, offering powerful capabilities for processing and generating human language. They are not designed for handling structured data, clustering, image analysis, or ranking structured data tasks, where other machine learning models and algorithms would be more appropriate. But more about that ‍

Closed & Open Source LLMs

Unleashing the Power of GPT-4 - Industry Use Cases Explained

The release of GPT-4, OpenAI's newest natural language processing (NLP) technology, promises to revolutionize industries with its advanced capabilities. Building on previous iterations, GPT-4 offers even more powerful AI features, making it a game-changer in the world of artificial intelligence.

Cloud Computing

What is Model Serving Exactly? - An Example With Amazon Web Services (AWS)

Why Model Serving? Model Serving is an important step in the machine learning lifecycle when creating an AI application. It involves taking the model that's been trained with a dataset and making it accessible for prediction or inference requests.