Improving LLM’s Reasoning In Production - The Structured Approach

This guide on achieving better reasoning performance of LLMs complements our guide on prompt engineering which explains how to improve LLM setups by systematic and programmatic means to enable flexibility while keeping the amount of operational variability to a minimum. Of course, the language component of a prompt can also be inspected and improved which is called In-context learning (ICL). There are a lot of papers on how to achieve better LLM reasoning performance through this technique that is all about the choice of words and their sequence in a prompt. For anyone interested, the known NLP-technique Retrieval-Augmented Generation (RAG) is a sub technique within the concept of ICL. Let’s dive into an overview of some interesting word games that have been unearthed so far.

Chain of Thought (CoT) Prompting (Zero shot and Few shot Examples)

Chain of Thought Zero Shot Example Prompt from the paper:

https://cdn.prod.website-files.com/63811f0b654325b008a7c1dc/650ac2efa338e7d6eb96ab22_Untitled (6).png

Chain of Thought Few Shot Example Prompt from the paper:

Zero-shot CoT uses special instructions to trigger CoT prompting without any examples. For example, the instruction “Let’s think step by step” works well for many tasks. The final answer is extracted from the last step of the thinking process.

Few-shot CoT uses some examples of questions and answers with reasoning chains to guide LLMs to use CoT prompting. The examples are given before the actual question. The final answer is also extracted from the last step of the thinking process.

In a nutshell, CoT prompting improves the performance of LLMs on many tasks, especially math and logic problems. However, it does not work for all tasks and sometimes gives wrong or multiple answers.

Reflexion Prompting (Reflect on x first and then reply)

Reflexion Prompt Example from the paper:

https://cdn.prod.website-files.com/63811f0b654325b008a7c1dc/650ac29da4b01c81f001a769_Untitled (1).png

Reflexion works by making the agents verbally reflect on the feedback they receive from the tasks, and use their own reflections to improve their performance. Reflexion does not require changing the model weights, and can handle different kinds of feedback, such as numbers or words. It outperforms baseline agents and is especially effective for tasks such as making decisions, writing code. Because the LLM will flag the output as “reflected”, it is not very useful in vanilla settings for use cases where being perceived as human is important for the success of the product.

Directional Stimulus Prompt (Do x while taking y related to x into account)‍

Directional Stimulus Prompt Example from the paper:

https://cdn.prod.website-files.com/63811f0b654325b008a7c1dc/650ac2aa9cd8d1e54d9fc174_Untitled (2).png

Directional stimulus prompts act as instance-specific additional inputs to guide LLMs in generating desired outcomes, such as including specific keywords in a generated summary. It’s very simple and straightforward. Like Prompt Alternating, you are most likely already using this technique in your setups already!

Tree of Thoughts (ToT) Prompting (Multiple experts vote on the right answer)

Tree of Thoughts Prompt Example from the paper:

https://cdn.prod.website-files.com/63811f0b654325b008a7c1dc/650ac3adf60de0f185288b7b_Untitled (8).png

Tree of Thoughts technique is effective for solving logic problems, such as finding the most likely location of a lost watch.It asks the LLM to imagine three different experts who are trying to answer a question based on a short story. The LLM has to generate the steps of thinking for each expert, as well as their critiques and likelihoods of their assertions and it also has to follow the rules of science and physics, and backtrack or correct itself if it finds a flaw in its logic. ToT is fun for riddles and very specific problems. It’s quite an academic technique and only useful for very specific and conversational AI use cases.

Reasoning And Acting (ReAct) Prompting (Combining reasoning step with possibility to act)

ReAct Prompt Example from the paper:

https://cdn.prod.website-files.com/63811f0b654325b008a7c1dc/650ac35a5bf4d19bcbe68207_Untitled (7).png

ReAct prompts consist of four components: a primary prompt instruction, ReAct steps, reasoning thoughts, and action commands. Setting up relevant sources of knowledge and APIs that can help the LLM perform the actions is vital. We  do not recommend using ReAct in high-stakes environments.

Reasoning WithOut Observation (ReWOO) Prompting (Experts with clearly separated roles, eg. Planner and Solver)

Reasoning WithOut Observation Prompt Example from the paper:

https://cdn.prod.website-files.com/63811f0b654325b008a7c1dc/650ac32d5d6af89f75fe18ad_Untitled (4).png

ReWOO performs better than ReAct despite not relying on current and previous observations. ReAct suffers from tool failures, action loops, and lengthy prompts, while ReWOO can generate reasonable plans but sometimes has incorrect expectations or wrong conclusions. Improving the tool responses and the Solver prompt to enhance the reasoning performance is vital to have good ReWOO performance.

CoH is easy to optimize and does not rely on reinforcement learning or reward functions, unlike previous methods.

Chain of Density (CoD) Prompting (Summarizing recursively and keeping equal word length)

Chain of Density Prompt Example from the paper:

https://cdn.prod.website-files.com/63811f0b654325b008a7c1dc/650ac3413c2b2cc68da6baf5_Untitled (5).png

CoD is a big deal. We think that semantic compression is the golden goose of AI, a technique that will keep on giving.

Here’s what the prompt instructs the LLM to do:

  • Identify informative entities from the article that are missing from the previous summary.
  • Write a new summary of the same length that covers every entity and detail from the previous summary plus the missing entities.
  • Repeat these two steps five times, making each summary more concise and entity-dense than the previous one.

The purpose of this CoD is to write effective summaries that capture the main points and details of a text recursively, making output more dense every iteration. It uses fusion, compression, and removal of uninformative phrases to make space for additional entities. Of course, the output can be saved in convenient JSON-format for further processing and function calling.

Sparring Time With Opsie!

Opsie is an audit & advocatus diaboli sparring partner.

Given the increasing need for human-like interactions in customer service, does this limit Reflexion's applicability?

Reflexion prompting may not be ideal for applications requiring human-like empathy and conversational nuance. However, it can be invaluable in scenarios requiring rigorous self-analysis and correction, such as data entry validation, educational tools, and automated debugging or technical support systems.

How would you handle the additional overhead and ensure consistency in the prompts provided?

Managing prompt complexity involves using centralized systems to track and standardize prompts. Leveraging modular prompt design patterns and developing robust testing frameworks to validate prompt effectiveness consistently can mitigate the risks of increased complexity.

How do you prevent the model from overly focusing on the hint and missing the broader context?

Handling edge cases requires incorporating fallback mechanisms that trigger alternative prompts or additional processing layers. Continuous monitoring and retraining with edge case data can help models learn to balance hints with broader context effectively.

Are there specific domain areas where ToT could be particularly beneficial?

Beyond riddles and logic problems, ToT can be beneficial in domains requiring structured decision-making or scenario analysis, such as strategic planning, legal reasoning, medical diagnosis support, and financial forecasting, enabling exploration of complex problem spaces comprehensively.

How do you ensure this is computationally viable for real-time applications?

Ensuring computational viability involves batch processing, parallel computing techniques, and leveraging hardware accelerators. Hybrid approaches where ToT is applied sparingly or for pre-processed batches can maintain a balance. Micro-optimization to prune less likely successful paths early can also reduce computational load.

What mechanisms would you put in place to detect and address these issues proactively?

Proactive detection and addressing involve implementing monitoring and exception-handling mechanisms. Using watchdog timers, anomaly detection algorithms, and fallback strategies to simpler prompt methods can help. Continuous logging and analysis of tool performance data enable proactive adjustments.

With ReAct not recommended for high-stakes environments, what criteria would you use to evaluate the risk and determine whether these techniques are suitable for a given use case?

Evaluating risk for high-stakes environments involves comprehensive assessment mechanisms, analyzing factors like error tolerance, decision impact, and reliability requirements. Simulation and testing in controlled environments, combined with a phased rollout strategy, can mitigate risk. A human-in-the-loop approach adds an additional safety layer for critical applications.

How do you ensure that critical nuances and contextual richness are not lost while creating entity-dense summaries through semantic compression?

Maintaining nuances and context involves using attention mechanisms to weigh key elements more heavily and incorporating redundancy checks where the original context is referenced. This ensures critical information is preserved during compression phases.

How do you measure the trade-off between precision and performance in time-sensitive applications?

Measuring the trade-off between precision and performance involves defining metrics and thresholds. Adaptive iteration approaches, where the depth of summarization is dynamically adjusted based on the context’s complexity and required precision, help maintain efficiency.

How do you seamlessly integrate these prompting techniques with Retrieval-Augmented Generation (RAG) to leverage their combined strengths?

Seamless integration with RAG involves optimizing the retrieval phase to align with the specific needs of the prompt method being applied. Unified embeddings, context-aware retrieval systems, query expansion, relevance feedback, and dynamic context adjustment enhance retrieval accuracy and relevance.

How To Improve Reasoning For LLMs?

Each approach has different advantages and disadvantages. You need to tinker and work your way up to achieve alignment of your use case and the input/output of your setup. So which technique is best? In our view, CoT, CoD, Directional Stimuli and good old RAG are the way to build chatbots. ReAct and ReWOO are more experimental and could be used for content generation.

Each of these techniques offers unique ways to enhance the performance of LLMs, making them more flexible, reliable, and context-aware. They are all based on the principle of prompt engineering, which involves the strategic use of prompts to guide the model's responses pragmatically and with minimal variability.

Let's Work Together Starting Today

If this work is of interest to you, then we’d love to talk to you. Please get in touch with our experts and we can chat about how we can help you get more out of your IT.

Send us a message and we’ll get right back to you. ->