This guide on achieving better reasoning performance of LLMs complements our guide on prompt engineering which explains how to improve LLM setups by systematic and programmatic means to enable flexibility while keeping the amount of operational variability to a minimum. Of course, the language component of a prompt can also be inspected and improved which is called In-context learning (ICL). There are a lot of papers on how to achieve better LLM reasoning performance through this technique that is all about the choice of words and their sequence in a prompt. For anyone interested, the known NLP-technique Retrieval-Augmented Generation (RAG) is a sub technique within the concept of ICL. Let’s dive into an overview of some interesting word games that have been unearthed so far.
Chain of Thought Zero Shot Example Prompt from the paper:
Chain of Thought Few Shot Example Prompt from the paper:
Zero-shot CoT uses special instructions to trigger CoT prompting without any examples. For example, the instruction “Let’s think step by step” works well for many tasks. The final answer is extracted from the last step of the thinking process.
Few-shot CoT uses some examples of questions and answers with reasoning chains to guide LLMs to use CoT prompting. The examples are given before the actual question. The final answer is also extracted from the last step of the thinking process.
In a nutshell, CoT prompting improves the performance of LLMs on many tasks, especially math and logic problems. However, it does not work for all tasks and sometimes gives wrong or multiple answers.
Reflexion Prompt Example from the paper:
Reflexion works by making the agents verbally reflect on the feedback they receive from the tasks, and use their own reflections to improve their performance. Reflexion does not require changing the model weights, and can handle different kinds of feedback, such as numbers or words. It outperforms baseline agents and is especially effective for tasks such as making decisions, writing code. Because the LLM will flag the output as “reflected”, it is not very useful in vanilla settings for use cases where being perceived as human is important for the success of the product.
Directional Stimulus Prompt Example from the paper:
Directional stimulus prompts act as instance-specific additional inputs to guide LLMs in generating desired outcomes, such as including specific keywords in a generated summary. It’s very simple and straightforward. Like Prompt Alternating, you are most likely already using this technique in your setups already!
Tree of Thoughts Prompt Example from the paper:
Tree of Thoughts technique is effective for solving logic problems, such as finding the most likely location of a lost watch.It asks the LLM to imagine three different experts who are trying to answer a question based on a short story. The LLM has to generate the steps of thinking for each expert, as well as their critiques and likelihoods of their assertions and it also has to follow the rules of science and physics, and backtrack or correct itself if it finds a flaw in its logic. ToT is fun for riddles and very specific problems. It’s quite an academic technique and only useful for very specific and conversational AI use cases.
ReAct Prompt Example from the paper:
ReAct prompts consist of four components: a primary prompt instruction, ReAct steps, reasoning thoughts, and action commands. Setting up relevant sources of knowledge and APIs that can help the LLM perform the actions is vital. We do not recommend using ReAct in high-stakes environments.
Reasoning WithOut Observation Prompt Example from the paper:
ReWOO performs better than ReAct despite not relying on current and previous observations. ReAct suffers from tool failures, action loops, and lengthy prompts, while ReWOO can generate reasonable plans but sometimes has incorrect expectations or wrong conclusions. Improving the tool responses and the Solver prompt to enhance the reasoning performance is vital to have good ReWOO performance.
CoH is easy to optimize and does not rely on reinforcement learning or reward functions, unlike previous methods.
Chain of Density Prompt Example from the paper:
CoD is a big deal. We think that semantic compression is the golden goose of AI, a technique that will keep on giving.
Here’s what the prompt instructs the LLM to do:
The purpose of this CoD is to write effective summaries that capture the main points and details of a text recursively, making output more dense every iteration. It uses fusion, compression, and removal of uninformative phrases to make space for additional entities. Of course, the output can be saved in convenient JSON-format for further processing and function calling.
Opsie: Reflexion prompting may not be ideal for applications requiring human-like empathy and conversational nuance. However, it can be invaluable in scenarios requiring rigorous self-analysis and correction, such as data entry validation, educational tools, and automated debugging or technical support systems.
Opsie: Managing prompt complexity involves using centralized systems to track and standardize prompts. Leveraging modular prompt design patterns and developing robust testing frameworks to validate prompt effectiveness consistently can mitigate the risks of increased complexity.
Opsie: Handling edge cases requires incorporating fallback mechanisms that trigger alternative prompts or additional processing layers. Continuous monitoring and retraining with edge case data can help models learn to balance hints with broader context effectively.
Opsie: Beyond riddles and logic problems, ToT can be beneficial in domains requiring structured decision-making or scenario analysis, such as strategic planning, legal reasoning, medical diagnosis support, and financial forecasting, enabling exploration of complex problem spaces comprehensively.
Opsie: Ensuring computational viability involves batch processing, parallel computing techniques, and leveraging hardware accelerators. Hybrid approaches where ToT is applied sparingly or for pre-processed batches can maintain a balance. Micro-optimization to prune less likely successful paths early can also reduce computational load.
Opsie: Proactive detection and addressing involve implementing monitoring and exception-handling mechanisms. Using watchdog timers, anomaly detection algorithms, and fallback strategies to simpler prompt methods can help. Continuous logging and analysis of tool performance data enable proactive adjustments.
Opsie: Evaluating risk for high-stakes environments involves comprehensive assessment mechanisms, analyzing factors like error tolerance, decision impact, and reliability requirements. Simulation and testing in controlled environments, combined with a phased rollout strategy, can mitigate risk. A human-in-the-loop approach adds an additional safety layer for critical applications.
Opsie: Maintaining nuances and context involves using attention mechanisms to weigh key elements more heavily and incorporating redundancy checks where the original context is referenced. This ensures critical information is preserved during compression phases.
Opsie: Measuring the trade-off between precision and performance involves defining metrics and thresholds. Adaptive iteration approaches, where the depth of summarization is dynamically adjusted based on the context’s complexity and required precision, help maintain efficiency.
Opsie: Seamless integration with RAG involves optimizing the retrieval phase to align with the specific needs of the prompt method being applied. Unified embeddings, context-aware retrieval systems, query expansion, relevance feedback, and dynamic context adjustment enhance retrieval accuracy and relevance.
Each approach has different advantages and disadvantages. You need to tinker and work your way up to achieve alignment of your use case and the input/output of your setup. So which technique is best? In our view, CoT, CoD, Directional Stimuli and good old RAG are the way to build chatbots. ReAct and ReWOO are more experimental and could be used for content generation.
Each of these techniques offers unique ways to enhance the performance of LLMs, making them more flexible, reliable, and context-aware. They are all based on the principle of prompt engineering, which involves the strategic use of prompts to guide the model's responses pragmatically and with minimal variability.
If this work is of interest to you, then we’d love to talk to you. Please get in touch with our experts and we can chat about how we can help you.