The Strengths, Weaknesses and Dangers of LLMs

Much has been said lately about the wonders of Large Language Models (LLMs). Most of these accolades are deserved. Ask ChatGPT to describe the General Theory of Relativity and you will get a very good (and accurate) answer. However, at the end of the day ChatGPT is still a computer program (as are all other LLMs) that is blindly executing its instruction set. It understands the General Theory of Relativity no better than your favorite pet. Unfortunately, we use “human-like” words to describe the techniques engineers use to create them - for example, “machine learning” and “training.” This is misleading because an LLM does not have a mind like a human.

There is a certain irony here - how can a non-thinking chatbot correctly summarize the findings of the smartest man to ever live? To understand this contradictory nature of LLMs, let’s explore them further in terms of Strengths, Weaknesses, and Dangers to see how we can use data, and a storage solution like MinIO, to exploit the first and mitigate the other two. These are techniques that every engineer should have in the back of their minds as they train, test and deploy LLMs for their organization. 

The Strength

The strength of LLMs is that they are trained to understand the probability distribution of words in the training set used to create them. If the training set is sufficiently large (i.e. a corpus of Wikipedia articles or public code on GitHub), then the models will have a vocabulary and a corresponding probability distribution that will make their results appear as if they have a real-world understanding of the text they output. Let’s examine another example in more detail - this time from Philosophy. Ask ChatGPT the question, "What does ‘cogito, ergo sum’ mean and who wrote it?" and you will get something similar to the text below.

"Cogito, ergo sum" is a Latin philosophical proposition that translates to "I think, therefore I am" in English. This statement is famously associated with René Descartes, a French philosopher, mathematician, and scientist. Descartes expressed this idea in his work "Discourse on the Method," published in 1637. The phrase reflects Descartes' attempt to establish a foundational truth that cannot be doubted—the certainty of one's own existence as a thinking being.

LLMs produce results like this using probability distributions. It works something like this, they start by looking at the text in the question and determine that the word “Cogito” has the highest probability of being the first word of the answer. From there, they look at the question and the first word of the answer to determine the word that has the highest probability of being next. This goes on and on until a special “end of answer” character is determined to be of the highest probability. 

This ability to generate a natural language response based on billions of probabilities is not something to be feared - rather, it is something that should be exploited for business value. The results get even better when you use modern techniques. For example, using techniques like Retrieval Augmented Generation (RAG) and Fine-tuning, you can teach an LLM about your specific business. Achieving these human-like results will require data and your infrastructure will need a strong data storage solution.

Not only can these next-token prediction capabilities be used to generate great text for your chatbot or marketing copy, but they can also be used to enable automated decision-making within your application. Given cleverly constructed prompts that contain a problem statement and information about APIs (“functions”) that can be called, an LLM’s understanding of language will enable it to generate an answer that explains what “function” should be called. For example, on a conversational weather app, a user could ask, “Do I need a rain jacket if I’m going to Fenway Park tonight?” With some clever prompting, an LLM could extract the location data from the query (Boston, MA) and can determine how a request to the Weather.com Precipitation API could be formulated. 

For a long time, the hardest part about building software was the interfacing between natural language and syntactic systems such as API calls. Now, ironically, that might be one of the simplest parts. Similar to text generation, the quality and reliability of LLM function-calling behavior can be aided with the use of fine-tuning and reinforcement learning with human feedback (RLHF).

Now that we understand what LLMs are good at and why, let’s investigate what LLMs cannot do.

The Weakness

LLMs cannot think, understand or reason. This is the fundamental limitation of LLMs. Language models lack the ability to reason about a user's question. They are probability machines that produce a really good guess to a user’s question. No matter how good of a guess something is, it is still a guess and whatever creates these guesses will eventually produce something that is not true. In generative AI, this is known as a “Hallucination.” 

When trained correctly, hallucinations can be kept to a minimum. Fine-tuning and RAG also greatly cut down on hallucinations. The bottom line - to train a model correctly, fine-tune it and give it relevant context (RAG) requires data and the infrastructure to store it at scale and serve it in a performant manner.

Let’s look at one more aspect of LLMs, which I’ll classify as a danger because it impacts our ability to test them.

The Danger

The most popular use of LLMs is Generative AI. Generative AI does not produce a specific answer that can be compared to a known result. This is in contrast to other AI use cases, which make a specific prediction that can be easily tested. It is straightforward to test models for image detection, categorization and regression. But how do you test LLMs used for generative AI in a way that is impartial, fact-faithful and scalable? How can you be sure that the complex answers LLMs generate are correct if you are not an expert yourself? Even if you are an expert, human reviewers can not be a part of the automated testing that occurs in a CI/CD pipeline.

There are a few benchmarks in the industry that can help. GLUE (General Language Understanding Evaluation) is used to evaluate and measure the performance of LLMs. It consists of a set of tasks that assess the ability of models to process human language. SuperGLUE is an extension of the GLUE benchmark that introduces more challenging language tasks. These tasks involve coreference resolution, question answering and more complex linguistic phenomena.

While the benchmarks above are helpful, a big part of the solution should be your own data collection. Consider logging all questions and answers and creating your own tests based on custom findings. This will also require a data infrastructure built to scale and perform.

Conclusion

There you have it. The strengths, weaknesses, and dangers of LLMs. If you want to exploit the first and mitigate the other two, then you will need data and a storage solution that can handle lots of it.