The Forest Amidst the Trees - The Takeaway from our AI Year

The Forest Amidst the Trees - The Takeaway from our AI Year

The calendar year 2023 will be a meaningful one, perhaps one of the most meaningful ones, when the history of AI is written. It was, in essence, the big bang. 

It started in late 2022 with OpenAI’s ChatGPT but it was the response that was so breathtaking. Within months we had Meta’s LLaMA 2, Google’s Bard chatbot followed later in the year by Gemini, Anthropic’s Claude and others. The battle between proprietary and open source raged and even mightly Google concluded there was no moat to be found. We think that favors open source. 

Either way, the breakneck speed of development has obfuscated our vision. We tend to, and this is particularly true of the media, to focus on the outputs - but the weights, measures, tokens and parameters are the trees - they are not the forest. The forest is the data infrastructure that enables the OpenAI’s of the world to work and it deserves our attention as we close out the year. 

Let’s mix some metaphors and consider the core ingredients of the AI cocktail:

GPUs: The sparkplug of AI operations are the GPUs and other specialized AI chips (but mostly GPUs). Indispensable for complex computations and parallel processing - they are central to machine learning algorithms and deep learning neural networks.

CPUs and TPUs: While CPUs are an afterthought - you can’t actually go end to end without them. They will look more like GPUs going forward (and more so than GPUs will look like CPUs). 

Object Storage: Object storage offers scalable, flexible, and cost-effective storage solutions for the vast and varied types of data AI systems require and does so in a flat environment, making it ideal for the unstructured data that AI often relies on. Furthermore, it is leveraged by the S3 API that developers and ML practitioners have come to know and love. These are some of the reasons why every foundational model was trained on an object store. The filesystem folks can make all the partnership announcements they want, the data science community knows that the alpha and omega of storage is object storage. 

Networking Infrastructure: The network story is criminally overlooked in AI. You simply can’t go fast enough these days. It won’t be long before dual NIC 100 GbE looks slow. Still, it is getting the job done. 

Software and Algorithms: While this year was the big bang, there has been a steady development of machine learning frameworks and libraries. This included CNNs, RNNs, GANs, reinforcement learning, topological data analysis, NLP and other technologies. They provided the foundation for LLMs, RAGs and federated learning. Still, you can’t ignore the massive advancements that came forward in 2023 - they set the stage for what’s next. 

Large-scale Data and Datasets: If GPUs are the sparkplug, data is the fuel of the AI and machine learning engine. Lots of accurate, clean, representative, diverse current data is needed. It should not matter if it is structured, semi-structured or unstructured. It needs to be version controlled and provenance tracked. While the data is the star, we cannot overstate the importance of the plumbing that routes, stores and replicates it. 

Security and Compliance: Given the prominence of security everywhere else, we don’t talk about it in the context of AI as much as we should. We do, however, talk about compliance and for good reason - both explainability and safety. These are both technology plays and are an important part of the data infrastructure. 

I suspect we have left a few things out or could add some additional detail to some of the above sections but this covers the key ingredients for a successful AI data infrastructure.  

Yes, 2023 was about LLMs, RAGs and weekly breakthroughs, but to return to the original analogy, those are the trees. The forest is the underlying data infrastructure. That is what enabled the progress. That is what will enable progress in 2024. The modern data infrastructure stack doesn’t need AI in the way that AI needs the modern data infrastructure stack. It will continue to be this way for the foreseeable future. It's a complex, often understated mix of components working in unison to harness the true potential of AI.No model, no matter how intricately crafted, can excel beyond the level of comprehension given limitations to its data and accompanying infrastructure. We've designed remarkable model architectures, but their full potential is capped by dependencies associated with compute, data, networking and storage.

Modern data infrastructure expands our possibilities. Clean pipelines fuel datasets covering more domains with greater accuracy and less bias - instantly improving downstream models. Expanding infrastructure also enables accelerated experimentation by eliminating data bottlenecks.

LLMs’ greatest strength is that they are trained to understand the probability distribution that makes up the real world, or more specifically, the data that makes up their training datasets. However, it is also their greatest weakness. LLMs can produce a really good guess to a user’s question, but that’s all it is: a guess. As it currently stands, generative AI lacks the ability to reason about a question and critically think. This means that an LLM’s reliability and foundational knowledge are reliant on one thing and one thing alone: web-scale training data. To handle these kinds of data collection and training workloads, an organization needs scalable data infrastructure. Infrastructure determines data breadth and versatility. So for long-term progress untethered from today's blindspots, improving the underlying data fabric promises the widest ripple effect. Data is AI's lifeblood; infrastructure channels it.

As we look to scale AI innovation into 2024 and beyond we are excited to be working on a key component: flexible, software-driven object storage. Capable of delivering performance at scale with the economics to enable ambitious projects, object storage has already established itself as the centerpiece of the software-defined infrastructure stack. Every application in the ecosystem from Anthropic to YOLO leverages an object store. 

We are committed to being the best in the space. Want to learn more? Sign up for the newsletter, download the code or join us on Slack. We are builders and are committed for the long haul.