Solidigm: Anatomy of a (Murderous) Prompt

I’ve seen Solidigm present at Tech Field Day events several times since I’ve been a delegate. They always have interesting perspectives on how storage deeply influences the efficiency and flexibility of an IT shop’s capability to scale their computing resources to maximum potential.

Wait … How Many Tokens?

My most recent encounter with Solidigm at #AIFD8 prompted me to change my perspectives on AI. Kapil Karkra, Senior Principal Engineer for AI Solutions and Software, kicked off their presentations with an in-depth explanation of how a typical LLM processes a prompt to return meaningful value to a user. The novel way he explained its inner workings helped me crystallize some new realizations about generative AI inferencing:

  • Because LLMs are naturally non-deterministic, it’s nearly impossible to predict what an AI prompt will actually demand in terms of resources to answer a query.
  • It’s therefore nearly impossible to determine which resources will be needed for inference ahead of time, which also implies it’s difficult to predict how many tokens will need to be spent to retrieve those resources.
  • It’s also possible that I could issue precisely the same prompt and the LLM will issue a completely different query. I’ve seen this happen with Oracle 26ai’s SELECT AI tools, which typically use ollama LLM variants to construct SQL queries based on available metadata and instructions.
  • And unless a vector database is in use and precisely the same vector embeddings have been retrieved and then cached within database memory from a prior prompt, a new prompt can’t leverage those previously-retrieved embeddings.

The end result? A reasonably carefully prepared prompt – regardless of its simplicity! – may need to request tens of thousands of tokens to provide an answer … even though the original prompt may only consume a handful of tokens.

DBA Means Don’t Bother Asking

I have an unusual background; I’m a long-time DBA with deep background in storage technology. (I worked as a subject matter expert for Hitachi Data Systems in the early 2010s and helped launch the initial VSP.) Leveraging that perspective, I compared how a similar query would be handled by most modern databases today – for example, an Oracle 26ai database.

Bear with me as I explain the memory strategies built over the last few decades:

  • The 26ai query optimizer constructs an optimal execution plan against every table and index needed to satisfy the query using metadata about what’s stored within those objects. (In fact, it might even find that no valid answer exists based on the range of values requested, and then simply return a NULL result immediately.)
  • The database then accesses the database objects and return only the database blocks containing the rows needed from storage.
  • Even better, if the needed blocks were already in the database buffer cache memory in the proper state, the database would just use those blocks.
  • If another user had earlier issued a similar query, the optimizer wouldn’t waste any time building a new execution plan – it would just use the one already in the database’s library cache.
  • Finally, the database blocks and the query plan itself would be cached until no longer needed; they’d be aged out of memory as more recent requests demanded memory be freed.
  • Best of all, my DevOps team can tune queries before they’re ever run. The optimizer can accurately approximate the execution plan and thus catch foolish mistakes like queries that will run nearly forever because of improper joins or lack of indexes on columns that are most often used for selection criteria.

The whole point of this optimization strategy is to minimize round trips to physical storage. That also limits expensive physical I/O and helps increase application throughput. And it’s practically the exact opposite strategy that AI inference requests use to locate needed data.

Non-Deterministic Prompts Span Different Storage Workload Patterns

Solidigm’s analysis of what their customer base is focused on for AI workloads is particularly pertinent for wringing performance from underlying cloud storage systems. Their research shows at least two key workload patterns that must be accommodated for effective inference throughput.

The first workload pattern is focused on RAG or grounding activity that’s dominated by relatively small random reads against resources typically stored within either Local NVMes or independent objects / files.

The second AI pattern resembles a typical data warehouse workload with large-block reads and writes dominating in totally different storage stacks, starting at the key-value (KV) cache and extending into other caches depending on how recently resources were propagated for reuse into those caches.

Solidigm’s storage tiering methodology allows them to simulate what a modern database does intrinsically within its carefully-controlled memory caches structures: in essence, systematically maximizing KV cache for higher availability of resources needed to answer AI prompts over longer periods of time while still offering up sufficiently-responsive storage for RAG and grounding. It’s not a perfect solution, of course, but it takes into account how their customers’ AI workloads are evolving from earlier emphasis on model training to leveraging powerful LLMs for inference demands.

One thought on “Solidigm: Anatomy of a (Murderous) Prompt

Comments are closed.