Cisco: Gladys Isn’t Happy. Nexus Hears Her Cry.

I’ve heard from Cisco more than a few times at prior Tech Field Days. As AI continues to penetrate every aspect of IT organizations’ workflows, they offered some interesting insights on the future of monitoring network performance at AI Field Day #8.

When Gladys Isn’t Happy, Watch Out.

When I taught DBAs the advanced Oracle database performance tuning course, I often talked about a mythical user – Gladys – who knew how fast the company’s primary applications should be working better than anyone.

If Gladys wandered by and said “The database is running slow,” you knew you were probably in for a long day and had better pay attention to her complaint … or the next thing you know, your boss would be getting a call from the CIO.

And conversely, if Gladys gave you the thumbs up after a major deployment, you knew you’d done your job.

It appears the folks at Cisco heard of Gladys from their customers – so much so they built a slide around her nebulous complaint.

My point, of course, is it’s relatively simply to diagnose normal, non-AI workloads because network admins, DBAs, and DevOps folks are reasonably skilled at looking at pertinent metrics – I/O wait times, network ping round-trips, application audit logs, average database response time – that quickly point at likely culprit(s).

Nexus: Cisco’s Toolset At the Center of Agentic Operations

AI workloads, however, often defy normal troubleshooting because they span diverse endpoints, including GPUs deployed for intense training, or dedicated compute nodes intended for purely inference operations.

Cisco’s latest strategy for navigating this complex melange is called AgenticOps. At its heart is Cisco AI Canvas, a collection of network-specific models they’ve developed across decades of network monitoring and diagnostic experience, that can deploy an AI agent via MCP to handle AI-centric workload issues.

Cisco demonstrated their AgenticOps strategy concepts through their Nexus Dashboard tools. They showed an AI-assisted diagnosis of networking anomalies related to GPU performance in plain business language, without having to know any of the typical networking CLI commands I’d need to know to drill into what actually was wrong with the corresponding network components and their configurations.

Even better, the Nexus agent framework was able to compare the network configurations against “golden” configurations and recommend potential remediations. As I’m certainly no expert on BGP command line instructions, so I’d probably call in some favors from a network expert colleague before I proceeded to apply the fixes recommended.

I found Nexus offered some intriguing statistics during a demo of it within the Splunk Observability Cloud. Tied to a series of Cisco AI PODs, it’s even possible to view overall “tokenomics” for AI-related operations – for example, how much a single token cost much each POD actively costs in several dimensions, including how many tokens (both input and output) have been expended, cumulative token costs, average cost per request, and any other applicable metrics.

AI Canvas: Human-Agent Collaboration For Issue Resolution

Cisco’s final demonstration showed how their AI Canvas tool made it possible to bring an experienced network engineer into a dialogue with other engineers who were actively attempting to isolate a particular AI workload issue.

The tool was eventually able to help the engineers to draw accurate conclusions about what was actually affecting performance: two GPUs were apparently under-utilized because the network connections to those GPUs’ compute nodes were not working with nominal standards.

Cisco admitted the examples they demonstrated show promise, and they’re planning to expand their Nexus toolsets to their customer base in the future. From my perspective as a definitely network not-know-it-all, I have to admit that even this relatively early stage of their AgenticOps strategy is impressive. At the very least it would keep someone suddenly tasked with figuring out how to solve a complex network issue from acting on minimal intelligence and making an utterly foolish and incorrect decision.

HPE: Agentic AI Demands a Compute Framework, Not Just Clustered GPUs

As part of my first serious Oracle DBA assignment in 2001, the first production database I’d ever built was hosted on an HP8500 ProLiant server – at the time, a serious computing beast with 8 removable CPUs and an enormous 8GB of physical memory.

I spent plenty of frigid days in our server room building storage arrays, attaching servers to storage and networking, and keeping a close eye on the hardware to insure everything remained fully available and performant. And over the next five years, that ProLiant simply never quit, even when under stress of demanding OLTP and reporting workloads.

Agentic AI Is Inference-Intensive. Robust Compute Is Still Crucial

Of course, that HP8500 is now ancient technology – it would likely melt down if asked to handle 1% of today’s generative AI, RAG, and agentic AI workloads that have dramatically expanded the breadth of requirements for extremely reliable and performant compute resources.

The latest ProLiant series we saw as HPE presented at AI Field Day #8 recognize these new computing demands in several dimensions.

HPE’s Senior Principal Product Manager Bharath Venkant talked at length with us on the changing dimensions of AI compute, from on-premises thru public and hybrid cloud, plus the growing need for AI inference at the edge.

I found HPE’s orientation on AI refreshing: They realize there’s an incredible divergent breadth of compute requirements for customers as journey towards whatever AI destination they’re headed, and they’ve (wisely!) focused on meeting customers wherever they might be right now on that roadmap.

The HPE product continuum I reviewed at AIFD8 offers everything from a DevOps AI “starter pack” sporting two NVIDIA RTX Pro 6000 GPUs all the way up through a continuum of small, medium, and large computing setups.

And for the most extreme AI compute requirements, HPE offers an expansion rack option housing as many as 64 NVIDIA H200 (“Hopper”) GPUs.

HPE’s AI Engine strategy thus promises to simplify an IT shop’s their AI-specific compute needs as they traverse shifting training and inference requirements because ever-increasing demands can be met within an identical, reliable hardware framework.

Inference At Razor’s Edge Demands Reliable Compute

I’ve always thought of HPE focused on building servers for pristine data center racks, so I was pleasantly surprised to see their offerings for AI at “the edge.” Factory automation, warehousing, logistics, and biomedical production immediately popped into mind … but certainly not battlefield situations.

HPE built their ProLiant EL2000 2U server chassis specifically for extreme edge conditions; it’s rated for military-grade ruggedness and resilience and plugs into their EL220 and EL240 Gen12 servers.

Extreme conditions – vibration, temperature, dust, intermittent connectivity, variable access to power and cooling – meant HPE had to build compute components to accommodate some potentially challenging edge constraints.

And while AI workloads at the edge are more likely focused primarily on inference to interpret critical situations and leverage agentic AI to recommend immediate action with human oversight, I’d foresee that data collection for retransmission back to a “home base” for later model fine-tuning could also be a requirement.

Private Cloud AI: Blueprints Sure Help


While I’ve spent the majority of the past decade either presenting at conferences or helping clients build complex database solutions, I’ve found myself immersed in DIY home remodeling. Fortunately, my successes outnumber my failures, but I’ve noticed my most successful projects start with intense and detailed planning, especially a drawing of what I intend to create.

Unfortunately, it’s an open secret that many IT shops struggle with laying out plans – much less building – a comprehensive AI solution that fits their organization’s needs. Mark Seither, PCAI Senior Principal Solutions Architect, explained in depth HPE’s alternative to DIY AI: their Private Cloud AI solutions.

The smallest system available – aka their developer system – uses a DL-325 server as a control plane plus a DL-380A that hosts two 2 NVIDIA RTX Pro 6000 GPUs.

This gives system admins and their DevOps teams a chance to get familiar with HPE AI capabilities – what I like to think of as the targets the “poke it with a stick” orientation – before deciding which configuration is likely needed for current production demands.

Moving up the sizing scale, the medium and large configurations shift from RTX Pro 6000s to NVIDIA NVIDIA H200 (“Hopper”) GPUs, with the largest system accommodating 10 H200s. And as mentioned previously, expansion racks can be added that handle up to 64 H200s. (HPE also mentioned they’re planning a 128 H200 expansion rack in the near future.)

If I were tasked with building out a new AI platform strategy, I see several advantages of this approach:

  • Since HPE deploys identical software across the entire platform family, it’s relatively simple to upgrade from, say, the developer system to a larger Private Cloud AI systems.
  • Speaking of the value of planning ahead: HPE leveraged NVIDIA’s published blueprints to create the most efficient and lowest-cost options for immediate deployments.
  • Finally, the partnership with NVIDIA offers advantages for AI implementations. NVIDIA’s open-source Nemotron 3 family of models, datasets, developer tools, and inference capabilities look pretty intriguing!

A Real-Life Example: Why Is My Discount Denied?

As an experienced developer, DBA, and presenter about technology, I’ve seen the value of the Don’t just tell me – show me maxim, especially when it comes to all things AI. I wasn’t disappointed with an all-too-brief demonstration of HPE’s quick-to-market AI capabilities.

Michael ran a live demo of their AI Essentials toolset to show how simple it would be to construct a chatbot that leveraged Langflow Agentic AI Workshop.

In under five minutes, he had deployed a reasonably sophisticated chatbot using the Nemotron 3 Nano LLMs from NVIDIA that accessed travel reimbursement policies retained in an internal document store. He then simulated requests from customers trying to understand why a particular travel refund request wasn’t handled to satisfaction .

How Do You Know Where You’re Going If You Don’t Know Where You Are?

So many IT shops these days are hearing the same drumbeat of their CIOs:

Look, We know {generative AI | RAG | agentic AI | predictive AI} is the solution – now go find problems to solve!

I found HPE’s orientation and roadmap for building a practical compute strategy refreshing because it was grounded in real-world use cases and offered a consistent vision for meeting a potential or current customer wherever they are in their AI journey, and their close partnership with NVIDIA adds stability and flexibility to whatever that journey may potentially lead.

Scality: AI Has Gone Agentic. Why Hasn’t Your Storage Strategy?

I heard from the folks at Scality at AI Field Day #8 for the first time at a Tech Field Day event, and I was happily surprised to discover some new takes on storage management in this brave new era of everything Agentic AI.

Full disclosure here: I have 25+ years of Oracle DBA experience, and I’ve also built storage arrays from the ground up. I also spent two years as a database subject matter expert at Hitachi Data Systems; my experience there convinced me just how poorly most DBAs and DevOps folks understand storage technology, especially how to monitor it for optimum performance.

Scality’s approach to managing complex storage environments definitely captured my attention.

Storage Performance Tuning Isn’t Easy. Why Let Customers Fumble Through It?

Scality’s CTO Giorgio Regni introduced their approach to modern storage strategies with a clever but refreshing history lesson on how most storage providers have tried to satisfy their customers’ demands for visibility into their storage platforms over the past decade or so:

  • First, customers demanded more control over their storage environments, so providers created complex control structures with a multiplicity of “buttons” anyone could press in hopes performance would improve.
  • When that didn’t work, providers created overladen dashboards that hopefully highlighted what was actually non-performant – as long as you knew exactly where to look at the right time.
  • As dashboards proved a distraction to immediate action, providers built complex alert mechanisms that triggered red flags – often sent right to your mobile phone, because who doesn’t want an alarm at zero dark thirty to fix a failed device?
  • Finally, providers decided the real solution for tuning storage performance was to give customers the power to configure anything they wanted through complex config files.

Giorgio’s point? These approaches have given humans complete control over their storage architecture, but unfortunately that meant humans have become the control plane. And that’s trapped system reliability engineers, storage administrators, and maybe even the occasional unfortunate dinosaur DBA who’s their 3rd-tier backup into making decisions to improve performance that exceeded the needed expertise.

Obviously, this approach is unsustainable, especially in light of the complex modern AI systems’ demand to provide peak performance of the underlying storage, whether those workloads are intense inference, long-running model training, or a mixture of both.

An Agentic Approach: Scality ADI

Scality chose AIFD8 as their venue to announce the release their latest product – Scality ADI, short for Autonomous Data Infrastructure. Essentially their product provides a single view of the totality of an enterprise’s storage resources as if it were a single AWS S3 storage endpoint.

While this approach to solving storage flexibility isn’t necessarily dramatically different from some of their competitors’ offerings, what intrigued me was their implementation of their agentic AI assistant, Guardian.

A Scality ADI user can leverage the Guardian agent UI directly to perform day-to-day standard storage management operations, handle data security operations, and even perform storage tasks an under-experienced user has limited or no knowledge of how to issue the appropriate commands. Scality also enables more sophisticated IT shops who have already begun to embrace agentic AI to construct their own storage management UI via MCP calls.

Sure, I Could Master Storage Needs If I Just RTFM’d. But Which FM Do I Need To R?

But the brief demo of Guardian that Scality performed at AIFD8 helped me understand how I could leverage Scality ADI to perform storage management without knowing what specific commands to issue.

Whichever method chosen to implement it, Scality ADI encapsulates 15+ years of storage management experience into a single AI-driven toolset that lets the least experienced user leverage an intelligent set of tools that will invoke the proper commands to perform complex operations without knowing the microscopic knowledge often required. And regardless of the implementation chosen, Scality ADI keeps a human in the processing loop at all times to make sure nothing incredibly foolish accidentally gets executed.

Selector AI: Bespoke Network Monitoring Tailored for AI

Any company who presents their first whiteboard image for their product at an event like AI Field Day #8 is sure to capture my immediate attention.

This was the first time I’d heard from Selector AI at a Tech Field Day event. I found them to be gutsy and independent while offering unique solutions to figuring out network issues in modern AI system architectures.

A Man’s Got To Know His Limits

Some background: I’m a reasonably experienced Oracle DBA, so I know how important a reliable network is for applications to connect to my databases.

I also know my limits.

I know I’d never be able to fill in as a backup network administrator during an irritating recurrence of a periodic latency, and definitely not during an unexpected crisis. I see the value of Selector AI’s platform as insurance against me doing something stupid after I guessed at a solution based five minutes’ research on Stack Overflow.

It’s All Fun & Games … Until Somebody Loses a Packet

Finding the true root cause of a network performance latency often becomes a multi-vendor finger-pointing exercise, especially when up to that point in time everything appeared to be working nominally.

When you factor in the obvious – no two IT shops’ network infrastructure is truly identical, even if they used identical hardware – it can be nearly impossible to assess whether sudden latency is expected because of normal business work schedules, versus the failure of a critical hardware component or because a newbie engineer deployed an untested DNS configuration.

That’s why Selector AI built their monitoring platform with the ability to capture the specific context of their client’s network performance during what would be considered normal and acceptable.

Selector AI’s platform can then deploy bespoke monitoring via proprietary AI models they’ve developed to filter out the noise from millions of log posts produced from many thousands of network devices to isolate root causes effectively.

When A Single-Shot Solution … Isn’t.

Joby Rudolph, Selector AI’s distinguished engineer, demonstrated in detail how the most recent version of their platform came into being.

Their first version used a single-shot AI approach to solving network issues; it enabled an unskilled human to ask questions using a proprietary NLP chatbot approach to ask simple questions about the network’s state. But as more robust AI infrastructure has matured in the past 18 months – especially the capabilities of Model Context Protocol (MCP) tools – Selector built the latest version of their platform around that orientation.

They realized the models already developed for detecting network issues were still valuable, but they migrated their solution from single-shot inferencing to leverage an agentic approach instead.

This latest version deploys three different types of agents to diagnose and solve a non-performant network:

  • An orchestrator agent coordinates the activities of all other agents in the stack.
  • A series of domain-specific agents tackle tasks across the network – for example, querying the health of an individual switch – and then report results cohesively to the orchestrator.
  • Domain-specific agents then leverage one or more MCP agents to obtain the results from the network desired component.

During an (all-too-brief!) demo, Selector AI showed how a detected problem (shown as a non-green hexagon in main monitoring UI) could be drilled into and then queried in natural language to provide advice on how to fix a particular issue.

Need Tools? Great. Got Tools? Perfect!

The three-level agentic model lets Selector AI offer bespoke solutions to each client in their portfolio, meeting them client where they are right now in their network problem resolution methodology:

  • As some clients already have a well-defined toolset for solving network issues, the Selector AI platform acts as an orchestrator to apply those tools directly.
  • Alternatively, their platform also lets a client who already has their own orchestration and solution toolsets to leverage the Selector AI models to figure out what’s malfunctioning and – always with a human in the loop where necessary – solve the problem with more precise intelligence than if the client had built their own detection and severity ranking infrastructure.

Selector AI offers a brief demo of its platform’s offerings that was roughly analogous to what they showed us, so grab a look so you can see it in action.

Hammerspace: An AI DatApocalypse, Forestalled

The AI continuum lately feels as if it’s being warped by the sudden appearance of a hidden and super-massive black hole. Spiraling development costs have given rise to feverish discussion of tokenomics as IT organizations struggle to limit their DevOps teams from depleting a year’s worth of tokens in just a few weeks. But runaway token spends are just the most visible part of today’s AI challenges.

Again With the Data.

Our final presenter at AI Field Day #8Hammerspace – targeted a key dimension that IT shops must focus on when building robust AI solutions that hopefully will yield meaningful results for their applications’ end users: AI still requires humongous volumes of data to produce accurate intelligence.

Running Out Of Everything Everywhere, All At Once

Hammerspace succinctly summarized the multiple uncomfortable realities about AI resource availability being discussed in just about every C-level boardroom discussion these days:

  • Datacenter power resources are crimped because utilities are unable to increase capacity quickly enough.
  • Even if you could even buy more processors, CPUs continue to be expensive.
  • Since storage is mainly SSD- and NVME-focused, the availability of reasonably priced storage is limited, too.
  • Finally, if a shop cannot secure additional power, compute, or storage within their own confines, cloud capacity is no longer a guaranteed off-ramp because nesoscalers – and even the largest hyperscalers – are nearing the maximum limits of their commodity hardware, just to support the existing workloads of their other customers.

Unifying, Rather Than Just Consolidating, Data

Hammerspace’s AI Data Platform solution thus implements a data unification strategy, rather than merely applying typical data consolidation approaches.

When a request is made for a specific set of data – for example, several thousand documents needed for additional AI training, or a vector-driven similarity search during an intense inference operation – the Hammerspace AI platform gathers those resources into the Tier 0 layer so operations can complete as quickly as possible. And when operations cease to require speedy access to resources, they can be moved intelligently to other (s)lower storage tiers.

Their solution essentially builds a global namespace that appears as a single storage viewpoint. Their platform stands in front of an organization’s present set of storage clusters and can access data from anywhere, regardless if it’s retained in local SSDs or NVMes – what we typically call Tier 0 – or an on-premises private cloud, or even a public cloud.

The advantage of this strategy? It treats data, wherever it might exist within multiple physical storage layers, as if they were kept within a single massive cluster.

Storage Vendor Partnerships = Even More Interesting Use Cases

Hammerspace’s customer base illustrates the attractiveness of their platform – they’ve implemented within customer HPE solutions as well as neoclouds. And Hammerspace also described several quite divergent use cases, including an AI application development environment supporting data retrieval demands spanning multiple petabytes and a user base supporting thousands of data scientist end-users.

Hammerspace also co-presented on how their global storage model helps Hitachi Vantara expand the capabilities of its VSP One platform for several of its AI-focused customers.

In concert with Vantara, Hammerspace’s solution was deployed across intriguing use cases, including a complex AI-based fraud detection and risk assessment application, as well as and an AI “incubation hub” where the Hammerspace solution sat above VSP-One block storage and communicated with an NVIDIA-powered two-node GPU cluster.

(Full disclosure: I’d actually helped launch the first iteration on the VSP in 2010 with extensive experimental research on how best to use Hitachi storage solutions to handle extreme Oracle database workloads as they consumed SSD storage bandwidth. I remember one workload was so intense that we actually melted a few SSDs in their cases.)

Storage Has Always Just “Been There”

If I were back in my role as a senior guy “in the trenches” just trying to keep a team of data engineers and data scientists productive and frustration-free, I can see exactly how valuable Hammerspace’s unified storage approach would benefit us. At every IT shop I’ve worked at, the expectation is that the storage layer is transparently ubiquitous, works all the time, and never ever runs out of space.

Of course, that’s a fantasy, and with the onset of increasing AI workload demands like training and inference that are often in direct conflict with each other, anything that helps homogenize and simplify managing storage is worth a closer look.

Our team didn’t get to see how Hammerspace’s AI Data Platform worked in concert with Hitachi VSP One because our final session ran out of time. That’s not a bad sign – it just means our team of delegates found lots of great questions to ask; check out the introductory video to see we all found interesting.

Solidigm: Anatomy of a (Murderous) Prompt

I’ve seen Solidigm present at Tech Field Day events several times since I’ve been a delegate. They always have interesting perspectives on how storage deeply influences the efficiency and flexibility of an IT shop’s capability to scale their computing resources to maximum potential.

Wait … How Many Tokens?

My most recent encounter with Solidigm at #AIFD8 prompted me to change my perspectives on AI. Kapil Karkra, Senior Principal Engineer for AI Solutions and Software, kicked off their presentations with an in-depth explanation of how a typical LLM processes a prompt to return meaningful value to a user. The novel way he explained its inner workings helped me crystallize some new realizations about generative AI inferencing:

  • Because LLMs are naturally non-deterministic, it’s nearly impossible to predict what an AI prompt will actually demand in terms of resources to answer a query.
  • It’s therefore nearly impossible to determine which resources will be needed for inference ahead of time, which also implies it’s difficult to predict how many tokens will need to be spent to retrieve those resources.
  • It’s also possible that I could issue precisely the same prompt and the LLM will issue a completely different query. I’ve seen this happen with Oracle 26ai’s SELECT AI tools, which typically use ollama LLM variants to construct SQL queries based on available metadata and instructions.
  • And unless a vector database is in use and precisely the same vector embeddings have been retrieved and then cached within database memory from a prior prompt, a new prompt can’t leverage those previously-retrieved embeddings.

The end result? A reasonably carefully prepared prompt – regardless of its simplicity! – may need to request tens of thousands of tokens to provide an answer … even though the original prompt may only consume a handful of tokens.

DBA Means Don’t Bother Asking

I have an unusual background; I’m a long-time DBA with deep background in storage technology. (I worked as a subject matter expert for Hitachi Data Systems in the early 2010s and helped launch the initial VSP.) Leveraging that perspective, I compared how a similar query would be handled by most modern databases today – for example, an Oracle 26ai database.

Bear with me as I explain the memory strategies built over the last few decades:

  • The 26ai query optimizer constructs an optimal execution plan against every table and index needed to satisfy the query using metadata about what’s stored within those objects. (In fact, it might even find that no valid answer exists based on the range of values requested, and then simply return a NULL result immediately.)
  • The database then accesses the database objects and return only the database blocks containing the rows needed from storage.
  • Even better, if the needed blocks were already in the database buffer cache memory in the proper state, the database would just use those blocks.
  • If another user had earlier issued a similar query, the optimizer wouldn’t waste any time building a new execution plan – it would just use the one already in the database’s library cache.
  • Finally, the database blocks and the query plan itself would be cached until no longer needed; they’d be aged out of memory as more recent requests demanded memory be freed.
  • Best of all, my DevOps team can tune queries before they’re ever run. The optimizer can accurately approximate the execution plan and thus catch foolish mistakes like queries that will run nearly forever because of improper joins or lack of indexes on columns that are most often used for selection criteria.

The whole point of this optimization strategy is to minimize round trips to physical storage. That also limits expensive physical I/O and helps increase application throughput. And it’s practically the exact opposite strategy that AI inference requests use to locate needed data.

Non-Deterministic Prompts Span Different Storage Workload Patterns

Solidigm’s analysis of what their customer base is focused on for AI workloads is particularly pertinent for wringing performance from underlying cloud storage systems. Their research shows at least two key workload patterns that must be accommodated for effective inference throughput.

The first workload pattern is focused on RAG or grounding activity that’s dominated by relatively small random reads against resources typically stored within either Local NVMes or independent objects / files.

The second AI pattern resembles a typical data warehouse workload with large-block reads and writes dominating in totally different storage stacks, starting at the key-value (KV) cache and extending into other caches depending on how recently resources were propagated for reuse into those caches.

Solidigm’s storage tiering methodology allows them to simulate what a modern database does intrinsically within its carefully-controlled memory caches structures: in essence, systematically maximizing KV cache for higher availability of resources needed to answer AI prompts over longer periods of time while still offering up sufficiently-responsive storage for RAG and grounding. It’s not a perfect solution, of course, but it takes into account how their customers’ AI workloads are evolving from earlier emphasis on model training to leveraging powerful LLMs for inference demands.

An example of the Deep Field photo from Hubble and JWST

When All Seems Dark, Look Backward, Forward, and Up

You know what I dearly love about attending Tech Field Day events? They aren’t just about humdrum, boots-on-the-ground problems we deal with every day. I found myself re-energized about what IT technology can accomplish for humanity at large after hearing more about the history of what we call The Cloud; tireless efforts to preserve trillions of web pages since the Internet’s inception; and just how tiny and precious our pale blue dot is when compared to the vastness of the universe.

The Cloud: A Brief History of Irrational Exuberance

Tom Lyon, an historian of computer science who has been programming since the early 1960s and co-founded DriveScale, stepped up to give us a totally different perspective versus how most of us delegates have probably thought about the current state of all things Cloud – public, private, hyper-scaled, neo-cloud, you name it – but posited that what we call The Cloud these days actually has its genesis nearly a century ago.

The interesting part? We’re really just ending yet another cycle of what Tom posits he and other keen observers call irrational exuberance as the famed AI hype cycle plays out.

Tom noted hyperscalers like Amazon, Google, Oracle, and Microsoft have built a house of cards around unbelievably rosy projections of AI growth possibilities, furthered by complex accounting tricks like special purpose vehicles to hide debt on their balance sheets. I commented that I’m just waiting for some 70-something investor to pop up at a shareholder’s meeting to shout at Larry Ellison, Satya Nadella, or Sundar Pichai Excuse me, but are you on crack?

The Real Ministry of Truth. Apologies to Winston Smith.

The continuing degradation of objective truth in the modern world has severe implications for policy-making, corporate planning, and history at large. It’s impossible to ignore the erasure of large swaths of scientific data about everything from climate change to basic public health that continues in real time.

Without the Wayback Machine, many USA government websites’ content that’s been effectively erased by the current administration in Washington. That’s why it’s so crucial to preserve the Internet’s billions of pages of content created over the last 40 years since the World Wide Web was born.

We heard from Joy Chesbrough, the Internet Archive’s Chief Philanthropy Officer (now that’s a cool job title!), on how their organization is able to preserve the past digitally on a relatively tiny budget of $30M per year. It was inspiring to hear how this non-profit is able to do the sometimes-thankless work of digital preservation and what those efforts conserve for future generations.

We’ve all seen how governments, politicians, and oligarchs have attempted recently to discard uncomfortable truths – everything from the denigration of the 1619 Project’s attempts to document how slavery played a key part of the founding of the American Experiment to recommended standards for nutrition, vaccine scheduling, and disease prevention. The Wayback Machine is a crucial tool for preserving that treasure trove of human knowledge against the creep of religious conservatism, willful ignorance, and authoritarian governments.

There’s No Intelligent Life Here. Do Look Up!

It’s easy to get discouraged about the current state of humanity, but I found my faith restored during our visit to the Search for Extraterrestrial Intelligence (SETI) Institute in nearby Mountain View.

We took a tour of the SETI offices after we chatted with Dr. Christina Ricci about the humongous scope of the universe and the mediums they’re searching for evidence of industrialized and advanced civilizations. She explained the famous Drake Equation which explains just how likely we tiny humans are likely to find evidence of extraterrestrial intelligence, if we just look hard enough. It might even happen in our lifetimes … and it would utterly transform the way humanity thinks about itself once we know we are truly not alone in the universe.

Dr. Vishal Gajjar explained how his efforts at SETI deploys artificial intelligence – really, complex machine learning algorithms trained across exabytes of radio and visual astronomy – are being actively applied to determine if there really are advanced civilizations in our galactic backyard. One of the most fascinating ideas involves looking for evidence of construction of mind-bogglingly huge structures like Dyson spheres surrounding exoplanets and stars across years of collected data in both EM and visible light spectrums.

Of course, I had to buy a couple of t-shirts to support their efforts. 🤯

VMware VCF 9.0: When All Prop Blades Work, OSS Database Performance Is Optimal

Our final day at #CFD25 dove deeply into the feature sets of the recently-released VMware Cloud Foundation 9.0 (VCF 9.0). It was refreshing to discuss the intricacies of cloud computing that didn’t immediately turn towards how many hundreds or thousands of Docker or Kubernetes containers we can deploy at scale. Instead we focused on where a lot of the real work still happens out of sight and mind: the rugged families of databases capturing the exabytes of data eventually used to create documents everyone wants their generative AI workloads to consume.

As a long-time beta tester, user, and afficianado of Oracle Cloud Infrastructure (OCI), I understood the challenges laid at VMware’s doorstep to transform their offerings: Modern IT organizations must effectively operationalize their computing, storage, and networking infrastructure. I equate these facets to an aircraft’s three-bladed propeller: If just one blade is under-performing, the power and effectiveness of the other two will be compromised as well.

Blade #1: Managing Memory To Forestall the Impending DRAMpocalypse

As our VWmare presenter acknowledged, we’re currently in the throes of a“DRAMpocalypse,” so it’s never been more crucial for IT shops to manage their existing server’s memory resources effectively. (I recently purchased a new FrameWork laptop – hopefully the last one I’ll ever need to buy – and the recent spike prices for DRAM memory were a wallet-shocker.)

Answering this DRAMpocalypse, VCF 9.0 offers advanced memory tiering features to exchange the least-active pages from DRAM to NVMe. While most modern databases provide this tiering capability via software, this actually happens within the VMware configuration itself. It’s a hypervisor-native tiering mechanism that leverages what VMware terms a Logical Memory Unit, comprised of DRAM at the top of the tier and the slower NVMe storage tiered below.

The tiering mechanism’s goal is to keep CPUs from waiting to process pages in memory. As database workloads proceed, VMs consume logical memory and the tiering software dynamically relocates the hottest pages to DRAM and switches out the colder pages to NVMe storage. The tiering algorithm takes into account the I/O access method – read-only vs. read-write – needed for operations, too.

VMware claims this tiering method at least doubles the effective use of memory and returns a corresponding 40% reduction in TCO because the hottest pages are placed essentially closer to the CPU. The tiering algorithm is configured automatically so it doesn’t need constant monitoring for effectiveness.

Again, this isn’t a revolutionary concept – Oracle Database 12c implemented this feature 10+ years ago – but since the memory management is native to the hypervisor itself, less sophisticated or open source databases like MySQL or PostGres can take advantage of these performance enhancements.

And since this strategy insures that hottest pages aren’t being constantly exchanged between DRAM and NVMes, there’s also a side benefit: the potential to extend NVME useful life by preventing extensive read/write operations over time.

Finally, several data encryption security features are supported, and it can be deployed at either the host or VM level. Check out this detailed video demonstrating these features, and here’s the deeper details from VMware: https://blogs.vmware.com/cloud-foundation/vcf-advanced-memory-tiering/

Blade #2: Managing MySQL, Postgres, and SQL Server Databases with Data Services Manager 9 (DSM)

I thought it was pretty gutsy for VMware to show up in a room populated with several experienced DBAs from at least three database families – SQL Server, Oracle, and MySQL – to talk about the second prop blade: Data Services Manager 9 (DSM).

DSM offers full support for MySQL and Postgres – two of the most popular open-source database these days – as well as SQL Server. DSM gives VSphere administrators a central management portal to manage and control related resources via specific data service policies and infrastructure policies that limit access to database resources to specific users.

Infrastructure policies make it simple to grant privileges to qualified users – perhaps a trusted DevOps resource, or a junior DBA – to deploy clones of existing production databases, even permitting deployment of prior versions of database engines (releases) for researching issues related to prior releases.

Our VMware presenter also demonstrated how to deploy resources to support a MySQL database through DSM, including the ability to quickly deploy a clustered MySQL environment – a non-trivial exercise – with just a few mouse clicks.

Finally, as an experienced DBA, let me assure you if your DBAs aren’t constantly fretting about backing up your organization’s crucial databases – which should include development and staging databases! – then you haven’t got the right people on staff. VMware showed how DSM 9 made it simple to enable backup strategies, including selection of the appropriate storage targets for backup files.

I did probe our presenters about preserving Transparent Data Encryption (TDE) for MySQL databases. TDE is a particularly valuable feature for MySQL environments; it ensures data is truly encrypted within the database itself. This implies that any backups taken of TDE-encrypted database files remain encrypted to guarantee any database blocks within encrypted tablespaces are also encrypted when they’re backed up to eliminate a potential vector for discovering / accessing data.

Here’s a detailed look at this set of features; you can watch our CFD25 delegates’ spirited questioning too.

Blade #3: Tying It All Together Within Virtual Private Clouds (VPCs)

I’ll admit that the final blade on the prop – networking – tends to be the least interesting (and thus most often ignored) feature for data engineers and experienced DBAs.

Deploying the network infrastructure to support a modern database with proper restrictions is crucial to keeping data secured properly within any application and database environment. (I’ve recently struggled to set up relatively complex networking within OCI environments, so trust me: if networking isn’t in your regular wheelhouse, this operation is potentially error-prone.)

VMware demonstrated how VCF Networking NSX services made short work of building out robust public / private network infrastructure within a Virtual Private Cloud (VPC) in matter of minutes without having to worry about choosing exactly the right IPv4 addresses to make everything work. A particularly useful feature: VCF will not allow subnets to be deployed within overlapping CIDR address blocks accidentally, thus insuring network communication isn’t compromised by face-palm-level mistakes.

What wasn’t readily apparent was how to insure that particular ports within IP addresses are blocked or opened. To their credit, our VMware colleagues explained the best way to guarantee that protection level was to deploy their VDefense toolset to control port-level permissions.

Here’s a detailed look at this offering from VMware’s perspective, and here’s our delegates’ in-depth discussions and questions.

Conclusion: When All Blades Work …

Overall, VCF 9.0 looks to me like a full-featured yet evolving toolset valuable for open-source database management and corresponding support for application development and production deployments. Its self-service features mean IT shops can relegate complex performance monitoring, database environment management, and networking to reasonably qualified or junior team members without incurring significant risks of self-harm.

Hammerspace: Metadata Will Be Assimilated, and Their Uniqueness Added to AI’s Demands

Cloud Field Day 25 #CFD25 was my first introduction to Hammerspace’s unique technology offerings specifically aimed at managing cloud-based workloads for today’s modern cloud environments.

As a long-time Oracle DBA with a deep background in storage technology, I found their analysis of what their customer base is focused on to be particularly pertinent for wringing performance from underlying cloud storage systems. Meeting the demands beyond modern OLTP applications and business intelligence dashboards – specifically, the impact of generative and agentic AI requirements – requires new ways of handling diverse I/O patterns and compute demand cycles, especially for AI training and inference operations.

Leaving Unstructured Data Where It Already Lives. Well, Most of It.

First, I was pleasantly surprised to see Hammersmith lead with an example of their AI Data Platform technology assisting an environment I know quite well: Oracle Cloud Infrastructure (OCI). I quite literally had to rub my eyes in relief, since most Tech Field Day presenters usually avoid Oracle technology like the plague. But OCI plays extremely well with advanced storage systems, whether a database is accessing flat files like tables in ORGANIZATION EXTERNAL mode or when chunking and transforming documents and images into its 26ai VECTOR datatype for generative AI purposes.

Of course, it’s no secret that AI workloads often demand huge compute and GPU resources to be brought to bear to tackle both training and inference operations against equally ginormous unstructured data sources. Instead of migrating terabytes or petabytes of these data to cloud storage, Hammerspace’s Global File System (GFS) enables extremely efficient access to unstructured data for massive, parallel cloud-based compute because it decides exactly which files should be moved to the fastest intra-cloud storage for handling while leaving the majority of unneeded files in their current location.

Those Metadata Matter.

The trick, of course, is selecting only the files most likely to benefit – and that means looking closely at each file’s metadata to determine which file(s) should be gathered from on-premises data and brought to the public cloud to take advantage of VMs, NVMEs, and GPUs.

Hammerspace’s solution involves first deploying an Anvil server that interrogates unstructured files’ metadata to assimilate their key attributes as a cone file system. One big advantage of this strategy is that files can span multiple environments, mount points, and file systems to capture an holistic view of all possible targets.

Once all mount points, file systems, directories, and individual files are assimilated, that information is captured so its Data Services Server (DSX) can then use to intelligently gather the files that would most benefit when processed for a particular workload’s demand.

And before you ask: While the extensive demos we saw at #CFD25 were all GUI-based, Hammerspace also offers access to all these operations via API calls accessed via their Hammerspace development toolkit (HSTK) or their HSCLI command line interface.

Result: A Logical Tier 0 NVMe-Based Storage Layer, With File Resiliency Built-In

Once assimilation is complete, the Hammerspace AI Data Platform solution enables instantiation of sufficient VMs with direct access to GPU processors. Each VM also accesses and gathers the files needed to complete the training or inference tasks at hand. Files are places within Hammerspace’s proprietary NVMe-based logical Tier 0 storage layer. I/O operations then proceed apace in parallel and without the limitations of some files being stored on slower, less responsive storage systems.

Obviously, file replication can be a time- and resource-consuming process. To assuage against the possible loss of data after the potentially lengthy process of replication, Hammerspace’s solution also insures files are spread across Availability Zones (AZ) to avoid placing a file image onto two NFS file buckets within same risk zone.

Even Meta Needs Meta(Data)

As someone who has manually built storage arrays for fastest database I/O processing and worked with high-speed and high-capacity systems for storage access like Hitachi’s Universal Storage Platform and Oracle’s Exadata Database Machine, this strikes me as an elegant solution to the varying demands of generative AI and agentic AI workloads to get compute and GPU resources closest to the right data needed to solve crucial business problems at scale.

And Hammerspace reports that one of their largest customers – Meta (yes, that Meta, as in Facebook, Instagram, et. al) – uses their solution to manage access to 40PB of unstructured data in production, with plans to expand that footprint to 100PB in the future. Meta’s use case also leverages several thousand servers accessing tens of thousands of GPUs.

Glean Insights & Value from Unstructured Data With Qlik Answers

I’ve found that every once in a while, it doesn’t hurt to see what everyone else is doing in the same technology space I’m currently focused on. For the past 18 months, that’s been the Generative AI space and the impending implementation of Agentic AI across diverse industries and applications.

Getting to Generative AI: Like Learning a New Foreign Language

Full disclosure: I’m an Oracle DBA with 25 years of experience in data engineering and 45 years of experience in application programming. Lately I’ve focused on building out simple Generative AI and Retrieval Augmented Generation (RAG) chatbot applications with Oracle Database 23ai technology and Oracle Application Express (APEX) within the Oracle Cloud Infrastructure (OCI) public cloud. That meant learning how to use LLMs to chunk and create embeddings for a corpus of documents, how to perform cosine similarity searches against vectorized content, and prepare appropriate proper system prompts within a chatbot framework to return cogent answers from that corpus based on questions asked.

This was a decidedly non-trivial task – it took me several weeks to master these concepts and then build demos that yielded relatively hallucination-free answers, and it was at least two months before I felt I could comfortably present my work to colleagues at user group conferences. I came away with a new respect for the depth of knowledge required to deliver qualified answers from LLMs and Generative AI applications.

Qlik Connect 2025: Qlik Answers

Back in May 2025 I had a chance to take a close look at the latest version of Qlik Answers for developing Generative AI solutions. While at Qlik Connect, I spoke with executives and developers about their vision for capturing valuable business insights into their customers’ data, especially if it was unstructured information strewn across thousands of pages of documents.

The folks at Qlik granted me a trial account and I dove into what Qlik Answers could achieve. I was pleasantly surprised that it was a relatively straightforward path to construct a chatbot that could search through several hundred pages of documents from multiple sources – scholarly papers, digital news reports, blog posts by reputable authors – to return cogent answers to business questions.

What impressed me was how quickly this all came together: Importing my corpus of nearly 30 documents, indexing them for use, and constructing a basic chatbot that could chew on the corpus to provide answers took less than 15 minutes.

Handling Outliers Is What Matters

I’m not some starry-eyed dreamer about AI capabilities; indeed, my earlier work with generative AI had yielded some surprising confabulations depending on what questions I asked of my chatbot.

Thus my evaluations included some tricks I’d learned during my prior experiences – things like prompt injection attempts and even hiding system prompt overrides within a source document. I discovered that Qlik Answers was able to handle the twists I threw at it quite well without any additional fine-tuning.

Over to You: Have An In-Depth Look At Qlik Answers

Obviously, this brief blog post isn’t going to convince anyone of how well Qlik Answers performed during this process, so please have a look at my complete evaluation of this tool. It contains detailed screenshots and explanations of every step I took, including references to each document I used as a source for my corpus so you can quickly run similar evaluations. Of course, please feel free to post comments to let me what you have discovered.