Starfish helps companies tap the data value buried in their file systems
Partner content Starfish Storage might not be a household name among enterprise storage practitioners, but in high performance computing circles, it is regarded as the most scalable and versatile file management platform. You will find Starfish running in the world's leading supercomputing centers. These include R&D departments of major corporations, research computing facilities of top universities, simulation farms for EDA, hedge funds, and animation studios.
Starfish tackles a mix of traditional storage management use cases, such as archiving, backup, migration, cost accounting, and aging analytics. It also handles data management use cases including AI/ML workflows, data curation, data preservation, and content classification.
All of this works at enormous scale. Starfish's largest customers have thousands of storage volumes, hundreds of petabytes, and tens of billions of files. As a prime example, Starfish was recently deployed at Lawrence Livermore National Laboratory on El Capitan, the world's most powerful supercomputer.
Founded in 2011, Starfish was one of the first commercial products to adapt the concept of the data catalog to the navigation and management of files that live out in the wild. By in the wild, we mean those stored on storage devices accessed via NFS, SMB, native clients, POSIX, and S3. These files are in active use. Users, applications, data acquisition devices, scientific and biomedical instruments are constantly adding, deleting, and updating them.
Such files are not housed behind portals such as Microsoft SharePoint and are not part of content management systems, records management systems, archives, or data lakes. They are just like the files on your personal computer. They are subject to renaming, deletion, duplicates, and version mismatches.
Data catalogs are software platforms that associate metadata with data resources such as databases, data lakes, and data warehouses. They enable business users to find and access their institutions' data assets.
Starfish's founders saw that the leading data catalog providers lacked the technical means to assign metadata to the unstructured data hiding behind complex directory trees and user permissions. Corporate data catalogs were limited to curating structured and semi-structured data while files in the wild remained opaque and untamed. Starfish's original product, now called the Unstructured Data Catalog or UDC, addressed this gap in the market by creating an index across all of the organization's file storage devices that associated metadata with files and directories. The UDC enables the business to understand how file contents relate to projects, intellectual property, workflows, and cost centers even when spread across multiple storage devices.
The UDC helps to solve the age-old problem of linking data storage to data value. It also provides insights into how best to manage storage over time, including what administrators can archive or delete, what they must preserve, and who pays for what. An embedded reporting dashboard shows capacity and aging analytics with fine-grained insights enabled by the metadata system.
Fast-forward to 2025. Organizations of all shapes and sizes are scrambling to become AI-ready by identifying and gaining access to data resources that could be relevant for AI workloads.
This AI/ML frenzy is highlighting the need for data cataloging, especially for the mountains of valuable information buried in an organization's file stores. AI data quality and security depends upon differentiating file versions, accounting for permissions (particularly in retrieval-augmented generation (RAG) scenarios), and integrating outputs from AI/ML workflows back into the catalog's metadata.
As one might expect new players are entering the field of unstructured data cataloging. Some are startups, while others are traditional storage vendors who are adding data cataloging features into their file storage products. This begs the question: What makes a great file-based data catalog?
One of the essential design criteria for Starfish was to be storage vendor-agnostic. It works with virtually all file and object storage devices. This allows Starfish to have a universal map of all file content stored in the organization. By contrast, a data catalog from a storage vendor is likely to work well with its own storage but not extend itself to content stored in devices from other vendors. The result is simply a new form of vendor lock-in. Starfish bypasses this problem, offering an unobstructed view across all storage devices.
Many data management systems are in-line, meaning they operate directly on the storage infrastructure. This risks bottlenecks and vulnerabilities at scale. Starfish, on the other hand, was designed from the ground up to operate out-of-band, interacting with the storage system from a separate process. This offers advantages like non-disruptive operation and easier scalability.
Starfish has a feature called Storage Zones that groups related content together and presents it to the relevant users, like researchers, lab managers, librarians, and others. It gives them tools for searching and tagging within the boundary of their zone. This enables storage users to manage their file collections, even if they are spread across multiple systems including NAS, HPC file systems, and S3 buckets.This is yet another advantage of being storage agnostic; The feature lets those who best understand the value of their data engage in data management practices. In the long term, the results add up, as organizations can store data in ways that better reflect its value while freeing primary storage.
A data catalog's metadata and discovery capabilities are only half the picture, whether you seek to be AI ready or to tackle some other aspect of unstructured data management. There must also be a mechanism to enable files of interest to be accessed and processed in secure ways.
Starfish incorporates a jobs engine called the Starfish Automation Engine that can process and move files based on insights from the catalog. In turn, the jobs engine adds metadata to the catalog based on discoveries made or actions taken by the jobs engine.
The catalog might identify files that should be used to train a model. The jobs engine could then submit the files to the training pipeline and record back into the metadata catalog which versions of which files were used to train the model. Over time, this feedback loop gives you a deeper understanding of how you are using and managing your data sets.
In summary
These are exciting times for Starfish Storage. The niche we've occupied for over a decade is going mainstream and we have a unique, mature solution that works at the top end of scale.
Alongside a unified file indexing system that spans multiple vendors' storage devices, we have a flexible metadata system that makes it easy to classify, move, and process file collections.
Contact starfishstorage.com to learn more. If you are attending ISC in Hamburg, Germany, come see us at Booth A22 June 10-13. Qualified customers are welcome to trial Starfish in their own environment free of charge. You will learn a lot about your file storage in such a trial.