Join the DZone community and get the full member experience.
Join For Free
Nowadays, most projects that utilize Artificial Intelligence (AI) models demand significant computational resources. Almost each time a new model comes out, and outperforms previous ones, it seems to require more computational resources to run efficiently. A lot of people will say that there are exceptions, such as the DeepSeek model, but that is not actually true. Models like DeepSeek are competitive with larger models but are not better than them. At least at this point, size seems to be directly correlated with the power of a model.
Traditionally, deploying AI at scale meant managing a very complex infrastructure, from provisioning servers or clusters to writing deployment scripts and even managing cloud-specific services. However, this overhead has not only become a major pain point for a lot of ML teams but has also become a limiting factor, stopping them from trying out new models and constraining their creativity. To avoid these limiting factors we need to adapt our approach, and this is exactly what Modal enables us to do as a unified cloud platform for running code for data and AI tasks.
Modal (launched by Modal Labs in 2023) is a platform for running AI workloads without manual infrastructure setup. It allows developers to define workflows entirely in Python, with code executed on cloud-managed compute resources. The goal is to simplify deployment by abstracting away server and cluster configuration.
How Does Modal Work?
Modal is a cloud platform for running code in the cloud without needing to focus on infrastructure. Developers interact with Modal through a Python SDK (Software Development Kit), defining so-called "apps" and "functions" that Modal runs on-demand on its infrastructure. This relatively novel approach, which might as well be called a "Functions-as-a-Service" model, means that developers can take a Python function and execute it remotely with a simple decorator or API call. If you're familiar with cloud computing, this might remind you of services like AWS Lambda or Google Cloud Functions. But while they share some surface similarities, Modal works quite differently.
Unlike conventional approaches where a developer might work with Docker or Kubernetes to prepare everything they need to execute code, Modal takes it a step further and allows developers to specify everything in Python code. To be more precise, in Modal we define containers. Containers are kind of like mini virtual machines that run just what you need, without the extra baggage. The containers are managed by container engines that use a variety of tricks to isolate programs from each other. To be more precise, Modal runs containers using the gVisor container runtime, developed by Google because of the need for a type of sandboxed container that can provide a secure isolation boundary between the host's OS and the application running in the container.
These containers will be built by Modal based on instructions that are in the Python code, and not in some YAML file or something similar. Essentially, when trying to run something on Modal, the first thing we will do is define an in the code, in which we need to define the version of Python we want to run our code on and the libraries required for running the code. Take a look at an example of how to define one such image for running the Flux model from HuggingFace:
As you can see in the code above, everything is handled in Python without the need for any external files. The user defines the dependencies in the Python code, which will NOT be installed locally but instead will only be installed in the remote environment on Modal.
As you can see at the top, before we define the actual image, we create an instance of the class. We use these objects to represent an application running on Modal. We'll attach all the functions and classes we create to this object, which keeps everything organized and easy to manage.
An ephemeral App is created when you run your script using or the CLI command. It's a temporary App that exists only while your script is running. On the other hand, a deployed App will exist indefinitely, or until you delete it with the web UI. Based on what you want to achieve with your app, you need to pick one of the two and go with it. Here, how you plan on scaling is a very important factor, so understanding how we scale with Modal is of the utmost importance.
Serverless GPU Acceleration and Scaling
Most serverless platforms are often limited to CPU-bound tasks or provide limited support for GPUs. Modal on the other hand allows users to attach a GPU to any function using a single parameter. In our previous example that was not necessary as Pillow doesn't benefit from it, but AI workflows in general are only effective if they run on GPUs; running the code on a CPU would be extremely slow. For instance, to attach an H100 GPU from NVIDIA to a function, making it run on that GPU, we simply declare that we wish to do so when defining a function:
Under the hood, Modal will provision an instance with an H100 and execute the container there. The platform supports the following GPU types, ranging from more economical ones all the way to SOTA (State-of-the-Art) ones:
This allows users to pick which GPU best suits their needs, which offers flexibility that is key for AI use cases. We can use weaker GPUs for smaller models or for testing, and switch to more powerful GPUs for inference or training, all by changing one value in our code. The only difference is of course going to be price. Compute is priced per second, with the cheapest Nvidia T4 costing $0.000164 /sec ($0.59 /h) and the most expensive Nvidia H100 costing $0.001097/sec ($3.95 /h).
Modal abstracts away how these GPUs are provisioned, meaning that the user is not exposed to whether they come from AWS, GCP, or another provider. This is what makes Modal cloud agnostic at the resource level, as the user only needs to specify which GPU they want to use and Modal handles the rest.
Beyond just offering GPUs, Modal emphasizes speed and scale in provisioning them. The company wrote its Rust-based container runtime that starts containers in well under a second, enabling an application to scale out to hundreds of GPU-backed workers within a few seconds; spinning up that many GPU instances via a cloud API or a Kubernetes cluster can take pretty long. This flexibility to scale to hundreds of GPU-backed workers nearly instantaneously is not only important when we want to train models in a distributed manner, but is instrumental in AI inference workloads, as we often run into sudden spikes in requests that can sometimes be hard to handle using standard approaches.
Handling Large Quantities of Data
Most AI workflows need to be able to handle large volumes of data. Modal also provides a built-in solution for that called , which is a distributed file system for persisting and sharing data across function runs. These volumes allow developers to mount a storage volume into any function's container at runtime, from which the function can read files and to which the function can write files as it would to a local filesystem. The key difference is that this volume persists beyond the life of a single function execution, meaning that other functions can access that same volume and interact with it at a later time.
For example, a user can download and store a large pre-trained model checkpoint into one of these . This allows multiple inference functions across multiple containers to read the weights of the model without having to download or transfer the model from an external source. In essence, it functions similarly to caching data in a particular Modal environment.
While this is the preferred way of interacting with data in Modal, it does support other data access patterns, allowing users to mount external cloud storage containers such as S3 buckets, Google Cloud Storage, and similar directly into functions. This is useful if your data is already stored in a cloud data container, however, Volumes are still the recommended approach as they are a much more performant solution.
Strategic Implications for AI Development and Cloud Adoption
In AI, there is an increasing demand for higher-level abstractions that simplify the deployment of complex workloads. While many ML engineers are extremely knowledgeable in their field, not all of them are necessarily extremely adept at setting up the infrastructure needed to deploy the models they have designed. By providing a cloud-agnostic, serverless platform tailored for AI and data tasks, Modal is positioning itself as the easiest option to introduce AI to a variety of different industries. This has several strategic implications, both for practitioners and the cloud industry at large.
For AI developers, Modal can significantly increase the speed at which we move from idea to production. It allows developers to avoid running into a standard bottleneck in their AI projects: the engineering work required to serve models to users or integrate them into products. In a lot of cases, this means that teams don't need to be scared of scaling a new ML feature, as the infrastructure needed to do so won't be a limiting factor.
Modal's cloud-agnostic approach also taps into the desire of some companies to avoid being deeply tied to a single cloud provider. By provisioning GPUs from multiple different providers the chance of running into outages becomes far less likely. However, this also means that if Modal and other similar platforms become extremely prominent in the space we could see a shift in power away from big cloud providers; they might become commodified back-ends rather than the interface developers directly engage with. This power shift is however not that likely to happen, as adopting a platform such as Modal can also be considered a form of vendor lock-in. Only time will tell how the landscape is going to look in a few years, as Modal is already seeing some competitors in the form of start-ups and the open-source realm, with major cloud providers surely working to simplify their offerings.
Real-World Use Cases
Modal's versatility has made it the platform of choice for companies working in a variety of different fields. Let's take a look at two interesting use cases, those being how Modal is used for generative AI inference at scale and in computational biology.
Suno, a startup that offers services for generating music and speech runs their production inference on Modal. This allows Suno to scale to thousands of concurrent users without needing to build out their own GPU farms. Modal allocates just as many resources as are needed: during spikes, it spins up new instances to handle demand, while during off-peak times it dynamically scales down to reduce costs. This demonstrates how even very complex and powerful models can be spun up quickly and adjusted dynamically based on demand.
The case of Sphinx Bio illustrates how Modal is being used in computational biology. Sphinx Bio runs protein folding models, similar to Google's Alphafold, on behalf of researchers. Protein folding is a very computationally intensive process, requiring many GPUs to run efficiently. By using Modal, Sphinx Bio can scale up for big experiments without maintaining its clusters and can scale down when they don't need as much computational power. Also, because Modal allows for scheduling, they can easily schedule and queue many independent computations i.e. folding many proteins concurrently, and let Modal handle the distribution of computational resources. While Sphinx Bio represents one such use case, other companies in the fields of genomics, physics simulations, and even financial modeling are sure to follow.
The above are just two example use cases, many more can be found on Modal's official website if you are interested in checking which companies currently use Modal.
Conclusion
Modal represents a new type of cloud platform. Instead of requiring users to manage infrastructure on their own, Modal offers a function-centric approach, abstracting away many of the complexities of launching AI applications at scale. By addressing two main pain points in releasing AI applications, those being long deployment cycles and fragmented tooling, Modal is making a bet that in most cases users will opt for simplicity, speed, and being cloud-agnostic instead of low-level control.
Even though this serverless approach has effectively lowered the barrier to entry for building sophisticated AI services, in certain situations, users might decide to roll their infrastructure, especially in latency-sensitive systems or those requiring custom hardware. This is however completely fine, as there is no "best" solution for all use cases. That being said, Modal has undeniably pushed forward the conversation of what an "ideal" cloud platform should look like, at least for those developing AI applications, in a new direction. As Modal grows and proves its model, a wave of similar solutions will likely appear, prompting tighter integration of serverless AI capabilities in mainstream cloud offerings. At the very least, Modal's success hints that we can expect the landscape of AI infrastructure to shift to not only emphasizing raw power but also ease of use.