Join the DZone community and get the full member experience.
Join For Free
In the early days of computing, applications handled tasks sequentially. As the scale grew with millions of users, this approach became impractical. Asynchronous processing allowed handling multiple tasks concurrently, but managing threads/processes on a single machine led to resource constraints and complexity.
This is where distributed parallel processing comes in. By spreading the workload across multiple machines, each dedicated to a portion of the task, it offers a scalable and efficient solution. If you have a function to process a large batch of files, you can divide the workload across multiple machines to process files concurrently instead of handling them sequentially on one machine. Additionally, it improves performance by leveraging combined resources and provides scalability and fault tolerance. As the demands increase, you can add more machines to increase available resources.
It is challenging to build and run distributed applications on scale, but there are several frameworks and tools to help you out. In this blog post, we'll examine one such open-source distributed computing framework: Ray. We'll also look at KubeRay, a Kubernetes operator that enables seamless Ray integration with Kubernetes clusters for distributed computing in cloud-native environments. But first, let's understand where distributed parallelism helps.
Where Does Distributed Parallel Processing Help?
Any task that benefits from splitting its workload across multiple machines can utilize distributed parallel processing. This approach is particularly useful for scenarios such as web crawling, large-scale data analytics, machine learning model training, real-time stream processing, genomic data analysis, and video rendering. By distributing tasks across multiple nodes, distributed parallel processing significantly enhances performance, reduces processing time, and optimizes resource utilization, making it essential for applications that require high throughput and rapid data handling.
When Distributed Parallel Processing Is Not Needed
How Ray Helps With Distributed Parallel Processing
Ray is a distributed parallel processing framework that encapsulates all the benefits of distributed computing and solutions to the challenges we discussed, such as fault tolerance, scalability, context management, communication, and so on. It is a Pythonic framework, allowing the use of existing libraries and systems to work with it. With Ray's help, a programmer doesn't need to handle the pieces of the parallel processing compute layer. Ray will take care of scheduling and autoscaling based on the specified resource requirements.
Ray provides a universal API of tasks, actors, and objects for building distributed applications.
(Image Source)
Ray provides a set of libraries built on the core primitives, i.e., Tasks, Actors, Objects, Drivers, and Jobs. These provide a versatile API to help build distributed applications. Let's take a look at the core primitives, a.k.a., Ray Core.
Ray Core Primitives
For information about primitives, you can go through the Ray Core documentation.
Ray Core Key Methods
Below are some of the key methods within Ray Core that are commonly used:
Here is an example of using most of the basic key methods:
How Does Ray Work?
Ray Cluster is like a team of computers that share the work of running a program. It consists of a head node and multiple worker nodes. The head node manages the cluster state and scheduling, while worker nodes execute tasks and manage actor
You can check out the Ray v2 Architecture doc for more detailed information.
Working with existing Python applications doesn't require a lot of changes. The changes required would mainly be around the function or class that needs to be distributed naturally. You can add a decorator and convert it into tasks or actors. Let's see an example of this.
To learn more about its concept, head over to Ray Core Key Concept docs.
Ray vs Traditional Approach of Distributed Parallel Processing
Below is a comparative analysis between the traditional (without Ray) approach vs Ray on Kubernetes to enable distributed parallel processing.
Kubernetes provides an ideal platform for running distributed applications like Ray due to its robust orchestration capabilities. Below are the key pointers that set the value on running Ray on Kubernetes:
KubeRay Operator makes it possible to run Ray on Kubernetes.
What Is KubeRay?
The KubeRay Operator simplifies managing Ray clusters on Kubernetes by automating tasks such as deployment, scaling, and maintenance. It uses Kubernetes Custom Resource Definitions (CRDs) to manage Ray-specific resources.
RayService allows you to deploy models on-demand in a Kubernetes environment. This can be particularly useful for applications like image generation or text extraction, where models are deployed only when needed.
Here is an example of stable diffusion. Once it is applied in Kubernetes, it will create RayCluster and also run a RayService, which will serve the model until you delete this resource. It allows users to take control of resources.
RayService serves different requirements to the user, where it keeps the model or application deployed until it is deleted manually. In contrast, RayJob allows one-time jobs for use cases like training a model, preprocessing data, or inference for a fixed number of given prompts.
Generally, we run our application in Deployments, which maintains the rolling updates without downtime. Similarly, in KubeRay, this can be achieved using RayService, which deploys the model or application and handles the rolling updates.
However, there could be cases where you just want to do batch inference instead of running the inference servers or applications for a long time. This is where you can leverage RayJob, which is similar to the Kubernetes Job resource.
Image Classification Batch Inference with Huggingface Vision Transformer is an example of RayJob, which does Batch Inferencing.
These are the use cases of KubeRay, enabling you to do more with the Kubernetes cluster. With the help of KubeRay, you can run mixed workloads on the same Kubernetes cluster and offload GPU-based workload scheduling to Ray.
Conclusion
Distributed parallel processing offers a scalable solution for handling large-scale, resource-intensive tasks. Ray simplifies the complexities of building distributed applications, while KubeRay integrates Ray with Kubernetes for seamless deployment and scaling. This combination enhances performance, scalability, and fault tolerance, making it ideal for web crawling, data analytics, and machine learning tasks. By leveraging Ray and KubeRay, you can efficiently manage distributed computing, meeting the demands of today's data-driven world with ease.
Not only that but as our compute resource types are changing from CPU to GPU-based, it becomes important to have efficient and scalable cloud infrastructure for all sorts of applications, whether it be AI or large data processing.
If you found this post informative and engaging. I'd love to hear your thoughts on this post, so do start a conversation on LinkedIn.