We increasingly rely on AI as assistants for coding, video production, and photography. While AI's capabilities in areas like code, image, and video generation are well-explored, we may be underestimating the potential of popular large language models (LLMs) to enhance the quality of the products we develop in other domains. Specifically, utilizing LLMs to model production data represents a significant, often overlooked, area of opportunity.
Let's first tackle the scenario as to why we need to model production data? I think we are all familiar with production data. To recap production data is customer data like name, address, emails, billing information etc. For any business production data is a high value asset and a lot of care is taken in building software that will affect this data.
A lot of times the environments that hold production data are change controlled which means we are not at liberty to execute scripts and tooling without prior approval and authorization. This is because these environments might include personally identifiable information and this data is protected by different privacy regulations like HIPPA and GDPR. There is no way a business can test its software and tooling on change controlled or production environments because that can have a potential negative impact on customer experience.
A good alternative to this problem is to create a sandbox environment. Usually sandbox environments can be created quickly with modern tooling and cloud providers like Azure, AWS or GCP. These sandbox environments are usually the dev, test or stage environments in the deployment pipeline. Although the new environment can closely simulate production environment configuration and setting, it will lack any data.
Also, there are instances where we might need to run analytics and monitoring tooling on production data. By definition production data will be a lot bigger in size as compared to the data in our sandbox environment. This is a situation where we need to model a lot of data that will match the size of our production data.
Those are some scenarios that showcase a need to not only model quality production data but also a large quantity of data.
Now that we have explored the need for modeling data. Let's look at an example where we generate simulated customer data for a hospital. In the following exercise, I provided Gemini with the following prompt:
"Create an example csv file of data with 10 rows with basic customer information for a hospital"
Based on the previous prompt I received the following output:
Patient_ID,First_Name,Last_Name,DOB,Gender,Phone_Number,Email,Blood_Type,Last_Visit_Date,Insurance_Provider
As we can see in the example above this data is actually a very good first draft of what we might need in a sandbox environment.
I will now refine the first draft of data to remove medical information from the data and add uuid identifier using the following prompts:
The generated data is an example of a customer information table in a database.
Patient_ID,First_Name,Last_Name,DOB,Gender,Phone_Number,Email
Building on the previously generated data, how can we generate more data that is needed? We will now using the following prompt to generate a second set of data that will contain the patients hospital visit information:
"Using the data in final_hospital_data.csv generate a single csv file that will contain their last medical visit information like patient_id, visit date, visit time, insurance provider, visit reason, doctor visited."
And we received the following output from LLM:
Patient_ID,Visit_Date,Visit_Time,Insurance_Provider,Visit_Reason,Doctor_Visited
44e36bad-2c81-4410-bf75-d9072651ac4b,2023-10-11,9:15,Humana,Hypertension Follow-up,Dr. Aris
c7d84b81-6d39-4bc6-b695-b0854e2201da,2024-05-15,8:45,BlueCross,Post-Op Review,Dr. Foster
f5d50c7b-e4ad-4763-af2a-9913abcc80ad,2023-06-04,10:30,Cigna,Hypertension Follow-up,Dr. Dhillon
a241c4c0-c071-49d2-976d-0f251a4e0abe,2023-07-18,8:15,BlueCross,Diabetes Consultation,Dr. Evans
This generated data is an example of a customer information and customer visit table for a hospital.
The collected data can now be readily integrated into either a SQL or NoSQL database. LLMs can be employed to generate the necessary insert queries for a SQL database, allowing this data to be fed into a local, development, or test environment. For the sake of clarity, I limited the example above to ten rows, but generating a larger dataset is straightforward.
Furthermore, it might be necessary to convert the data from its current CSV format into other structures, such as JSON or XML, to facilitate easier integration. While LLMs are capable of handling all these conversion tasks, we will omit them from this article to keep the focus narrow.
LLMs are a great tool to simulate data for our local or simulated environments. By using this generated data, we can test our software better and have more confidence in the solutions that we build.