Join the DZone community and get the full member experience.
Join For Free
As data engineers, we've all encountered those recurring requests from business stakeholders: "Can you summarize all this text into something executives can read quickly?", "Can we translate customer reviews into English so everyone can analyze them?", or "Can we measure customer sentiment at scale without building a new pipeline?". Traditionally, delivering these capabilities required a lot of heavy lifting. You'd have to export raw data from the warehouse into a Python notebook, clean and preprocess it, connect to an external NLP API or host your own machine learning model, handle retries, manage costs, and then write another job to push the results back into a Delta table. The process was brittle, required multiple moving parts, and -- most importantly -- took the analysis out of the governed environment, creating compliance and reproducibility risks.
With the introduction of AI functions in Databricks SQL, that complexity is abstracted away. Summarization, translation, sentiment detection, document parsing, masking, and even semantic search can now be expressed in one-line SQL functions, running directly against governed data. There's no need for additional infrastructure, no external services to maintain, and no custom ML deployments to babysit. Just SQL, governed and scalable, inside the Lakehouse.
In this article, I will walk you through five such functions using the familiar Bakehouse sample dataset. We will see how tasks that once demanded custom pipelines and weeks of engineering effort are now reduced to simple queries, transforming AI from a specialized project into an everyday tool for data engineers.
1. Summarization With ai_summarize()
If you wanted to summarize Bakehouse customer reviews in the past, the workflow was anything but simple. Reviews are often long, unstructured, and written in free-form text -- which means they contain everything from slang and typos to emojis, mixed languages, and incomplete sentences. Extracting the raw reviews from Delta tables was only the beginning. The real challenge was making that text usable for downstream analysis.
First, you had to clean and normalize the data: removing non-standard characters, fixing casing inconsistencies, stripping out emojis or special symbols, and sometimes even detecting and filtering different languages. Only after preprocessing could you feed the cleaned text into a Python-based summarization model (like Pegasus, BART, or T5). Running those models at scale introduced its own operational overhead: managing GPUs, batching requests, handling long input sequences, and storing the generated summaries back into Delta tables. Finally, you had to write additional logic to extract useful signals -- often reducing a verbose, messy review into a short two-sentence takeaway. The entire pipeline was brittle, resource-intensive, and required constant maintenance.
With the new function in Databricks SQL, this entire process collapses into a single line of code. You simply pass the raw review text into the function, and it returns a concise summary directly as part of your query results. No separate preprocessing, no external APIs, no ML pipeline maintenance -- just SQL. The function is smart enough to handle free-form text, cut through the noise, and surface the main point of a customer's feedback.
Before we look at the summaries themselves, let's first explore the raw complexity of the column in the Bakehouse dataset with a simple query:
Now, let's use the function to summarize the reviews:
2. Translation With ai_translate()
Consider this scenario: the Bakehouse management team in Japan wants to analyze customer reviews, but most of the feedback is stored in English. For the Japan team, reading reviews in English creates a barrier; not only does it slow down analysis, but it also introduces the risk of misinterpretation or missed cultural nuances. As data engineers, we've all dealt with these kinds of requests: "Can you make this dataset available in our local language?"
Traditionally, this meant exporting the reviews out of Delta tables, wiring them into a third-party translation API, managing authentication and quotas, handling errors and retries, and then loading the translated text back into the warehouse. This was a multi-step process that required maintaining fragile ETL pipelines and often raised compliance questions since sensitive customer data had to leave the governed environment.
With , this entire workflow collapses into a single SQL query. The function takes raw review text as input and outputs the same content in the target Japanese language. For the Bakehouse dataset, this means the Japan team can instantly access reviews in their local language without any additional infrastructure.
3. Sentiment Analysis With ai_analyze_sentiment()
Traditionally, data engineers or data scientists had to build or fine-tune a sentiment analysis model in Python, often starting with frameworks like TensorFlow, PyTorch, or Hugging Face. The process involved collecting labeled data, training or fine-tuning a classifier, validating the model, and then packaging it into a deployable service. Once deployed, the service had to be hosted on a GPU or CPU endpoint, monitored for uptime, and maintained with scaling logic for production loads. On top of that, engineers had to write pipeline jobs to send raw review text to the endpoint, collect the predictions, and store results back into Delta tables. All this work just to answer a seemingly simple question: "Are our customers happy or not?"
With Databricks' function, that entire workflow is reduced to a single line of SQL. There's no need to train models, deploy endpoints, or manage infrastructure. You can feed raw review text directly into the function, and it automatically returns a sentiment label such as positive, negative, or neutral.
4. Mask PII Data With ai_mask()
Protecting personally identifiable information (PII) is one of the challenging tasks for data engineers. I have written an elaborate article in DZone on building scalable data with security. The function automatically detects and masks the PII data based on the input parameters. The Bakehouse analytics team can safely analyze reviews without exposing sensitive customer data, all directly in SQL, no custom regex required.
Conclusion
The examples we explored are , , , , and , which show how the usage of AI simplifies workloads for engineering teams and allows them to conduct quick analytics. Tasks that once required complex pipelines, custom Python scripts, or external APIs are now reduced to simple, one-liner SQL functions. These AI SQL functions are in Public Preview at the time of writing this article and may evolve as Databricks expands its capabilities.