2 Sources
[1]
Apple tests if AI assistants can anticipate consequences of app use - 9to5Mac
As AI agents come closer to taking real actions on our behalf (messaging someone, buying something, toggling account settings, etc.), a new study co-authored by Apple looks into how well do these systems really understand the consequences of their actions. Here's what they found out. Presented recently at the ACM Conference on Intelligent User Interfaces in Italy, the paper From Interaction to Impact: Towards Safer AI Agents Through Understanding and Evaluating Mobile UI Operation Impacts introduces a detailed framework for understanding what can happen when an AI agent interacts with a mobile UI. What is interesting about this study is that it doesn't just explore if agents can tap the right button, but rather if they are able to anticipate the consequences of what may happen after they tap it, and whether they should proceed. From the researchers: "While prior research has studied the mechanics of how AI agents might navigate UIs and understand UI structure, the effects of agents and their autonomous actions -- particularly those that may be risky or irreversible -- remain under-explored. In this work, we investigate the real-world impacts and consequences of mobile UI actions taken by AI agents." The premise of the study is that most datasets for training UI agents today are composed of relatively harmless stuff: browsing a feed, opening an app, scrolling through options. So, the study set out to go a few steps further. In the study, recruited participants were tasked with using real mobile apps and recording actions that would make them feel uncomfortable if triggered by an AI without their permission. Things like sending messages, changing passwords, editing profile details, or making financial transactions. These actions were then labeled using a newly developed framework that considers not just the immediate impact on the interface, but also factors like: The result was a framework that helps researchers evaluate whether models consider things like: "Can this be undone in one tap?" "Does it alert someone else?" "Does it leave a trace?", and take that into account before acting on the user's behalf. Once the dataset was built, the team ran it through five large language models, including GPT-4, Google Gemini, and Apple's own Ferret-UI, to see how well they could classify the impact of each action. The result? Google Gemini performed better in so-called zero-shot tests (56% accuracy), which measure how well an AI can handle tasks it wasn't explicitly trained on. Meanwhile, GPT-4's multimodal version led the pack (58% accuracy) in evaluating impact when prompted to reason step by step using chain-of-thought techniques. As voice assistants and agents get better at following natural language commands ("Book me a flight," "Cancel that subscription," etc.), the real safety challenge is having an agent who knows when to ask for confirmation or even when not to act at all. This study doesn't solve that yet, but it proposes a measurable benchmark for testing how well models understand the stakes of their actions. And while there's plenty of research on alignment, which is the broader field of AI safety concerned with making sure agents do what humans actually want, Apple's research adds a new dimension. It puts into question how good AI agents are at anticipating the results of their actions, and what they do with that information before they act.
[2]
Apple researchers work to stop AI from taking actions you didn't approve
AI agents are learning to tap through your iPhone on your behalf, but Apple researchers want them to know when to pause. A recent paper from Apple and the University of Washington explored this disparity. Their research focused on training AI to understand the consequences of its actions on a smartphone. Artificial intelligence agents are getting better at handling everyday tasks. These systems can navigate apps, fill in forms, make purchases or change settings. They can often do this without needing our direct input. Autonomous actions will be part of the upcoming Big Siri Upgrade that may appear in 2026. Apple showed its idea of where it wants Siri to go during the WWDC 2024 keynote. The company wants Siri to perform tasks on your behalf, such as ordering tickets for an event online. That kind of automation sounds convenient. But it also raises a serious question: what happens if an AI clicks "Delete Account" instead of "Log Out?" Mobile devices are personal. They hold our banking apps, health records, photos and private messages. An AI agent acting on our behalf needs to know which actions are harmless and which could have lasting or risky consequences. People need systems that know when to stop and ask for confirmation. Most AI research focused on getting agents to work at all, such as recognizing buttons, navigating screens, and following instructions. But less attention has gone to what those actions mean for the user after they are taken. Not all actions carry the same level of risk. Tapping "Refresh Feed" is low risk. Tapping "Transfer Funds" is high risk. The studystarted with workshops involving experts in AI safety and user interface design. They wanted to create a "taxonomy" or structured list of the different kinds of impacts a UI action can have. The team looked at questions like -- Can the agent's action be undone? Does it affect only the user or others? Does it change privacy settings or cost money? The paper shows how the researchers built a way to label any mobile app action along multiple dimensions. For example, deleting a message might be reversible in two minutes but not after. Sending money is usually irreversible without help. The taxonomy is important because it gives AI a framework to reason about human intentions. It's a checklist of what could go wrong, or why an action might need extra confirmation. The researchers gathered real-world examples by asking participants to record them in a simulated mobile environment. Instead of easy, low-stakes tasks like browsing or searching, they focused on high-stakes actions. Examples included changing account passwords, sending messages, or updating payment details. The team combined the new data with existing datasets that mostly covered safe, routine interactions. They then annotated all of it using their taxonomy. Finally, they tested five large language models, including versions of OpenAI's GPT-4. The research team wanted to see if these models could predict the impact level of an action or classify its properties. Adding the taxonomy to the AI's prompts helped, improving accuracy at judging when an action was risky. But even the best-performing AI model -- GPT-4 Multimodal -- only got it right around 58% of the time. The study found that AI models often overestimated risk. They would flag harmless actions as high risk, like clearing an empty calculator history. That kind of cautious bias might seem safer. However, it can make AI assistants annoying or unhelpful if they constantly ask for confirmation when it is not needed. More worryingly (and unsurprisingly), the models struggled with nuanced judgments. They found it hard to decide when something was reversible or how it might affect another person. Users want automation that is helpful and safe. An AI agent that deletes an account without asking can be a disaster. An agent that refuses to change the volume without permission can be useless. The researchers argue their taxonomy can help design better AI policies. For example, users could set their own preferences about when they want to be asked for approval. The approach supports transparency and customization. It helps AI designers identify where current models fail, especially when handling real-world, high-stakes tasks. Mobile UI automation will grow as AI becomes more integrated into our daily lives. Research shows that teaching AI to see buttons is not enough. It must also understand the human meaning behind the click. And that's a tall task for artificial intelligence. Human behavior is messy and context-dependent. Pretending that a machine can resolve that complexity without error is wishful thinking at best, negligence at worst.
Share
Copy Link
Apple and University of Washington researchers investigate how well AI assistants can anticipate the consequences of their actions in mobile app interfaces, aiming to enhance safety and user trust in AI-driven interactions.
In a groundbreaking study, researchers from Apple and the University of Washington have delved into the critical question of how well AI agents understand the consequences of their actions when interacting with mobile user interfaces (UIs). The research, titled "From Interaction to Impact: Towards Safer AI Agents Through Understanding and Evaluating Mobile UI Operation Impacts," was presented at the ACM Conference on Intelligent User Interfaces in Italy 1.
As AI assistants become more integrated into our daily lives, there's a growing concern about their ability to make informed decisions when performing tasks on our behalf. The study highlights the importance of AI agents not just recognizing UI elements, but also anticipating the potential outcomes of their actions 2.
Source: AppleInsider
The researchers created a detailed taxonomy to classify the impacts of mobile UI actions. This framework considers factors such as:
This approach aims to provide AI with a structured way to reason about human intentions and potential risks associated with different actions 1.
To build a relevant dataset, the study recruited participants to record actions in real mobile apps that they would feel uncomfortable with an AI performing without permission. These included high-stakes actions like sending messages, changing passwords, and making financial transactions 1.
The researchers then tested five large language models, including GPT-4, Google Gemini, and Apple's Ferret-UI, to evaluate their ability to classify the impact of various actions. The results showed that while AI models have made progress, there's still significant room for improvement:
The study revealed that current AI models often struggle with nuanced judgments and tend to overestimate risks. This cautious approach, while potentially safer, could lead to frustrating user experiences if AI assistants constantly seek confirmation for low-risk actions 2.
This research is particularly relevant as companies like Apple plan to expand AI capabilities in virtual assistants. The upcoming "Big Siri Upgrade," potentially slated for 2026, aims to enable Siri to perform more complex tasks autonomously 2.
The findings underscore the importance of developing AI systems that can:
By addressing these challenges, researchers hope to create AI assistants that are not only more capable but also more trustworthy and aligned with user intentions.
Databricks raises $1 billion in a new funding round, valuing the company at over $100 billion. The data analytics firm plans to invest in AI database technology and an AI agent platform, positioning itself for growth in the evolving AI market.
11 Sources
Business
10 hrs ago
11 Sources
Business
10 hrs ago
SoftBank makes a significant $2 billion investment in Intel, boosting the chipmaker's efforts to regain its competitive edge in the AI semiconductor market.
22 Sources
Business
18 hrs ago
22 Sources
Business
18 hrs ago
OpenAI introduces ChatGPT Go, a new subscription plan priced at ₹399 ($4.60) per month exclusively for Indian users, offering enhanced features and affordability to capture a larger market share.
15 Sources
Technology
18 hrs ago
15 Sources
Technology
18 hrs ago
Microsoft introduces a new AI-powered 'COPILOT' function in Excel, allowing users to perform complex data analysis and content generation using natural language prompts within spreadsheet cells.
8 Sources
Technology
11 hrs ago
8 Sources
Technology
11 hrs ago
Adobe launches Acrobat Studio, integrating AI assistants and PDF Spaces to transform document management and collaboration, marking a significant evolution in PDF technology.
10 Sources
Technology
10 hrs ago
10 Sources
Technology
10 hrs ago