2 Sources
[1]
How AI support can go wrong in safety-critical settings
When it comes to adopting artificial intelligence in high-stakes settings like hospitals and airplanes, good AI performance and brief worker training on the technology is not sufficient to ensure systems will run smoothly and patients and passengers will be safe, a new study suggests. Instead, algorithms and the people who use them in the most safety-critical organizations must be evaluated simultaneously to get an accurate view of AI's effects on human decision making, researchers say. The team also contends these evaluations should assess how people respond to good, mediocre and poor technology performance to put the AI-human interaction to a meaningful test -- and to expose the level of risk linked to mistakes. Participants in the study, led by engineering researchers at The Ohio State University, were 450 Ohio State nursing students, mostly undergraduates with varying amounts of clinical training, and 12 licensed nurses. They used AI-assisted technologies in a remote patient-monitoring scenario to determine how likely urgent care would be needed in a range of patient cases. Results showed that more accurate AI predictions about whether or not a patient was trending toward a medical emergency improved participant performance by between 50% and 60%. But when the algorithm produced an inaccurate prediction, even when accompanied by explanatory data that didn't support that outcome, human performance collapsed, with an over 100% degradation in proper decision making when the algorithm was the most wrong. "An AI algorithm can never be perfect. So if you want an AI algorithm that's ready for safety-critical systems, that means something about the team, about the people and AI together, has to be able to cope with a poor-performing AI algorithm," said first author Dane Morey, a research scientist in the Department of Integrated Systems Engineering at Ohio State. "The point is this is not about making really good safety-critical system technology. It's the joint human-machine capabilities that matter in a safety-critical system." Morey completed the study with Mike Rayo, associate professor, and David Woods, faculty emeritus, both in integrated systems engineering at Ohio State. The research was published recently in npj Digital Medicine. The authors, all members of the Cognitive Systems Engineering Lab directed by Rayo, developed the Joint Activity Testing research program in 2020 to address what they see as a gap in responsible AI deployment in risky environments, especially medical and defense settings. The team is also refining a set of evidence-based guiding principles for machine design with joint activity in mind that can smooth the AI-human performance evaluation process and, after that, actually improve system outcomes. According to their preliminary list, a machine first and foremost should convey to people the ways in which it is misaligned to the world, even when it is unaware that it is misaligned to the world. "Even if a technology does well on those heuristics, it probably still isn't quite ready," Rayo said. "We need to do some form of empirical evaluation because those are risk-mitigation steps, and our safety-critical industries deserve at least those two steps of measuring performance of people and AI together and examining a range of challenging cases." The Cognitive Systems Engineering Lab has been running studies for five years on real technologies to arrive at best-practice evaluation methods, mostly on projects with 20 to 30 participants. Having 462 participants in this project -- especially a target population for AI-infused technologies whose study enrollment was connected to a course-based educational activity -- gives the researchers high confidence in their findings and recommendations, Rayo said. Each participant analyzed a sequence of 10 patient cases under differing experimental conditions: no AI help, an AI percentage prediction of imminent need for emergency care, AI annotations of data relevant to the patient's condition, and both AI predictions and annotations. All examples included a data visualization showing demographics, vital signs and lab results intended to help users anticipate changes to or stability in a patient's status. Participants were instructed to report their concerns to each patient on a scale from 0 to 10. Higher concern for emergency patients and lower concern for non-emergency patients were the indicators deemed to show better performance. "We found neither the nurses nor the AI algorithm were universally superior to the other in all cases," the authors wrote. The analysis accounted for differences in participants' clinical experience. While the overall results provided evidence that there is a need for this type of evaluation, the researchers said they were surprised that explanations included in some experimental conditions had very little sway in participant concern -- instead, the algorithm recommendation, presented in a solid red bar, overruled everything else. "Whatever effect that those annotations had was roundly overwhelmed by the presence of that indicator that swept everything else away," Rayo said. The team considered the study methods, including custom-built technologies representative of health care applications currently in use, as a template for why their recommendations are needed and how industries could put the suggested practices in place. The coding data for the experimental technologies is publicly available, and Morey, Rayo and Woods further explained their work in an article published at AI-frontiers.org. "What we're advocating for is a way to help people better understand the variety of effects that may come about from technologies," Morey said. "Basically, the goal is not the best AI performance. It's the best team performance."
[2]
How AI Support Can Go Wrong in Safety-Critical Settings | Newswise
Newswise -- COLUMBUS, Ohio - When it comes to adopting artificial intelligence in high-stakes settings like hospitals and airplanes, good AI performance and a brief worker training on the technology is not sufficient to ensure systems will run smoothly and patients and passengers will be safe, a new study suggests. Instead, algorithms and the people who use them in the most safety-critical organizations must be evaluated simultaneously to get an accurate view of AI's effects on human decision making, researchers say. The team also contends these evaluations should assess how people respond to good, mediocre and poor technology performance to put the AI-human interaction to a meaningful test - and to expose the level of risk linked to mistakes. Participants in the study, led by engineering researchers at The Ohio State University, were 450 Ohio State nursing students, mostly undergraduates with varying amounts of clinical training, and 12 licensed nurses. They used AI-assisted technologies in a remote patient-monitoring scenario to determine how likely urgent care would be needed in a range of patient cases. Results showed that more accurate AI predictions about whether or not a patient was trending toward a medical emergency improved participant performance by between 50% and 60%. But when the algorithm produced an inaccurate prediction, even when accompanied by explanatory data that didn't support that outcome, human performance collapsed, with an over 100% degradation in proper decision making when the algorithm was the most wrong. "An AI algorithm can never be perfect. So if you want an AI algorithm that's ready for safety-critical systems, that means something about the team, about the people and AI together, has to be able to cope with a poor-performing AI algorithm," said first author Dane Morey, a research scientist in the Department of Integrated Systems Engineering at Ohio State. "The point is this is not about making really good safety-critical system technology. It's the joint human-machine capabilities that matter in a safety-critical system." Morey completed the study with Mike Rayo, associate professor, and David Woods, faculty emeritus, both in integrated systems engineering at Ohio State. The research was published recently in npj Digital Medicine. The authors, all members of the Cognitive Systems Engineering Lab directed by Rayo, developed the Joint Activity Testing research program in 2020 to address what they see as a gap in responsible AI deployment in risky environments, especially medical and defense settings. The team is also refining a set of evidence-based guiding principles for machine design with joint activity in mind that can smooth the AI-human performance evaluation process and, after that, actually improve system outcomes. According to their preliminary list, a machine first and foremost should convey to people the ways in which it is misaligned to the world, even when it is unaware that it is misaligned to the world. "Even if a technology does well on those heuristics, it probably still isn't quite ready," Rayo said. "We need to do some form of empirical evaluation because those are risk-mitigation steps, and our safety-critical industries deserve at least those two steps of measuring performance of people and AI together and examining a range of challenging cases." The Cognitive Systems Engineering Lab has been running studies for five years on real technologies to arrive at best-practice evaluation methods, mostly on projects with 20 to 30 participants. Having 462 participants in this project - especially a target population for AI-infused technologies whose study enrollment was connected to a course-based educational activity - gives the researchers high confidence in their findings and recommendations, Rayo said. Each participant analyzed a sequence of 10 patient cases under differing experimental conditions: no AI help, an AI percentage prediction of imminent need for emergency care, AI annotations of data relevant to the patient's condition, and both AI predictions and annotations. All examples included a data visualization showing demographics, vital signs and lab results intended to help users anticipate changes to or stability in a patient's status. Participants were instructed to report their concern for each patient on a scale from 0 to 10. Higher concern for emergency patients and lower concern for non-emergency patients were the indicators deemed to show better performance. "We found neither the nurses nor the AI algorithm were universally superior to the other in all cases," the authors wrote. The analysis accounted for differences in participants' clinical experience. While the overall results provided evidence that there is a need for this type of evaluation, the researchers said they were surprised that explanations included in some experimental conditions had very little sway in participant concern - instead, the algorithm recommendation, presented in a solid red bar, overruled everything else. "Whatever effect that those annotations had was roundly overwhelmed by the presence of that indicator that swept everything else away," Rayo said. The team considered the study methods, including custom-built technologies representative of health care applications currently in use, as a template for why their recommendations are needed and how industries could put the suggested practices in place. The coding data for the experimental technologies is publicly available, and Morey, Rayo and Woods further explained their work in an article published at AI-frontiers.org. "What we're advocating for is a way to help people better understand the variety of effects that may come about from technologies," Morey said. "Basically, the goal is not the best AI performance. It's the best team performance."
Share
Copy Link
A new study from Ohio State University highlights the complexities of integrating AI in high-stakes environments like hospitals and airplanes, emphasizing the need for joint human-AI evaluation to ensure safety and effectiveness.
A groundbreaking study led by engineering researchers at The Ohio State University has shed light on the complexities of integrating artificial intelligence (AI) in high-stakes settings such as hospitals and airplanes. The research, published in npj Digital Medicine, emphasizes that good AI performance and brief worker training are insufficient to ensure smooth system operation and safety in these critical environments 1.
Source: newswise
The study's key finding is the necessity for simultaneous evaluation of algorithms and their human users in safety-critical organizations. This approach provides a more accurate view of AI's effects on human decision-making. Dane Morey, the study's first author, stressed, "It's the joint human-machine capabilities that matter in a safety-critical system" 2.
The research involved 462 participants, including 450 Ohio State nursing students and 12 licensed nurses. They used AI-assisted technologies in a remote patient-monitoring scenario to assess the likelihood of urgent care needs in various patient cases. This large sample size, particularly involving a target population for AI-infused technologies, lends high confidence to the findings 1.
Source: Medical Xpress
Results revealed that accurate AI predictions improved participant performance by 50-60%. However, when the algorithm produced inaccurate predictions, human performance significantly degraded, with over 100% deterioration in proper decision-making when the AI was most incorrect. Surprisingly, explanatory data accompanying inaccurate predictions had little impact on participant decisions 2.
The research team, part of the Cognitive Systems Engineering Lab at Ohio State, is developing evidence-based guiding principles for machine design. These principles aim to improve the AI-human performance evaluation process and enhance system outcomes. A key recommendation is that machines should convey their misalignment with the world, even when unaware of such misalignment 1.
The study underscores the need for empirical evaluation and risk mitigation steps in safety-critical industries. The researchers advocate for measuring the performance of people and AI together and examining a range of challenging cases. They have made their coding data for experimental technologies publicly available to promote further research and implementation of their recommendations 2.
The research team emphasizes that the ultimate goal is not to achieve the best AI performance but to optimize team performance. This shift in focus highlights the importance of understanding the varied effects of technologies on human-AI interactions in critical settings 1.
Summarized by
Navi
[1]
Databricks raises $1 billion in a new funding round, valuing the company at over $100 billion. The data analytics firm plans to invest in AI database technology and an AI agent platform, positioning itself for growth in the evolving AI market.
12 Sources
Business
1 day ago
12 Sources
Business
1 day ago
Microsoft has integrated a new AI-powered COPILOT function into Excel, allowing users to perform complex data analysis and content generation using natural language prompts within spreadsheet cells.
9 Sources
Technology
1 day ago
9 Sources
Technology
1 day ago
Adobe launches Acrobat Studio, integrating AI assistants and PDF Spaces to transform document management and collaboration, marking a significant evolution in PDF technology.
10 Sources
Technology
1 day ago
10 Sources
Technology
1 day ago
Meta rolls out an AI-driven voice translation feature for Facebook and Instagram creators, enabling automatic dubbing of content from English to Spanish and vice versa, with plans for future language expansions.
5 Sources
Technology
16 hrs ago
5 Sources
Technology
16 hrs ago
Nvidia introduces significant updates to its app, including global DLSS override, Smooth Motion for RTX 40-series GPUs, and improved AI assistant, enhancing gaming performance and user experience.
4 Sources
Technology
1 day ago
4 Sources
Technology
1 day ago