Intelligent documentation support systems will take over many routine clinical documentation tasks
Many documentation tasks can in principle be automated, either using present day technologies, or emerging AI methods. Digital scribes are intelligent documentation support systems. This emerging class of technology, still loosely defined, is designed to support humans in documentation tasks (Fig. 1). Such systems are well known in other sectors, such as the software industry, where they have been used in some form for over 30 years to assist with software documentation, but remain in their infancy in healthcare. There is a continuum of possible levels of such automation, commencing with humans carrying out all critical functions, through to tasks being entirely delegated to technology. In the middle, humans and computer work in tandem, each carrying out the tasks best suited to their capabilities.
As the technologies needed to support rich human-computer interaction mature, such as SR and summarisation, we will likely see clinical documentation support evolve through three broad stages, each characterised by increasingly autonomous functional capabilities (Table 1):
In today's state of the art systems, humans are still the ones tasked with creating clinical documentation, but are provided with tools to make the task simpler or more effective. Dictation technologies are widely used to support documentation in settings such as radiology where letters and reports are a major element of the workflow. SR technologies can create verbatim transcripts of human speech, or can be used to invoke templates and standard paragraphs to simplifying the burden of data entry. SR appears to be beneficial for transcription tasks, reducing report turn-around time, but when compared to human transcriptionists, does have a higher error rate and documents take longer to edit.
SR is also now increasingly used not just for note dictation, but as the primary mechanism to interact with the EHR. SR can be used to navigate within the EHR, select items in drop down lists, and enter text in different fields of the record. Despite the wide availability of such technology, it has surprisingly had little formal evaluation. When evaluations are performed, they suggest that SR leads to higher error rates, compared to the use of keyboard and mouse (KBM), and significantly increases documentation times. This suggests that, while SR is useful for dictating long notes, it may not be an ideal mechanism for interacting with complex environments like the EHR.
This may be simply because SR to date is "bolted on" to a system primarily designed with KBM in mind. There also appear to be fundamental cognitive limits to the use of SR. Speech processing requires access to human short-term and working memory, which are also required for human problem solving. In contrast, when hand-eye coordination is used for pointing and clicking, more cognitive resources are available for problem solving and recall. This means that experienced keyboard users have a greater ability to problem solve in parallel while performing data entry, compared to those using SR.
Some EHRs now incorporate decision-support to automatically proof dictated text for obvious linguistic and clinical errors, next word auto-completion or suggest items commonly associated with information already entered, for example proposing additional investigations or diagnoses consistent with a note's content. Assistive features can predict the likely content of notes, such as gender, age through to overall note structure, using inferred templates.
Whilst in principle such prompting can be helpful, it needs to arrive at a point in the clinical process where it can be of value. Suggesting tests or diagnoses at the end of a consultation when an assessment is complete and tests have been ordered might just be too late to make a clinical difference.
This emerging class of documentation support models itself more on human scribes and is delegated part of the documentation task. Human and computer each take the initiative for some parts of the record generation process, and record generation emerges out of the partnership.
Automated documentation systems in this class of digital scribe must automatically detect speech within the clinical encounter and use advanced SR to translate the discussions and data associated with an encounter into a formal record. Clinicians might interact with digital scribes using voice commands or hand gestures (much as we do with home assistant systems), or may use augmented reality technologies, such as smart glasses. Documentation context, stage or content can be signalled by human interaction with the documentation system using predefined gestures, commands or conversational structures.
Key to understanding the technical leap required to develop such systems is the distinction between present day transcription systems and still emergent summarisation technologies. Today's speech systems are designed to detect and then literally transcribe each word that is spoken. Automated documentation systems are also tasked with recognising speech, but must then create a summary or précis of its content, suitable for documentation. By analogy, a transcriber is like a gene sequencer, literally creating a read of all the 'words' in a DNA sequence, without addition or deletion. A summarizer in contrast must identify only what is salient to the encounter, like sifting junk DNA from coding sequences. It must then communicate it's meaning, just as we are ultimately interested in the functional role of a gene rather it's constituent base pairs. How that might best happen is still the subject of research.
Text summarisation methods are traditionally broken down into extractive methods, which identify and extract unchanged salient phrases and sentences, and abstractive methods, which produce a shortened reinterpretation of text based on inference about its meaning. When a summary is generated from human speech instead of a set of documents, additional tasks emerge, such as speaker identification and SR, as well as more classic natural language processing tasks. These include mapping recognised words and phrases to a common language reference model, and the use of hybrid methods, such as rules to populate pre-defined templates, e.g. for well-defined sections of a clinical note such as medication or allergies. Deep learning methods can be used in tandem with such approaches, or on their own. Once a machine readable summary is created, methods for the automated generation of text from such structured representations can create a human readable version of the information. Whilst much effort is currently focussed on automating the summarisation process, it should not be forgotten that humans are a ready source of context cues. Many difficult problems in natural language processing may be solved by good human-computer interaction design.
A third class, the "autopilot" digital scribe, will emerge when computers can lead in the documentation process. Human interaction would only occur to assist the machine in resolving specific ambiguities in the clinical encounter, perhaps to clarify goals and intentions, request missing details, or resolve contradictions. For highly structured and well-bounded encounters, for example routine clinic visits to monitor patients for chronic illness or post-operative recovery, the entire documentation process might be delegated to automation, and humans only invoked when exceptions to the expected process occur.
Achieving this class of documentation system will require major advances in AI, and much experience with the use of the less autonomous versions of digital scribes described earlier. Not only will autonomous documentation systems need to be expert in the form and content of clinical encounters, and the encounter record, they will need to exploit rich models of the knowledge base underpinning specific clinical domains.