Figure 3 shows students' average level of agreement with four statements about their perceptions of learning, broken down between the two groups (in-class active learning vs. AI tutor). Students rated their level of agreement on a 5-point Likert scale, with 1 representing "strongly disagree" and 5 representing "strongly agree." In responding to the first statement, relating to engagement, the students in the AI group agreed more strongly (Mean = 4.1, SD = 0.98) than those in the in-class active learning group (Mean = 3.6, SD = 0.92), t(311) = - 4.5, p < 0.0001. Likewise, in responding to the second statement, relating to motivation, students in the AI group agreed more strongly (Mean = 3.4, SD = 1.0) than those in the in-class active learning group (Mean = 3.1, SD = 0.86), t(311) = -3.4, p < 0.001. Students' average level of agreement with the remaining two statements (relating enjoyment to growth mindset) were not statistically significantly different between the two groups. To summarize, Fig. 3 shows that, on average, students in the AI group felt significantly more engaged and more motivated during the AI class session than the students in the in-class active learning group, and the degree to which both groups enjoyed the lesson and reported a growth mindset was comparable.
We have found that when students interact with our AI tutor, at home, on their own, they learn significantly more than when they engage with the same content during an in-class active learning lesson, while spending less time on task. This finding underscores the transformative potential of AI tutors in authentic educational settings. In order to realize this potential for improving STEM outcomes, student-AI interactions must be carefully designed to follow research-based best practices.
The extensive pedagogical literature supports a set of best practices that foster students' learning, applicable to both human instructors and digital learning platforms. Key practices include (i) facilitating active learning, (ii) managing cognitive load, (iii) promoting a growth mindset, (iv) scaffolding content, (v) ensuring accuracy of information and feedback, (vi) delivering such feedback and information in a targeted and timely fashion, and (vii) allowing for self-pacing. We aimed to design an AI system that conforms to these practices to the fullest extent current technology allows, thus establishing a model for future educational AI applications.
A subset of the best practices (i-iii) were incorporated into the AI pedagogy by careful engineering of the AI tutor's system prompt. We designed the AI tutor with a system prompt with guidelines (Supplementary Material 1) to facilitate active engagement, manage cognitive load, and promote a growth mindset. However, we found that a system prompt could not reliably provide enough structure to scaffold problems with multiple parts (iv), as the AI tutor would occasionally discuss parts out of sequence or that were not immediately relevant. For this reason, the AI platform was designed to guide students sequentially through each part of each problem in the lesson, mirroring the approach taken by the instructor during the in-class active learning (see screenshot of AI tutor platform in Figure S1).
The occurrence of inaccurate "hallucinations" by the current generation of large language models (LLMs) poses a significant challenge for their use in education. Thus, we avoided relying solely on GPT-4 to generate solutions for these activities. Given that LLMs proceed by next-token prediction, accuracy in complex math or science problems is enhanced when the system generates, or is provided with, detailed step-by-step solutions. Therefore, we enriched our prompts with comprehensive, step-by-step answers, guiding the AI tutor to deliver accurate and high-quality explanations (v) to students. As a result, 83% of students reported that the AI tutor's explanations were as good as, or better than, those from human instructors in the class.
While best practices (i-v) can be readily adhered to in a classroom setting, the remaining best practices (vi-vii) cannot. Providing timely feedback that targets the specific needs of individual students (vi) and self-pacing (vii) are difficult to achieve and impossible to maintain in a typical classroom. We believe that the increased learning from structured AI tutoring is largely due to its ability to offer personalized feedback on demand -- just as one-on-one tutoring from a (human) expert is superior to classroom instruction. In addition, interactions with the AI tutor are self-paced (vii), as indicated by the distribution of times in Fig. 2. Students who need more time to build conceptual understanding or to fill gaps in their knowledge can take that time, instead of having to synchronously follow the pace of the in-class lesson. Students who are familiar with the material or have underlying skills, however, can move through the activities in less time than required for the in-class lesson. We measured the students' perception of pace during the control condition (in-class active learning) on the days the experiment took place. Notably, the 3.8% of students who found the pace of class "too fast" all spent more than the median time (49 minutes) on the AI lesson, while the 2.2% who found the pace of the in-class lesson "too slow" all spent less than the median time on the AI lesson.
Our results contrast with previous studies that have shown limitations of AI-powered instruction. Krupp et al. (2023) observed limited reflection among students using ChatGPT without guidance, while Forero (2023) reported a decline in student performance when AI interactions lacked structure and did not encourage critical thinking. These previous approaches did not adhere to the same research-based best practices that informed our approach. Our success suggests that thoughtful implementation of AI-based tutoring could lead to significant improvements to current pedagogy and enhanced learning gains for a broad range of subjects in a format that is accessible in any environment with an internet connection.
How might an AI tutoring system, such as the one we have deployed, integrate into current pedagogical best practices, given its effectiveness in terms of learning gains and student perceptions?
Existing pedagogies often fail to meet students' individual needs, especially in classrooms where students have a wide range of prior knowledge. Here, we have shown the advantage of using asynchronous AI tutoring as students' first substantial engagement with challenging material. AI could be used to effectively teach introductory material to students before class, which allows precious class time to be spent developing higher-order skills such as advanced problem solving, project-based learning, and group work. Instructors can assess these skills in person, which avoids the problematic use of AI as a shortcut on assessments such as homework, papers, and projects. As in a "flipped classroom" approach, an AI tutor should not replace in-person teaching -- rather, it should be used to bring all students up to a level where they can achieve the maximum benefit from their time in class.
That said, beyond the initial introduction of material, AI tutors like the ones employed here could serve an extremely wide range of purposes, such as assisting with homework, offering study guidance, and providing remedial lessons for underprepared students. Yet our results show that, with today's GAI technology, pedagogical best practices must be explicitly and carefully built into each such application. As seen in previous studies, instructors should avoid using AI in situations where students are likely to use it as a crutch to circumvent critical thinking. We advise against the notion that AI, solely due to its efficacy in enhancing teaching and learning, should entirely supplant in-class instructional methods. Our demonstration illustrates how AI can bolster student learning beyond the confines of the classroom. We advocate harnessing this capability to enable instructors to use in-class sessions for activities and projects that foster advanced cognitive skills such as critical thinking and content synthesis.
Our AI tutoring approach was applied in a setting where students were engaging substantially with material in particular subject areas for the first time. Our lessons were comprised of activities focused on learning objectives categorized at the understanding, applying, and analyzing levels of Bloom's Taxonomy -- as were the associated pre- and post-test questions. This stage of learning, characterized by a meaningful degree of information delivery, appears to be particularly well suited for current generative AI tutors. The significant gains and positive affect observed in this study may also depend on several factors: a heterogeneous student population requiring varying instructional paces, integration of high-quality instructional videos, a large language model capable of closely following complex prompts (e.g., GPT-4), expert-crafted, question-specific prompts written by instructors experienced with the content, a carefully structured framework designed to scaffold and guide student interactions, and content that lends itself to such a format. While the advantages of the experimental condition are widely generalizable and our findings have broad implications, we do not presume that structured AI tutoring will always outperform in-class active learning in all contexts, for example, those requiring complex synthesis of multiple concepts and higher-order critical thinking.
Compelling directions for future work include exploring other contexts throughout the learning process where AI tutoring may be successfully implemented, such as in homework, recitation, exam studying, pre-class assignments, and laboratory. Valuable follow-up studies could also explicitly examine the details of such combinations throughout an entire course. This would also allow for systematic integration of well-established retention enhancing strategies (e.g., spacing) and could provide insights into other novel phenomena that may arise from prolonged and varied use of AI in education, such as potential impact on collaboration skills. Given that the current AI tutor implementation mirrors well-established in-class active learning pedagogies and generates comparable affect -- with its primary difference (besides personalization) being the medium of delivery, which typically does not impact learning on its own -- it is reasonable to expect findings from in-class active learning approaches to carry over. Nonetheless, studies that explicitly replicate known in-class active learning results would be valuable for confirming and refining the details of this transferability. Such research could include explorations of the qualities that constitute effective system prompts and behaviors for AI tutors in various situations (e.g., determining when the AI tutor should openly provide answers versus guiding students to reflect on their own responses).
Generative AI technology is developing very rapidly, allowing for expansion of the capabilities and application of AI tutoring. While accuracy of our AI tutor relied on pre-written answers, as generative AI models improve in scientific reasoning, studies could explore whether such efficacy could be achieved without a provided solution. Also, in our approach, feedback was provided in response to student input, but multimodality would allow AI systems to interpret images (or audio) of a student's work and more proactively provide feedback. Investigations could explore whether holistic monitoring of a student's process could provide feedback on issues with thinking that may not be addressed by current pedagogies (in or out of the classroom) in which students typically receive targeted feedback only when they ask a question.