Based on existing literature analysis, the novelty of this study shows bridge the gap by designing a multidimensional and multi-method methodology for the assessment of AI-based tools in language education. Unlike previous studies which are largely dependent on either statistical analysis, qualitative methods, or machine learning models alone, as displayed in Table 2, this research combines all three methods into a single pipeline. The use of the Fuzzy for the identification of high-importance lapidary main criteria - The use of AI, usability, and analytical quality, which is rarely utilized in educational technology research. Additionally, the proposed system employs deep sequence models trained based on the outputs of real classroom and interpretable AI techniques, which ensure both prediction performance and interpretability. Finally, the addition of a teacher-focused engagement score providing an expert-weighted benchmarking and interpretation ability to AI based pedagogical tools.
The proposed research methodology, as shown in Fig. 2, aims to provide details about the steps which have been carried out to accomplish the study. The present study employed a mixed-method research design combining qualitative and quantitative methodologies to explore the pedagogical effects of integrating AI tools.
Objectives and methods
The objective of this study is to design and validate an AI-driven tools assessment framework for English pedagogy that integrates expert consensus, predictive modeling, and interpretability. To achieve this, the methods combine the FDT to capture expert consensus and generate weighted evaluation criteria, providing a structured decision-making foundation, statistical analyses for validation of the significance of relationships among student engagement, clarity, and AI tool usage, and deep learning models (LSTM and Bi-LSTM) were adopted to handle sequential text data and deliver high accuracy in assessing pedagogical effectiveness, enhanced with explainable AI techniques (LIME and SHAP) were integrated to interpret the deep learning models, ensuring transparency and providing actionable insights for educators. Collectively, this combination of methods forms a coherent hybrid framework, where each component contributes uniquely to addressing the complexity of AI-driven assessment in education.
To improve the reliability and predictability of the proposed model, we initially utilized FDT to determine consensus-based weights for the evaluative criteria. FDT distinguishes from classical descriptive statistics the nature of a structured approach to agreement modeling that ensures the motivational and cognitive dimensions of engagement, clarity, and analytical quality are supported by a robust consensus-forming process. These weighted features were passed on to the deep learning models, the LSTM and the Bi-LSTM, that are well-suited for capturing sequential and contextual features of student responses, written outputs, or temporally ordered interactions. The reason for such hybridization is twofold: (i) the abstraction level imposed by FDT controls the features to be used as input to the predictive models so that they embody validated pedagogical factors which improve interpretability and (ii) deep learning takes advantage of such structured features to undertake a high-quality predictive performance in pedagogical effectiveness modelling. In this manner, the framework integrates consensus-driven feature validation and data-driven predictive modeling which forms a principled approach to AI-aided assessment in English education.
Data collection
Data were collected from a diverse sample of 200-student corpus of undergraduate and postgraduate students reading English Literature and Linguistics in different universities, variables description shown in Table 3. Participants were randomly assigned to utilize one of the three AI tools ensuring balanced exposure and comparative analyses. The population of students was drawn up using a purposive sampling strategy to obtain a representation sample of students currently enrolled in course sessions within which AI tools were used as a part of learning activities. The recruitment criteria specified that the participants must have previously used AI-supported writing or feedback systems to ensure that the responses represented genuine pedagogical interactions.
The observational period extended over 20 days, during which collected students' responses to the survey, written artifacts, interview documents, and learning interactions documented in temporal sequence. The mix of quantitative indicators and qualitative components contributed to a rich, multimodal set of data. These structured questionnaires were collected that collected demographic data. Furthermore, a series of measures became standardized, including engagement scores, quality in terms of writing, analytical quality, as well as effective use of AI generated insights.
Data preprocessing
The dataset underwent a visual inspection to delete missing, duplicated, and inconsistent data or to impute mean (numerical variables) or mode values (categorical) for unreliable information. Textual fields including students' written responses, interview transcripts and sequential learning interactions were preprocessed and cleansed by removing stop words, punctuation and non-informative tokens, and lemmatized to standardize lexical variants. Categorical attributes were labeled (one-hot) encoded or one hot encoded according to their cardinality, and numerical variables were standardized for features comparability. Consecutive messages in learning interactions were ordered by time and converted to a series of structured tokens to represent the temporal dependence. Finally, the data was divided into training, validation and test samples to guarantee the robustness of the predictive models.
Data nalysis
The available quantitative data analysis was done through descriptive and inferential statistics. The differences in the three AI tool groups were assessed on mean scores, standard deviations, and ANOVA tests to determine the statistical significance.
The in-depth analysis is essential to understanding the dynamics of how student engages with AI accelerated English literature and English linguistics learning. There seems to be no clear trend of age having an impact on engagement score: throughout different age groups, the engagement score is scattered in Fig. 3. Moreover, the box plot in Fig. 4 comparing median amounts and amount spread between students' engagement and clarity cohesion scores conveys related scores, meaning that there is a close relationship between the students' engagement and clarity of their writing. Together, these visualizations establish that AI tools engender alike engagement on all demographic and academic domains and encourage further study on which personal factors are associated with AI-enhanced learning.
Furthermore, in Fig. 5, both male and female participants had variability about the mean engagement line, indicating that age was not a strong predictor of engagement level. The box plot of engagement score by field of study shows that there is slightly higher mean engagement for students in linguistics and a wider interquartile range implying higher variation of responses in linguistics students. Comparing the average engagement scores of the bar chart in Fig. 6 about the usage of AI tools, on average, ChatGPT and Grammarly performed quite similarly, although Voyant Tools performed slightly better.
The three critical pedagogical dimensions of Engagement, Clarity and Analytical Quality are represented in Fig. 7, which represents the probability distribution of these three dimensions over the student data set. The smoothed estimates of the distribution of scores on a continuous scale form each curve, which helps to visualize concentration patterns. Figure 8 reveals a pattern that can be viewed with a fine-grained view provided by the plot. Regardless of sex, all tools experience similar distributions across the score range of about 3.0 to 5.0 of student performance, indicating a small degree of gender disparity in performance. While the pair plot in Fig. 9 gives a granular view of the relationships of key learning variables also distinguishing patterns based on gender. The scatter relationships between Engagement versus Clarity and Clarity versus Analytical Quality show visibly dense overlapping points that indicate some clustering and weak and moderately correlated relationship in the case of Pearson test.
Research procedures
It is declared that all the procedures performed in this study involving human participants for data preparations were carried out in accordance with required ethical guidelines and regulations. The research protocol and dataset preparation were reviewed and approved by Institutional Ethics Review Board, Xinyang Vocational College of Art, China.
Classroom case studies
An 8-week structured integration of the selected AI tools into regular coursework activities was by classroom case studies. During this time, students were placed through AI applications that worked on samples in tasks designed to invoke the specific strengths of the toolset. As mentioned, ChatGPT helped with interactive literary analysis and critical thinking through real time discussions, and Grammarly improved the writing accuracy and clarity. With these robust linguistic analysis capabilities, Voyant tools allowed students to visualize complex textual patterns as well as to encroach upon the stylistic features more deeply. To measure these, the engagement and academic progress were measured with the use of observational checklists and periodic assessments, statistically significant improvements were recorded across all three tool categories. According to the participants, these AI enhanced methods improved motivation and comprehension of literature and linguistics more generally.
In-depth interviews
So complemented with the quantitative assessments, semi structured in depth interviews were conducted with thirty educators who have experience in deploying these AI tools. The interviews, ranging from about 45 min, dealt with educators' points of view on how AI reoriented their teaching methods as well as the atmosphere in the classroom. The integration of AI was identified to influence teacher recognition of more engaged students, improved critical and analytical thinking, as well as individualization of instruction. Nevertheless, it was clear that educators had some notable challenges: the risk of dependence by students on AI generated content and for lack of accuracy and potential biases in AI outputs. Over 90% of the interviewed educators supported structured professional development and clear institutional policies so that they can be used in an ethical and effective manner across all educational settings.
Analysis of course materials and student outputs
The third methodological approach involved direct evaluation of the impact of AI integration on educational outputs, which took place through the combinations of both course materials, as well as student generated outputs. If bringing AI insights into critical statements was any indication, first, it was evaluated on how clear, how deep it was analytically, how engaged the student was, and how effectively the student was able to employ AI insights. The outputs of student responses were quantitatively analyzed, and it was found that all the assessed areas had a marked improvement post-intervention with the analytical quality and writing clarity figuring most markedly. Results of descriptive statistics showed that the mean scores increased from 3.1 to 4.3 (p < 0.001) on analytical quality, and from 3.0 to 4.5 (p < 0.001) on writing clarity. Results were confirmed for these outcomes empirically using qualitative analysis of student reflections as to the benefits of AI tools in helping better analyze, improve writing skills, and increase overall academic engagement and performance.
Fuzzy Delphi technique
Fuzzy Delphi Technique (FDT) is a hybrid decision making method which improves the traditional Delphi process by introducing the fuzzy set theory to express uncertainty and vagueness of the opinion, as steps defined in Fig. 10. The method is particularly well suited for educational research where the system of analysis must consist of qualitative judgments (such as usability, engagement and improvement of learning).
FDT is a series of mathematically processed evaluations and consensus on the key criteria working. Initially, a panel of Triangular Fuzzy Numbers (TFNs) are based on linguistic ratings on each criterion using a 5-point Likert scale, as defined using Eq. 1.
Where, \(\:\stackrel{\sim}{{A}_{ij}}\) represents the TFN given by \(\:i\) on criterion \(\:j\), showing minimum values lower bond with \(\:{l}_{ij}\), most probable value mode showing with \(\:{m}_{ij}\), and maximum possible value with upper bond \(\:{u}_{ij}\). For each \(\:j\), the \(\:\stackrel{\sim}{{A}_{ij}}\:\)aggregated using the fuzzy averaging operator, resulting in an average TFN, computed using Eq. 2:
To convert these fuzzy numbers into a crisp value for comparison and ranking, defuzzification is applied using the Center of Gravity (COG) method, defined using Eq. 3:
Next, to evaluate consensus, the threshold distance\(\:{d}_{ij}\) between decision of TFN and the group average TFN is computed using the Euclidean distance formula for fuzzy numbers, defined as in Eq. 4:
The average threshold value \(\:\stackrel{-}{dj}\) for each criterion is then calculated as in Eq. 5:
If this threshold is then a criterion is said to be accepted. The agreement ratio ≤ 0.2 with 75% agreement level for census criteria. These steps allow FDT to filter and prioritize the most pedagogically meaningful criteria in a mathematical, objective and subjective manner. In addition to creating a structured process that further improved the credibility of evaluations, this structured process also built a strong foundation on which data driven and consensus-based frameworks could be designed in educational settings like evaluating AI tools.
Predictive modelling
In the modeling phase of this research, the predictions and classifying of important pedagogical outcomes were made using deep learning approaches namely Long Short-Term Memory (LSTM) and its variant Bidirectional LSTM (BiLSTM) networks to discover the relationships between students' encounters with these AI driven education tools. Since LSTM has memory cell structure which can retain long term contextual information while tackling vanishing gradient problems, LSTM is particularly effective in capturing temporal dependencies in sequential data. The Bi LSTM model integrated the LSTM architecture by processing input sequences in the forward and backward directions, based on which past and future dependencies are considered simultaneously, shown in Fig. 11. Particularly useful for modeling reflective and interpretive variables, such as engagement and linguistic analysis, which depend on both preceding and subsequent learning cues, the bidirectional learning was very beneficial.
XAI interpretation
To make predictive models of this study more transparent and interpretable, two widely popular XAI techniques were used in this study. With these methods this will help when understanding how each feature impacts model outputs both globally and locally, and SHAP is based on cooperative game theory using the concept of Shapley values to redistribute model prediction with input features fairly. For a model \(\:f\) and input \(\:x\), the SHAP explanation model approximates the output as in Eq. 6.
Where \(\:{\varphi\:}_{0}\) presents the base value across the dataset, \(\:{\varphi\:}_{i}\) shows the SHAP values of feature \(\:i\), representing its contribution to the deviation from base value.
Each \(\:{\varphi\:}_{i}\) is computed as a weighted average of marginal contributions across all possible feature conditions \(\:S\subseteq\:F\setminus\:\left\{i\right\}\) shown in Eq. 7:
Furthermore, LIME explains individual predictions by locally approximating the complex model \(\:f\:\)with an interpretable linear model \(\:g\) around a specific input \(\:x\). It generated perturbation of \(\:x\), evaluates \(\:f\) on these samples, and fits \(\:g\) using weighted least squares, in Eq. 8.
to ensure fair contribution attribution to each feature while accounting for all possible feature interaction.
Performance measures
A standard classification metric used to evaluate the performance of the predictive models in this study are accuracy, precision, recall, and F1-score, display in Table 4. Thes measures provide a holistic view of model effectiveness especially in settings of education where balance between classes is key such as classifying varying levels of student engagement or learning effectiveness.
Classroom case studies
The classroom case studies provided a lot of insights into how integrating AI tools, (ChatGPT, Grammarly, and Voyant Tools) into literature and linguistics courses affect education. Tests showed that all platforms significantly improved student engagement as well as their analytical and conceptual understanding, and their synthetic quantitative analysis, as results display in Table 5. According to the specific result, ChatGPT users had highest analytical skills scores (M = 4.4) most likely because the platform enabled interaction via the course materials considering the literary texts, which encouraged students to critically engage with the texts. The thematic interpretation was actively worked and discussed by the students - a type of critical thinking that is particularly useful in reading postcolonial literature.
On the other hand, Grammarly was more effective in polishing up students' outputs in terms of clarity and cohesiveness. Grammarly's structured linguistic feedback was especially valuable and an average understanding with concepts scoring at M = 4.5 as students were empowered to write clearer and more articulate written assignments. Additionally, student engagement (M = 4.1) to linguistic corpus analysis was contributed by Voyant Tools, which is well known for its capability in creating interactive visual analyses to expose students' ability to recognize linguistic patterns and stylistic devices. This raises together an educational opportunity of these AI tools to boost deep learning and to trigger student involvement. Grammarly had a notable impact on understanding grammatical concepts, while Voyant Tools consistently improved student engagement by making linguistic analysis interactive and accessible.
In-depth interviews
Qualitative insights about how educators perceive the pedagogical changes after integration with AI were also brought to light. If AI integration proved unsatisfactory to students' everyday lives, educators did not encounter pedagogical changes through its method of presentation that were overwhelmingly positive: 87% of educators said the AI integration in the platform led to increased student engagement, dynamic instructional interactions, and improved student-teacher dialogue, shown in Table 6.
Nevertheless, educators had significant apprehensions. According to them, a considerable number of them (68%) pointed out that reliance on AI tools by students could negatively affect independent analytical thinking, while simultaneously diminishing the originality of their results. The consequences of the second were also rather ethical because 65% of educators were worried about the accuracy and potentially biased content of the AI generated material as well as the equity in digital tool access. These suggestions conjure educators' cautious optimism towards AI's potential and awareness and encourage them to reflect critically on AI and pursue strategies of careful implementation.
Analysis of course materials and student outputs
The detailed analysis of course materials and student-generated outputs provided compelling quantitative evidence of the pedagogical effectiveness of AI tools, outcomes shown in Table 7. The improvement in all of these was notable after integration with AI. Given baseline means of 3.2 and robustness to 4.4, AI enabled assignments had the effect of promoting more analytical ways of thinking that allowed students to make more sophisticated and nuanced arguments. Similarly, writing clarity scores (3.1→4.5) were increased by Grammarly's own precision (area) feedback mechanisms for increasing students' self-editing skills and sentence level coherence.
The change in student engagement scores indicated a very big positive shift in motivation and active participation in learning tasks from 3.0 to 4.6. Further qualitative findings found quantitative student feedback, which supported these quantitative findings, and identified benefits as well as risks associated with using AI. While over 83% of students reported that they enjoyed pedagogical benefits from AI, a considerable number (41%) were uncertain if they could interpret literary texts independently without AI help. For pedagogy, this statistic illustrates a central balance in deciding between an outcome where AI does add to learning outcomes, while there is nonetheless some risk of producing dependency towards automated analytical tools on the part of students, as shown in Figs. 12 and 13.
Comparative analysis of three methodological approaches
Three methodological approaches were used to explore the integration of these AI tools in the teaching and learning of English literature and linguistics, as display in Table 8. The statistical analysis indicated that in terms of being engaged and the quality of analysis, no significant differences were found between the demographic groups (t-test, p = 0.543) or among AI tools (i.e., ChatGPT, Grammarly, and Voyant, ANOVA, p = 0.58). Pearson's correlation between interest and analytical quality (r = -0.009, p = 0.921) was also not significant. Although the lack of a pattern of differences between the groups does not provide evidence that the tool type and demographic variable do not work empirically individually, this similarity across the three groups can also be interpreted as a positive result, because it implies that the type and the nature of tool selected did not by itself affect learning outcomes, thus emphasizing the generalizability of students' experiences of using AI-based tools. At the same time, the Z-test was significant (Z = 13.03, p < 0.001), thus showing that clarity and cohesion scores significantly deviated from population means. This result thus demonstrates the clear necessity of cohesive writing in academic writing when assisted by AI tools. The Friedman test for engagement, clarity, and analytical quality did not prove significant (χ² = 1.13, p = 0.57), which echoes the idea that scores are somewhat stable across levels of interest. Combined, these findings suggest that while the framework does not reveal marked differences between groups or tools in traditional statistical terms, it does highlight stability and consistency of AI support across learners. Most importantly, hybrid architecture of the proposed framework, in which the methods are stacked allows the approach to still generate actionable insights, coherent explanations and accurate predictions for pedagogical enhancement, even if there are limited statistical group-level differences.
However, the Pearson correlation heatmap in Fig. 14 shows almost no too weak correlations between the core variables. The highest positive correlation found is that between clarity and analytical quality despite a weak correlation, thereby suggesting that better analytical performance may loosely relate to improvement of clarity of writing.
All three methods matched in comparison, particularly on several key findings, as each method continued to suggest that AI integration improved student engagement, and the student's skills academically. Nevertheless, methodological differences proved to be complementary. Since the comparative strength of this research inheres just in its methodological triangulation, namely in the combination of quantitative rigor and qualitative depth for comprehensive consideration of literary and linguistic education in the light of the impact of AI, the aim of it is to achieve that goal.
Fuzzy approach
The application of Fuzzy Delphi Technique (FDT) in determining how effective AI tools are in pedagogical effectiveness when imparted to English language and literature education through the relevant context at hand has proven useful, as overall analysis display in Table 9. AI Usage scored the highest consensus of 93.74% and a very strong defuzzied score of 4.26 which indicates that it is the most important and unanimously agreed upon dimension among the seven evaluated criteria. Again, this emphasizes that educators are beginning to understand AI. Usability, too, proved to be a significant construct with high agreement level of 86.67% and defuzzied score of 3.93. Reflecting on its prioritization, teachers are pragmatic about the fact that even the most innovative AI tool can become pedagogically irrelevant if it is not accessible or intuitively designed. Another indicator in line with the key indicators mentioned here was validated as Learning Enhancement (80.00% agreement, 4.00 score) and Analytical Quality (80.29% agreement, 3.40 score). Analysis on Clarify Cohesion shows another communication level reflecting the ambiguous measurements of interpretive or qualitative side of a student's output when mediated by AI.
At the same time, FDT's structured thresholding mechanism (i.e., agreement %, d ≤ 0.2) prevents the consensus indicators from being maintained as sources of bias or criterion influence, shown in Fig. 15. As a result, this makes the model robust and scalable, and pedagogically meaningful, a key advantage when learning AI evaluation rubrics for educational, administrative or policy purposes. This research has also proved that the Fuzzy Delphi Technique is methodologically appropriate for analyzing complex interpretive domains, and it gives actionable insights into perceived effectiveness of AI tools in English Education. Now, the high priority dimensions of AI Usage, Usability, Learning Enhancement, and Analytical Quality can act as the base stones of an AI based assessment framework built on the notions of English educators.
Therefore, the FDT develops as a consensus tool to design the future of learning environments in an AI integrated space, rather than as a mere consensus tool. The linguistic fuzzy rules derived and applied using membership functions of variables such as Clarity, Engagement, Usability, and Effectiveness, each of which was modeled by triangular membership functions for possible flexible interpretation of linguistic input, display in Table 10. Fuzzy logic allows for the construction of more fuzzy system with overlapping regions of 'low', medium' and 'high', considering the real-world ambiguity that exists when humans make decisions. The linguistic fuzzy rules, in summary, give a solid structure to assess the impact of AI tools employed for English language pedagogy.
By analyzing practical examples using real-life scenarios, we evaluate student performance in a test (out of 100 marks) using fuzzy logic. We define three fuzzy sets with performance level of Lower \(\:L\), Medium \(\:M\), and High \(\:H\). We design triangular membership functions:
* Low \(\:\left(L\right)\): \(\:\mu\:L\left(x\right)\) = \(\:(100-x)/50\), for \(\:0\:\le\:\:x\:\le\:\:50\).
* Medium \(\:\left(M\right):\) \(\:\mu\:M\left(x\right)\) = \(\:(x-30)/20\), for \(\:30\:\le\:\:x\:\le\:\:50\), and \(\:(70-x)/20\), for \(\:50\:\le\:\:x\:\le\:\:70\).
* High \(\:\left(H\right)\): \(\:\mu\:H(x\)) = \(\:(x-50)/50\), for \(\:50\:\le\:\:x\:\le\:\:100\).
For a student who scores 45 marks, by calculating membership values:
* \(\:\mu\:L\left(45\right)\:=\:(100-45)/50\:=\:55/50\:=\:1.10\) \(\:\to\:\) clipped to 1.0 (full membership in Low).
* \(\:\mu\:M\left(45\right)\:=\:(45-30)/20\:=\:15/20\:=\:0.75\) \(\:\to\:\) membership in Medium.
* \(\:\mu\:H\left(45\right)\:=\:(45-50)/50\:=\:negative\:\to\:\:0\) membership in High.
Thus, the student has strong membership in Low (1.0), partial membership in Medium (0.75), and no membership in High. Instead of labeling the student as strictly "Low" or "Medium", fuzzy logic shows that the performance lies between Low and Medium, providing a nuanced evaluation, as shown in Fig. 16.
Predictive modelling outcomes
For different pedagogical features, the application of the LSTM model yields different results in terms of how each criterion contributes to total performance of the AI driven English language education. The results are then tabulated and provide a precise quantitative comparison with standard classification metrics (Accuracy, Precision, Recall, F1-Score) as well as the supporting figures visualize the progress of learning as the model internalizes and distinguishes each feature over the training epochs. The accuracy of the Learning Enhancement category was 82%, its precision 83%, recall 82%, and the F1 score as 80%, which was the highest performance. The consistency across all metrics suggests that LSTM is particularly strong at modeling features that relate to the use of AI tools towards improving comprehension and retention. Its high score confirms that this feature plays a fundamental role in predicting learning effectiveness as well as learning enhancement acts as a trustworthy predictor of pedagogical success when AI tools are employed.
The classification results of the usability experiment were also quite strong, reaching 80% accuracy and 79% F1 score, displayed in Table 11. These results reinforce that AI tools perceived to be easily usable and accessible have a stronger impact on learning outcomes consistent with earlier FDT results that were also strong on Usability consensus.
This result corroborates the fact that interface design and interaction experience matter to educational AI and points to the fact that the distinction between the usability, which must be inherently learnt by the LSTM, is effectively learnt in a high precision and high recall manner. This feature is pedagogically important, although the model struggled to find strong patterns and so may point to the need for more granular input features such as rubric-based assessments or skill-based task completions. Furthermore, accuracy for Clarity, Cohesion and Analytical Quality was 73%, 78% and 81% respectively. Both (71 and 77) are balanced tradeoff between precision and recall as measured by the F1 scores. The trends graphs, in particular, large scatter and trend lines plot for total validation accuracy through epochs, exhibit increasing performance over time, in Fig. 17. The validation accuracy began from 40% and started to rise steadily to over 80% after the 100th epoch, which shows how high the LSTM model can learn and converge. This trend is consistent, which is easily visualized by the color gradient and bubble sizes, validating both a model's generalization power and robustness to different features. Further breakdown of epochs in the four-panel visualization reinforces the stability of LSTM training, in Fig. 18.
In evaluating the elements of pedagogy within the AI-enhanced English language learning feature set, the Bi-LSTM model successfully showed very encouraging results. Overall model performance was robust, and the performance with all evaluated features was classified, achieving 90% accuracy, 92% precision, 93% recall and 90% F1 score. Bi-LSTM is highly effective in capturing bidirectional dependencies and long-range contextual information, and thus the strong metrics here indicate that the Bi-LSTM found that sequencing learning patterns in educational data are very successful, shown in Table 12. With respect to all the individual features, Skills Development and Analytical Quality performed best with 90% accuracy or higher with excellent F1 scores of 97%.
This corroborates the fact that Bi-LSTM can deeply learn complex, structured tasks with task-based outcomes (skill acquisition) as well as logical reasoning (analytical performance). A visualization of the train and validation accuracy with epochs segments (1-25, 26-50, 51-75 and 76-100) is crucial for learning insights into how the network learns, shown in Fig. 19. It is worth noticing that the training accuracy is always near 99%, while the validation accuracy fluctuates between 81% and 83%.
This implies a highly stable training process with low overfitting controlled on that trend proximity between train and validation. This stability is reinforced by the bubble plots within each quadrant no major drops or irregularities appear, and that validation bubbles line up well to trend lines show performance has remained consistent across epoch intervals. The validation accuracy of the model across all 100 epochs is graphically represented in Fig. 20 as a line and bubble scatter plot which demonstrates how the model learns through the plot. Around 8-10%, it gets accurately up to 90-100% with a linear trend upward. This graduated and gradual rise proves that the Bi-LSTM developed well in time without premature saturation or plateau. This trend line over layers clears through and a line of solid purple to yellow is full end to reflect a positive strong slope.
Since the model is truly generalizing well, the training and validation accuracy curves are indeed closely overlapping in the final Fig. 21, peaking above 95% accuracy and indicating the validity of this explanation. The almost identical trajectories of the blue (training) and orange (validation) lines indicate that the model did not overfit and ranged consistency on training as well as unseen data. The proposed Bi-LSTM shows robustness to variance, learns efficiently and is suitable for contexts in which the sequence of contextual data needs to be retained, e.g. textual engagement and clarity modeling.
Explainable artificial intelligence (XAI) interpretation
This multidimensional model could add interpretability by quantifying the impact of each feature to the predicted student engagement using XAI techniques. The SHAP summary plots (Fig. 22a and b) show that clarity cohesion score has the most impactful effect on the model predictions globally, followed closely by analytical quality score, while the a.i. insights usage score has mild influence and age has minimal influence. This result is consistent with writing pedagogy principles that highlight cohesion and clarity as aspects of written expression in academic settings, and with constructivist views on learning that emphasize analytical thinking for knowledge representation. That the impact of demographic factors such as age is limited also indicates that it is relatively certain that AI based tools work regardless of learner background, which makes more generalizable applicability of the framework.
The LIME explanations shown in Fig. 23 serve as the local interpretable counterpart to the SHAP results for individual predictions. Analytical quality score, clarity cohesion score and a.i. insights usage score positively contributed to higher probability of pedagogical effectiveness, and the demographic variables including Gender, Age and Field of Study had very little to do with pedagogical effectiveness. This shows that model choices were based primarily on pedagogically relevant factors rather than demographic biases, indicating the trustworthiness of the framework.
A joint interpretation of SHAP and LIME results is shown in Table 13, associating importance of features to underlying educational theories. For example, the low weight of the importance factor related to clarity and cohesiveness is theoretically supported by studies in writing pedagogy that emphasize coherence in written texts, (2) and the contribution of AI insight use has a satisfactory rationale in theories of self-regulated learning in terms of the learning that occurs with iterative feedback, displayed in Table 11. Together, these findings demonstrate that the proposed framework not only delivers accurate predictions but also provides theoretically grounded and actionable explanations, enhancing its pedagogical relevance.
Discussion
Evaluation of the AI-Driven Tools Assessment Framework for English Teachers showed that it is multidimensional evaluation based on statistical analysis, fuzzy logic, a predictive modeling, and an explainable AI to evaluate the pedagogical effectiveness of digital tools, followed as defined in algorithm. T-tests, ANOVA, chi square, z tests all showed no gender or tool-based bias but statistically significant effect of clarity & AI on learning outcomes. These findings together validate the strength and interpretability of the proposed assessment framework as a valid, data driven means of providing a reliable guide to English teachers as to the use of AI tools for pedagogical purposes.
Figure 24 shows the complementary cumulative distribution function (CCDF) of the dataset on log - log scales with a fitted theoretical reference power law slope ≈ - 1. The solid line is within a factor of 2 of the power-law values over most of the data range, which shows that the data are heavy-tailed. This suggests that most of the features and samples have only small effects, while a small number of features and samples have larger effects, which is in line with the observed impact that clarity and analytical quality have on predictive modeling. Such a distribution has positive implications for the sample size (N = 200) adequacy, as statistical power is focused on high-impact features with generalization and stability provided by long-tail attributes. This power-law nature of the model also empirically justifies the methodological decision to concentrate on pedagogically relevant cues based on SHAP, LIME, and Fuzzy Delphi analysis.
From Table 14 a comparative study between the proposed Bi-LSTM model and previous works is shown. Previous models like CNN-LSTM used for predicting students feedback achieved 87% accuracy and 85% F1-score, and CNN-GRU on multimodal speech datasets had accuracy 82% and F1-score 81%. Likewise, when used for course review text data, Bi-LSTM achieved an accuracy of 86% and an F1-score of 85%.
By contrast, the proposed Bi-LSTM model exploiting numeric and text features of AI tools in education maintains better overall performance with 90% accuracy, 92% precision, 93% recall, and an F1-score of 92%. These enhancements demonstrate that combining attributes both textual and numeric in the hybrid framework is effective, and that the proposed model outperforms previous methods and possesses strong predictive ability for educational applications.