In conclusion, while progress has been made in the field of nanozymes, the combination of AI with nanozyme research offers an unprecedented opportunity to overcome these gaps. AI can streamline data curation, standardize experimental parameters, and offer predictive insights that will significantly enhance the speed and accuracy of nanozyme design and application, ultimately enabling broader real-world implementation.
We initially retrieved over 6,000 nanozyme-related publications from reputable databases, including Google Scholar, ACS Publications, Elsevier, and Web of Science. This body of literature was subsequently filtered on the basis of three stringent criteria: (1) a primary emphasis on nanozyme-like enzymatic activities, specifically peroxidase (POD), oxidase (OXD), catalase (CAT), superoxide dismutase (SOD), and glutathione peroxidase (GPx); (2) inclusion of morphological characterizations of nanozymes within the publications; and (3) comprehensive documentation of catalytic types and steady-state kinetic parameters. Through these criteria, we refined the dataset to 366 highly relevant publications, encompassing 12 disciplines across 97 academic journals. Of these, 56 publications were from the past two years (2022-2024), and 175 were published within the last five years (2019-2024). The average impact factor (IF) of these 366 publications is 8.4, with 251 articles having an IF above 5. For the selected publications, we employed a rigorous data extraction protocol. Key experimental parameters, such as chemical composition, nonmetal doping, metal ratios, metal types, metal oxidation states, morphologies, particle sizes, surface modifications, and synthesis pathways, were meticulously extracted to determine the physical properties of the nanozymes. Additionally, data on steady-state kinetic conditions, including dispersion media, buffer pH, temperature, substrate types, and substrate concentrations, were collected. Enzyme-mimicking catalytic types (POD, OXD, CAT, GPx, and SOD) and Jiang et al.'s research further emphasizes the importance of standardized methodologies for the accurate evaluation of the catalytic activity and kinetic properties of peroxidase-like nanozymes, ensuring their precise and reproducible application in biological detection and diagnostics. The catalytic activity metrics we have collected are primarily inspired by this literature and systematically document key catalytic parameters such as K, V, k, and IC50. Each entry in the database is accompanied by a Digital Object Identifier (DOI) and detailed citation information from the original source, allowing researchers to trace back to the primary literature for verification or further data analysis.
In our curated collection of 366 publications, we observed significant data dispersion concerning the catalytic efficiency of nanozymes across the literature. Notably, 129 of these papers include relevant catalytic efficiency data within supplementary files presented in various formats (PDF and Word documents). Additionally, there is considerable inconsistency in the units employed to report the catalytic efficiency parameters. For example, the Michaelis constant (K) appears in four unit formats (M, mM, µM, and nM); the maximum velocity (Vmax) is reported in seven formats (M/min, M/s, mM/min, mM/s, nM/s, µM/min, and µM/s); and the turnover number (kcat) is expressed in either sā»Ā¹ or minā»Ā¹.
To address these inconsistencies, we systematically compiled and organized existing nanozyme research data to establish a standardized repository. By aggregating data from diverse publications, we constructed a comprehensive and cohesive database. Given that nanozyme research is an emerging field, standardized terminology for nanozymes is limited, with only one formal set of definitions provided by the China Science and Technology Terminology Standardization Committee. This standard outlines foundational terminology for nanozyme characterization, specific catalytic activity types, and nanozyme classification. We aligned the nanozyme nomenclature with these guidelines, adopting the following format: "Material Composition/Structure" + "Enzyme-like Activity" + nanozyme, such as FeO - Peroxidase Nanozyme. Although most reported nanozymes have yet to meet this criterion, a broader definition would be beneficial to the field.
Currently, no uniform standard exists for reporting K, V, and k values in nanozyme research. Similarly, there is no consensus regarding the description of nanozyme morphologies or reaction dispersion systems. To standardize these aspects, we extracted all supplementary data from the 129 relevant papers, as well as the main text data from the remaining publications, into a unified Excel spreadsheet. Units for K, Vmax, and kcat were standardized to mM, µM/s, and sā»Ā¹, respectively. Nanozyme morphologies can be categorized into seven distinct shapes: nanocubes, nanodots, nanoparticles, nanorods, nanosheets, nanowires, and polyhedrons. The dispersion systems were classified into eight standard categories: Britton-Robinson, citrate, MES buffer, NaAc-HAc, PBS, phosphate, Tris-borate, and water.
During data organization, we identified instances where nanozyme sizes were reported as ranges. For example, in one study, the nanozyme size was 2.9-3.4 nm. In such cases, we employed an averaging approach to derive a single standardized value, recording the size as 3.15 nm rather than preserving the range.
On the basis of our stringent data extraction criteria, all 1,085 entries in our collection display some degree of incompleteness owing to the exhaustive nature of our data requirements. For example, we required entries to specify the fourth metal oxidation state for nanozymes; however, in cases such as Pt nanozymes, which consist of a single elemental component, a fourth metal oxidation state is inherently absent. Similarly, for NiO nanozymes, details on surface modifications are not reported in the literature, resulting in data gaps.
To address missing values and ensure data completeness for machine learning applications, we applied the K-nearest neighbor (K-NN) algorithm for imputation. The K-NN algorithm identifies the closest "neighbors" by calculating the similarity between samples with missing values and other entries via Euclidean or Manhattan distance metrics. To maintain accuracy, similarity calculations were based exclusively on features without missing values. This imputation process effectively filled data gaps, enhancing the integrity of our dataset for subsequent model training.
In this study, we employ a machine learning approach to predict the enzyme-mimicking activity of nanozymes on the basis of their names. The following steps outline the data processing and model training procedures:
The dataset contains information on nanozyme names and their corresponding enzyme mimic activities. The data are loaded into a DataFrame, with the primary column renamed Nanozyme Name for clarity and consistency in feature selection. This column serves as the input feature X, while the target variable y represents the enzyme activity categories.
Given the text-based nature of the nanozyme names, CountVectorizer is employed to transform the raw text into numerical feature vectors. This process tokenizes the text data, captures the frequency of word occurrences, and converts the nanozyme names into a sparse matrix suitable for machine learning applications.
The processed data are split into training and testing sets, using an 80/20 split ratio. The training set, comprising 80% of the data, is used to train the machine learning model, while the remaining 20% forms the test set. A random seed of 42 ensures reproducibility.
An AdaBoost ensemble classifier is selected because of its ability to improve accuracy by reducing bias and variance. The AdaBoost classifier uses a decision tree as its base estimator, with the following configuration:
A maximum depth (max_depth) of 70 is used to control tree growth, ensuring adequate learning in complex feature spaces. The minimum number of samples for a split (min_samples_split) is set to 2. The minimum number of leaf samples (min_samples_leaf) was set to 1. The maximum number of features (max_features) was limited to 10, constraining each split to a subset of features. Gini impurity (criterion) was used as the split criterion for node purity. The best-split strategy (splitter) to select the optimal split. A random seed (random_state) of 824 was used to stabilize the model learning process.
The AdaBoost classifier is configured with 19 estimators and a learning rate of 0.01, optimizing performance by gradually reducing the contribution of individual estimators. The model is then trained on the X_train and y_train datasets to identify patterns in nanozyme names that correlate with enzyme mimic activity.
After model training, a predict_nanoenzyme_type function is defined. This function receives a single nanozyme name, converts it to a feature vector via CountVectorizer, and applies the trained AdaBoost model for prediction. The numerical prediction is mapped to specific enzyme types as follows:
Finally, the predicted enzyme type is returned, offering a streamlined approach to determine the mimic activity type of novel nanozymes on the basis solely of name-based features.
This approach provides an efficient framework for nanozyme activity prediction, enhancing nanozyme research by facilitating rapid classification and enabling further insights into nanozyme functionality.
In this study, we designed a prediction model using a machine learning approach to estimate key catalytic properties -- K, Vmax, and Kcat -- for nanozymes on the basis of a variety of physicochemical attributes. The methodology involves the following detailed steps:
The dataset, containing nanozyme attributes and target catalytic properties, is loaded into a DataFrame. The attributes include the elemental composition (e.g., N, P, S), chemical formula, particle shape, size, surface modifications, pH of the buffer, temperature, and substrate concentrations. These features are selected to represent structural and environmental factors influencing enzyme mimicry in nanozymes.
Since some chemical formulas might slightly differ or contain minor errors, the function find_similar_formula is implemented to identify and map chemically similar formulas. Using a preestablished mapping dictionary loaded from a JSON file, this function uses string similarity matching to locate close matches, ensuring compatibility in formula interpretation.
In the prediction function predict_ property, the target catalytic properties (K, Vmax, Kcat) are iteratively set as target columns. Each target property represents a distinct output of the regression model, whereas the remaining features form the input matrix X.
The data are divided into training and testing sets, with an 80/20 split ratio. This ensures that the model is trained on a representative portion of the dataset while retaining a portion for independent evaluation. The random seed for splitting is set to 7 for reproducibility.
To normalize the feature space, MinMaxScaler is used to scale the input features to a range of [0, 1]. Owing to their skewed distributions, the target variables are log-transformed to approximate a normal distribution. This log transformation is reversed at the prediction stage to provide interpretable outputs.
A GradientBoostingRegressor is employed as the regression model. Owing to its high accuracy and ability to handle complex interactions, gradient boosting is suitable for this application. The key parameters include:
{n_estimators is set to 197, controlling the number of boosting stages.
A learning_rate of 0.01 adjusts each tree's contribution to the final prediction.
max_depth is set to 54, allowing the model to capture deep relationships.
min_samples_split and min_samples_leaf values (5 and 2, respectively) control the minimum data required for splits and leaves.
max_features capped at 22 to prevent overfitting by limiting features at each split.
The loss function is set to squared_error for regression, and the criterion is set to friedman_mse to improve the split quality.
The model is trained via the x_train and y_train datasets, fitting it to predict catalytic property values from input features.
In predict_advanced, the user provides nanozyme properties as inputs. A new sample is constructed as a DataFrame, scaled with the previously fitted scaler, and passed to the trained regressor. The model outputs predictions for each target property, which are then transformed back from the log scale to their original scale for interpretability.
The model outputs for K, V, and K are reported as distinct predicted values. These predictions provide insights into the catalytic efficiency and potential applications of novel nanozymes on the basis of their physicochemical properties, thereby enhancing our understanding and design of enzyme-mimicking nanomaterials.
This machine learning model demonstrates a robust approach for predicting nanozyme catalytic properties, facilitating nanozyme development by providing predictive insights into key reaction characteristics on the basis solely of compositional and structural parameters.
Owing to the inefficiency of manually searching for nanozyme-related data, ChatGPT, as an advanced natural language processing tool, has gradually been applied in various scientific research tasks. In nanozyme research, it plays the role of a "copilot," helping researchers more efficiently process and analyze vast amounts of literature data. ChatGPT is used to automatically extract key information from large volumes of scientific literature. This process includes identifying and extracting experimental parameters, research results, and relevant nanozyme functional data from the literature. Through its pretrained model, ChatGPT can understand complex scientific terminology and context, accurately extracting targeted information. Compared with traditional manual extraction methods, ChatGPT offers significant advantages in terms of data extraction efficiency and accuracy. According to the test results, ChatGPT achieves an accuracy rate of 67.55% in extracting data from the literature, meaning that it can significantly reduce the manual labor of researchers while also minimizing the possibility of human error.
ChatGPT is not only capable of extracting information but can also perform preliminary semantic analysis and organization of the extracted data. For example, it can correlate and integrate data from different studies on the same research topic, providing researchers with a more comprehensive perspective. ChatGPT can also identify data trends and patterns in the literature and classify and annotate data when necessary, making complex data easier to understand and use.
During the research design phase, ChatGPT can offer real-time suggestions and support. For example, in nanozyme function prediction and experimental design, researchers can interact with ChatGPT to receive advice on experimental methods, parameter selection, and data analysis. These suggestions are based on ChatGPT's ability to learn and understand extensive literature data, which can help researchers optimize experimental design, thereby increasing the success rate of experiments and the accuracy of the data.
ChatGPT can also serve as a bridge for communication between different disciplines. In nanozyme research, which involves multiple fields, such as chemistry, materials science, and biology, researchers may face differences in terms and concepts across different domains. ChatGPT can assist researchers in understanding and translating these terms, thus promoting interdisciplinary collaboration and improving research efficiency.
As a "copilot," ChatGPT plays a crucial role in nanozyme research, significantly enhancing research efficiency and accuracy. It is not just a tool but an intelligent assistant capable of providing support and suggestions throughout various stages of the research process. The use of ChatGPT for data integration and standardization lays the foundation for machine learning algorithms.