Evaluation of ChatGPT’s Answering Capabilities in Natural Science and Engineering Domains: A Study at Delft University of Technology
In our recent study, we delved into the capabilities of ChatGPT within the natural science and engineering domains. The study involved a diverse group of participants from different faculties at Delft University of Technology, including assistant professors, associate professors, full professors, lecturers, Ph.D. students, postdoctoral researchers, and others.
Our evaluation focused on assessing ChatGPT’s answering capabilities across various skill categories and educational levels. The results, as depicted in Figure 1, highlighted several key findings. Firstly, ChatGPT received higher scores for basic and scientific skills compared to skills beyond scientific knowledge. Participants rated the question relatedness of the answers and the level of English highly. However, the model’s critical attitude scored lowest among the assessment criteria, suggesting the need for further verification of results.
Moreover, the assessment of scientific correctness revealed that ChatGPT can provide mostly correct answers for Bachelor level questions and partly correct answers for Master and Ph.D. level questions. It was interesting to note the impact of the answers generated by ChatGPT, with participants mentioning various potential impacts ranging from environmental to safety concerns.
Further analysis of the study variables, including skill categories and educational levels, showed significant influences on the assessment scores. Scientific skills were rated higher than skills beyond scientific knowledge, and answers for lower educational levels received better ratings. Faculty, however, did not show a significant influence on the assessment rating.
The study also included free text comments from participants, providing additional insights into the perceived quality of ChatGPT’s answers. Comments ranged from critiques about lack of detail to comparisons with student answers. Some participants raised concerns about the sources of training data used by ChatGPT and its implications on the generated answers. Emotional reactions were also observed, with a mix of neutral, positive, and negative sentiments expressed in the comments.
Overall, our study sheds light on the strengths and weaknesses of ChatGPT in answering questions related to natural science and engineering. While the model demonstrates competence in certain areas, further improvements are needed, especially in critical thinking and ensuring scientific correctness. As AI continues to shape the future of education and research, studies like ours provide valuable insights for enhancing the capabilities of AI-powered tools in academic settings.