Can AI Pass an Exam or Test? / ITech content

Over the past six months, we have seen a lot of news about how ChatGPT and similar bots based on large language models are coping with complex online exams, including tests in medicine and business management. At first glance, this might seem like a triumph of artificial intelligence over human capabilities. If AI successfully passes such tests, it might give the impression that it truly possesses skill and understanding. Jeremy Rochelle, Executive Director of Learning Sciences Research at Digital Promise and a member of the International Society for Learning Sciences, disagrees with the assertion that artificial intelligence's abilities can be compared to human ones. In his column for the Association for Computing Machinery blog, he emphasizes the inappropriateness of such comparisons, especially in the context of exams. Rochelle emphasizes that AI and humans have fundamentally different approaches to learning and assessment. Artificial intelligence operates with algorithms and data, while human cognition incorporates emotional and social aspects. This analysis makes it clear that comparing AI and human performance in an educational context does not reflect the real picture.

Executive Director of Learning Sciences Research at Digital Promise and a member of the International Society for the Learning Sciences, he has significant experience in educational technology. and research. His work is focused on improving the quality of education using innovative methods and approaches. As part of his work, he is actively involved in the development and implementation of effective educational practices that advance the science of learning. Membership in the International Society for Learning Sciences underscores his commitment to international collaboration and knowledge sharing in this field.

Why You Shouldn't Trust ChatGPT Test Results

Jeremy Rochelle suggests recalling the process of creating exam tests, especially in the context of the American education system, which ChatGPT sometimes solves so successfully. Understanding this process will help evaluate how artificial intelligence copes with tasks and what methods are used to develop tests, which is important for their further improvement.

Testing is based on psychometrics and modern assessment methods. These methods allow us to determine the probability of respondents' correct answers to tasks of varying difficulty. The test development process begins with the creation of an extensive bank of exam items. These items are then tested on a group of real students, not machines. Based on the results, specialists evaluate how effectively the test can distinguish the level of knowledge and abilities of participants in a particular area. It is important that test items truly reflect the abilities of test takers. Therefore, items that do not provide information about differences in knowledge are excluded from the exam, and those that effectively accomplish this task are retained. This approach ensures high accuracy and reliability of testing.

The validity of the test as a measure of human ability is assessed based on empirical data. It is important to note that modern test theory does not guarantee that this validity applies to non-human subjects, such as artificial intelligence algorithms or hypothetical aliens. Since AI models respond to test tasks differently than humans, it cannot be assumed that a high test score indicates a high level of intelligence in an AI model. Modern test theory does not have the necessary data to accurately distinguish between highly intelligent and less intelligent AI models.

The researcher emphasizes another important feature of tests that complicates comparisons between the abilities of "robots" and humans: the conclusions drawn by developers based on a limited number of tasks and their formats require confirmation. This means that the results must be compared with other metrics. If the metrics show similarity, it can be assumed that the conclusions are correct. However, other metrics tend to be related to human abilities, knowledge and skills rather than artificial intelligence. Thus, a full analysis must take into account the context and diversity of human experience, which makes comparisons with AI even more challenging.

Still: the film "Robot and Frank" / Dog Run Pictures / Park Pictures

Jeremy Rochelle emphasizes that there is no guarantee that conclusions drawn from a specific set of tasks, such as legal tests, will hold true for non-human subjects. This calls into question the accuracy of assessing the knowledge, skills, and abilities necessary for a successful career in law.

Why is AI an amateur, not an expert?

If tests are not designed for artificial intelligence, then why do AI-powered chatbots successfully pass them? The answer is that many tests are standardized. They have a similar form, structure, and content, which significantly simplifies the task for artificial intelligence. Standardized tests allow AI to more easily recognize patterns and regularities, leading to higher answer accuracy. This highlights the importance of adapting tests to more complex and varied scenarios to truly assess AI capabilities.

The expert notes that he is more impressed by ChatGPT's ability to interact with people in unstructured dialogues than by its performance on standardized tests. Standardized tests have clear frameworks and a predictable structure, making them less indicative of real-world skills. The question is why someone would consider an AI model that successfully performs on standardized tests more effective than one that demonstrates high performance in complex and non-standard situations. This highlights the importance of AI flexibility and adaptability in real-world interactions.

Jeremy Rochelle illustrates why artificial intelligence cannot be considered an expert using the example of meeting a fellow traveler—a house painter who expressed an interest in physics. Despite studying physics independently through encyclopedias and attempting to connect various topics, his knowledge lacked a systematic approach. While the painter possessed a certain erudition in the field, the primary difference between him and a true physics expert was obvious: he couldn't organize his knowledge into a coherent framework. The depth of scientific understanding is determined precisely by a professional's ability to comprehend the logic and interrelationships of phenomena, linking them to the fundamental principles of physics. Therefore, artificial intelligence, despite its information processing capabilities, is incapable of achieving the same level of understanding as a qualified specialist.

Photo: film "Mind Games" / Fastnet Films / Icon Entertainment

Modern large language models resemble amateur physicists: they can answer questions, but lack a deep understanding of the subject. Neural networks confidently operate on the sequence of words in sentences, but they lack true competence in the topics under consideration. According to Jeremy Rochelle, a full understanding of these systems is still a long way off.

The ability of an artificial intelligence model or algorithm to pass a "human" test is not a reliable indicator of its knowledge comparable to an expert. Currently, generative AI is more like an artist than we are willing to admit. Reports of AI passing tests are misleading, as they simplify the essence of expert knowledge in various fields. This situation highlights the importance of deep understanding and experience, which cannot be replaced by the superficial level of skill demonstrated by artificial intelligence.

Rochelle is convinced that this problem must be actively addressed. Scientists must communicate to a wider audience that exams are not a reliable tool for assessing the strengths of artificial intelligence, and that comparing its results to human ones is often inappropriate. Education specialists also have a key role to play in this process: their task is to develop new exam formats and methods for assessing skills and knowledge that will more accurately reflect real-world competencies.

Learn more about education on our Telegram channel. Join us!

Can AI Pass an Exam or Test? / ITech content

Contents:

Why You Shouldn't Trust ChatGPT Test Results

Why is AI an amateur, not an expert?