EducAItion: 1/3 of the way through a Master’s Thesis?

Feb 6

What does the future hold for university assessments in the face of generative AI?

1 Context

In the face of generative AI and LLMs – most notably ChatGPT – students and professors are questioning the purpose and efficacy of traditional assessment methods. Before COVID-19, exams were mostly done the standardised way: closed book, in person. During the pandemic universities grappled with the shift to ’virtual’, ’e-learning’ [19]. This required rethinking exam structure not only to prevent cheating, but in attempt to retain the accuracy in which exams communicate student comprehension to teachers and fairly reflect student achievement to various other stakeholders – shown in Figure 1. Just as universities were reigning students back in for face-to-face delivery, ChatGPT made itself readily available. Given this context, the brief literature review for the project thus far is split into four sections in attempt to cover all implications of my proposal: The Philosophy of Education, Oral Assessments, AI and Education, and some of the Technical Stuff. After this review, and more broadly after contemplating the purpose of exams, the value of university education, and the human skills and abilities that will be invaluable in the future, I propose a return to the lost art of the oral exam, or ’viva voce’ [17]... not without the significant help of LLMs.

1.1 The Philosophy of Education

The demographics that have been educated and in charge of educating changed a lot throughout history. After the Socratic methods in questioning and the assertion of knowledge through debate and dialogue came philosophers such as Dewey and Locke who championed critical thinking, reflection, a personalised approach to teaching, and were against rote memorisation [8]. These same themes are present today, with educators promoting active learning, moving away from ’schoolification’ and focusing on better preparing students for life [11]. These same educators are harsh critics of the standardised exam – demonstrated to have negative impacts on teaching and learning. It’s agreed the nuances of student ability cannot be captured solely through quantitative measures and students inherently perform better and engage more thoroughly with teachings when they can connect not only with the content but with the professor delivering it [23]. When inspecting education systems around the world and how they differ, there is privilege and prestige attached to smaller teacher:student ratios, private tutoring, and access to personalised learning. Oxford is famous for its ’tutorials’ in which students of all disciplines sit in groups of 1-2 pupils with their professor to discuss papers, problem sheets, theorems, and the like [15]. Enlisting AI and LLMs, there is potential to make this personalisation more accessible and empower professors everywhere to engage more with their students.

1.2 Oral Assessments

Until the 1600s, oral assessments were publicly conducted in Latin. Institutions in the UK kept up this practice until the question of how to deliver education to larger populations, and assess these populations more efficiently, eclipsed the practice and the standardised written exam emerged. This was unfortunate. When done right, oral exams can be more humane than written assessments. They make it possible to test a student’s ’intellectual agility’ and their ability to synthesise in a way that isn’t possible with written exams [30]. Numerous papers, opinion pieces, forums, and universities vouch for oral exams – citing their more personal and dynamic nature, the provision of on-the-spot feedback which benefits students, being ’kinder but more thorough’, and helping to develop speaking skills and handle the anxieties that arise which is crucial for the workplace and interview experience post-graduation [9, 6, 7, 27, 1]. Research has been done on how to structure these exams to reduce bias, variability, and student stress – some of the main reasons, apart from professors simply not having the time, that oral exams aren’t more widely used today [14, 29, 20, 26, 12].

1.3 AI and Education

Technology in general can be seen as a ’means to an end’ in improving the quality, deliverability, and for this project, assessment in university education [22]. Looking to AI specifically, many leaders in industry, Ed Tech, and research agree that the biggest risks ”[with AI in education] is doing nothing”, and ”teaching [and assessing] the exact same subjects the exact same way” moving forward [16]. Integrating AI in university education isn’t just important to leverage an enhanced experience, but to prepare humans to live and work with AI in the future [2, 13]. Professors are split over how to handle AI in the classroom and research is being done to see exactly what GPT and other LLMs can and can’t do [21], as well as how to adapt assessments [10, 3]. Most everyone agrees teachers shouldn’t be displaced by machines; we should be looking to develop AI tools to support the adaptive learning processes and embrace the positive changes AI brings to higher education without ignoring its dangers [24, 25].

*Figure 1: Stakeholders in a university exam*

1.4 The Technical Stuff

Finally, literature pertaining to the feasibility of this project was reviewed. OpenAI’s Assistant will be used in the development of the prototype as its capabilities, thread lengths, and file upload limits are all satisfactory [4]. Apart from GPT, open-source models have been developed using methods such as Retrieval Augmented Method (RAG) to test conversational ability and understanding in specific contexts [18]. Competitions have been run to develop dialogue tools using AI, bench marking larger models against smaller fine tuned counterparts. In general, literature concludes that the field lacks evaluation metrics specific to educational contexts [28]. The most relevant findings for this project come from Sherpa, a startup born out of Stanford’s Piech Lab run by two undergraduate students [5]. With this tool, users upload files and questions are generated. Students record their responses and these are marked with feedback. I spoke with one of the founders who offered to be of help during my project. He confirmed they run this tool with GPT-4, and it’s used formatively at the secondary school level.

2 Aims and Objectives

This project aim stems from epistemological questions. For a student: ’how is my learning preparing me for the next step in life?’, and ’what/how am I learning?’. For professors: ’am I doing my job?’ and more lately, ’what does my job mean in this world of generative AI?’. This project will focus on university STEM examinations – directly relevant to my own degree. I aim for my tool to help professors create appropriate questions for oral exams, then record, analyse, mark, and generate feedback for their students. The project should address concerns with question generation, subjectivity, and time restraints that hinder oral assessments to date. I will explore whether these LLM-assisted oral exams not only resolve those issues, but facilitate greater meta cognition and personalisation than written exams, while relieving the workload and stress of professors, allowing them to focus on what is most important: connecting with their students. I also aim to report whether students feel more positively about their knowledge retention and the demonstration of their learning after taking an oral exam and receiving marks with the use of the tool. Validation will come from student trials. Using a working prototype, students from the same module at the Dyson school will be asked to participate. Based on this first trial, analysis will be done. Marks awarded by the tool will be compared with their actual marks from the module lead. Professors will be consulted to critique the tool’s marking as well as feedback the process. Following this, adjustments will be made to the tool, and time and circumstance allowing, students will be re-trialed, or the tool will be shared with a wider audience for further validation.

3 Work Plan

The project will be divided into several phases:
1. Development of the logic and back-end, Whisper and GPT API.
2. Creation of a user-friendly front end.
3. Design of the data storing and security for exam transcripts and marks generated

4. Design and execution of the experimental study.
5. Data analysis and interpretation of results.
6. Iterative improvements based on feedback and analysis.
7. External testing and feedback collection.
8. Final report writing and conclusion.

Each of these steps is elaborated below.

3.1 Development

This portion of the project will be split into smaller tasks such as further refining the system architecture, seen in Figure 2, programming the initial logic, refining the feedback of GPT-4 in the context of different exam situations, and setting up the live transcription with Whisper. I have tested GPTs capabilities as an ’oral examiner’ in marking and giving personalised feedback, as well as using Whisper to write transcriptions to text files. The results of that trial can be viewed here: GPT ’Oral Exam’ Trial. These very preliminary results were encouraging, and showed that the technologies should scale for the rest of the project. The logic will be written in python in VS code, and I will use GPT-4’s Assistant API and OpenAI’s Whisper transcription software.

*Figure 2: v0.0001 of the system architecture diagram*

3.2 UI/UX

The design principles learned so far in my degree will make themselves useful for the UI and UX inherent in this tool. It will be a simple and clean interface. I plan to do mock-ups and intermediate user testing to see how professors prefer to interact with the tool, and how best to package results to students and professors. Appropriate time has been built into the Gantt chart in Figure 3.8 below to account for prototyping and iterations on the front end.

3.3 Data Security

Even if this tool isn’t fully functional and ready-for-use by the end of the project timeline, it’s still important to plan and factor in the architecture for data storage and transfer, especially when handling sensitive material such as exam transcripts and student marks. There will be a login verification based on Imperial credentials for security. Professors’ interface will include question generation and checking, and the start of transcription view. At the conclusion of the exam, recordings and transcripts will be stored in a folder along with marks and feedback. Students’ interfaces will also store their own recording and the marks once released by professor.

3.4 Testing

As mentioned in Section 2 Aims and Objectives, the validation from students is central to this project. As the tool is first developed, it will be tested with the Machine Learning (ML) and/or Design Analytics for the Sharing Economy (DSAE) modules. Trialing either of these modules is crucial for the first round of development as Bob Shorten already uses the oral assessment method. Thus comparisons can be drawn with the typical oral exams as well as written ones. These students are also further along in their degree and would likely offer more reflective critiques. Bob Shorten would also be able to review feedback and provide input on student responses given his years of experience conducting oral exams, as well as comment on the actual value added to his workflow. After the initial development, I plan to trial with the DE1 Engineering Mathematics Course. Apart from my supervisor instructing the course, it represents validation that the tool could truly enable oral exams to find application in modules that are traditionally written. The data from a Maths trial would be doubly interesting as DE1 students have spent nearly all of their university experience with LLMs like ChatGPT available and most likely haven’t had an oral exam in university yet. Arguably, this would also add most value and novelty to professors if successful. Depending on how specialised the tool becomes in development, all three modules – and perhaps others – could be accounted for. At least ten students will be asked to participate and given context to the project. This is to ensure the results are as close to a ’live examination’ environment as possible. They will be asked to voluntarily share their actual marks from module exams to compare and interviewed afterwards on their experience.

3.5 Analysis

Transcripts from the exams will be stored and given permission from the participants, the marks and feedback from the tool will be shared with module professor to check for variance in how they would actually be marked in a live setting. This data will be interpreted and the professor’s input taken into consideration alongside student reflections in making adjustments to the tool – namely the feedback provided.

3.6 Packaging and Report Writing

The timeline for this project is quite important, especially considering the element of student participa- tion. The first prototype should be fully functional by the end of Term 2. This way the trials can be carried out before the Easter holidays. While away, I will carry out analysis and make improvements to tool. This leaves the summer term a one-month buffer or so to account for setbacks, wrapping up the report, and ideally share the work to a wider audience for further input.

3.7 Considerations

Besides the general concerns of things going awry during the project – mostly the actual technical implementation – there are a few other things to consider, some having been mentioned already.

The £200 budget for the project should be more than enough to cover the API calls, and to entice a few students to be a part of the trial, if need be.
Time will be an issue if I procrastinate too much, but otherwise should be quite feasible to generate at least a working prototype, with the time this term being most important for the building and development
Not really knowing how to code at all, I am a bit nervous and conscious that my work will not be optimal or efficient. However, I will learn a lot in the process, and I am surrounded by friends who are all better at coding than me so I am not poor in resources.
This project also relies heavily on a professor (or more than one) to be available to devote some time to looking over the results, and ideally conducting some mock exams towards the end. I recognise the value of professors’ time and will factor this in as well.
Finally, it’s always appropriate to have a ’back-up plan’ or contingency plan if things fail. With this project I am confident that at the very least I will have lots of interesting insight from students, researchers and mentors in the field, and will have spent much more time with ChatGPT than I would’ve thought. This should make for a final year project that, in the very least, is reflective and hopefully humorous.

References

[1] Active learning increases student performance in science, engineering, and mathematics.
[2] Beijing Consensus on Artificial Intelligence and Education - UNESCO Digital Library.
[3] Is education ready for artificial intelligence? | Cambridge Assessment Insights.
[4] OpenAI Platform.
[5] Sherpa.
[6] STEM Exams Can Be Inspiring and Challenging With Less Anxiety: A Multi-Institution Research Study Into the ‘Public Exam System’.
[7] Students’ views of oral performance assessment in mathematics: straddling the ‘assessment of’ and ‘assessment for’ learning divide. ISSN: 0260-2938.
[8] A Brief History of Education (& Educational Technology) - Technology for Learners, May 2018. Section: Education.
[9] Oral Assessments: Benefits, Drawbacks, and Considerations, November 2022. Section: Assessing Learning.
[10] New principles on use of AI in education, November 2023.
[11] No grades for undergraduate students?, March 2023.
[12] Resist AI by rethinking assessment, March 2023.
[13] Why is education more important today than ever? Innovation, February 2023.
[14] Ayesha Ahmed, Alastair Pollitt, and Leslie Rose. Assessing Thinking and Understanding: Can Oral Assessment Provide a Clearer Perspective.
[15] John H. Alschuler. The historical roots of educational innovation. PhD thesis, University of Mas- sachusetts Amherst, 1973.
[16] Code.org. AI 101 For Teachers: Fireside Chat with Sal Khan and Hadi Partovi, August 2023.
[17] Stephen Dobson. Why universities should return to oral exams in the AI and ChatGPT era, April 2023.
[18] Andy Extance. ChatGPT has entered the classroom: how LLMs could transform education. Nature, 623(7987):474–477, November 2023. Bandiera abtest: a Cg type: News Feature Number: 7987 Publisher: Nature Publishing Group Subject term: Computer science, Education, Machine learning, Mathematics and computing.
[19] Anna Hurajova, Daniela Kollarova, and Ladislav Huraj. Trends in education during the pandemic: modern online technologies as a tool for the sustainability of university education in the field of media and communication studies. Heliyon, 8(5):e09367, May 2022.
[20] Dredge Kang, Sara Goico, Sheena Ghanbari, Kathleen Bennallack, Taciana Pontes, Dylan O’Brien, and Jace Hargis. Providing an Oral Examination as an Authentic Assessment in a Large Section, Undergraduate Diversity Class. International Journal for the Scholarship of Teaching and Learning, 13(2), May 2019.
[21] Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as Agents, October 2023. arXiv:2308.03688 [cs].
[22] George R. McMeen. The Impact of Technological Change on Education. Educational Technology, 26(2):42–45, 1986. Publisher: Educational Technology Publications, Inc.
[23] Jerry Z. Muller. The Tyranny of Metrics. Princeton University Press, April 2019. Google-Books-ID: dil2DwAAQBAJ.
[24] Stefania Palma, Yuan Yang, and Anna Gross. Chinese AI scientists call for stronger regulation ahead of landmark summit. Financial Times, November 2023.
[25] Ben Platt. Now the Humanities Can Disrupt ”AI”, February 2023.
[26] Laura Roberts and Joanne Berry. Should open-book, open-web exams replace traditional closed- book exams in STEM? An evaluation of their effectiveness in different disciplines. Journal of Learning Development in Higher Education, (28), September 2023. Number: 28.
[27] Brian K. Sato, Cynthia F. C. Hill, and Stanley M. Lo. Testing the test: Are exams measuring understanding? Biochemistry and Molecular Biology Education, 47(3):296–302, 2019. eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/bmb.21231.
[28] Ana ̈ıs Tack, Ekaterina Kochmar, Zheng Yuan, Serge Bibauw, and Chris Piech. The BEA 2023 Shared Task on Generating AI Teacher Responses in Educational Dialogues. In Ekaterina Kochmar, Jill Burstein, Andrea Horbach, Ronja Laarmann-Quante, Nitin Madnani, Ana ̈ıs Tack, Victoria Yaneva, Zheng Yuan, and Torsten Zesch, editors, Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), pages 785–795, Toronto, Canada, July 2023. Association for Computational Linguistics.
[29] UCL. Oral assessment, August 2019.
[30] Molly Worthen. Opinion | If It Was Good Enough for Socrates, It’s Good Enough for Sophomores. The New York Times, December 2022.

AIEdTechLLMsChatGPTSTEMOralexams

Mila Robins

EducAItion: 1/3 of the way through a Master’s Thesis?

What can Israel do to improve its economy and the lives of inhabitants in the face of a long and costly conflict?

Design Analytics for the Sharing Economy