Discover and explore top open-source AI tools and projects—updated daily.
AI for Korean SAT test-taking, aiming for top scores
Top 59.6% on SourcePulse
This project aims to achieve a perfect score on the Korean CSAT (College Scholastic Ability Test) for Korean Language, targeting students and researchers interested in LLM performance on standardized tests. It provides a framework for evaluating various LLM models and prompting strategies on past CSAT exams, offering insights into their capabilities and limitations.
How It Works
The project leverages the AutoRAG framework to benchmark LLMs against Korean CSAT exams. It involves data preparation, where PDF CSAT papers are parsed into question, passage, and answer formats. Various prompt engineering techniques, including zero-shot CoT, Plan-and-Solve, and custom "Format Specification" prompts, are applied. The output from each LLM is then manually scored against the official answers, with detailed performance metrics reported for each model and prompt combination.
Quick Start & Requirements
pip install AutoRAG
and pip install -r requirements.txt
..env
file.python run_all.py --dir_name=<save_path> --model_name=<model> --start_year=<year> --end_year=<year>
.gpt-4
, gpt-3.5-turbo-16k
, and synatra
(7b).Highlighted Details
Maintenance & Community
This project is developed by NomaDamas, a team from the POSTECH gifted education center's AI program. Contact information for the supervising professor and team members is provided.
Licensing & Compatibility
The project states that copyright for CSAT questions belongs to the Korea Institute for Curriculum Evaluation (KICE) and that the public availability of the 10-year dataset is under inquiry.
Limitations & Caveats
The project's benchmark results are subject to variation due to the inherent stochastic nature of LLMs, even with fixed parameters. Some models, like GPT-4-32k and Claude, were not fully tested due to access limitations at the time of development. The performance of Synatra-7B on Korean tasks is significantly lower than GPT models, suggesting limited effectiveness for this specific use case without further fine-tuning.
11 months ago
Inactive