KICE_slayer_AI_Korean  by NomaDamas

AI for Korean SAT test-taking, aiming for top scores

Created 2 years ago
533 stars

Top 59.6% on SourcePulse

GitHubView on GitHub
Project Summary

This project aims to achieve a perfect score on the Korean CSAT (College Scholastic Ability Test) for Korean Language, targeting students and researchers interested in LLM performance on standardized tests. It provides a framework for evaluating various LLM models and prompting strategies on past CSAT exams, offering insights into their capabilities and limitations.

How It Works

The project leverages the AutoRAG framework to benchmark LLMs against Korean CSAT exams. It involves data preparation, where PDF CSAT papers are parsed into question, passage, and answer formats. Various prompt engineering techniques, including zero-shot CoT, Plan-and-Solve, and custom "Format Specification" prompts, are applied. The output from each LLM is then manually scored against the official answers, with detailed performance metrics reported for each model and prompt combination.

Quick Start & Requirements

  • Install using pip install AutoRAG and pip install -r requirements.txt.
  • Requires an OpenAI API key set in a .env file.
  • Run benchmarks using python run_all.py --dir_name=<save_path> --model_name=<model> --start_year=<year> --end_year=<year>.
  • Supported models include gpt-4, gpt-3.5-turbo-16k, and synatra (7b).
  • Data covers CSAT exams from 2015 to 2024.

Highlighted Details

  • GPT-4 with a "Format Specification" prompt achieved an average score of 71.9 on the 2024 CSAT, while GPT-3.5-16K scored 35.1.
  • English prompts ("zero-shot-CoT-English") showed superior performance for GPT-4 on Korean language tasks, whereas Korean prompts were better for GPT-3.5-16K.
  • The "Format Specification" prompt, custom-designed for CSAT, consistently performed well across models, outperforming general techniques.
  • Synatra-7B (Mistral-7B fine-tuned) showed limited Korean comprehension, but English prompts yielded better reasoning capabilities.

Maintenance & Community

This project is developed by NomaDamas, a team from the POSTECH gifted education center's AI program. Contact information for the supervising professor and team members is provided.

Licensing & Compatibility

The project states that copyright for CSAT questions belongs to the Korea Institute for Curriculum Evaluation (KICE) and that the public availability of the 10-year dataset is under inquiry.

Limitations & Caveats

The project's benchmark results are subject to variation due to the inherent stochastic nature of LLMs, even with fixed parameters. Some models, like GPT-4-32k and Claude, were not fully tested due to access limitations at the time of development. The performance of Synatra-7B on Korean tasks is significantly lower than GPT models, suggesting limited effectiveness for this specific use case without further fine-tuning.

Health Check
Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Edward Sun Edward Sun(Research Scientist at Meta Superintelligence Lab).

AGIEval by ruixiangcui

0.1%
763
Benchmark for evaluating foundation models on human-centric tasks
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.