The Laude Institute dropped some spicy results from the inaugural K Prize, a tough-as-nails AI coding challenge launched by Databricks and Perplexity co-founder Andy Konwinski.The winner? A Brazilian prompt engineer named Eduardo Rocha de Andrade took home $50,000 in prize money.
The plot twist? He won by getting just 7.5% of the problems right.
Yes, you read that correctly. Not 75%. Seven. Point. Five.
“We’re glad we built a benchmark that is actually hard,” said Konwinski, sounding like someone who just handed out a test that made everyone cry and felt great about it.
The challenge, designed to separate the AI wheat from the overhyped chaff, is called the K Prize, and it’s no cakewalk. Unlike previous benchmarks like SWE-Bench (which many models have already trained against), this test is what Konwinski calls “contamination-free.” Meaning: No cheating, no pre-game peeking, and no fine-tuning on known answers. Just raw AI problem-solving power, measured in blood, sweat, and GitHub commits.
To keep it fresh, the first round only used GitHub issues reported after March 12th, ensuring that models hadn’t already seen the problems in the wild. That makes this test a better proxy for real-world programming, not just leaderboard-chasing.
“Scores would be different if the big labs had entered with their biggest models,” Konwinski admitted, but noted that the K Prize intentionally runs offline with limited compute to favor smaller, open-source models.
Translation: No GPT-4 with turbo boosters allowed. Just you, your model, and the edge of your wits.
Related: OpenAI Drops $30billion A Year On Oracle To Run Its AI Empire
In case you are wondering: the current top scores on SWE-Bench look way higher. The ‘Verified’ test boasts a 75% ceiling, while the tougher ‘Full’ test maxes out at around 34%. That makes Andrade’s 7.5% look dismal, until you realize how much harder this benchmark is meant to be.
The K Prize is not just about crushing spirits; it is about raising the standard. Konwinski has even pledged $1 million to the first open-source model that can break the 90% mark. That’s not just a carrot. It’s a whole field of golden carrots.
AI researcher Sayash Kapoor from Princeton is here for it. “Without experiments like this,” he says, “we can’t tell if current benchmarks are just contaminated, or if we’ve been gaming the system with humans in the loop.”
Translation: All those AI coding leaderboard scores? Might be inflated by shortcuts, not skill. So while 7.5% might sound like a fail, it’s a starting point, a reminder that we’re still figuring out how smart these machines are, especially when we strip away their cheat sheets.
As Konwinski puts it, “If you listen to the hype, it is like we should be seeing AI doctors and AI lawyers and AI software engineers. And that’s just not true.” AI might be good at writing tweets, picking stock photos, or hallucinating recipes, but building real software from scratch? That is still a very human job.
The question now is: Can open-source AI ever pass the test without a cheat code?