AAAI Publications, Second AAAI Conference on Human Computation and Crowdsourcing

Font Size: 
STEP: A Scalable Testing and Evaluation Platform
Maria Christoforaki, Panagiotis Ipeirotis

Last modified: 2014-09-05


The emergence of online crowdsourcing sites, online work platforms, and evenMassive Open Online Courses (MOOCs), has created an increasing need for reliably evaluating the skills of the participating users in a scalable way.Many platforms already allow users to take online tests and verify their skills, but the existing approaches face many problems. First of all, cheating is very common in online testing without supervision, as the test questions often "leak" and become easily available online together with the answers.Second, technical skills, such as programming, require the tests to be frequently updated in order to reflect the current state-of-the-art. Third,there is very limited evaluation of the tests themselves, and how effectively they measure the skill that the users are tested for. In this paper, we present a Scalable Testing and Evaluation Platform (STEP),that allows continuous generation and evaluation of test questions. STEP leverages already available content, on Question Answering sites such as StackOverflow and re-purposes these questions to generate tests. The system utilizes a crowdsourcing component for the editing of the questions, while it uses automated techniques for identifying promising QA threads that can be successfully re-purposed for testing. This continuous question generation decreases the impact of cheating and also creates questions that are closer to the real problems that the skill holder is expected to solve in real life.STEP also leverages the use of Item Response Theory to evaluate the quality of the questions. We also use external signals about the quality of the workers.These identify the questions that have the strongest predictive ability in distinguishing workers that have the potential to succeed in the online job marketplaces. Existing approaches contrast in using only internal consistency metrics to evaluate the questions. Finally, our system employs an automatic "leakage detector" that queries the Internet to identify leaked versions of our questions. We then mark these questions as "practice only," effectively removing them from the pool of questions used for evaluation. Our experimental evaluation shows that our system generates questions of comparable or higher quality compared to existing tests, with a cost of approximately 3-5 dollars per question, which is lower than the cost of licensing questions from existing test banks.

Full Text: PDF