Facebook today introduced Dynabench, a platform for AI data collection and benchmarking that uses humans and models “in the loop” to create challenging test data sets. Leveraging a technique called dynamic adversarial data collection, Dynabench measures how easily humans can fool AI, which Facebook believes is a better indicator of a model’s quality than current benchmarks provide.
A number of studies imply that commonly used benchmarks do a poor job of estimating real-world AI performance. One recent report found that 60%-70% of answers given by natural language processing (NLP) models were embedded somewhere in the benchmark training sets, indicating that the models were often simply memorizing answers. Another study a meta-analysis of over 3,000 AI papers found that metrics used to benchmark AI and machine learning models tended to be inconsistent, irregularly tracked, and not particularly informative.