I feel there is a better way, as exemplified by a recent paper by Jia and Liang. In this paper the authors corrupt the SQUAD dataset with distractor sentences which have no effect on human performance, but which radically degrade the performance of the systems on the leaderboard. This reminds me of work by Paperno et. al. on a paragraph completion task which humans perform with high skill and for which all state of the art NLP approaches fail miserably. Both of these works clearly indicate that our current automatic systems only bear a superficial (albeit economically valuable) resemblance to humans.
This approach to honest self-assessment of our capabilities is not only more scholarly, but also more productive, as it provides concrete tasks to consider. At minimum, this will result in improved technological artifacts. Furthermore iterating this kind of goal-setting-and-goal-solving procedure many many times might eventually lead to something worthy of the moniker Artificial Intelligence.
(You might argue that the Yoav Goldberg strategy is more entertaining, but the high from the Yoav Goldberg way is a “quick hit”, whereas having a hard task to think about has a lot of “replay value”.)