This is such an important distinction. The fact that AI's performance can be skewed by its training data highlights a key limitation of how we currently measure its capabilities.
It’s like giving a student a test they’ve already seen the results don’t necessarily reflect true reasoning or intelligence.
The idea of needing "offline" tests for AI opens up a whole new area of thought on how we ensure fair comparisons between human and machine cognition.
Thanks for sharing!