Comparability and reproducibility
We intuitively feel that we should be able to compare the performance of an agent or an algorithm in a particular task to the performance of another agent or algorithm in the same task. For example, if an agent gets a score of 1,000 on average in the Atari game of Space Invaders, we should be able to tell that this agent is performing worse than an agent that scores 5000 on average in the Space Invaders game in the same amount of training time. But what happens if the scoring system for the game is slightly changed? Or if the environment interface was modified to include additional information about the game states that will provide an advantage to the second agent? This would make the score-to-score comparison unfair, right?
To handle such changes in the environment, OpenAI Gym uses strict versioning for environments. The toolkit guarantees that if there is any change to an environment, it will be accompanied by a different version number. Therefore, if the original version of the Atari Space Invaders game environment was named SpaceInvaders-v0 and there were some changes made to the environment to provide more information about the game states, then the environment's name would be changed to SpaceInvaders-v1. This simple versioning system makes sure we are always comparing performance measured on the exact same environment setup. This way, the results obtained are comparable and reproducible.