RewardHackWatch is an open-source reward hacking detection tool for LLM agents. It detects when AI agents learn to game their reward signals and tracks whether these behaviors generalize to broader ...