The important fragment is always in the center of the field of view. A separate mechanism is responsible for turning the head and focusing on moving, bright, loud, important things. This mechanism can be implemented later.
Films for cinemas and YouTube are of little use, because the viewer moves his gaze across the screen. They can be combined with the direction of the pupils later.
The easiest way is to shoot a short clip yourself so that the camera follows your eyes and the important thing is always in the center of the screen.
You need to process a narrow zone in the center, like through a gun sight. One can use resolution reduction for recognition and comparison with what was previously seen.
When all pixels move in the same direction, eyes move. This moments should not be captured. Movement direction and time cam be recorded for relative positioning.