Generative Augmentation Driven Prediction of Diverse Visual Scanpaths in Images

Ashish Verma, Debashis Sen

Department of Electronics and Electrical Communication Engineering
Indian Institute of Technology Kharagpur, India


 Abstract

Visual scanpaths of multiple humans on an image represent the process by which they capture the information in it. State-of-the-art models to predict visual scanpaths on images learn directly from recorded human visual scanpaths. However, the generation of multiple visual scanpaths on an image having diversity like human visual scanpaths has not been explicitly considered. In this paper, we propose a deep network for predicting multiple diverse visual scanpaths on an image. Image-specific hidden Markov model based generative data augmentation is performed in the beginning to increase the number of image-visual scanpath training pairs. Considering a similarity between our generative data augmentation process and the use of long short-term memory (LSTM) for prediction, we propose an LSTM based visual scanpath predictor. A network to predict a single visual scanpath on an image is designed first. The network is then modified to predict multiple diverse visual scanpaths representing different viewer varieties by using a parameter indicating the uniqueness of a viewer. A random vector is also employed for subtle variations within scanpaths of the same viewer variety. Our models are evaluated on three standard datasets using multiple performance measures, which demonstrate the superiority of the proposed approach over the state-of-the-art. Empirical studies are also given indicating the significance of our generative data augmentation method and our multiple scanpath prediction strategy producing diverse visual scanpaths.


 Highlights

  1. A visual scanpath predictor network for images, which is trained end-to-end, driven by generative data augmentation.
  2. A HMM-based generative data augmentation procedure to obtain image-specific training pairs of images & visual scanpaths.
  3. A training strategy based on the uniqueness of a viewer to generate multiple and diverse visual scanpaths of different viewer varieties for an image.

 Proposed Architecture

proposed diverse visual scanpath prediction
The architecture of the proposed model for visual scanpath prediction comprises of two main novel components: augmentation of image-visual scanpath pairs using HMM and the LSTM-based scanpath predictor.

 Sample Results (Full results will be released soon)

proposed diverse visual scanpath prediction
Table 1: Performance comparison of the various multiple scanpath prediction models on the OSIE test set. The top-3 ranked models for each MultiMatch score are indicated by subscripts. The top-3 models with the closest Intra-SS to Human-GT's (G) are also indicated by subscripts to tick marks denoting they are within G±0.25.

 

Table number 2, caption is mentioned below.
Table 2: Cross-dataset evaluation of the various multiple scanpath prediction models. The top-3 ranked models for each MultiMatch score are indicated by subscripts. The top-3 models with the closest Intra-SS to that of Human-GT (G) are also indicated by subscripts to tick marks denoting that they are within G±0.25.

 Visual Comparison

Image for qualitative comparison of our proposed DiviScan
Visual scanpaths (with highest SS) predicted by the various models embedded on images with different numbers of objects of interest.

 Download

Code
Google drive image icon
Training & Testing Datasets

 References

  • [7] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual attention for rapid scene analysis,” IEEE T-PAMI, no. 11, pp. 1254–1259, Nov. 1998.
  • [10] O. Le Meur and Z. Liu, “Saccadic model of eye movements for free-viewing condition,” Vis. Res., vol. 116, pp. 152–164, Nov. 2015.
  • [11] C. Wloka, I. Kotseruba, and J. K. Tsotsos, “Active fixation control to predict saccade sequences,” in CVPR, Jun. 2018, pp. 3184–3193.
  • [13] W. Sun, Z. Chen, and F. Wu, “Visual scanpath prediction using IOR-ROI recurrent mixture density network,” IEEE T-PAMI, vol. 43, no. 6, pp. 2101–2118, Dec. 2019.
  • [22] G. Boccignone and M. Ferraro, “Modelling gaze shift as a constrained random walk,” Phys. A, Statist. Mech. Appl., vol. 331, no. 1-2, pp. 207– 218, Jan. 2004.19.
  • [28] M. Assens Reina, X. Giro-i Nieto, K. McGuinness, and N. E. O’Connor, “Saltinet: Scan-path prediction on 360 degree images using saliency volumes,” in ICCVW, Oct. 2017, pp. 2331–2338.
  • [29] M. Assens, X. Giro-i Nieto, K. McGuinness, and N. E. O’Connor, “Path-GAN: Visual scanpath prediction with generative adversarial networks,” in ECCVW, Sep. 2018.
  • [33] X. Chen, M. Jiang, and Q. Zhao, “Predicting human scanpaths in visual question answering,” in CVPR, Jun. 2021, pp. 10 876–10 885.