Introduction

Schizophrenia (SCHZ) is typically associated with deficits in domains related to information processing, such as perception, attention, working memory, and learning1. All these domains likely have one common denominator: impaired salience, the property by which something stands out from surrounding context. Salience is typically regarded as having two components: physical and cognitive salience. Physical salience refers to the aspects of a stimulus that automatically capture attention or direct gaze in a stimulus-driven, goal-independent, or bottom-up manner2. In contrast, cognitive salience is task-oriented, influenced by tasks assigned by external sources or driven by one’s current internal goals3. Disruption of physical salience, which is based on sensory sensitivity to external stimuli, may impede the formation of cognitive salience-related associations. This means that it can affect our ability to attribute meaning to individual stimuli from the external environment4. Kapur proposed that dysregulated, hyperdopaminergic states at the cellular level may lead to the attribution of aberrant salience to individual experiences at the psychological experiential level5. However, salience formation is a complex, long-term process that reflects our internal model of the world, which may not be stable in SCHZ due to distortions and instability of sensory signals6.

Vision is our most developed sense7,8 and unsurprisingly a substantial amount of brain processing is devoted to it, with over half the primate brain being involved in vision-related processing9. Due to the limited computational capacity of the visual cortex10, it is critical to correctly cluster visual percepts according to a hierarchy of importance. The internal model of the world is derived from the combination of neural filters and cognitive signals that gradually calibrate them. This mechanism allows the brain to process visual signals efficiently and to focus its limited computational capacity and attention only on those parts of the scene that are subconsciously assessed as important11,12. Computational capacity limits are mainly related to the physiological aspects of the neurons themselves and the functional circuits sensitive to the different elements of the visual scene13,14. The brain solves this limited capacity for attention allocation through prediction mechanisms15. The perceptual onset is preceded by a quick subliminal observation of the scene (bottom-up), which is based on its physical saliency (contrast, brightness, and low spatial frequencies). This observation helps us quickly orient ourselves and focus our attention in the next step, in which higher (top-down) cognitive processes come into play. These processes are related to the cognitive saliency formed by our internal model of the world6,16. Low spatial frequency (LSF) information is swiftly extracted from visual stimuli and conveys general details about the shape and orientation of objects within a scene. This LSF information subsequently contributes to the formation of top-down predictions, influencing visual attention and higher-level cognitive processes related to visual perception16,17,18,19. A primary outcome resulting from the disruption of this process is a disorder of attentional capacity and the inability to rapidly incorporate salient percepts into the stream of consciousness20,21.

In SCHZ, previous findings indicated a disruption in both types of processing: basal visual perception based on incorrect processing of visual stimuli (bottom-up)22,23,24,25, and impairment of higher visual cognition based on the processing of visual stimuli influenced and orchestrated by previous experience (top-down/feedforward sweep)26,27,28,29,30,31,32,33,34. The stimuli used in these experiments are typically designed based on the research question being addressed. Bottom-up experiments predominantly work with elementary stimuli, such as basic line figures35, Gabor patterns29,36, and pop-out structures37, while top-down experiments use different types of visual illusions33,38 or faces39. However, this approach falls short in providing a comprehensive mapping of the interplay between bottom-up and top-down processes during complex visual processing in everyday environments. It also lacks the capability to conclusively ascertain how deficits in bottom-up processing influence the perception, cognition and formation of aberrant saliency of complex real-life scenes in SCHZ population.

To address this knowledge gap, we attempt to identify differences between both groups by using recent saliency “bottom-up” and “top-down” predictive models40,41, with the former relying solely on physical visual properties and the latter additionally incorporating object recognition. Attention allocation has been intensively investigated through saliency models using “saliency maps”42,43,44, a computational concept that predicts graded saliency for each location of an image based on its low-level visual features, and thus predicts bottom-up attention45. It includes three components: (1) feature maps that represent fundamental visual characteristics such as color, orientation, luminance, and motion; (2) saliency maps resulting from combining normalized feature maps that highlight the visually significant areas in an image, solely based on their physical attributes, without taking into account any semantic features of the stimulus; (3) the “ground truth maps” representing the saliency maps derived from the real eye-tracking data capturing viewer attention allocation to specific regions of the image. The efficacy of saliency model predictions is then evaluated through its comparison with ground truth maps. In previous studies, saliency models have even been employed to analyze brain activity in response to visual stimuli, with distinct brain areas linked to the ‘saliency map’ generated by a saliency model46,47.

Recent technological advances in the field of machine learning have enabled the incorporation of additional convolutional neural network (CNN) layers to original bottom-up models. These added CNN layers reflect top-down cognition, which is involved in analysis and categorization of specific semantic content of a scene (e.g., objects, faces, emotions)48,49,50,51. However, it is important to emphasize that such models are not solely based on top-down cognition; they still incorporate the bottom-up layer within their computations. In this paper, for the sake of simplicity, we refer to such models as “top-down” because, unlike bottom-up models, they have the capability to suppress the bottom-up component in favour of top-down processing52,53.

We utilized these two models to determine the likelihood of an observer directing their attention to specific areas within the scene. We expect that analyzing ground truth maps derived from eye-tracking data of individuals with schizophrenia (SCHZ) and healthy controls (HCs), and comparing these with mathematically predicted saliency, will provide deeper insights into the similarities and differences in bottom-up and top-down visual processing between these two groups. We hypothesized that SCHZ patients’ attention is influenced more by the physical properties of the image than HC’s attention. This suggests a tendency to prioritize highly physically salient percepts in the scene more than HC54,55,56,57, likely reflecting the disruption of higher cortical processes consistently found across studies and resulting in the expected lower predictive ability of the top-down model in SCHZ patients58,59,60. In this paper, we employ the term “bottom-up bias” to denote a tendency to prioritize bottom-up signal over top-down processing61.

To investigate the ‘bottom-up bias’ in schizophrenia (SCHZ), our approach involved a multi-faceted comparison using saliency models across both SCHZ patients and HCs. Initially, we compared the overall results of these models between the two groups. Furthermore, our analysis extended to assessing the performance of the saliency models across five specific content-based categories, each inherently linked to either bottom-up or top-down processing. This nuanced categorization allowed us to parse the visual processing mechanisms more precisely and understand how each model interprets different types of visual stimuli in SCHZ and HCs. Subsequently, we integrated a stepwise analysis of two consecutive time periods in our study – the first encompassing up to five fixations, and the second starting from the sixth fixation. This sequential analysis was aimed to unravel the dynamics of visual perception in SCHZ. By examining these two distinct phases, we sought to identify and contrast the engagement of bottom-up and top-down components in the visual perception processing of both groups. Finally, to reveal confounding factors that might influence the results of the two saliency models, we decided to test the relationship of oculomotor movements with psychological metrics (Continuous Performance Test (CPT) and Positive and Negative Syndrome Scale (PANSS)), medication, disease duration, and the length of its untreated phase (DUP).

Results

Differences in the Performance of Saliency Models

Comparison of saliency maps calculated for each participant (ground truth maps) to saliency predictions lead to 13,436 normalized scan path (NSS) values from 53 subjects (28 SCHZ, 25 HC). A direct nonstatistical comparison of the NSS scores between two saliency models showed that the bottom-up (GBVS) model was able to predict oculomotor behavior better in the SCHZ population (M = 1.43, SD = 0.58) than in HC (M = 1.35, SD = 0.51). In contrast, the top-down (EML-Net) model better predicted the distribution of fixations in HC (HC: M = 2.16, SD = 1.13) than SCHZ (SCHZ: M = 2.08, SD = 1.29). However, when we employed linear mixed effects models (LME) for statistical comparison, the analysis did not corroborate the differences observed in the direct, non-statistical comparison of NSS scores between groups and across models.

Evaluation of NSS scores for the bottom-up (GBVS) model did not show significant differences between-groups but indicated significantly higher performance of SCHZ patients in the highly salient image category (Table 1). The top-down (EML-Net) model also did not show an overall between-groups effect but showed significantly lower patients’ performance in images depicting social interactions (Table 1).

Table 1 Results of LME comparison for top-down and bottom-up model.

At the whole-group level, including both SCHZ and HC, the bottom-up (GBVS) model showed no differences between image categories. On the other hand, the top-down (EML-Net) model showed lower prediction capability in the physically salient image category, and higher capability in the social interaction and social landscape image categories (Table 1).

Between-group differences in bottom-up and top-down predictions in time

To identify the inter-group differences in the involvement of bottom-up and top-down processes over time, we calculated NSS score for each model in two different time periods: up to the fifth fixation and from the sixth fixation (Fig. 1). The decision to split the dataset into two periods was based on previous research showing that prediction accuracy for bottom-up models is lost around the fifth fixation62. Another decision that led us to split the dataset is the peak of the fixation duration, which is located just around the fifth fixation, for both groups (Fig. 2). We applied LMER models to both periods and both saliency models.

Fig. 1: The difference between models performance in time.
figure 1

A difference in NSS score of the top-down and bottom-up model between-groups over time. Description: The top-down (EML-Net) model performs better within both time periods in the case of HCs. The bottom-up model, on the other hand, is better in predicting saliency in the SCHZ population only in the case of the second period from the sixth fixation. In the first period, the prediction is more accurate for HCs than SCHZ patients.

Fig. 2: Inter-group differences in the duration of individual fixations (group mean, standard error of the mean).
figure 2

Vertical red dotted lines show the mean number of fixations in groups ***p < 0.001; **p < 0.01; *p < 0.05; ns = not significant. A sequential testing procedure was applied to control false positive rate – stopping at the first fixation with a non-significant result.

Sequential analysis of bottom-up (GBVS) model

The LME model revealed no significant differences in NSS scores between the SCHZ and HC groups for either observed period. However, in the context of physically salient images, the model consistently showed a better prediction of oculomotor behavior for SCHZ patients compared to HCs, in both periods (Table 2).

Table 2 Differences in NSS scores between SCHZ and HC groups, for bottom-up (GBVS) model in two different time periods.

Furthermore, an analysis of the second period revealed differential performance across image categories at the whole-group level. Specifically, the bottom-up model indicated better performance for physically salient images, while it showed reduced effectiveness in accurately predicting oculomotor movements for stimuli depicting social interactions and social landscapes (Table 2).

Sequential analysis of top-down (EML-Net) model

LME results showed a difference in NSS score between groups during the first time period (Table 3). We also observed significantly higher model predictive performance of patients’ oculomotor behavior in the physically salient image category and lower performance in social landscape images category in the first period. Stimuli depicting social interactions had significantly lower NSS score in SCHZ patients in both periods (Table 3). Contrastingly, when we examined the whole-group level results, which include both SCHZ and HC groups, no differences were observed between image categories in either of the two periods (Table 3).

Table 3 Differences in NSS scores between SCHZ a HC groups for top-down (EML-Net) model in two different time periods.

Group Differences in Fixation and Explored Area of the Image

The SCHZ group showed a significantly lower mean number of fixations per image than the HC (SCHZ: M = 8.92, SD = 1.28; HC: M = 9.22, SD = 0.75; t(54) = 5.26, p < 0.001), and the overall mean fixation duration was longer in SCHZ than in HC (SCHZ: M = 326.12 ms, SD = 22.97; HC: M = 254.83 ms, SD = 24.15; t(54)= −4.44, p < 0.001). We also observed a statistically significant difference between the groups in terms of the total area of the image that received fixations. This ‘total fixed image area’ refers to the cumulative portion of the image that was the focus of gaze fixations across all participants within each group. The standard deviation (SD) test revealed that the SCHZ group had significantly reduced spread of fixations over the image area (SCHZ: SD Mean = 678.28; SD = 76.3; HC: SD Mean = 727.56 (SD = 83.82); t(54) = 6.87, p < 0.001).

In addition, we identified between-group differences in the temporal dynamics of fixation duration. In SCHZ, the average fixation duration stabilized after an initial increase in duration. Around the fifteenth fixation, their duration became comparable to HC. The fifth fixation was achieved in 99% of all trials in HC and in 96% of all trials in SCHZ. Tenth fixation was achieved in 96% of all trials in HC and in 82% of all trials in SCHZ. Fifteenth fixation was achieved in 79% of all trials in HC and in 45% of all trials in SCHZ. A sequential testing procedure was used to test the significance of this difference. The first fourteen fixations showed a statistically significant difference in fixation lengths (t(54) = −2.55, p = 0.013). The fifteenth and subsequent fixation durations did not differ between groups (t(54) = −1.67, p = 0.098) (Fig. 2).

In the SCHZ group, we also investigated the relationship between oculomotor movements (including the duration and number of fixations) and various factors: the antipsychotic medication dosage, responses on the PANSS questionnaire, the duration of illness, and the period of untreated illness. However, our analysis revealed no statistically significant correlations between these variables and oculomotor movements. Additionally, we examined the relationship between oculomotor movements and CPT test results in both SCHZ and HC groups. We found a negative correlation between CPT Commissions and the mean number of fixations in HC group, but no other significant correlations with other measured variables and participant groups. Detailed results can be found in (Table 4).

Table 4 Results of psychological measurements.

Discussion

The main finding of our study is that the bottom-up model was able to better predict the oculomotor behavior of the SCHZ population and in contrast the top-down model better predicted the oculomotor behavior of HCs. While the LME model did not statistically confirm differences for either the bottom-up or top-down models overall, it identified significant variations upon examining specific image categories. These findings indicate that the bottom-up model better predicted oculomotor behavior in SCHZ patients compared to HC when viewing physically salient images. This observation supports a ‘bottom-up‘ bias in SCHZ patients and the assumption of a delayed integration of visual signals initially processed by bottom-up mechanisms into the subsequent top-down processing26,55,56.

On the other hand, the top-down model was more effective in predicting the gaze patterns of SCHZ patients compared to HCs when they viewed incongruent scenes. This observation suggests that although the model is capable of predicting gaze patterns in relation to the objects within a scene, it falls short in recognizing the incongruity of these objects, that is, an understanding how the objects relate contextually. This observed behavior is likely because the top-down model, which inherently lacks the ability to assess the semantic context of objects, does not factor in the presence of incongruent objects within its predictive framework. In essence, the model’s limited capacity to evaluate semantic contexts aligns with the similar cognitive limitation observed in SCHZ patients63. Therefore, the enhanced predictive accuracy of the top-down model for SCHZ patients may stem from this shared deficiency in correctly interpreting the semantic context of objects, resulting in more accurate oculomotor predictions for this group. Our findings also indicate that the top-down model more accurately predicted the oculomotor behavior of HCs compared to SCHZ patients in the context of social interactions images. This is consistent with earlier research highlighting the impaired ability of SCHZ patients to process more complex visual scenes such as social interactions and emotions64,65,66. This outcome is linked to negative symptoms of emotional blunting67 and a deficit in processing the low spatial frequency (LSF) of images68,69.

Category-specific stimuli analyses showed better performance in SCHZ group for the top-down model in categories of social interaction and social landscape. This finding is in agreement with previous reports on the properties of saliency models70,71. This enhanced prediction accuracy suggests that this model excels in accounting for higher cognitive processes associated with the interpretation of individuals and objects within the scene and their interactions. Conversely, the performance of the top-down model was less effective in predicting the oculomotor behavior of HCs in response to physically salient stimuli. The top-down model’s reduced capacity to predict oculomotor behavior for physically salient stimuli reaffirms its overall lower sensitivity to the bottom-up component within the predicted saliency map.

As expected, the temporal analysis of the models allowed us to reveal how top-down and bottom-up processes are involved in cognition and its formation in the groups we studied. The bottom-up (GBVS) model indicated no significant differences between the groups across both periods. However, this trend changed when we focused on specific stimulus categories. Notably, for physically salient images, the GBVS model consistently showed better performance in SCHZ patients than in HCs during both periods. This confirms the previously reported tendency of SCHZ patients to focus their attention on physically salient stimuli72,73. The second analysis shows a difference in performance of the top-down (EML-Net) model between groups. Especially in the first period, the nuanced differences in how SCHZ and HC groups process visual information is highlighted. This distinction, particularly evident in the early period, underscores a potential divergence in cognitive processing strategies between the two groups. As the model’s ability to differentiate between SCHZ and HC partly diminishes in the second period, it suggests a partial convergence in visual processing strategies over time, or possibly an adaptation in the SCHZ group’s visual attention mechanisms. Differences persist for images depicting social interaction and emerge in incongruent images category.

Furthermore, these observations are in agreement with results from the CPT, where SCHZ patients exhibited higher rates of omission and perseveration errors compared to HCs. These CPT findings imply a greater tendency of SCHZ patients to overall inattentiveness (as indicated by higher omission scores) and to the use of more automatic responses (as evidenced by higher perseveration scores). Together, these elements suggest an impaired ability of SCHZ patients to direct their focus towards visual stimuli74. This impairment may also contribute to the delayed scene orientation observed in SCHZ patients, thereby affecting the efficiency of bottom-up signal processing. In the HC population, after the initiation phase, bottom-up saliency is suppressed by the top-down saliency of higher cognitive processes16,75,76,77, but as seen in the results it appears that this onset is delayed in the SCHZ population.

The delayed emergence of top-down cognitive processes is likely attributable to dysfunctions in LSF processing. LSF processing is essential for swift scene orientation, laying the groundwork for top-down predictive mechanisms and focused attention distribution within the visual scene16. The absence of notable differences between-groups in the second period of top-down model predictions implies that the slower initiation of top-down cognition might be linked to LSF processing abnormalities repeatedly reported in SCHZ population61,78,79,80. Previous studies mainly focus on the reduced ability of the SCHZ population to process LSFs, which has been attributed to dysfunction of the magnocellular optical pathways. However, recent findings indicate that LSFs may not be processed only by the magnocellular pathways but are likely processed in parallel in the koniocellular pathways81,82. Consequently, the research focus has shifted toward the retina itself in recent years83,84,85. One possible reason for the slower bottom-up signal processing in SCHZ is the inflammatory processes of retinal microvascularity, which are associated with commonly reported atrophy of retinal nerve fibers86,87. The outcome of this process is a low signal-to-noise ratio88, particularly resulting in an increased level of vagueness related to the nature of a percept/signal, ultimately leading to a disruption of the decision-making process89. However, inflammatory processes and associated atrophy would not explain why, in early-stage and untreated first-episode patients, hypersensitivity is often encountered55,57. Retinal atrophy can only explain the later stages of the illness when hypersensitivity eventually progresses to hyposensitivity, which also extends to other frequencies of the visual scene55,90,91. An alternative explanation that would also include hypersensitivity to LSFs would be instability in retinal dopamine levels6. Dopamine influences the size of receptive fields, thereby affecting the sensitivity to individual frequencies of the perceived image92. Increased dopamine levels reduce the size of receptive fields, leading to increased sensitivity to high spatial frequencies and vice versa93,94. Therefore, the instability of the receptive fields may contribute significantly to the formation of the aberrant salience that is typical for schizophrenia6.

In our study, the SCHZ patient group exhibited fewer yet longer fixations compared to the HC group, corroborating findings from existing literature95,96,97. While previous studies have suggested a link between these oculomotor differences and the severity of both negative and positive SCHZ symptoms, the nature of this association remains a subject of debate98. In contrast to these studies, our results did not establish a connection between the severity of SCHZ symptoms (whether negative or positive) and oculomotor behavior. This absence of correlation extended to the outcomes of the PNASS as well as to medication effects. Furthermore, we observed no significant relationship between fixation patterns and CPT performance within the SCHZ group. These findings imply that the overall ability of SCHZ patients to sustain attention does not significantly impact the results of predictive models. It raises the possibility that these specific differences in saliency and its predictive model might be considered as trait markers of SCHZ itself.

Temporal analysis of fixation duration revealed a diminishing difference between the HC and SCHZ groups over time. Initially, the SCHZ group exhibited prolonged fixations, likely indicative of extended time needed for scene orientation and LSF signal processing. However, fixation durations gradually decreased, suggesting the engagement of advanced top-down cognitive processes. This pattern aligns with the documented reduction in fixation duration and count in SCHZ during top-down cognitive tasks, such as object search or fixation within a scene99. This “unknown compensatory mechanism”, as the authors of the original study called it, might relate to altered receptive field sensitivity, potentially due to dopamine fluctuations in the retina and variations in retinal morphology, affecting receptive field distribution and size. However, a precise answer to this question would require more in-depth research.

In this study, we explored the application of salience models in schizophrenia (SCHZ) research, an area with limited prior investigation100,101. Our findings indicate that predictive models of visual saliency are potent tools for identifying errors in visual information processing and the development of aberrant saliency in SCHZ patients. Emphasis should be placed on incongruent stimuli, stimuli that are physically salient, and complex stimuli depicting social interactions. These types of stimuli effectively illustrate the limitations of the models and the specific abnormalities in visual processing among the SCHZ population. Our study also reveals that the previously documented bias in SCHZ patients towards bottom-up signals31,55,57,61,102,103 is variable over time, possibly originating from disruptions in early-stage visual processing. This disruption might further impede the onset of top-down visual cognition. The altered and prolonged processing of bottom-up signals likely leads to flawed and unstable internal representations of the world, impacting higher cognitive functions6. Our study highlights the complex interaction between bottom-up and top-down processes in the visual signal processing of SCHZ patients, marked by a progressive decrease in fixation duration. However, to fully comprehend these intricate dynamics, further research is essential.

Limitations

The first limitation of the presented study arises from the above-mentioned question: to what extent the presented saliency models reflect purely “bottom-up” and “top-down” processing? Although this is still a matter of debate, the proportion of these two components largely differs in the applied models and thus the presented methodology can describe the differences between HC and SCHZ bottom-up and top-down processing. Also, the top-down EML-Net model, having been trained on data from individuals without neurological conditions, presents a challenge in interpretation: it’s unclear whether the improved model fit observed in the control group is due to differences in the type of top-down information prioritized by patients and controls, or if it simply reflects variances in the degree to which they prioritize such information. This ambiguity raises questions about the model’s ability to accurately capture the nuances of top-down information processing in populations with neurological conditions like SCHZ. Other limitation pertains to the antipsychotic treatment of SCHZ participants. The relationship between antipsychotic medication and oculomotor movement is a controversial topic which has been questioned before104,105,106, and our results support these concerns.

Methods

Participants

This study involved 62 subjects (37 SCHZ and 25 HC) (Table 5), matched in age, sex, and years of education (within ± 2 years). Some HCs were matched to a larger number of SCHZ patients due to the lower availability of HCs with fewer years of education, resulting in this imbalance. The number of participants was estimated by a power analysis (Appendix A). Nine participants (9 SCHZ, 0 HC) were excluded due to incorrect eye-tracking measurements (within the measurement, the calibration deviation increased to more than 0.5°; high blink rate; fatigue; and concentration problems). Participants were recruited into the study as part of the Early-Stage Schizophrenia Outcome (ESO) Study107,108,109 and through the National Institute of Mental Health clinic, Czech Republic (NIMH CZ). The diagnostic procedure was standardized with the structured Mini-International Neuropsychiatric Interview110, and patients were diagnosed according to ICD-10111. Only patients diagnosed with schizophrenia spectrum disorder were included in the analyses (i.e., F20, F23 and F25)111. Additional inclusion criteria were age between 18 and 60 years, the absence of severe neurological illness or organic brain problems, and normal color vision as determined by the Ishihara test112. All the patients took medication at the time of participation. HCs were recruited via an advertisement from a similar socio-demographic background to the SCHZ participants. HCs were not allowed to have a history of psychiatric disorders (evaluated with a modified version of the M.I.N.I.) or in their first- and second-degree family members (assessed by an anamnestic questionnaire). Both groups were recruited between 2018 and 2021. The ethics committee of the NIMH CZ approved the study. All the experiments were performed in accordance with the relevant guidelines and regulations. Written, informed consent was obtained from all the subjects after receiving a complete study description. Participation in the research was voluntary, with a financial compensation of 500 CZK. In the SCHZ group, the current clinical condition and medication dose were also taken into consideration.

Table 5 Demographic and clinical characteristics of the experimental groups.

Visual stimuli selection and pre-processing

A total of 250 color images of an everyday naturalistic scene were used in the study. All the photographs were downloaded from public databases (Flicker, World Images, and Vecteezy) or taken by the study’s authors. The stimuli were divided into five categories (50 images pear each), based on their content (congruent, incongruent, physically salient, social landscape, social interaction) (Fig. 3). (1) Everyday Scenes (Congruent): This category includes images of typical, everyday environments where all elements are contextually appropriate and consistent. Such congruent scenes are expected to align well with top-down models’ predictions, as they match usual expectations of everyday environments. (2) Incongruent images: These scenes contain everyday settings but with objects that are contextually out of place or unusual. The incongruence of these objects is anticipated to challenge top-down models, which rely on contextual appropriateness, and could be more accurately predicted for individuals with SCHZ than HC due to the expected bottom-up bias in SCHZ56. (3) Natural Scenes with Physically Salient Elements: Scenes in this category are natural environments that include elements with notable physical salience—like unusual color, contrast, or orientation. These elements are expected to be more effectively predicted by bottom-up models, and thus potentially better predicted for individuals in the SCHZ group. (4) Scenes Depicting Social Interactions: This category comprises scenes focused on social interactions. These types of stimuli are expected to be more accurately predicted by top-down model for the HC group, as they involve understanding social cues and contexts. 5) Social Landscapes: These are natural scenes that include elements of nature and feature humans. Termed “social landscapes,” these scenes are anticipated to align better with top-down model predictions for the HC group, as they combine elements of nature with social interactions.

Fig. 3: Examples of stimuli utilized in the experiment.
figure 3

The photographs were categorized into five different groups based on their content. (1) Everyday Scenes (Congruent) include images of typical, everyday environments where all elements are contextually appropriate and consistent. (2) Incongruent images contain everyday scenes but with objects that are contextually out of place or unusual. (3) Natural Scenes with Physically Salient Elements include natural environments that include elements with notable physical salience. (4) Scenes Depicting Social Interactions comprises scenes depicting social interactions. (5) Social Landscapes are natural scenes that include elements of nature, but feature also humans.

The Shine toolbox113 for MATLAB was used to normalize all the stimuli to color and luminance. Then two saliency models, Expandable Multi-Layer NETwork (EML-Net) and Graph-Based Visual Saliency Model (GBVS) (See below in section 4.6), were applied to each photograph, producing one saliency map per image and model. Subsequently, a black border was added to each image to reach a resolution of 3840 × 2160 pixels. The original mean image area was M = 6,029,277.12 pix, SD = 818,762.31. The mean area of the added black borders was M = 1,487,522.88 pix, SD = 818,762.31. The image area therefore occupied approximately 80% of the monitor area. The experiment was created and presented using SR Research Experiment Builder 2.3.1114 .

Eye-tracking data acquisition

Eye movements were recorded using the EyeLink 1000 Plus eye tracker (SR Research Ltd. Ottawa, Ontario, Canada). The eye-tracker samples raw gaze data at 1000 Hz, fixations and saccadic movements are derived from that. Stimuli images were presented on a 4 K 27” (3840 × 2160, 163 PPI, 60 Hz refresh rate) IPS screen with 100% sRGB color space. The screen was color- and luminance-calibrated with X-Rite i1 Display Pro probes connected during the whole rating session to adjust the screen for ambient light. The eye tracking and rating session took place in a quiet and windowless eye tracking lab in standardized conditions across all raters. Raters were seated with their heads on a chin and forehead rest (SR Research Head Support) 70 cm from the screen. Every participant saw images in a randomized order, with instructions to freely observe image on the computer screen.

We determined the dominant eye of each participant using a variation of the Porta test115. Although vision is binocular, we tracked only the dominant eye. The eye tracker was calibrated by a standard nine-point routine. Calibrations was validated by the EyeLink software and repeated as necessary until the optimal calibration criterion is reached.

Each image begun with a drift correction. A fixation cross on an 18% grey background appeared (in eight possible positions) on the screen, and participants were instructed to focus their gaze on it. The distance of the centers of the corner crosses from the center of image was 1275 pix at angles of 155°; −155°; 25°; −25°. The centers of the crosses above and below the image center were 542 pix at angles of 90° and −90°. The centers of the crosses to the right and left of the image center were 1150 pix at the angles of 0° and 180°. The cross size was 183 pix with a stroke thickness of 7 pix. The semi-random position of the cross out of the center was chosen to avoid visual bias towards the center of the image. When a participant’s eye fixates on the cross, the stimuli presentation will initiate for five seconds.

Symptom rating and cognitive testing

After conducting the eye-tracking measurements, we utilized the Positive and Negative Syndrome Scale (PANSS)116 to assess the severity of positive and negative symptoms in SCHZ patients. Additionally, we employed Conners’ Continuous Performance Test III (CPT)117 to evaluate attention. We hypothesized that diminished attention, as indicated by the CPT, would influence perception processing, given that visual attention is crucial for acquiring information visually117. These assessments were conducted at the National Institute of Mental Health (NIMH CZ) in a quiet, dedicated room. The entire assessment process, led by a trained psychologist, lasted approximately 2 hours. The primary objective of this psychological testing was to investigate any potential causal links between the illness, the performance of the saliency models, and the oculomotor behavior observed in the patients.

Data pre-processing and statistics

Primary pre-processing (differentiation between saccades and fixations) was performed in the EyeLink Data Viewer. The data were then exported to a spreadsheet format (CSV) for further processing. In the first step, all ET data were cleaned of off-monitor fixations and saccades. The first fixation overlapping with the fixation cross between stimuli was removed and no longer considered. Pre-processing and all table data (including PANSS, CPT, saliency prediction scores, and demographic data) were statistically analyzed with R118 using the tidyverse package119.

Ground truth fixation matrices were calculated from the cleaned fixation data for each participant and image in Python using the GazePointHeatMap package120. This matrix contains the fixation averages for each image area over time. Ground truth fixation map was in full resolution of the original stimuli (3840 × 2160). Two subsequent ground truth maps from fixations were computed (up to the fifth fixation and from the sixth fixation) to examine whether the bottom-up signal bias in the SCHZ group persists over time or not. Python was used to process both saliency models, which are published at github.com (GBVS121; EML-Net122). The final performance evaluation of each saliency model was calculated using the MIT saliency benchmark toolbox40 in MATLAB (Fig. 4).

Fig. 4: The diagram illustrating data processing and analysis steps utilized in the study.
figure 4

Pink arrows mark the processing path of the ground truth map. Green arrows mark the processing path of the saliency models. Black arrows mark the processing path of table data for statistical comparison; CSV comma-separated values, EDF standardized European data format for storage of medical time series, NSS normalized scan path saliency, PANSS Positive and Negative Syndrome Scale, CPT Conners’ Continuous Performance Test III.

The inter-group difference in the total examined image area was calculated using the standard distance deviation formula (SDD) in R with the mapTool package123. We investigated the relationship between the oculomotor behavior of SCHZ patients and key clinical factors: the duration of untreated psychosis and the chlorpromazine equivalent54,124,125 were investigated in R.

Finally, the metrics differences between-groups were evaluated using Linear Mixed-Effects Models (R lme4 package)126. The models used NSS metrics value as the dependent variable and included fixed effects for interaction between-group (patients vs controls), image category, crossed random intercepts for each individual (participants ID) and each image category. Estimating random intercepts for individual images was not feasible due to the extensive number of parameters required. Prior to modelling, the NSS score was transformed using square root transformation to suppress skewness of the distribution. Inputs and resulting distributions, as well as model residuals, were checked using density and q-q plots. Significance tests on fixed effects were performed using Satterthwaite’s method (R lmerTest package)127.

The Wilcoxon signed-rank test was applied to assess saccadic eye movement, which had a non-normal distribution. A Pearson’s correlation test was used to assess the association between medication, the outcomes of psychological tests, and the duration of untreated psychosis with the findings of the oculomotor movements. For all the tests, the significance level was set at alpha < 0.001 in order to take into consideration multiple comparisons.

For the between-group comparison of fixation duration, we used the sequential testing procedure: starting from fixation 1, the between-group differences were compared using the t-test at a significance level alpha = 0.05. The subsequent fixations were considered significant if, and only if, current and all preceding tests rejected the null hypothesis. This approach conforms to the closed testing procedure and thus controls the overall significance level at alpha = 0.05128.

Saliency Models

The selection of the most recent top-down and bottom-up saliency models used in our study was based on the models’ overall success in their category as measured by the MIT Saliency Benchmark (saliency.mit.edu)40. We selected the best-performing models from the top-down and bottom-up categories based on the NSS metrics129,130,131,132, which was set as a mandatory performance indicator at the 14th European Conference on Computer Vision40. The second criterion was the availability of source code. We chose the results from a MIT300 dataset131, which by its nature, better reflects the stimuli used in our study than a CAT2000133, which contains only natural scenery.

As the bottom-up model, we selected the pre-trained GBVS134, which works by constructing a graph representation of the image, where each node in the graph corresponds to a small region of the image. This process consists of two steps. First, it creates numerical activation maps of feature channels extracted from locations in the image (e.g., by linear filtering followed by elementary nonlinear filtering). Second, it normalizes the activation maps in a way that emphasizes conspicuity and allows combinations with other maps135. The model takes a Markovian approach at both steps. Markov chains are defined over various image maps, and the equilibrium distribution over map locations is treated as activation and saliency values. The edges between the nodes represent the similarity between the regions. The model then computes a saliency value for each node based on its contrast with neighboring regions. The nodes with high saliency values are considered to be the most visually salient regions of the image and are likely to attract human attention.

As the top-down model, we selected the pre-trained EML-Net136, a deep-learning model used for image saliency prediction. The EML-Net model uses CNN layers to extract features from the image and then passes these features through multiple layers of fully connected neural network layers to predict the saliency. Specifically, the encoder consists of NasNet from ImageNet and DenseNet from PLACE365136, both are used as encoder for image classification. During training, the model learns to predict the saliency map for a given input image by adjusting the weights of the neurons in the network to minimize the difference between the predicted saliency map and the ground truth map.

To enable a meaningful comparison between two distinct prediction models, the NSS metrics were selected to evaluate their performance40. Specifically, NSS metrics measure accuracy by comparing the predicted saliency map created by the model with the fixation density map from eye-tracking data (ground truth map).129 The fixation density map shows where viewers look at an image. NSS calculates the mean saliency value at the fixated locations by comparing the predicted map with a binary fixation map, where ‘ones’ represent fixations and ‘zeros’ represent other areas137. A higher NSS value suggests a better prediction of viewer attention, while a value of zero indicates chance-level predictions. NSS is widely used for comparing different saliency models because it provides a straightforward and standardized way to assess their performance.