Back to Journals » Cancer Management and Research » Volume 18
Multimodal Fusion of 3D CT and Pathological Images for Gastric Cancer Recurrence Prediction
Authors Cao L, Tian M
, Li J, Chen Z, Wang X, Zhong S, Wu G, Tang Z
, Yu J
Received 20 October 2025
Accepted for publication 25 February 2026
Published 27 April 2026 Volume 2026:18 563640
DOI https://doi.org/10.2147/CMAR.S563640
Checked for plagiarism Yes
Review by Single anonymous peer review
Peer reviewer comments 2
Editor who approved publication: Professor Kattesh Katti
Longjun Cao,1 Mengxin Tian,2– 4,* Jia Li,5,* Zhongtao Chen,1 Xuefei Wang,2– 4,6 Su Zhong,1 Guoqing Wu,1 Zhaoqing Tang,2– 4 Jinhua Yu1
1School of Biomedical Engineering and Technological Innovation, Fudan University, Shanghai, People’s Republic of China; 2Department of Gastrointestinal Surgery, Zhongshan Hospital, Fudan University, Shanghai, People’s Republic of China; 3Gastric Cancer Center, Zhongshan Hospital, Fudan University, Shanghai, People’s Republic of China; 4Cancer Center, Zhongshan Hospital, Fudan University, Shanghai, People’s Republic of China; 5Department of General Surgery, Taicang TCM Hospital Affiliated to Nanjing University of Chinese Medicine, Nanjing, Jiangsu, People’s Republic of China; 6Department of General Surgery, Zhongshan Hospital (Xiamen), Fudan University, Xiamen, People’s Republic of China
*These authors contributed equally to this work
Correspondence: Jinhua Yu, School of Biomedical Engineering and Technological Innovation, Fudan university, Shanghai, People’s Republic of China, Email [email protected] Zhaoqing Tang, Department of Gastrointestinal Surgery, Zhongshan Hospital, Fudan University, Shanghai, People’s Republic of China, Email [email protected]
Background: Gastric cancer recurrence severely impacts postoperative outcomes, and accurate prediction is crucial for personalized management. 3D CT images (macroscopic lesion context) and Whole Slide Images (WSIs, microscopic histopathological details) offer complementary information, but effective fusion is hindered by feature dimensionality disparity and lack of robust integration strategies. Existing models show suboptimal performance due to inadequate multimodal fusion, resulting in unreliable risk assessments that cannot be clinically applied for personalized therapy.
Aim: To preoperatively identify high-risk patients and enable personalized postoperative follow-up and treatment stratification to assist clinicians, this study adopts a fusion strategy combining multi-stage attention and co-attention mechanisms to achieve efficient integration of Whole Slide Images (WSIs) and CT images, thereby providing an accurate, robust and generalizable solution for gastric cancer recurrence prediction.
Methods: This retrospective multi-center study included three datasets: a primary cohort (646 patients, Zhongshan Hospital) and two independent test sets (160 patients, Zhongshan Hospital Xiamen Branch; 140 patients, Taicang TCM Hospital). CT features were extracted using DSMAGNet integrated with the iSAFF module and GateNetwork, while Whole Slide Image (WSI) features were derived via multi-stage attention-based dimensionality reduction (MSAT). Finally, multimodal fusion of the two types of features was accomplished through a co-attention mechanism.
Results: The framework achieved an AUC of 83.4% on the primary dataset, outperforming 11 comparative methods by up to 4.2%. On external test sets, it showed superior performance with AUC improvements of 4.63% and 3.91% vs. the next-best methods. Ablation studies confirmed the effectiveness of DSMAGNet and MSAT.
Conclusion: The multimodal framework enables accurate, interpretable, and generalizable gastric cancer recurrence prediction by integrating WSI and CT images. It aids preoperative identification of high-risk patients, supporting personalized postoperative follow-up and treatment stratification to improve long-term outcomes.
Keywords: gastric cancer recurrence prediction, multimodal learning, multi-stage attention mechanism, whole slide image, CT image
Introduction
In clinical practice, physicians increasingly rely on multimodal data integration—including pathological slides, medical imaging (eg., CT, MRI), and clinical biomarkers-to improve diagnostic accuracy, prognostic stratification, and personalized therapeutic strategies for gastric cancer (GC) patients1–3 Notably, postoperative recurrence remains a critical concern in GC management, with reported recurrence rates exceeding 30% in advanced-stage disease.4
Traditional prognostic approaches in gastric cancer rely on clinicopathological factors such as tumor size, depth of invasion, histological grade, and lymph node involvement.5–8 While these factors provide valuable clinical insights, they often fail to accurately predict recurrence—primarily because they cannot capture the inherent molecular and biological complexity of gastric cancer, nor can they reliably assess critical tumor features (eg., microvascular invasion, perineural invasion, and micrometastases) that significantly modulate recurrence risk.9–12 Against this backdrop, an improved recurrence prediction model is therefore essential for enhancing prediction accuracy, delivering more personalized assessments of recurrence risk, and ultimately guiding tailored postoperative surveillance and adjuvant therapeutic strategies to optimize patient outcomes.
Accurate recurrence prediction is therefore essential for tailoring postoperative surveillance and adjuvant therapies. With the advancement of deep neural networks, many state-of-the-art multimodal learning methods have been proposed for Whole Slide Image (WSI) analysis.1,13–16 For example, Li et al17 introduced a multimodal method based on Transformers, using attention mechanisms to guide the fusion of pathological whole slide images and medical imaging for survival analysis. Wang et al18 proposed a method that aggregates WSI information into fine-grained feature representations, followed by learning shared and modality-specific features for multimodal fusion. However, critical challenges persist in multimodal pathological image analysis that existing methods have not adequately addressed: 1) Due to the large size of WSIs (eg., approximately 150,000 × 60,000 pixels in our case), they contain much higher-dimensional information than other modalities such as CT. Effectively reducing the high-dimensional WSI features to lower-dimensional representations for further multimodal information fusion remains a challenging task; 2) Information from different modalities is often complementary and interrelated, making it a significant challenge to distill the most relevant multimodal information; 3) Existing multimodal learning methods typically employ post-hoc fusion techniques (eg., average pooling,19 concatenation,1,15 bilinear pooling20,21), which limits the model’s ability to learn relationships between features from different modalities and restricts model flexibility. Notably, these technical limitations of inadequate multimodal fusion strategies directly restrict the clinical applicability of existing prediction models, which consequently fail to address the core clinical need of identifying high-risk patients preoperatively and guiding personalized postoperative follow-up and treatment stratification for GC patients. Designing a more effective modality fusion strategy to address these issues remains a challenge.
In this paper, we propose an effective and accurate multimodal learning framework for predicting gastric cancer recurrence based on whole slide images (WSI) and CT images. To address the challenges mentioned above, we introduce a novel framework combining ResNet22 and Transformer23 architectures, designed to facilitate multi-stage feature fusion during the dimensionality reduction process of WSI features. Specifically, we apply a multi-stage attention mechanism to reduce the patch-level features of WSI into attention-level representations, aligning the feature dimensions of both modalities. To fully leverage the interaction between modalities, we implement a co-attention mechanism at an early fusion stage when WSI features contain the most informative content. In this stage, CT features are used as queries to identify WSI patch-level features most relevant to the CT features. Finally, we combine the initially fused features with the attention-level features to predict gastric cancer recurrence. Extensive experiments were conducted on three gastric cancer recurrence datasets to comprehensively validate the effectiveness and robustness of our framework. Experimental results demonstrate that our method outperforms state-of-the-art WSI prediction models and multimodal WSI analysis approaches.
Materials and Methods
Dataset and Preprocessing
This study was approved by the Ethics Committees of relevant multi-center institutions. The research included three independent datasets. The first dataset comprised 1,437 patients with a pathologically confirmed diagnosis of GC who underwent surgical resection at Zhongshan Hospital of Fudan University (Shanghai China), between April 2006 and August 2020. Among these, 646 GC patients were selected as the primary cohort. The second dataset was collected from Zhongshan Hospital Xiamen Branch, Fudan University, and included 182 patients treated between January 2019 and March 2022, with 160 gastric cancer cases selected as Independent Test Set 1. The third dataset was obtained from Taicang Traditional Chinese Medicine Hospital, Jiangsu Province, and included 146 patients treated between December 2013 and January 2024, with 140 gastric cancer cases selected as Independent Test Set 2. According to the National Comprehensive Cancer Network (NCCN) Guidelines for Gastric Cancer (version 2.2019), recurrence patterns of gastric cancer (GC) include locoregional recurrence (LR) and metastatic disease. Metastatic disease can be further subdivided into peritoneal dissemination and distant metastases. In the present study, patients were classified by their recurrence patterns into three categories: LR, peritoneal metastasis, and distant metastasis. LR encompassed recurrence in the gastric bed, the gastric remnant at the anastomosis, the duodenal stump, and/or regional gastric lymph node recurrence. Peritoneal metastasis included lesions involving the peritoneum, omentum, and mesentery. Distant metastases were defined as those occurring in other solid organs and non-gastric regional lymph nodes. Recurrence in the gastric bed and the gastric remnant at the anastomosis was confirmed by gastroscopic biopsy. Recurrence in regional gastric lymph nodes and the duodenal stump was mainly assessed by dynamic postoperative contrast-enhanced computed tomography (CT). On dynamic contrast-enhanced CT follow-up, regional lymph node recurrence was diagnosed if gastric regional lymph nodes showed enlargement with necrosis or progressive enlargement on serial imaging, after excluding tuberculosis and other benign etiologies. Peritoneal metastasis was diagnosed in the following scenarios: postoperative CT demonstrated nodular or mass-like thickening of the peritoneum, omentum, or mesentery; the number or size of lesions increased on dynamic follow-up; or malignant cells were detected in ascitic fluid. Distant metastasis was confirmed via dynamic postoperative CT monitoring. The same NCCN guidelines specify that most postoperative GC recurrences develop within 2 years of surgery. In the present study, all non-recurrent patients were followed up for a minimum of 2 years. Follow-up assessments were performed every 3–6 months during the first 2 years, every 6–12 months for the subsequent 3 years, and annually thereafter. The core surveillance modalities included abdominal contrast-enhanced CT, gastroscopy, and serum tumor marker testing. Consistent with the high early recurrence rate noted above, recurrence in this study was defined as any locoregional or distant tumor relapse detected within 24 months of surgical resection. The median follow-up duration across all patients was 44 months. The data screening process for the three datasets is shown in Figure 1. All patients provided verbal informed consent. Demographic and clinicopathological characteristics of the enrolled patients are detailed in Table 1.
|
Table 1 Clinical Information for Main Cohort and Two Independent Testing Sets |
|
Figure 1 Process of the patient enrollment for main cohort and two independent testing sets. |
Figure 2a illustrates a schematic diagram of our proposed framework. We applied instance-level feature extractors to transform paired histological and radiological data into representative features, where histological data were processed using the pre-trained TransPath24 to extract patch-level pathological features, and radiological data were encoded by DSMAGNet (Dual stream mask attention gate network) to obtain image-level representations. This ensured that input data from both modalities were consistently mapped into a high-dimensional representation space, providing a foundation for feature fusions. To extract critical information from the patch-level features of histological data, MSAT (multi-stage attention) employed a multi-stage attention mechanism, utilizing two stages of attention modules to effectively aggregate salient regional information and generate more abstract, globally aware attention-level features that significantly enhance the expressive power of pathological features. The framework further incorporated a Co-Attention layer to explore complex interactions between radiological and pathological features, learning a joint attention mapping between CT and WSI (Whole Slide Image) patches to capture synergistic patterns across modalities and identify interaction features critical for recurrence prediction. The attention mappings generated by this framework could also be visualized as WSI-level heatmaps, revealing key regions of interest identified by the model. To conclude, a fusion approach combining Fully Connected (FC) layers and Global Average Pooling (GAP) was employed to concatenate the preliminarily fused multimodal features with attention-level features, generating the final feature representations for recurrence prediction. The FC layers enabled nonlinear combinations of features, while GAP reduced dimensionality effectively while preserving global semantic information, thus improving the model’s predictive performance and generalization capability.
CT Embeding
Figure 2b illustrates our proposed new network architecture, DSMAGNet, which integrates the classification task into a unified end-to-end model, significantly enhancing performance. First, we preprocess the original 3D abdominal CT images using segvol25 and nnunetV2,26 extracting 3D abdominal CT images containing only the stomach and their corresponding stomach tumor mask images. These preprocessed images and masks are then input into the DSMAGNet model, which generates multimodal fusion features through feature extraction and fusion. The DSMAGNet architecture consists of two main components: the feature extraction module and the feature fusion module. The feature extraction module uses ResNet34 as the backbone network, which efficiently extracts multi-level features from the abdominal CT images. The feature fusion module aggregates multi-scale features from layers 2, 3, and 4 and combines them into CT features for input. Considering the characteristics of our dataset (the volume data of stomach and tumor are stored separately), we optimized the dual-stream feature processing pipeline. Due to the high memory requirements of multi-stream computations, the feature extraction process omits refinement operations for layers 0 and 1, retaining only higher-level feature information for processing. These coarse multi-scale features are then input into our designed iSAFF module. The iSAFF module efficiently fuses multimodal features with different volumes and semantic information, generating preliminary CT feature embeddings. To further address feature redundancy or misleading information, we propose a collaborative design of the iSAFF module and GateNetwork in DSMAGNet.
The goal of the iSAFF module is to focus on the most important information from two different data streams and fuse it through an attention-guided method. Inspired by the attention feature fusion theory proposed by Y Dai et al27 we introduced the iSAFF module in DSMAGNet. Unlike Y Dai et al’s approach, which uses a multi-scale channel attention module to fuse semantically and scale-inconsistent features, we add an additional layer of attention. To better preserve informational features when fusing data from the two different data streams, we chose soft pooling for exponentially weighted activation downsampling. Therefore, our method embeds two distinct data stream fusion strategies, greatly improving the efficiency of feature extraction.
Figure 2c shows the designed multi-level attention-guided feature fusion module. The core idea is to leverage channel attention across multiple scales by changing the spatial pooling size in two stages. In the first stage, given the input feature
, global channel context
,and local channel context
, the refined feature
is obtained through MSS-CAM as follows:
Where σ is the sigmoid function,
denotes element-wise multiplication, and
denotes broadcasted addition. The local channel context
is calculated as:
Where
represents the rectified linear unit (ReLU),
represents a sequential operation that first conducts batch normalization (BN) to standardize input features across the batch dimension, and then applies pointwise convolution to modify the number of channels or linearly combine feature maps. The global channel context
is computed as:
The output of the soft pooling operation can be expressed as:
Where
represents the activation map,
denotes the set of exponents corresponding to activations in the 3D spatial region. Then, based on the multi-scale soft pooling channel attention module, iSAFF can be represented as:
where
is the fused feature,
and
are the features from the two data streams in the architecture, and
denotes the feature fusion operation.
represents the feature fusion module from the first stage as described earlier.
To comprehensively integrate multi-scale abdominal CT with gastric cancer mask features, we introduced a gating network mechanism. As shown in Figure 2d, this mechanism is different from traditional feature fusion methods in that it dynamically assigns feature weights through an attention mechanism, thereby enhancing key region features and effectively suppressing redundant information. The ultimately generated high-quality fusion features not only retain key information from multi-scale features, but also significantly improve the efficiency and accuracy of feature expression.
Using features from layers 2, 3, and 4 as input, we first model the global context for each feature layer through an attention module:
where the corresponding attention matrix
is is generated for each input feature
using the attention generator
. After global context modeling, we perform weighted adjustments to each input feature, generating new feature representations:
where
denotes element-wise multiplication. Subsequently, all adjusted features are mapped to a unified dimension through channel mapping and then fused. The channel mapping is achieved using a 1x1x1 convolution, and the fusion formula is as follows:
where
represents the channel mapping function.
WSI Data Embeding
To address the significant dimensional gap between the WSI features and the CT images embeddings for each case, which poses a major challenge for multimodal information fusion, we propose a multi-stage attention-based feature dimensionality reduction (MSAT) paradigm, which reduces the dimensionality of WSI patch features in two stages to align them with the CT features. As shown in Figure 2a, we first crop the 20× magnified slides into non-overlapping 256×256 patches, removing any patches that lack tissue cells. Then, we use the pre-trained TransPath24 as the feature extractor, mapping each patch into a 768-dimensional feature vector. We apply SAT to choose the patch-level feature vectors into K categories and further design a cross-dimensional attention module to aggregate the patch-level features of each category into attention-level feature representations. Unlike attention pooling schemes in the proposed multi-stage attention module, we set up K branches to more effectively aggregate each type of patch-level feature. In the first stage, we design two loss functions for feature selection and information compression. By imposing constraints on the high-dimensional features of WSI patches, we gradually reduce them to a low-dimensional representation consistent with the CT embeddings. This process not only effectively retains crucial local information but also suppresses interference from redundant features. In the second stage, we introduce a Cross-Attention mechanism to align the reduced-dimensional WSI features with the CT features in a cross-modal manner. By dynamically assigning weights during feature alignment, this mechanism can capture key correlations between multimodal data, enabling more efficient and consistent multimodal fusion. In summary, our method successfully addresses the dimensionality discrepancy between WSI and CT features through staged dimensionality reduction and alignment processes, providing an efficient and general solution for multimodal analysis.
As shown in Figure 2e, our dimensionality reduction method consists of two stages. In the first stage, Gate Attention is used for feature dimensionality reduction. Given the initial features
, we perform dimensionality reduction by first computing two intermediate matrices:
Where
and
. The element-wise product of matrices
and
, denoted as
, is calculated as:
This result is then mapped to the classification space using
:
Finally, we apply a softmax operation and use the resulting weights to refine the features, producing the first-stage dimensionality reduction output:
In the second stage, we employ the OvO (One-vs-Others) mechanism to enhance feature interaction after dimension reduction.
First, for the input feature
, we construct the One-vs-Others scoring function:
Where
is a learnable weight matrix related to feature
, and
denotes the total number of features. Next, we normalize the scores using the Softmax function and generate the interaction-enhanced features:
To further enhance the feature representation, we perform multi-head One-vs-Others feature interaction. The calculation process of the multi-head mechanism is as follows:
where concat() represents the concatenation of multi-head feature embeddings, and
is the linear transformation matrix of the k-th head. Finally, the interaction-enhanced feature representation is defined as
:
Feature Fusion
Current histology-radiology fusion methods typically use post-fusion-based strategies such as vector concatenation and bilinear pooling due to the large data heterogeneity gap between gigapixel WSI and CT images. Strategies based only on late fusion limit the interaction between histological and radiological data.
First, let
represent CT features and
represent WSI patch level features,
,and
is the number of Hist feature vectors.
denotes the interaction-enhanced feature representation from the previous stage. To achieve efficient feature fusion, we adopt a Co-Attention mechanism, where
acts as the Query(Q) and
serves as the key(K) and value(V). The specific calculation is as follows:
Where
,
,
learnable projection matrices for the Query, Key, and Value, respectively, and
represents the embedding dimension.Next
is concatenated with
for feature integration:
The final fused feature
combines complementary information between CT and Hist features. Simultaneously
embeds high-order relationships among Hist features, providing richer feature representations for downstream tasks.
Training Procedure
We used the PyTorch library to train “DSMNGet,” “Co-attention-based Feature Grouping,” and “MSAT” on two A100 (40G) GPUs. During training, we employed the Adam optimizer with a momentum of 0.9, a weight decay of 5e-4, and a batch size of 1. To address the class imbalance issue in the sliding classification stage, we adopted class-weighted cross-entropy loss28 to prevent the network from overlooking the less frequent WSI classes during training.
Results
Experimental Setup and Evaluation Metrics
We conduct five-fold cross-validation five times with different seeds (ie., five repeated cross-validation) to evaluate our prediction method and report the average performance with standard deviation among the five runs of combined test folds. In our experiments, the area under the curve (AUC) together with the accuracy, precision, recall, and F1 score were used to evaluate the performance of our proposed method and the state-of-the-art methods. Among these, AUC and F1 scores are more comprehensive when comparing the performance of different methods.
Comparison with State-of-the-Art Methods
Table 2 compares the proposed method with several state-of-the-art approaches, including Transformer,23 LN_MSDA,29 DSMANet,30 TransMIL,31 DTFD-MIL,32 MCAT,15 and HMCAT.17 As a unimodal baseline, a Transformer model based on the architecture proposed by Vaswani et al [shared] was trained using clinical text data, employing a single Transformer encoder for the prediction of gastric cancer recurrence. LN_MSDA29 utilized a Feature Dynamic Transfer (FDT) mechanism to extract multiscale features from high-resolution 3D CT images, enhancing the model’s representation capability. DSMANet employed a 3D mask-guided attention network that leveraged masks to guide the extraction of key regional features while integrating global features, thereby effectively capturing information from lesion areas. TransMIL31 adopted Transformer modules to model morphological and spatial relationships among instances, improving the modeling of interactions between instances. DTFD-MIL32 introduced a dual-layer feature distillation mechanism, wherein WSIs were divided into pseudo-bags, feature vectors were distilled from each pseudo-bag, and a two-layer MIL model was constructed to achieve precise instance-level predictions. MCAT, as proposed by Chen et al,15 is an attention-based multimodal MIL approach It employs a co-attention module to enable gene-informed learning of WSI feature representations, utilizes a Transformer encoder to capture intra-modality relationships among instances, and aggregates features via attention for the final prediction. HMCAT leverages hierarchical feature extractors to capture multilevel information in WSIs and incorporates a hierarchical radiology-guided co-attention module (HRCA) to model multimodal interactions between histological visual concepts and radiological features, learning hierarchical co-attention mappings across the two modalities. For a fair comparison with our method, we extended the competing approaches using two widely adopted late-fusion mechanisms, namely concatenation33 and bilinear pooling,14,34 to integrate histological and radiological features. All models were trained under identical conditions, including five-fold cross-validation on the dataset from Zhongshan Hospital, Fudan University. The same pre-trained TransPath24 was used as the instance-level feature extractor for WSIs, and a ResNet-3422 pre-trained on ImageNet was employed as the backbone for 3D CT. Consistent training hyperparameters and loss functions were applied across all experiments, and a total of 160 models were trained for both comparative and ablation studies.
|
Table 2 Comparisons Between Our Proposed Method and Other State-of-the-Art Approaches |
We reimplemented all existing methods based on the literature and publicly available codebases, ensuring the use of identical feature extractors to guarantee a fair comparison. Compared to unimodal methods based solely on CT data, such as ResNet-34 and LN_MSDA, our approach significantly improved classification performance. In terms of AUC, our method achieved a notable increase from 0.559 and 0.672 to 0.834, demonstrating the enhanced representational capacity of CT features through multimodal data integration. Notably, compared with the best-performing CT-based method, our approach improved ACC by 7.8% and AUC by 11.2%. For the WSI modality, our method also demonstrated strong competitiveness. Compared to the TransMIL, which relies solely on WSI data, our approach improved ACC from 0.734 to 0.786, AUC from 0.758 to 0.834, and achieved a significant 12.1% increase in F1 score. Additionally, we conducted an in-depth comparison of late-fusion methods used in the WSI modality. The results indicated that neither concatenation nor bilinear pooling surpassed our method. For instance, DTFD-MIL, which employed bilinear pooling, achieved an AUC of 0.752, while our method reached 0.834, representing a relative improvement of 8.2%. Notably, existing multimodal learning methods, such as MCAT and HMCAT, although incorporating attention mechanisms for inter-modal information exchange, exhibit limitations in retaining and effectively fusing unimodal features. For example, MCAT and HMCAT achieved AUCs of 0.767 and 0.792, respectively, both falling short of our method. This suggests that current methods struggle to fully capture the deep semantic relationships between CT and WSI, whereas our approach, through the introduction of refined inter-modal coordination strategies, not only enhances the effective transfer of cross-modal information but also preserves the core features of each modality. Clinical data, included as a supplementary modality in our experiments, demonstrated limited standalone performance. In summary, our method exhibited superior performance on both CT and WSI unimodal datasets and further improved classification robustness and accuracy through an effective multimodal fusion strategy. As shown in the tabulated results and ROC curves in Figure 3, our approach outperformed existing methods across nearly all evaluation metrics, underscoring its potential and practical value in multimodal medical data analysis.
|
Figure 3 ROC curves for different methods of predicting recurrence based on unimodal or multimodal information. |
Ablation Study
To verify the effectiveness of our proposed core model framework, we conducted a series of ablation experiments by replacing DSMAGNet with ResNet34 for feature extraction and substituting MSAT with ViT in the multimodal interaction module. The detailed experimental results are presented in Table 3. As shown in the table, DSMAGNet significantly outperformed ResNet34 in CT feature extraction, achieving an ACC of 0.708 and an AUC of 0.722, compared to ResNet34’s ACC of 0.559 and AUC of 0.593. Similarly, MSAT demonstrated superior performance over ViT in multimodal interaction, with ACC and AUC values of 0.762 and 0.791, respectively, while ViT achieved ACC of 0.743 and AUC of 0.771. When combined in the full framework, our method achieved optimal results, with an ACC of 0.786 and an AUC of 0.834. These findings confirm that DSMAGNet provides superior feature extraction for CT data, and MSAT is highly effective in optimizing multimodal fusion, thus validating the overall design of our framework.
|
Table 3 Ablation Studies of Different Variants of Our Method. Mean ± Standard Deviation of Reported Five-Fold Cross-Validation |
To comprehensively validate the effectiveness of our proposed method, we conducted experiments on two independent test sets and compared our model against eleven different methods. The experimental results unequivocally demonstrated the superiority of our approach. On Test Set 1, our method achieved an AUC improvement of 3.19% over the second - best method. On Test Set 2, it achieved an even more significant gain, with an AUC increase of 4.63%. Our method demonstrated a remarkable improvement in AUC across multiple evaluation metrics. When compared against eleven diverse methods, our approach effectively captured the latent associations between multimodal data, thereby significantly enhancing classification performance. Table 4 presents a comprehensive comparison of the AUC results, clearly illustrating the superiority of our proposed method over competing approaches. Additionally, Figure 4 showcases the ROC curves for Test1 and Test2, further validating the consistent performance gains achieved by our model relative to other methods.
|
Table 4 Comparison Between the Baseline Model and Our Method on Two Test Sets |
|
Figure 4 ROC curves comparing the baseline model and our method on two test sets. |
In multimodal medical image analysis tasks, it is essential to aggregate features from WSI (Whole Slide Images) before the final fusion step to better align the information across modalities. To validate the effectiveness of the proposed MSAT method, we re-implemented several mainstream multimodal fusion methods from the literature and open-source code, including Attention Pooling (AP), Concat MI-FCN,35 Merged Attention,36 and the Concat-based Attention Feature Grouping (SAG) for feature aggregation in the WSI modality. Additionally, to ensure a fair comparison, we used the same feature aggregation module and the same CT embedding module across all methods. Yao et al35 used the Concat MI-FCN, which aggregates all patch-level features within the same cluster into a cluster-level feature through fully convolutional layers.
However, this method lacks the flexibility of attention mechanisms, which can assign appropriate weights to different patches, resulting in inferior performance compared to Attention Pooling and MSAT. The F1 score for this method was only 61.4%. On the other hand, SAG first performs K-means clustering on patch-level features and then aggregates features from each category using the Concat Attention module. Merged Attention and Attention Pooling (AP) are common modules for multimodal fusion tasks. Merged Attention does not distinguish between modality sources during the alignment stage, directly merging keys and values from all modalities for feature interaction. Attention Pooling (AP), on the other hand, performs weighted summation of modality features based on learned weights, but its simple fusion approach struggles to fully capture the complex interactions between modalities. In contrast to these methods, the core innovation of MSAT lies in its hierarchical attention module, which progressively filters and dynamically fuses multimodal features, thus better capturing the interactions between modalities. The multi-stage attention mechanism of MSAT allows it to gradually focus on more discriminative features from the initial large-scale feature space. Through this progressive aggregation design, MSAT significantly enhances the modeling ability of complex relationships within multimodal data. Moreover, MSAT’s dynamic feature fusion strategy adapts at each stage according to the importance of the modality features, maximizing the complementarity of multimodal information and minimizing information redundancy. As shown in Table 5, all the multimodal fusion methods performed well on our dataset, with MSAT outperforming other methods significantly. For example, in terms of the F1 score, MSAT improved by 2.7% over SAG and by 3.9% over Merged Attention, demonstrating its strong advantages in complex modality interaction tasks.
|
Table 5 Model Performance Using Different Aggregation Methods |
To further investigate the impact of the number of K on model performance in the multi-stage attention module (MSAT), we conducted an ablation study as shown in Table 6. In this experiment, we set K to values of 0, 1, 5, 10, 15, 20, 50, and 100, and observed the trend of changes in model performance (AUC). The results reveal that the choice of the number of K significantly affects the model’s performance, with AUC showing a trend of increasing initially and then decreasing. When K is set to 0, it indicates the absence of the multi-stage attention mechanism (MSAT).
|
Table 6 The Performance of the Proposed Method in the MSAT with Different Values of the Parameter K |
In this case, the model is unable to effectively extract multi-level, rich features from the WSI modality, leading to poor performance, highlighting the critical importance of the multi-stage attention mechanism for enhancing the model’s feature representation capability. As the number of K gradually increases (eg., 1, 5, 10), the model’s performance improves significantly. This improvement is due to the fact that, as the number of K increases, MSAT can perform more refined clustering of patches in WSIs, allowing each cluster to generate more representative features. This higher representativeness helps the module extract key features more effectively, thereby improving the overall performance of the model. However, when the number of K continues to increase (eg., 15, 20, 50, 100), the model performance starts to decline. We hypothesize that the reason for this is that, although the model captures more fine-grained features with a larger number of K, an excessive number of meaningless redundant features are also introduced. Too many K increase the number of attention branches, leading to a higher computational burden for the module in generating the final WSI modality features, which may reduce the network’s inference capability and, consequently, negatively impact model performance. Based on the experimental results and performance analysis, we ultimately selected K = 10 as the optimal parameter setting for MSAT. Under this setting, the model achieved its best performance, indicating that this configuration strikes a good balance between extracting representative features and avoiding redundant information. This choice further validates the importance of MSAT in multimodal feature aggregation.
Multimodal Interpretability
In addition to achieving superior performance in terms of AUC and ACC compared to existing models, our approach demonstrates a high degree of interpretability, enabling a deeper understanding of how each histological patch contributes to the construction of its corresponding radiological embedding. As illustrated in Figure 5, we overlay the normalized co-attention weights onto their respective spatial locations within the original WSI, forming a radiologically guided visual representation—the attention heatmap. This visualization highlights the contribution of specific tissue regions in the context of multi-modal recurrence prediction. For radiological data, we adapted the Grad-CAM technique to better suit the characteristics of medical imaging, allowing for the generation of precise heatmaps on CT images.
|
Figure 5 Presents the visualization results of our method. (a) depicts patients with recurrence, while (b) shows patients without recurrence. |
These cross-modal attention maps reveal distinct patterns between recurrent and non-recurrent patients, suggesting a strong semantic association between histopathological morphology and radiological features. In CT images, the multi-stage attention fusion mechanism enables the model to focus not only on the tumor core but also on the surrounding anatomical context. In recurrence case, regions of compressed and deformed gastric tissue adjacent to the tumor receive heightened attention, potentially reflecting local invasion or early signs of recurrence. In WSI, attention maps further distinguish recurrent from non-recurrent cases. For non-recurrent patients, high-attention patches are predominantly located in well-structured stroma and tumor regions, indicating preserved tissue organization. In contrast, recurrent patients exhibit increased attention to tumor-infiltrated stroma, especially regions enriched with tumor-infiltrating lymphocytes (TILs) and lymphocyte clusters—features frequently associated with immune response and poor prognosis.
To investigate the ability of our method to improve histology and radiology-based patient stratification, we plotted the Kaplan-Meier (KM) curves of our method against DSMANet (CT only), DTFD-MIL (WSI only), and HMCAT in Figure 6. The Log rank test was used to assess statistical significance between the survival curves of low- and high-risk patients, where risk groups were defined by the 50th percentile of risk predictions. The results show that, compared to competing methods, our approach achieves the best discrimination of three risk groups of gastric cancer patients, demonstrating its superior performance in separating patients with distinct survival outcomes. Meanwhile, we evaluated the prognostic value of our method alongside established clinical variables by performing a multivariate Cox regression analysis, with hazard ratios (HR) and 95% confidence intervals (CI) plotted in Figure 7. The results demonstrate that our method yielded the highest HR of 6.97 (95% CI: 2.15–22.58, P=0.001), indicating that patients classified as high-risk by our model had a nearly 7-fold higher risk of mortality compared to those in the low-risk group. This finding confirms that our method is the strongest independent prognostic factor among all variables examined, outperforming established clinicopathological factors such as pM stage (HR=4.82, P<0.001), lymphovascular invasion (LVI, HR=2.60, P<0.001), and perineural invasion (PNI, HR=2.52, P=0.003). Other significant prognostic factors included pT stage (HR=2.08, P<0.001), pN stage (HR=1.61, P<0.001), tumor size (HR=1.16, P<0.001), and age (HR=1.03, P=0.008). In contrast, Lauren type (HR=1.08, P=0.531) and gender (HR=0.58, P=0.055) did not show significant prognostic value in this cohort. Collectively, these results highlight the superior performance of our method in stratifying patients and its potential to complement traditional clinical risk assessment tools.
|
Figure 7 Hazard ratio. |
To further validate the clinical reliability of our model, we assessed its probability calibration performance using a calibration curve (Figure 8). The calibration curve plots the fraction of positive recurrence events against the mean predicted probability of recurrence, with the dashed diagonal representing a perfectly calibrated model. Our method’s calibration curve (red squares) closely follows the ideal diagonal across the entire range of predicted probabilities, demonstrating that the predicted recurrence probabilities are well-aligned with the actual observed recurrence rates. Specifically, at low predicted probabilities (0.0–0.4), our model shows minor deviations but remains within a clinically acceptable range, while at higher probabilities (0.6–1.0), the curve converges almost exactly to the perfect calibration line. This strong calibration performance ensures that the risk scores generated by our model can be reliably interpreted as actionable probabilities in clinical decision-making, reducing the risk of overconfidence or underconfidence in high-stakes patient stratification scenarios. Together with the superior AUC, ACC, interpretability, and survival stratification, the excellent calibration further underscores the robustness and clinical utility of our multi-modal approach.
|
Figure 8 Calibration curve for recurrence probability prediction. |
Discussions
Related Works
Despite advancements in digital microscopy and computational pathology, traditional prognostic models cannot be directly applied to WSI due to their giga-pixel nature. To address this, we typically segment WSI into multiple patches and aggregate patch-level analysis results into slide-level outcomes. This “patch-to-slide” analysis method aligns with the concept of Multi-Instance Learning (MIL), where instances are aggregated into a bag. Initially, fixed-pooling-based bag embedding methods, such as max pooling and average pooling, were proposed.37 However, these methods rely on untrained embedding processes, making the model less flexible. With the growing use of attention mechanisms in deep learning, more attention-based MIL approaches31,35,38,39 have been proposed, yielding promising results. Chen et al extended attention-based MIL (AMIL)40 for weakly-supervised survival prediction using WSIs. They later introduced the Hierarchical Image Pyramid Transformer (HIPT)41 to leverage the hierarchical information within WSIs. Unlike untrained or partially trainable embedding methods, fully trainable attention mechanisms can learn the relationships between patches and flexibly assign appropriate weights to each patch. Building on this direction, this study introduces a multi-stage attention mechanism for dimensionality reduction, enabling effective embedding of WSI into low-dimensional feature representations.
In recent years, there has been a growing body of research utilizing radiological data, such as CT images, for cancer prediction,42,43 leading to the development of radiomics-based methods combined with deep learning techniques.44 Traditional radiomics methods transform radiological images into predefined high-dimensional features, typically encompassing metrics based on size, shape, texture, and frequency domain analysis. Although radiomics has demonstrated its potential in revealing tumor characteristics,45 these methods often suffer from issues related to reproducibility and interpretability.46,47 With the advancement of deep convolutional neural networks (CNNs), deep learning-based prediction methods have shown promising performance on radiological data.48 For instance, Zhang et al29 proposed a feature pyramid network to extract multiscale features from high-resolution 3D CT images for lymph node metastasis prediction in gastric cancer patients. Ding et al49 introduced a feature attention module with learnable adaptive weights to fully leverage deep image feature extraction and high-level non-imaging factors through a feedforward neural network. However, as radiological data primarily provide macroscopic tumor information, we argue for the integration of histopathological data to complement the microscopic tumor characteristics for recurrence prediction. In this context, we propose an end-to-end deep learning framework to effectively integrate features across various scales, facilitating a more robust fusion of tumor and gastric volume characteristics.
Learning joint representations through multimodal deep learning presents a significant challenge due to the varying statistical properties and noise levels across different modalities.50,51 Existing modality fusion approaches primarily include vector concatenation, element-wise addition, bilinear pooling,52 and attention mechanisms. These methods have been successfully applied to language-vision tasks such as sentiment analysis21,53,54 and visual question answering (VQA).20,55 In the field of pathology, pathologists often spend substantial time analyzing pathological images alongside medical reports to reach a final diagnosis. Consequently, various multimodal approaches have been proposed for medical image analysis tasks11,56,57. For instance, Mobadersany et al33 utilized vector concatenation to integrate histological and genomic features through convolutional networks for survival outcome prediction. However, these post-fusion approaches provide limited interpretability of multimodal interactions. Addressing this limitation, Chen et al15 proposed an attention-based multimodal approach, MCAT, which aggregates whole-slide pathology images under the guidance of genomic features for survival analysis. Similarly, Wang et al18 introduced a novel transformer-guided intermediate fusion approach, BSFT, which leverages shared and modality-specific features to maximize the complementarity of multimodal information. Building on this line of research, we propose a multi-stage attention-based fusion method that systematically integrates the global context of WSIs, captures the inter-modality correlations between WSI and CT data, and effectively leverages the high-dimensional features within WSIs.
Motivation and Proposed Framework
Cancer diagnosis and prognosis often rely on multiple heterogeneous data sources, of which radiological images (eg., CT) and histological sections (WSI) are the two main types of medical images. However, most current multimodal approaches focus mainly on histological sections combined with clinical data or genomic information, while few approaches combine radiological images and histological sections to systematically explore the potential relationship and intrinsic causal mechanisms between them. To address this problem, we propose an interpretable, multi-stage attention-driven multimodal learning framework, called MSAT.MSAT captures deep interactions in CT and WSI data through a hierarchical attention mechanism. In MSAT, each patch of WSI and region of the CT image is represented as an embedded representation, and multimodal interactions between layered features are computed through a multistage attention module. This mechanism enables progressive local-to-global modelling of complementary features of histological and radiological data. In addition, the WSI-level multimodal interaction heatmaps generated by MSAT visualise potential connections between CT image regions and WSI key lesions, providing an intuitive way to understand the association between histological and radiological data. Compared with traditional multimodal learning methods, MSAT is not only optimised in terms of computational efficiency (by reducing redundant computation of feature interactions through multi-stage attention), but is also able to more comprehensively mine complementary information in radiological and histological images. In our experiments, we validate the effectiveness of MSAT by demonstrating its ability to capture causal mechanisms among data through visual analyses, further demonstrating the potential of the model for application in diagnostic and prognostic tasks.
Limitation
Despite the promising results, our study has several limitations. First, while the model was validated on two independent external datasets, all data were collected retrospectively from three institutions within specific geographic regions. This retrospective design inherently introduces potential selection bias, and the relatively limited sample size of the external validation sets (160 and 140 cases) may restrict the statistical power and generalizability of our findings to broader patient populations or diverse healthcare systems. Second, imaging protocols lacked standardization across institutions and time periods—CT scans varied in parameters such as slice thickness, contrast agent dosage, and scanning equipment, while WSI digitization differed in resolution and staining standards. These inconsistencies may introduce modality-specific noise, affecting feature extraction stability and the model’s transferability across institutional settings. Third, the proposed framework does not incorporate dedicated domain adaptation modules, and its adaptability to unseen domains (eg., institutions with different imaging equipment or pathological staining protocols) remains unvalidated, which may hinder real-world deployment. Fourth, the current model focuses on binary recurrence prediction and does not provide quantitative estimates of recurrence-free survival or overall survival probabilities, which are critical for personalized postoperative surveillance planning.
Future research should aim to: (1) conduct prospective multi-center validation with standardized imaging protocols and more diverse patient cohorts to improve generalizability and mitigate retrospective biases; (2) integrate domain adaptation techniques to enhance the model’s robustness across varying institutional settings and imaging workflows; (3) strengthen clinical interpretability by validating attention heatmaps with pathologists and radiologists, and quantifying the correlation between model-derived features and known prognostic factors (eg., tumor-infiltrating lymphocytes, microvascular invasion); and (4) incorporate additional clinically relevant data, such as TNM staging, tumor markers, and genomic information, to further enhance predictive accuracy.
Conclusion
In this paper, we focus on two main modalities for cancer recurrence: histological and radiological data. To capture different dimensional information in WSI, we base a dimensionality reduction approach in MSAT using a multi-stage attention mechanism. We then applied CT image-guided co-attention to describe the multimodal interaction of histological patch features and CT images and visualized it as a WSI-level co-attention heatmap. We conducted experiments on three datasets including WSI and CT images. Our results show that the proposed method improves the prognostic judgement of unimodal methods trained only on histological, radiological data and clinical text data. This advancement enables clinicians to preoperatively identify high-risk gastric cancer patients, support personalized postoperative follow-up and treatment stratification, and ultimately optimize patient outcomes by facilitating early intervention for recurrence-prone individuals.
Ethics Approval and Informed Consent
This study was conducted in accordance with the principles of the Declaration of Helsinki. Ethical approval for this study was obtained from the Institutional Review Boards of three institutions, including Zhongshan Hospital of Fudan University (Approval No.: B2021-785R), Xiamen Branch of Fudan University Zhongshan Hospital (Approval No.: B2021-785R-2) and Taicang TCM Hospital Affiliated to Nanjing (Approval No.: B2021-785R-1). Verbal informed consent was obtained from all participants prior to their inclusion in the study; This procedure was approved by the ethics committees. All patient data were anonymized and handled with strict confidentiality throughout the study.
Funding
This work was supported by the Shanghai Science and Technology Commission Explorer Project (24TS1410600), National natural science foundation of China (82372096), AI for Science Foundation of Fudan University (FudanX24AI038), National Key R&D Program of China, MOST (2023YFC2510000).
Disclosure
The authors report no conflicts of interest in this work.
References
1. Li H, Yang F, Xing X, et al. Multi-modal multi-instance learning using weakly correlated histopathological images and tabular clinical information. In:
2. Collins FS, Varmus H. A new initiative on precision medicine. N Engl J Med. 2015;372(9):793–21. doi:10.1056/NEJMp1500523
3. Kosorok MR, Laber EB. Precision medicine. Annu Rev Statist. 2019;6:263–286. doi:10.1146/annurev-statistics-030718-105251
4. Shao W, Wang T, Sun L, et al. Multi-task multi-modal learning for joint diagnosis and prognosis of human cancers. Med Image Anal. 2020;65:101795. doi:10.1016/j.media.2020.101795
5. Horeweg N, de Bruyn M, Nout RA, et al. Prognostic integrated image-based immune and molecular profiling in early-stage endometrial cancer. Cancer Immunol Res. 2020;8(12):1508–1519. doi:10.1158/2326-6066.CIR-20-0149
6. Fremond S, Andani S, Barkey Wolf J, et al. Interpretable deep learning model to predict the molecular classification of endometrial cancer from haematoxylin and eosin-stained whole-slide images: a combined analysis of the PORTEC randomised trials and clinical cohorts. Lancet Digital Health. 2023;5(2):e71–e82. doi:10.1016/S2589-7500(22)00210-2
7. Lafarge MW, Koelzer VH. Towards computationally efficient prediction of molecular signatures from routine histology images. Lancet Digital Health. 2021;3(12):e752–e753. doi:10.1016/S2589-7500(21)00232-6
8. Sirinukunwattana K, Domingo E, Richman SD, et al. Image-based consensus molecular subtype (imCMS) classification of colorectal cancer using deep learning. Gut. 2021;70(3):544–554. doi:10.1136/gutjnl-2019-319866
9. Graham S, Vu QD, Raza SEA, et al. Hover-net: simultaneous segmentation and classification of nuclei in multi-tissue histology images. Med Image Anal. 2019;58:101563. doi:10.1016/j.media.2019.101563
10. Lee Y, Park JH, Oh S, et al. Derivation of prognostic contextual histopathological features from whole-slide images of tumours via graph deep learning. Nat Biomed Eng. 2022;6:1–15.
11. Chen RJ, Lu MY, Williamson DFK, et al. Pan-cancer integrative histology-genomic analysis via multimodal deep learning. Cancer Cell. 2022;40(8):865–878. doi:10.1016/j.ccell.2022.07.004
12. Wulczyn E, Steiner DF, Moran M, et al. Interpretable survival prediction for colorectal cancer using deep learning. Npj Digital Med. 2021;4(1):71. doi:10.1038/s41746-021-00427-2
13. Charalampakis N, Economopoulou P, Kotsantis I, et al. Medical management of gastric cancer: a 2017 update. Cancer Med. 2018;7(1):123–133. doi:10.1002/cam4.1274
14. Chen RJ, Lu MY, Wang J, et al. Pathomic fusion: an integrated framework for fusing histopathology and genomic features for cancer diagnosis and prognosis. IEEE Transact Med Imag. 2020;41(4):757–770. doi:10.1109/TMI.2020.3021387
15. Chen RJ, Lu MY, Weng WH, et al. Multimodal co-attention transformer for survival prediction in gigapixel whole slide images. In:
16. Ning Z, Du D, Tu C, et al. Relation-aware shared representation learning for cancer prognosis analysis with auxiliary clinical variables and incomplete multi-modality data. IEEE Transact Med Imag. 2021;41(1):186–198. doi:10.1109/TMI.2021.3108802
17. Zhe L, Jiang Y, Lu M, et al. Survival prediction via hierarchical multimodal co-attention transformer: a computational histology-radiology solution. IEEE Transact Med Imag. 2023;42(9):2678–2689. doi:10.1109/TMI.2023.3263010
18. Wang Z, Yu L, Ding X, et al. Shared-specific feature learning with bottleneck fusion transformer for multi-modal whole slide image analysis. IEEE Transact Med Imag. 2023;42(11):3374–3383. doi:10.1109/TMI.2023.3287256
19. Nagrani A, Yang S, Arnab A, et al. Attention bottlenecks for multimodal fusion. Advanc Neural Informat Process Syst. 2021;34:14200–14213.
20. Fukui A, Park DH, Yang D, et al. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847. 2016. doi:10.48550/arXiv.1606.01847.
21. Zadeh A, Chen M, Poria S, Cambria E, Morency LP, et al. Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250 (2017),doi:10.48550/arXiv.1707.07250.
22. He K, Zhang X, Ren S, Sun J, et al. Deep residual learning for image recognition. In:
23. Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Advanc Neural Informat Process Syst. 2017;30.
24. Wang X, Yang S, Zhang J, et al. Transformer-based unsupervised contrastive learning for histopathological image classification. Med Image Anal. 2022;81:102559. doi:10.1016/j.media.2022.102559
25. Du Y, Bai F, Huang T, Zhao B, et al. Segvol: universal and interactive volumetric medical image segmentation. Advanc Neural Informat Process Syst. 2024;37:110746–110783.
26. Isensee F, Wald T, Ulrich C, et al. nnu-net revisited: a call for rigorous validation in 3d medical image segmentation. In:
27. Dai Y, Gieseke F, Oehmcke S, Wu Y, Barnard K. Attentional feature fusion. In:
28. Panchapagesan S, Sun M, Khare A, et al. Multi-task learning and weighted cross-entropy for DNN-based keyword spotting. 2016. doi:10.21437/Interspeech.2016-1485
29. Zhang Y, Yuan N, Zhang Z, et al. Unsupervised domain selective graph convolutional network for preoperative prediction of lymph node metastasis in gastric cancer. Med Image Anal. 2022;79:102467. doi:10.1016/j.media.2022.102467
30. Chen Z, Tian M, Tang Z, Wang X, Yu J, et al. Dual stream mask attention network to predict the LN metastasis for gastric cancer. In:
31. Shao Z, Bian H, Chen Y, et al. Transmil: transformer based correlated multiple instance learning for whole slide image classification. Advanc Neural Informat Proc Syst. 2021;34:2136–2147.
32. Zhang H, Meng Y, Zhao Y, et al. Dtfd-mil: double-tier feature distillation multiple instance learning for histopathology whole slide image classification. In:
33. Mobadersany P, Yousefi S, Amgad M, et al. Predicting cancer outcomes from histology and genomics using convolutional networks. Proceed National Acad Sci. 2018;115:E2970–E2979. doi:10.1073/pnas.1717139115.
34. Weng W-H, Cai Y, Lin A, et al. Multimodal multitask representation learning for pathology biobank metadata prediction. arXiv preprint arXiv:1909.07846. 2019. doi:10.48550/arXiv.1909.07846.
35. Yao J, Zhu X, Jonnagaddala J, et al. Whole slide images based cancer survival prediction using attention guided deep multiple instance learning networks. Med Image Anal. 2020;65:101789. doi:10.1016/j.media.2020.101789
36. Hendricks LA, Mellor J, Schneider R, et al. Decoupling the role of data, attention, and losses in multimodal transformers. Transact Associat Computational Linguistics. 2021;9:570–585. doi:10.1162/tacl_a_00385
37. Wang X, Yan Y, Tang P, et al. Revisiting multiple instance neural networks. Pattern Recogn. 2018;74:15–24. doi:10.1016/j.patcog.2017.08.026
38. Ilse M, Tomczak J, Welling M. Attention-based deep multiple instance learning. In:
39. Li H, Yang F, Zhao Y, et al. DT-MIL: deformable transformer for multi-instance learning on histopathological image. In:
40. Valanarasu JMJ, Oza P, Hacihaliloglu I, Patel VM, et al. Medical transformer: gated axial-attention for medical image segmentation. In:
41. Gao Y, Zhou M, Metaxas DN. UTNet: a hybrid transformer architecture for medical image segmentation. In:
42. Zhang L, Dong D, Liu Z, Zhou J, Tian J. Joint multi-task learning for survival prediction of gastric cancer patients using CT images. In:
43. Chaddad A, Desrosiers C, Abdulkarim B, Niazi T. Predicting the gene status and survival outcome of lower grade glioma patients with multimodal MRI features. In:
44. Yao J, Shi Y, Cao K, et al. DeepPrognosis: preoperative prediction of pancreatic cancer survival and surgical margin via comprehensive understanding of dynamic contrast-enhanced CT imaging and tumor-vascular contact parsing. Med Image Anal. 2021;73:102150. doi:10.1016/j.media.2021.102150
45. Gillies RJ, Kinahan PE, Hricak H. Radiomics: images are more than pictures, they are data. Radiology. 2016;278(2):563–577. doi:10.1148/radiol.2015151169
46. Jia W, Li C, Gensheimer M, et al. Radiological tumour classification across imaging modality and histology. Nature Mach Intell. 2021;3(9):787–798. doi:10.1038/s42256-021-00377-0
47. Traverso A, Wee L, Dekker A, et al. Repeatability and reproducibility of radiomic features: a systematic review. Int J Radiat Oncol Biol Phys. 2018;102(4):1143–1158. doi:10.1016/j.ijrobp.2018.05.053
48. Feng B, Shi J, Huang L, et al. Robustly federated learning model for identifying high-risk patients with postoperative gastric cancer recurrence. Nat Commun. 2024;15(1):742. doi:10.1038/s41467-024-44946-4
49. Ding M, Cui H, Li B, et al. Integrating preoperative CT and clinical factors for lymph node metastasis prediction in esophageal cancer by Feature-wise Attentional Graph Neural Network (FAGNN). Int J Radiat Oncol Biol Phys. 2021;111(3):e123–e124. doi:10.1016/j.ijrobp.2021.07.545
50. Ngiam J, Khosla A, Kim M, et al. Multimodal deep learning. ICML. 2011;11.
51. Baltrušaitis T, Ahuja C, Morency L-P. Multimodal machine learning: a survey and taxonomy. IEEE transact Pattern Analys Machine Intelligence. 2018;41: 423–443. doi:10.1109/TPAMI.2018.2798607.
52. Lin T-Y, RoyChowdhury A, Maji S. Bilinear CNN models for fine-grained visual recognition. In:
53. Truong Q-T, Lauw HW. Vistanet: visual aspect attention network for multimodal sentiment analysis. In:
54. Ju X, Zhang D, Li J, Zhou G. Transformer-based label set generation for multi-modal multi-label emotion detection. In:
55. Kim J-H, Jun J, Zhang B-T. Bilinear attention networks. Advan Neural Informa Proc Syst. 2018;31.
56. Boehm KM, Aherne EA, Ellenson L, et al. Multimodal data integration using machine learning improves risk stratification of high-grade serous ovarian cancer. Nat Cancer. 2022;3(6):723–733. doi:10.1038/s43018-022-00388-9
57. Sammut S-J, Crispin-Ortuzar M, Chin S-F, et al. Multi-omic machine learning predictor of breast cancer therapy response. Nature. 2022;601(7894):623–629. doi:10.1038/s41586-021-04278-5
© 2026 The Author(s). This work is published and licensed by Dove Medical Press Limited. The
full terms of this license are available at https://www.dovepress.com/terms
and incorporate the Creative Commons Attribution
- Non Commercial (unported, 4.0) License.
By accessing the work you hereby accept the Terms. Non-commercial uses of the work are permitted
without any further permission from Dove Medical Press Limited, provided the work is properly
attributed. For permission for commercial use of this work, please see paragraphs 4.2 and 5 of our Terms.





















