The rapid rise of data-driven weather forecasting has prompted increasing interest in how such models perform relative to traditional numerical weather prediction systems. While recent studies have highlighted the formers’ superior skill on standard forecast metrics, questions remain regarding their ability to forecast physically complex, derived variables, particularly in the context of extreme events. In this study, we assess the performance of two leading operational data-driven models (GraphCast and Pangu-Weather) in forecasting integrated vapour transport (IVT) and atmospheric rivers (ARs), using ECMWF’s IFS-HRES as a reference physics-based forecast. Forecasts are evaluated against ERA5 reanalysis over one year of global data, using three AR detection algorithms and lead times ranging from 1 to 10 days. Results show that data-driven models achieved root-mean-square errors for IVT comparable to or slightly better than IFS-HRES, particularly in the tropics and at shorter lead times. However, they achieved a poorer representation of the higher quantiles of the IVT distribution. A case study of a high-impact AR event revealed that all models could forecast the main event characteristics up to five days in advance. However, AR characteristics and detection performance varied substantially across detection algorithms. Notably, the geometrically stricter detection method highlighted a clearer advantage for IFS-HRES, especially in the midlatitudes and at shorter leads. Overall, while no model systematically outperformed the others across all AR detection algorithms, the results suggest that physics-based models may retain advantages in forecasting geometrically and physically consistent derived features like ARs, particularly under strict detection criteria. These findings underscore the importance of targeted evaluation frameworks for derived and extreme phenomena as data-driven models become more central in operational forecasting.