Abstract Large language models (LLMs) represent the forefront of artificial intelligence, exhibiting substantial potential for advancing understanding and reasoning in complex Earth system sciences. Hydrological modeling, essential for water resource management and Earth system knowledge, predominantly relies on traditional calibration paradigms and remains largely unexplored with respect to leveraging recent advancements in LLMs. Five advanced LLMs (GPT‐4o‐mini, DeepSeek‐R1, DeepSeek‐V3, Llama‐4maverick, and llama‐70b) were systematically evaluated and compared against two benchmarks (SCE‐UA and NSGA‐III). Results indicate substantial variability: DeepSeek‐R1 achieved stable near‐optimal convergence within 200 iterations, significantly faster than SCE‐UA (>1,200) or NSGA‐III (>2,200), and delivered slightly improved accuracy, while yielding parameters with enhanced physical interpretability consistent with expert reasoning. Alternatively, the other LLMs performed less favorably than two benchmarks. These findings demonstrate both the promise and limitations of applying LLMs to hydrological parameter calibration, providing an initial step toward exploring the broader potential of LLMs in Earth system modeling.