Unraveling Transformer Optimization: A Hessian-Based Explanation for Adam’s Superiority over SGD
Large Language Models (LLMs) based on Transformer architectures have revolutionized AI development. However, the complexity of their training process remains poorly understood. A significant challenge in this domain is the…