Remove 2024 Remove AWS Remove Clustering
article thumbnail

Your guide to generative AI and ML at AWS re:Invent 2024

AWS Machine Learning Blog

The excitement is building for the fourteenth edition of AWS re:Invent, and as always, Las Vegas is set to host this spectacular event. Third, we’ll explore the robust infrastructure services from AWS powering AI innovation, featuring Amazon SageMaker , AWS Trainium , and AWS Inferentia under AI/ML, as well as Compute topics.

AWS 107
article thumbnail

Node problem detection and recovery for AWS Neuron nodes within Amazon EKS clusters

AWS Machine Learning Blog

In the post, we introduce the AWS Neuron node problem detector and recovery DaemonSet for AWS Trainium and AWS Inferentia on Amazon Elastic Kubernetes Service (Amazon EKS). eks-5e0fdde Install the required AWS Identity and Access Management (IAM) role for the service account and the node problem detector plugin.

professionals

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

article thumbnail

AWS at NVIDIA GTC 2024: Accelerate innovation with generative AI on AWS

AWS Machine Learning Blog

AWS was delighted to present to and connect with over 18,000 in-person and 267,000 virtual attendees at NVIDIA GTC, a global artificial intelligence (AI) conference that took place March 2024 in San Jose, California, returning to a hybrid, in-person experience for the first time since 2019.

AWS 133
article thumbnail

Customize DeepSeek-R1 distilled models using Amazon SageMaker HyperPod recipes – Part 1

AWS Machine Learning Blog

Amazon SageMaker HyperPod recipes At re:Invent 2024, we announced the general availability of Amazon SageMaker HyperPod recipes. The launcher interfaces with underlying cluster management systems such as SageMaker HyperPod (Slurm or Kubernetes) or training jobs, which handle resource allocation and scheduling. recipes=recipe-name.

article thumbnail

Accelerate pre-training of Mistral’s Mathstral model with highly resilient clusters on Amazon SageMaker HyperPod

AWS Machine Learning Blog

The compute clusters used in these scenarios are composed of more than thousands of AI accelerators such as GPUs or AWS Trainium and AWS Inferentia , custom machine learning (ML) chips designed by Amazon Web Services (AWS) to accelerate deep learning workloads in the cloud.

article thumbnail

Develop and train large models cost-efficiently with Metaflow and AWS Trainium

AWS Machine Learning Blog

In 2024, however, organizations are using large language models (LLMs), which require relatively little focus on NLP, shifting research and development from modeling to the infrastructure needed to support LLM workflows. Historically, natural language processing (NLP) would be a primary research and development expense.

AWS 125
article thumbnail

Benchmarking Amazon Nova and GPT-4o models with FloTorch

AWS Machine Learning Blog

OpenAI launched GPT-4o in May 2024, and Amazon introduced Amazon Nova models at AWS re:Invent in December 2024. The implementation included a provisioned three-node sharded OpenSearch Service cluster. Interested users are invited to try out FloTorch from AWS Marketplace or from GitHub.