Remove 2024 Remove AWS Remove Clustering
article thumbnail

Your guide to generative AI and ML at AWS re:Invent 2024

AWS Machine Learning Blog

The excitement is building for the fourteenth edition of AWS re:Invent, and as always, Las Vegas is set to host this spectacular event. Third, we’ll explore the robust infrastructure services from AWS powering AI innovation, featuring Amazon SageMaker , AWS Trainium , and AWS Inferentia under AI/ML, as well as Compute topics.

AWS 110
article thumbnail

Speed up your cluster procurement time with Amazon SageMaker HyperPod training plans

AWS Machine Learning Blog

In this post, we demonstrate how you can address this requirement by using Amazon SageMaker HyperPod training plans , which can bring down your training cluster procurement wait time. We further guide you through using the training plan to submit SageMaker training jobs or create SageMaker HyperPod clusters.

professionals

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

article thumbnail

How climate tech startups are building foundation models with Amazon SageMaker HyperPod

Flipboard

In 2024, climate disasters caused more than $417B in damages globally, and theres no slowing down in 2025 with LA wildfires that destroyed more than $135B in the first month of the year alone. To offer a more concrete look at these trends, the following is a deep dive into how climate tech startups are building FMs on AWS.

AWS 130
article thumbnail

Node problem detection and recovery for AWS Neuron nodes within Amazon EKS clusters

AWS Machine Learning Blog

In the post, we introduce the AWS Neuron node problem detector and recovery DaemonSet for AWS Trainium and AWS Inferentia on Amazon Elastic Kubernetes Service (Amazon EKS). eks-5e0fdde Install the required AWS Identity and Access Management (IAM) role for the service account and the node problem detector plugin.

article thumbnail

AWS at NVIDIA GTC 2024: Accelerate innovation with generative AI on AWS

AWS Machine Learning Blog

AWS was delighted to present to and connect with over 18,000 in-person and 267,000 virtual attendees at NVIDIA GTC, a global artificial intelligence (AI) conference that took place March 2024 in San Jose, California, returning to a hybrid, in-person experience for the first time since 2019.

AWS 135
article thumbnail

Accelerating Mixtral MoE fine-tuning on Amazon SageMaker with QLoRA

AWS Machine Learning Blog

Although QLoRA helps optimize memory during fine-tuning, we will use Amazon SageMaker Training to spin up a resilient training cluster, manage orchestration, and monitor the cluster for failures. To take complete advantage of this multi-GPU cluster, we use the recent support of QLoRA and PyTorch FSDP. 24xlarge compute instance.

article thumbnail

Customize DeepSeek-R1 distilled models using Amazon SageMaker HyperPod recipes – Part 1

AWS Machine Learning Blog

Amazon SageMaker HyperPod recipes At re:Invent 2024, we announced the general availability of Amazon SageMaker HyperPod recipes. The launcher interfaces with underlying cluster management systems such as SageMaker HyperPod (Slurm or Kubernetes) or training jobs, which handle resource allocation and scheduling. recipes=recipe-name.