AWS, Clustering and Deep Learning - Data Science Current

PEFT fine tuning of Llama 3 on SageMaker HyperPod with AWS Trainium

AWS Machine Learning Blog

DECEMBER 24, 2024

The process of setting up and configuring a distributed training environment can be complex, requiring expertise in server management, cluster configuration, networking and distributed computing. To simplify infrastructure setup and accelerate distributed training, AWS introduced Amazon SageMaker HyperPod in late 2023.

AWS

AWS Clustering Deep Learning Deep Learning

Your guide to generative AI and ML at AWS re:Invent 2024

AWS Machine Learning Blog

NOVEMBER 19, 2024

The excitement is building for the fourteenth edition of AWS re:Invent, and as always, Las Vegas is set to host this spectacular event. Third, we’ll explore the robust infrastructure services from AWS powering AI innovation, featuring Amazon SageMaker , AWS Trainium , and AWS Inferentia under AI/ML, as well as Compute topics.

AWS

AWS ML ML AI

Deploy Meta Llama 3.1 models cost-effectively in Amazon SageMaker JumpStart with AWS Inferentia and AWS Trainium

AWS Machine Learning Blog

NOVEMBER 25, 2024

8B and 70B inference support on AWS Trainium and AWS Inferentia instances in Amazon SageMaker JumpStart. Trainium and Inferentia, enabled by the AWS Neuron software development kit (SDK), offer high performance and lower the cost of deploying Meta Llama 3.1 An AWS Identity and Access Management (IAM) role to access SageMaker.

AWS

AWS Python ML ML

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Automation, Evolved: Your New Playbook for Smarter Knowledge Work

Smart Tech + Human Expertise = How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Get started quickly with AWS Trainium and AWS Inferentia using AWS Neuron DLAMI and AWS Neuron DLC

AWS Machine Learning Blog

JUNE 11, 2024

Starting with the AWS Neuron 2.18 release , you can now launch Neuron DLAMIs (AWS Deep Learning AMIs) and Neuron DLCs (AWS Deep Learning Containers) with the latest released Neuron packages on the same day as the Neuron SDK release. AWS Systems Manager Parameter Store support Neuron 2.18

AWS

AWS Deep Learning Deep Learning ML

Map Earth’s vegetation in under 20 minutes with Amazon SageMaker

AWS Machine Learning Blog

OCTOBER 16, 2024

Although setting up a processing cluster is an alternative, it introduces its own set of complexities, from data distribution to infrastructure management. We use the purpose-built geospatial container with SageMaker Processing jobs for a simplified, managed experience to create and run a cluster. format("/".join(tile_prefix),

ML

ML ML Clustering Machine Learning

Efficiently train models with large sequence lengths using Amazon SageMaker model parallel

AWS Machine Learning Blog

NOVEMBER 27, 2024

Mixed Precision Training with FP8 As shown in figure below, FP8 is a datatype supported by NVIDIA’s H100 and H200 GPUs, enables efficient deep learning workloads. More details about FP8 can be found at FP8 Formats For Deep Learning. supports the Llama 3.1 (and Request a Service Quota for 1x p4d.24xlarge

AWS

AWS Clustering ML ML

Reduce energy consumption of your machine learning workloads by up to 90% with AWS purpose-built accelerators

Flipboard

JUNE 20, 2023

For reference, GPT-3, an earlier generation LLM has 175 billion parameters and requires months of non-stop training on a cluster of thousands of accelerated processors. The Carbontracker study estimates that training GPT-3 from scratch may emit up to 85 metric tons of CO2 equivalent, using clusters of specialized hardware accelerators.

AWS

AWS Machine Learning Machine Learning ML

Accelerate pre-training of Mistral’s Mathstral model with highly resilient clusters on Amazon SageMaker HyperPod

AWS Machine Learning Blog

SEPTEMBER 18, 2024

The compute clusters used in these scenarios are composed of more than thousands of AI accelerators such as GPUs or AWS Trainium and AWS Inferentia , custom machine learning (ML) chips designed by Amazon Web Services (AWS) to accelerate deep learning workloads in the cloud.

Clustering

Clustering AWS ML ML

Simple guide to training Llama 2 with AWS Trainium on Amazon SageMaker

AWS Machine Learning Blog

MAY 1, 2024

Llama2 by Meta is an example of an LLM offered by AWS. To learn more about Llama 2 on AWS, refer to Llama 2 foundation models from Meta are now available in Amazon SageMaker JumpStart. Virginia) and US West (Oregon) AWS Regions, and most recently announced general availability in the US East (Ohio) Region.

AWS

AWS ML ML Clustering

Scale your machine learning workloads on Amazon ECS powered by AWS Trainium instances

AWS Machine Learning Blog

MAY 31, 2023

With containers, scaling on a cluster becomes much easier. In late 2022, AWS announced the general availability of Amazon EC2 Trn1 instances powered by AWS Trainium accelerators, which are purpose built for high-performance deep learning training. Therefore, we have two different options. Amazon Linux 2) ????????'

AWS

AWS Machine Learning Machine Learning ML

Announcing New Tools for Building with Generative AI on AWS

Flipboard

APRIL 13, 2023

Prime Air (our drones) and the computer vision technology in Amazon Go (our physical retail experience that lets consumers select items off a shelf and leave the store without having to formally check out) use deep learning. AWS has the broadest and deepest portfolio of AI and ML services at all three layers of the stack.

AWS

AWS ML ML AI

Fast and cost-effective LLaMA 2 fine-tuning with AWS Trainium

AWS Machine Learning Blog

OCTOBER 5, 2023

In this post, we walk through how to fine-tune Llama 2 on AWS Trainium , a purpose-built accelerator for LLM training, to reduce training times and costs. We review the fine-tuning scripts provided by the AWS Neuron SDK (using NeMo Megatron-LM), the various configurations we used, and the throughput results we saw.

AWS

AWS Machine Learning Machine Learning Deep Learning

Develop and train large models cost-efficiently with Metaflow and AWS Trainium

AWS Machine Learning Blog

APRIL 29, 2024

For AWS and Outerbounds customers, the goal is to build a differentiated machine learning and artificial intelligence (ML/AI) system and reliably improve it over time. First, the AWS Trainium accelerator provides a high-performance, cost-effective, and readily available solution for training and fine-tuning large models.

AWS

AWS ML ML Python

10 Things AWS Can Do for Your SaaS Company

Smart Data Collective

FEBRUARY 20, 2022

AWS (Amazon Web Services), the comprehensive and evolving cloud computing platform provided by Amazon, is comprised of infrastructure as a service (IaaS), platform as a service (PaaS) and packaged software as a service (SaaS). With its wide array of tools and convenience, AWS has already become a popular choice for many SaaS companies.

AWS

AWS Cloud Computing Data Lakes Database

Scaling distributed training with AWS Trainium and Amazon EKS

AWS Machine Learning Blog

FEBRUARY 1, 2023

Recent developments in deep learning have led to increasingly large models such as GPT-3, BLOOM, and OPT, some of which are already in excess of 100 billion parameters. Many enterprise customers choose to deploy their deep learning workloads using Kubernetes—the de facto standard for container orchestration in the cloud.

AWS

AWS Clustering Deep Learning Deep Learning

Frugality meets Accuracy: Cost-efficient training of GPT NeoX and Pythia models with AWS Trainium

AWS Machine Learning Blog

DECEMBER 12, 2023

In this post, we’ll summarize training procedure of GPT NeoX on AWS Trainium , a purpose-built machine learning (ML) accelerator optimized for deep learning training. M tokens/$) trained such models with AWS Trainium without losing any model quality. We’ll outline how we cost-effectively (3.2 billion in Pythia.

AWS

AWS Machine Learning Machine Learning Deep Learning

Scale and simplify ML workload monitoring on Amazon EKS with AWS Neuron Monitor container

AWS Machine Learning Blog

JUNE 25, 2024

Amazon Web Services is excited to announce the launch of the AWS Neuron Monitor container , an innovative tool designed to enhance the monitoring capabilities of AWS Inferentia and AWS Trainium chips on Amazon Elastic Kubernetes Service (Amazon EKS). The Container Insights dashboard also shows cluster status and alarms.

AWS

AWS ML ML Clustering

The future of productivity agents with NinjaTech AI and AWS Trainium

AWS Machine Learning Blog

JUNE 27, 2024

In this post, we describe how we built our cutting-edge productivity agent NinjaLLM, the backbone of MyNinja.ai, using AWS Trainium chips. For training, we chose to use a cluster of trn1.32xlarge instances to take advantage of Trainium chips. We used a cluster of 32 instances in order to efficiently parallelize the training.

AWS

AWS AI AI Clustering

Scaling Large Language Model (LLM) training with Amazon EC2 Trn1 UltraClusters

Flipboard

FEBRUARY 16, 2023

Modern model pre-training often calls for larger cluster deployment to reduce time and cost. In October 2022, we launched Amazon EC2 Trn1 Instances , powered by AWS Trainium , which is the second generation machine learning accelerator designed by AWS. The following diagram shows an example.

Clustering

Clustering AWS Deep Learning Deep Learning

Sprinklr improves performance by 20% and reduces cost by 25% for machine learning inference on AWS Graviton3

AWS Machine Learning Blog

JUNE 11, 2024

In this post, we describe the scale of our AI offerings, the challenges with diverse AI workloads, and how we optimized mixed AI workload inference performance with AWS Graviton3 based c7g instances and achieved 20% throughput improvement, 30% latency reduction, and reduced our cost by 25–30%.

Machine Learning

Machine Learning Machine Learning AWS Natural Language Processing

Fine-tune and deploy Llama 2 models cost-effectively in Amazon SageMaker JumpStart with AWS Inferentia and AWS Trainium

AWS Machine Learning Blog

JANUARY 17, 2024

Today, we’re excited to announce the availability of Llama 2 inference and fine-tuning support on AWS Trainium and AWS Inferentia instances in Amazon SageMaker JumpStart. In this post, we demonstrate how to deploy and fine-tune Llama 2 on Trainium and AWS Inferentia instances in SageMaker JumpStart.

AWS

AWS Python Machine Learning Machine Learning

Building an efficient MLOps platform with OSS tools on Amazon ECS with AWS Fargate

AWS Machine Learning Blog

SEPTEMBER 18, 2024

Zeta’s AI innovations over the past few years span 30 pending and issued patents, primarily related to the application of deep learning and generative AI to marketing technology. Zeta’s AI innovation is powered by a proprietary machine learning operations (MLOps) system, developed in-house.

AWS

AWS Machine Learning Machine Learning ML

Welcome to a New Era of Building in the Cloud with Generative AI on AWS

AWS Machine Learning Blog

NOVEMBER 30, 2023

The number of companies launching generative AI applications on AWS is substantial and building quickly, including adidas, Booking.com, Bridgewater Associates, Clariant, Cox Automotive, GoDaddy, and LexisNexis Legal & Professional, to name just a few. Innovative startups like Perplexity AI are going all in on AWS for generative AI.

AWS

AWS AI AI ML

Pre-training genomic language models using AWS HealthOmics and Amazon SageMaker

AWS Machine Learning Blog

MAY 31, 2024

In this blog post and open source project , we show you how you can pre-train a genomics language model, HyenaDNA , using your genomic data in the AWS Cloud. Amazon SageMaker Amazon SageMaker is a fully managed ML service offered by AWS, designed to reduce the time and cost associated with training and tuning ML models at scale.

AWS

AWS ML ML Machine Learning

Scale LLMs with PyTorch 2.0 FSDP on Amazon EKS – Part 2

AWS Machine Learning Blog

APRIL 1, 2024

Machine learning (ML) research has proven that large language models (LLMs) trained with significantly large datasets result in better model quality. Distributed model training requires a cluster of worker nodes that can scale. The following figure shows how FSDP works for two data parallel processes.

Clustering

Clustering AWS ML ML

Accelerate PyTorch with DeepSpeed to train large language models with Intel Habana Gaudi-based DL1 EC2 instances

AWS Machine Learning Blog

JUNE 7, 2023

Libraries such as DeepSpeed (an open-source deep learning optimization library for PyTorch) address some of these challenges, and can help accelerate model development and training. Training setup We provisioned a managed compute cluster comprised of 16 dl1.24xlarge instances using AWS Batch. Pre-training of a 1.5-billion-parameter

AWS

AWS Clustering Deep Learning Deep Learning

Enable pod-based GPU metrics in Amazon CloudWatch

AWS Machine Learning Blog

SEPTEMBER 7, 2023

Since then, this feature has been integrated into many of our managed Amazon Machine Images (AMIs), such as the Deep Learning AMI and the AWS ParallelCluster AMI. Prerequisites To simplify reproducing the entire stack from this post, we use a container that has all the required tooling (aws cli, eksctl, helm, etc.)

Clustering

Clustering AWS Machine Learning Machine Learning

How Amazon Search M5 saved 30% for LLM training cost by using AWS Trainium

AWS Machine Learning Blog

NOVEMBER 22, 2023

Similar to the rest of the industry, the advancements of accelerated hardware have allowed Amazon teams to pursue model architectures using neural networks and deep learning (DL). Last year, AWS launched its AWS Trainium accelerators, which optimize performance per cost for developing and building next generation DL models.

AWS

AWS ML ML Deep Learning

Scale AI training and inference for drug discovery through Amazon EKS and Karpenter

AWS Machine Learning Blog

APRIL 19, 2024

Our deep learning models have non-trivial requirements: they are gigabytes in size, are numerous and heterogeneous, and require GPUs for fast inference and fine-tuning. The architecture deploys a simple service in a Kubernetes pod within an EKS cluster. The following diagram illustrates the solution architecture.

Clustering

Clustering AI AI AWS

Amazon SageMaker model parallel library now accelerates PyTorch FSDP workloads by up to 20%

AWS Machine Learning Blog

DECEMBER 22, 2023

As a result, machine learning practitioners must spend weeks of preparation to scale their LLM workloads to large clusters of GPUs. Integrating tensor parallelism to enable training on massive clusters This release of SMP also expands PyTorch FSDP’s capabilities to include tensor parallelism techniques.

Clustering

Clustering Deep Learning Deep Learning AWS

Accelerate Mixtral 8x7B pre-training with expert parallelism on Amazon SageMaker

AWS Machine Learning Blog

MAY 23, 2024

By distributing experts across workers, expert parallelism addresses the high memory requirements of loading all experts on a single device and enables MoE training on a larger cluster. The following figure offers a simplified look at how expert parallelism works on a multi-GPU cluster.

Clustering

Clustering AWS Deep Learning Deep Learning

Federated Learning on AWS with FedML: Health analytics without sharing sensitive data – Part 2

AWS Machine Learning Blog

JANUARY 13, 2023

To mitigate these challenges, we propose a federated learning (FL) framework, based on open-source FedML on AWS, which enables analyzing sensitive HCLS data. It involves training a global machine learning (ML) model from distributed health data held locally at different sites. Request a VPC peering connection.

AWS

AWS Analytics Analytics Machine Learning

Unlocking generative AI for enterprises: How SnapLogic powers their low-code Agent Creator using Amazon Bedrock

AWS Machine Learning Blog

OCTOBER 23, 2024

SnapLogic uses Amazon Bedrock to build its platform, capitalizing on the proximity to data already stored in Amazon Web Services (AWS). To address customers’ requirements about data privacy and sovereignty, SnapLogic deploys the data plane within the customer’s VPC on AWS.

AI

AI AI Database AWS

Deploy DeepSeek-R1 distilled models on Amazon SageMaker using a Large Model Inference container

AWS Machine Learning Blog

MARCH 11, 2025

The MoE architecture allows activation of 37 billion parameters, enabling efficient inference by routing queries to the most relevant expert clusters. By integrating this model with Amazon SageMaker AI , you can benefit from the AWS scalable infrastructure while maintaining high-quality language model capabilities.

AWS

AWS ML ML Natural Language Processing

Architect personalized generative AI SaaS applications on Amazon SageMaker

Flipboard

MARCH 9, 2023

In this post, we review the technical requirements and application design considerations for fine-tuning and serving hyper-personalized AI models at scale on AWS. Second, SageMaker supports unique GPU-enabled hosting options for deploying deep learning models at scale.

AWS

AWS AI AI ML

Optimize generative AI workloads for environmental sustainability

AWS Machine Learning Blog

SEPTEMBER 21, 2023

To add to our guidance for optimizing deep learning workloads for sustainability on AWS , this post provides recommendations that are specific to generative AI workloads. Adopt an efficient inference infrastructure – You can deploy your models on an AWS Inferentia2 accelerator.

AWS

AWS AI AI Deep Learning

Optimizing MLOps for Sustainability

AWS Machine Learning Blog

SEPTEMBER 11, 2024

AWS published Guidance for Optimizing MLOps for Sustainability on AWS to help customers maximize utilization and minimize waste in their ML workloads. In this blog post, you will learn how to optimize MLOps for sustainability. Start with the AWS Region you choose for your workload.

AWS

AWS Data Preparation ML ML

How Cisco accelerated the use of generative AI with Amazon SageMaker Inference

AWS Machine Learning Blog

AUGUST 8, 2024

Webex’s focus on delivering inclusive collaboration experiences fuels their innovation, which uses artificial intelligence (AI) and machine learning (ML), to remove the barriers of geography, language, personality, and familiarity with technology. Webex works with the world’s leading business and productivity apps—including AWS.

AWS

AWS AI AI Clustering

Accelerate hyperparameter grid search for sentiment analysis with BERT models using Weights & Biases, Amazon EKS, and TorchElastic

AWS Machine Learning Blog

MARCH 2, 2023

Hyperparameter optimization is highly computationally demanding for deep learning models. In our solution, we implement a hyperparameter grid search on an EKS cluster for tuning a bert-base-cased model for classifying positive or negative sentiment for stock market data headlines. The code can be found on the GitHub repo.

Clustering

Clustering AWS Deep Learning Deep Learning

Training large language models on Amazon SageMaker: Best practices

AWS Machine Learning Blog

MARCH 6, 2023

These factors require training an LLM over large clusters of accelerated machine learning (ML) instances. In the past few years, numerous customers have been using the AWS Cloud for LLM training. We recommend working with your AWS account team or contacting AWS Sales to determine the appropriate Region for your LLM workload.

AWS

AWS Clustering ML ML

Use Kubernetes Operators for new inference capabilities in Amazon SageMaker that reduce LLM deployment costs by 50% on average

AWS Machine Learning Blog

APRIL 19, 2024

We are excited to announce a new version of the Amazon SageMaker Operators for Kubernetes using the AWS Controllers for Kubernetes (ACK). ACK is a framework for building Kubernetes custom controllers, where each controller communicates with an AWS service API. They are also supported by AWS CloudFormation. Release v1.2.9

AWS

AWS ML ML Machine Learning

Achieve operational excellence with well-architected generative AI solutions using Amazon Bedrock

AWS Machine Learning Blog

OCTOBER 2, 2024

In this post, we share AWS guidance that we have learned and developed as part of real-world projects into practical guides oriented towards the AWS Well-Architected Framework , which is used to build production infrastructure and applications on AWS. We focus on the operational excellence pillar in this post.

AWS

AWS AI AI ML

Enable faster training with Amazon SageMaker data parallel library

AWS Machine Learning Blog

DECEMBER 5, 2023

Walkthrough AWS-optimized AllGather AWS-optimized AllGather uses the following techniques to achieve better performance on AWS infrastructure compared to NCCL: We move data between instances via Elastic Fabric Adapter (EFA) network with an all-to-all communication pattern. 24xlarge nodes (512 NVIDIA A100 GPUs) PyTorch FSDP 97.89

AWS

AWS Deep Learning Deep Learning Clustering

Scaling Thomson Reuters’ language model research with Amazon SageMaker HyperPod

AWS Machine Learning Blog

SEPTEMBER 12, 2024

In this post, we explore the journey that Thomson Reuters took to enable cutting-edge research in training domain-adapted large language models (LLMs) using Amazon SageMaker HyperPod , an Amazon Web Services (AWS) feature focused on providing purpose-built infrastructure for distributed training at scale. So, for example, a 6.6B

Clustering

Clustering AWS ML ML

PEFT fine tuning of Llama 3 on SageMaker HyperPod with AWS Trainium

Your guide to generative AI and ML at AWS re:Invent 2024

Webinars

Trending Sources

Deploy Meta Llama 3.1 models cost-effectively in Amazon SageMaker JumpStart with AWS Inferentia and AWS Trainium

Webinars

Get started quickly with AWS Trainium and AWS Inferentia using AWS Neuron DLAMI and AWS Neuron DLC

Map Earth’s vegetation in under 20 minutes with Amazon SageMaker

Efficiently train models with large sequence lengths using Amazon SageMaker model parallel

Reduce energy consumption of your machine learning workloads by up to 90% with AWS purpose-built accelerators

Accelerate pre-training of Mistral’s Mathstral model with highly resilient clusters on Amazon SageMaker HyperPod

Simple guide to training Llama 2 with AWS Trainium on Amazon SageMaker

Scale your machine learning workloads on Amazon ECS powered by AWS Trainium instances

Announcing New Tools for Building with Generative AI on AWS

Fast and cost-effective LLaMA 2 fine-tuning with AWS Trainium

Develop and train large models cost-efficiently with Metaflow and AWS Trainium

10 Things AWS Can Do for Your SaaS Company

Scaling distributed training with AWS Trainium and Amazon EKS

Frugality meets Accuracy: Cost-efficient training of GPT NeoX and Pythia models with AWS Trainium

Scale and simplify ML workload monitoring on Amazon EKS with AWS Neuron Monitor container

The future of productivity agents with NinjaTech AI and AWS Trainium

Scaling Large Language Model (LLM) training with Amazon EC2 Trn1 UltraClusters

Sprinklr improves performance by 20% and reduces cost by 25% for machine learning inference on AWS Graviton3

Fine-tune and deploy Llama 2 models cost-effectively in Amazon SageMaker JumpStart with AWS Inferentia and AWS Trainium

Building an efficient MLOps platform with OSS tools on Amazon ECS with AWS Fargate

Welcome to a New Era of Building in the Cloud with Generative AI on AWS

Pre-training genomic language models using AWS HealthOmics and Amazon SageMaker

Scale LLMs with PyTorch 2.0 FSDP on Amazon EKS – Part 2

Accelerate PyTorch with DeepSpeed to train large language models with Intel Habana Gaudi-based DL1 EC2 instances

Enable pod-based GPU metrics in Amazon CloudWatch

How Amazon Search M5 saved 30% for LLM training cost by using AWS Trainium

Scale AI training and inference for drug discovery through Amazon EKS and Karpenter

Amazon SageMaker model parallel library now accelerates PyTorch FSDP workloads by up to 20%

Accelerate Mixtral 8x7B pre-training with expert parallelism on Amazon SageMaker

Federated Learning on AWS with FedML: Health analytics without sharing sensitive data – Part 2

Unlocking generative AI for enterprises: How SnapLogic powers their low-code Agent Creator using Amazon Bedrock

Deploy DeepSeek-R1 distilled models on Amazon SageMaker using a Large Model Inference container

Architect personalized generative AI SaaS applications on Amazon SageMaker

Optimize generative AI workloads for environmental sustainability

Optimizing MLOps for Sustainability

How Cisco accelerated the use of generative AI with Amazon SageMaker Inference

Accelerate hyperparameter grid search for sentiment analysis with BERT models using Weights & Biases, Amazon EKS, and TorchElastic

Training large language models on Amazon SageMaker: Best practices

Use Kubernetes Operators for new inference capabilities in Amazon SageMaker that reduce LLM deployment costs by 50% on average

Achieve operational excellence with well-architected generative AI solutions using Amazon Bedrock

Enable faster training with Amazon SageMaker data parallel library

Scaling Thomson Reuters’ language model research with Amazon SageMaker HyperPod

Stay Connected