Artificial Intelligence, AWS and Clustering

PEFT fine tuning of Llama 3 on SageMaker HyperPod with AWS Trainium

AWS Machine Learning Blog

DECEMBER 24, 2024

The process of setting up and configuring a distributed training environment can be complex, requiring expertise in server management, cluster configuration, networking and distributed computing. To simplify infrastructure setup and accelerate distributed training, AWS introduced Amazon SageMaker HyperPod in late 2023.

AWS

AWS Clustering Deep Learning Deep Learning

Racing into the future: How AWS DeepRacer fueled my AI and ML journey

AWS Machine Learning Blog

NOVEMBER 19, 2024

In 2018, I sat in the audience at AWS re:Invent as Andy Jassy announced AWS DeepRacer —a fully autonomous 1/18th scale race car driven by reinforcement learning. But AWS DeepRacer instantly captured my interest with its promise that even inexperienced developers could get involved in AI and ML.

AWS

AWS ML ML AI

Your guide to generative AI and ML at AWS re:Invent 2024

AWS Machine Learning Blog

NOVEMBER 19, 2024

The excitement is building for the fourteenth edition of AWS re:Invent, and as always, Las Vegas is set to host this spectacular event. Third, we’ll explore the robust infrastructure services from AWS powering AI innovation, featuring Amazon SageMaker , AWS Trainium , and AWS Inferentia under AI/ML, as well as Compute topics.

AWS

AWS ML ML AI

Webinars

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

ByteDance processes billions of daily videos using their multimodal video understanding models on AWS Inferentia2

AWS Machine Learning Blog

FEBRUARY 26, 2025

At ByteDance, we collaborated with Amazon Web Services (AWS) to deploy multimodal large language models (LLMs) for video understanding using AWS Inferentia2 across multiple AWS Regions around the world. Solution overview Weve collaborated with AWS since the first generation of Inferentia chips.

AWS

AWS ML ML Clustering

Node problem detection and recovery for AWS Neuron nodes within Amazon EKS clusters

AWS Machine Learning Blog

JULY 25, 2024

In the post, we introduce the AWS Neuron node problem detector and recovery DaemonSet for AWS Trainium and AWS Inferentia on Amazon Elastic Kubernetes Service (Amazon EKS). eks-5e0fdde Install the required AWS Identity and Access Management (IAM) role for the service account and the node problem detector plugin.

Clustering

Clustering AWS ML ML

How Rocket Companies modernized their data science solution on AWS

AWS Machine Learning Blog

FEBRUARY 21, 2025

Communication between the two systems was established through Kerberized Apache Livy (HTTPS) connections over AWS PrivateLink. Responsibility for maintenance and troubleshooting: Rockets DevOps/Technology team was responsible for all upgrades, scaling, and troubleshooting of the Hadoop cluster, which was installed on bare EC2 instances.

Data Science

Data Science AWS Hadoop Data Scientist

Get started quickly with AWS Trainium and AWS Inferentia using AWS Neuron DLAMI and AWS Neuron DLC

AWS Machine Learning Blog

JUNE 11, 2024

Starting with the AWS Neuron 2.18 release , you can now launch Neuron DLAMIs (AWS Deep Learning AMIs) and Neuron DLCs (AWS Deep Learning Containers) with the latest released Neuron packages on the same day as the Neuron SDK release. AWS DLCs provide a set of Docker images that are pre-installed with deep learning frameworks.

AWS

AWS Deep Learning Deep Learning ML

AWS at NVIDIA GTC 2024: Accelerate innovation with generative AI on AWS

AWS Machine Learning Blog

APRIL 11, 2024

AWS was delighted to present to and connect with over 18,000 in-person and 267,000 virtual attendees at NVIDIA GTC, a global artificial intelligence (AI) conference that took place March 2024 in San Jose, California, returning to a hybrid, in-person experience for the first time since 2019.

AWS

AWS AI AI Clustering

Accelerate pre-training of Mistral’s Mathstral model with highly resilient clusters on Amazon SageMaker HyperPod

AWS Machine Learning Blog

SEPTEMBER 18, 2024

The compute clusters used in these scenarios are composed of more than thousands of AI accelerators such as GPUs or AWS Trainium and AWS Inferentia , custom machine learning (ML) chips designed by Amazon Web Services (AWS) to accelerate deep learning workloads in the cloud.

Clustering

Clustering AWS ML ML

Use language embeddings for zero-shot classification and semantic search with Amazon Bedrock

AWS Machine Learning Blog

FEBRUARY 13, 2025

Amazon Bedrock offers a serverless experience, so you can get started quickly, privately customize FMs with your own data, and integrate and deploy them into your applications using Amazon Web Services (AWS) services without having to manage infrastructure. AWS Lambda The API is a Fastify application written in TypeScript.

AWS

AWS K-nearest Neighbors Clustering Algorithm

Efficiently train models with large sequence lengths using Amazon SageMaker model parallel

AWS Machine Learning Blog

NOVEMBER 27, 2024

Launching a machine learning (ML) training cluster with Amazon SageMaker training jobs is a seamless process that begins with a straightforward API call, AWS Command Line Interface (AWS CLI) command, or AWS SDK interaction. Surya Kari is a Senior Generative AI Data Scientist at AWS.

AWS

AWS Clustering ML ML

Unlocking Japanese LLMs with AWS Trainium: Innovators Showcase from the AWS LLM Development Support Program

AWS Machine Learning Blog

JULY 31, 2024

Amazon Web Services (AWS) is committed to supporting the development of cutting-edge generative artificial intelligence (AI) technologies by companies and organizations across the globe. Let’s dive in and explore how these organizations are transforming what’s possible with generative AI on AWS.

AWS

AWS AI AI Clustering

Announcing New Tools for Building with Generative AI on AWS

Flipboard

APRIL 13, 2023

At AWS, we have played a key role in democratizing ML and making it accessible to anyone who wants to use it, including more than 100,000 customers of all sizes and industries. AWS has the broadest and deepest portfolio of AI and ML services at all three layers of the stack.

AWS

AWS AI AI ML

10 Things AWS Can Do for Your SaaS Company

Smart Data Collective

FEBRUARY 20, 2022

AWS (Amazon Web Services), the comprehensive and evolving cloud computing platform provided by Amazon, is comprised of infrastructure as a service (IaaS), platform as a service (PaaS) and packaged software as a service (SaaS). With its wide array of tools and convenience, AWS has already become a popular choice for many SaaS companies.

AWS

AWS Cloud Computing Data Lakes Database

Real value, real time: Production AI with Amazon SageMaker and Tecton

AWS Machine Learning Blog

DECEMBER 4, 2024

Orchestrate with Tecton-managed EMR clusters – After features are deployed, Tecton automatically creates the scheduling, provisioning, and orchestration needed for pipelines that can run on Amazon EMR compute engines. You can view and create EMR clusters directly through the SageMaker notebook.

ML

ML ML AWS AI

Boost your forecast accuracy with time series clustering

AWS Machine Learning Blog

APRIL 4, 2023

AWS provides various services catered to time series data that are low code/no code, which both machine learning (ML) and non-ML practitioners can use for building ML solutions. In this post, we seek to separate a time series dataset into individual clusters that exhibit a higher degree of similarity between its data points and reduce noise.

Clustering

Clustering ML ML AWS

Customize DeepSeek-R1 671b model using Amazon SageMaker HyperPod recipes – Part 2

AWS Machine Learning Blog

MAY 14, 2025

With HyperPod, users can begin the process by connecting to the login/head node of the Slurm cluster. You can execute each step in the training pipeline by initiating the process through the SageMaker control plane using APIs, AWS Command Line Interface (AWS CLI), or the SageMaker ModelTrainer SDK.

Clustering

Clustering AWS ML ML

Scale and simplify ML workload monitoring on Amazon EKS with AWS Neuron Monitor container

AWS Machine Learning Blog

JUNE 25, 2024

Amazon Web Services is excited to announce the launch of the AWS Neuron Monitor container , an innovative tool designed to enhance the monitoring capabilities of AWS Inferentia and AWS Trainium chips on Amazon Elastic Kubernetes Service (Amazon EKS). The Container Insights dashboard also shows cluster status and alarms.

AWS

AWS ML ML Clustering

The future of productivity agents with NinjaTech AI and AWS Trainium

AWS Machine Learning Blog

JUNE 27, 2024

NinjaTech AI’s mission is to make everyone more productive by taking care of time-consuming complex tasks with fast and affordable artificial intelligence (AI) agents. In this post, we describe how we built our cutting-edge productivity agent NinjaLLM, the backbone of MyNinja.ai, using AWS Trainium chips. MyNinja.ai

AWS

AWS AI AI Clustering

Centralize model governance with SageMaker Model Registry Resource Access Manager sharing

AWS Machine Learning Blog

NOVEMBER 14, 2024

We recently announced the general availability of cross-account sharing of Amazon SageMaker Model Registry using AWS Resource Access Manager (AWS RAM) , making it easier to securely share and discover machine learning (ML) models across your AWS accounts.

AWS

AWS ML ML Machine Learning

Simple guide to training Llama 2 with AWS Trainium on Amazon SageMaker

AWS Machine Learning Blog

MAY 1, 2024

Large language models (LLMs) are making a significant impact in the realm of artificial intelligence (AI). Llama2 by Meta is an example of an LLM offered by AWS. To learn more about Llama 2 on AWS, refer to Llama 2 foundation models from Meta are now available in Amazon SageMaker JumpStart.

AWS

AWS ML ML Clustering

Revolutionizing large language model training with Arcee and AWS Trainium

AWS Machine Learning Blog

APRIL 29, 2024

Close collaboration with AWS Trainium has also played a major role in making the Arcee platform extremely performant, not only accelerating model training but also reducing overall costs and enforcing compliance and data integrity in the secure AWS environment. Our cluster consisted of 16 nodes, each equipped with a trn1n.32xlarge

AWS

AWS Clustering ML ML

Sprinklr improves performance by 20% and reduces cost by 25% for machine learning inference on AWS Graviton3

AWS Machine Learning Blog

JUNE 11, 2024

Each of these products are infused with artificial intelligence (AI) capabilities to deliver exceptional customer experience. During this journey, we collaborated with our AWS technical account manager and the Graviton software engineering teams.

Machine Learning

Machine Learning Machine Learning AWS Natural Language Processing

Develop and train large models cost-efficiently with Metaflow and AWS Trainium

AWS Machine Learning Blog

APRIL 29, 2024

For AWS and Outerbounds customers, the goal is to build a differentiated machine learning and artificial intelligence (ML/AI) system and reliably improve it over time. First, the AWS Trainium accelerator provides a high-performance, cost-effective, and readily available solution for training and fine-tuning large models.

AWS

AWS ML ML Python

Bring legacy machine learning code into Amazon SageMaker using AWS Step Functions

AWS Machine Learning Blog

MARCH 15, 2023

Tens of thousands of AWS customers use AWS machine learning (ML) services to accelerate their ML development with fully managed infrastructure and tools. The data scientist is responsible for moving the code into SageMaker, either manually or by cloning it from a code repository such as AWS CodeCommit.

AWS

AWS Machine Learning Machine Learning Data Scientist

Fast and cost-effective LLaMA 2 fine-tuning with AWS Trainium

AWS Machine Learning Blog

OCTOBER 5, 2023

In this post, we walk through how to fine-tune Llama 2 on AWS Trainium , a purpose-built accelerator for LLM training, to reduce training times and costs. We review the fine-tuning scripts provided by the AWS Neuron SDK (using NeMo Megatron-LM), the various configurations we used, and the throughput results we saw.

AWS

AWS Machine Learning Machine Learning Deep Learning

Detect hallucinations for RAG-based systems

Flipboard

MAY 16, 2025

Prerequisites To use the methods presented in this post, you need an AWS account with access to Amazon SageMaker , Amazon Bedrock , and Amazon Simple Storage Service (Amazon S3). Statement: 'AWS is Amazon subsidiary that provides cloud computing services.' Finally, we compare approaches in terms of their performance and latency.

AWS

AWS Cloud Computing Natural Language Processing AI

Creating an artificial intelligence 101

Dataconomy

MARCH 13, 2023

How to create an artificial intelligence? The creation of artificial intelligence (AI) has long been a dream of scientists, engineers, and innovators. Understanding artificial intelligence Before diving into the process of creating AI, it is important to understand the key concepts and types of AI.

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Natural Language Processing Algorithm

Boost your MLOps efficiency with these 6 must-have tools and platforms

Data Science Dojo

FEBRUARY 20, 2023

It is used by businesses across industries for a wide range of applications, including fraud prevention, marketing automation, customer service, artificial intelligence (AI), chatbots, virtual assistants, and recommendations. It provides a large cluster of clusters on a single machine.

Machine Learning

Machine Learning Machine Learning AWS Azure

How Lumi streamlines loan approvals with Amazon SageMaker AI

AWS Machine Learning Blog

APRIL 4, 2025

Integration with existing systems on AWS: Lumi seamlessly integrated SageMaker Asynchronous Inference endpoints with their existing loan processing pipeline. Using Databricks on AWS for model training, they built a pipeline to host the model in SageMaker AI, optimizing data flow and results retrieval. Follow him on LinkedIn.

AI

AI AI Machine Learning Machine Learning

CBRE and AWS perform natural language queries of structured data using Amazon Bedrock

AWS Machine Learning Blog

MAY 30, 2024

CBRE is unlocking the potential of artificial intelligence (AI) to realize value across the entire commercial real estate lifecycle—from guiding investment decisions to managing buildings. AWS Prototyping developed an AWS Cloud Development Kit (AWS CDK) stack for deployment following AWS best practices.

AWS

AWS SQL Database AI

Frugality meets Accuracy: Cost-efficient training of GPT NeoX and Pythia models with AWS Trainium

AWS Machine Learning Blog

DECEMBER 12, 2023

In this post, we’ll summarize training procedure of GPT NeoX on AWS Trainium , a purpose-built machine learning (ML) accelerator optimized for deep learning training. M tokens/$) trained such models with AWS Trainium without losing any model quality. We’ll outline how we cost-effectively (3.2 billion in Pythia.

AWS

AWS Machine Learning Machine Learning Deep Learning

Building an efficient MLOps platform with OSS tools on Amazon ECS with AWS Fargate

AWS Machine Learning Blog

SEPTEMBER 18, 2024

The ZMP analyzes billions of structured and unstructured data points to predict consumer intent by using sophisticated artificial intelligence (AI) to personalize experiences at scale. IAM roles : Assign appropriate AWS Identity and Access Management (IAM) roles to the tasks for accessing other AWS resources securely.

AWS

AWS Machine Learning Machine Learning ML

Introducing Amazon SageMaker HyperPod to train foundation models at scale

AWS Machine Learning Blog

NOVEMBER 30, 2023

Building foundation models (FMs) requires building, maintaining, and optimizing large clusters to train models with tens to hundreds of billions of parameters on vast amounts of data. SageMaker HyperPod integrates the Slurm Workload Manager for cluster and training job orchestration.

Clustering

Clustering AWS Machine Learning Machine Learning

Retrieval-Augmented Generation with LangChain, Amazon SageMaker JumpStart, and MongoDB Atlas semantic search

Flipboard

NOVEMBER 17, 2023

Set up a MongoDB cluster To create a free tier MongoDB Atlas cluster, follow the instructions in Create a Cluster. This type of data is often used in ML and artificial intelligence applications. Specify the AWS Lambda function that will interact with MongoDB Atlas and the LLM to provide responses.

K-nearest Neighbors

K-nearest Neighbors AWS Clustering Database

Fine-tune and deploy Llama 2 models cost-effectively in Amazon SageMaker JumpStart with AWS Inferentia and AWS Trainium

AWS Machine Learning Blog

JANUARY 17, 2024

Today, we’re excited to announce the availability of Llama 2 inference and fine-tuning support on AWS Trainium and AWS Inferentia instances in Amazon SageMaker JumpStart. In this post, we demonstrate how to deploy and fine-tune Llama 2 on Trainium and AWS Inferentia instances in SageMaker JumpStart.

AWS

AWS Python Machine Learning Machine Learning

Welcome to a New Era of Building in the Cloud with Generative AI on AWS

AWS Machine Learning Blog

NOVEMBER 30, 2023

The number of companies launching generative AI applications on AWS is substantial and building quickly, including adidas, Booking.com, Bridgewater Associates, Clariant, Cox Automotive, GoDaddy, and LexisNexis Legal & Professional, to name just a few. Innovative startups like Perplexity AI are going all in on AWS for generative AI.

AWS

AWS AI AI ML

Dynamic video content moderation and policy evaluation using AWS generative AI services

AWS Machine Learning Blog

MAY 30, 2024

Generative artificial intelligence (AI) has unlocked fresh opportunities for these use cases. In this post, we introduce the Media Analysis and Policy Evaluation solution, which uses AWS AI and generative AI services to provide a framework to streamline video extraction and evaluation processes.

AWS

AWS AI AI ML

Deploy generative AI models from Amazon SageMaker JumpStart using the AWS CDK

AWS Machine Learning Blog

MAY 23, 2023

In April 2023, AWS unveiled Amazon Bedrock , which provides a way to build generative AI-powered apps via pre-trained models from startups including AI21 Labs , Anthropic , and Stability AI. Amazon Bedrock also offers access to Titan foundation models, a family of models trained in-house by AWS. Deploy the AWS CDK application.

AWS

AWS AI AI ML

Scaling Large Language Model (LLM) training with Amazon EC2 Trn1 UltraClusters

Flipboard

FEBRUARY 16, 2023

Modern model pre-training often calls for larger cluster deployment to reduce time and cost. In October 2022, we launched Amazon EC2 Trn1 Instances , powered by AWS Trainium , which is the second generation machine learning accelerator designed by AWS. The following diagram shows an example.

Clustering

Clustering AWS Deep Learning Deep Learning

Deploy pre-trained models on AWS Wavelength with 5G edge using Amazon SageMaker JumpStart

AWS Machine Learning Blog

APRIL 7, 2023

To reduce the barrier to entry of ML at the edge, we wanted to demonstrate an example of deploying a pre-trained model from Amazon SageMaker to AWS Wavelength , all in less than 100 lines of code. In this post, we demonstrate how to deploy a SageMaker model to AWS Wavelength to reduce model inference latency for 5G network-based applications.

AWS

AWS Clustering ML ML

Build a Search Engine: Semantic Search System Using OpenSearch

PyImageSearch

MAY 19, 2025

Each word or sentence is mapped to a high-dimensional vector space, where similar meanings cluster together. run_opensearch.sh Running OpenSearch Locally A script to start OpenSearch using Docker for local testing before deploying to AWS. Figure 3: What Is Semantic Search? These can be used for evaluation and comparison.

K-nearest Neighbors

K-nearest Neighbors AWS Deep Learning Deep Learning

How Apoidea Group enhances visual information extraction from banking documents with multimodal models using LLaMA-Factory on Amazon SageMaker HyperPod

AWS Machine Learning Blog

MAY 15, 2025

Amazon SageMaker HyperPod offers an effective solution for provisioning resilient clusters to run ML workloads and develop state-of-the-art models. About the Authors Tony Wong is a Solutions Architect at AWS based in Hong Kong, specializing in financial services.

AWS

AWS ML ML Machine Learning

Cloud Data Science News Beta #1

Data Science 101

NOVEMBER 11, 2019

Call for Research Proposals Amazon is seeking proposals impact research in the Artificial Intelligence and Machine Learning areas. If you are at a University or non-profit, you can ask for cash and/or AWS credits. AWS Parallel Cluster for Machine Learning AWS Parallel Cluster is an open-source cluster management tool.

Cloud Data

Cloud Data Data Science Azure Clustering

PEFT fine tuning of Llama 3 on SageMaker HyperPod with AWS Trainium

Racing into the future: How AWS DeepRacer fueled my AI and ML journey

Webinars

Trending Sources

Your guide to generative AI and ML at AWS re:Invent 2024

Webinars

ByteDance processes billions of daily videos using their multimodal video understanding models on AWS Inferentia2

Node problem detection and recovery for AWS Neuron nodes within Amazon EKS clusters

How Rocket Companies modernized their data science solution on AWS

Get started quickly with AWS Trainium and AWS Inferentia using AWS Neuron DLAMI and AWS Neuron DLC

AWS at NVIDIA GTC 2024: Accelerate innovation with generative AI on AWS

Accelerate pre-training of Mistral’s Mathstral model with highly resilient clusters on Amazon SageMaker HyperPod

Use language embeddings for zero-shot classification and semantic search with Amazon Bedrock

Efficiently train models with large sequence lengths using Amazon SageMaker model parallel

Unlocking Japanese LLMs with AWS Trainium: Innovators Showcase from the AWS LLM Development Support Program

Announcing New Tools for Building with Generative AI on AWS

10 Things AWS Can Do for Your SaaS Company

Real value, real time: Production AI with Amazon SageMaker and Tecton

Boost your forecast accuracy with time series clustering

Customize DeepSeek-R1 671b model using Amazon SageMaker HyperPod recipes – Part 2

Scale and simplify ML workload monitoring on Amazon EKS with AWS Neuron Monitor container

The future of productivity agents with NinjaTech AI and AWS Trainium

Centralize model governance with SageMaker Model Registry Resource Access Manager sharing

Simple guide to training Llama 2 with AWS Trainium on Amazon SageMaker

Revolutionizing large language model training with Arcee and AWS Trainium

Sprinklr improves performance by 20% and reduces cost by 25% for machine learning inference on AWS Graviton3

Develop and train large models cost-efficiently with Metaflow and AWS Trainium

Bring legacy machine learning code into Amazon SageMaker using AWS Step Functions

Fast and cost-effective LLaMA 2 fine-tuning with AWS Trainium

Detect hallucinations for RAG-based systems

Creating an artificial intelligence 101

Boost your MLOps efficiency with these 6 must-have tools and platforms

How Lumi streamlines loan approvals with Amazon SageMaker AI

CBRE and AWS perform natural language queries of structured data using Amazon Bedrock

Frugality meets Accuracy: Cost-efficient training of GPT NeoX and Pythia models with AWS Trainium

Building an efficient MLOps platform with OSS tools on Amazon ECS with AWS Fargate

Introducing Amazon SageMaker HyperPod to train foundation models at scale

Retrieval-Augmented Generation with LangChain, Amazon SageMaker JumpStart, and MongoDB Atlas semantic search

Fine-tune and deploy Llama 2 models cost-effectively in Amazon SageMaker JumpStart with AWS Inferentia and AWS Trainium

Welcome to a New Era of Building in the Cloud with Generative AI on AWS

Dynamic video content moderation and policy evaluation using AWS generative AI services

Deploy generative AI models from Amazon SageMaker JumpStart using the AWS CDK

Scaling Large Language Model (LLM) training with Amazon EC2 Trn1 UltraClusters

Deploy pre-trained models on AWS Wavelength with 5G edge using Amazon SageMaker JumpStart

Build a Search Engine: Semantic Search System Using OpenSearch

How Apoidea Group enhances visual information extraction from banking documents with multimodal models using LLaMA-Factory on Amazon SageMaker HyperPod

Cloud Data Science News Beta #1

Stay Connected