AWS, Clustering and Data Preparation

PEFT fine tuning of Llama 3 on SageMaker HyperPod with AWS Trainium

AWS Machine Learning Blog

DECEMBER 24, 2024

The process of setting up and configuring a distributed training environment can be complex, requiring expertise in server management, cluster configuration, networking and distributed computing. To simplify infrastructure setup and accelerate distributed training, AWS introduced Amazon SageMaker HyperPod in late 2023.

AWS

AWS Clustering Deep Learning Deep Learning

Your guide to generative AI and ML at AWS re:Invent 2024

AWS Machine Learning Blog

NOVEMBER 19, 2024

The excitement is building for the fourteenth edition of AWS re:Invent, and as always, Las Vegas is set to host this spectacular event. The sessions showcase how Amazon Q can help you streamline coding, testing, and troubleshooting, as well as enable you to make the most of your data to optimize business operations.

AWS

AWS ML ML AI

Enhance your Amazon Redshift cloud data warehouse with easier, simpler, and faster machine learning using Amazon SageMaker Canvas

AWS Machine Learning Blog

OCTOBER 24, 2024

Conventional ML development cycles take weeks to many months and requires sparse data science understanding and ML development skills. Business analysts’ ideas to use ML models often sit in prolonged backlogs because of data engineering and data science team’s bandwidth and data preparation activities.

Data Warehouse

Data Warehouse Machine Learning Machine Learning Cloud Data

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Automation, Evolved: Your New Playbook for Smarter Knowledge Work

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Centralize model governance with SageMaker Model Registry Resource Access Manager sharing

AWS Machine Learning Blog

NOVEMBER 14, 2024

We recently announced the general availability of cross-account sharing of Amazon SageMaker Model Registry using AWS Resource Access Manager (AWS RAM) , making it easier to securely share and discover machine learning (ML) models across your AWS accounts.

AWS

AWS ML ML Machine Learning

Apply fine-grained data access controls with AWS Lake Formation in Amazon SageMaker Data Wrangler

AWS Machine Learning Blog

AUGUST 21, 2023

You can streamline the process of feature engineering and data preparation with SageMaker Data Wrangler and finish each stage of the data preparation workflow (including data selection, purification, exploration, visualization, and processing at scale) within a single visual interface.

AWS

AWS Data Lakes Clustering Data Preparation

Welcome to a New Era of Building in the Cloud with Generative AI on AWS

AWS Machine Learning Blog

NOVEMBER 30, 2023

The number of companies launching generative AI applications on AWS is substantial and building quickly, including adidas, Booking.com, Bridgewater Associates, Clariant, Cox Automotive, GoDaddy, and LexisNexis Legal & Professional, to name just a few. Innovative startups like Perplexity AI are going all in on AWS for generative AI.

AWS

AWS AI AI ML

Pre-training genomic language models using AWS HealthOmics and Amazon SageMaker

AWS Machine Learning Blog

MAY 31, 2024

In this blog post and open source project , we show you how you can pre-train a genomics language model, HyenaDNA , using your genomic data in the AWS Cloud. Amazon SageMaker Amazon SageMaker is a fully managed ML service offered by AWS, designed to reduce the time and cost associated with training and tuning ML models at scale.

AWS

AWS ML ML Machine Learning

Building an efficient MLOps platform with OSS tools on Amazon ECS with AWS Fargate

AWS Machine Learning Blog

SEPTEMBER 18, 2024

In addition to its groundbreaking AI innovations, Zeta Global has harnessed Amazon Elastic Container Service (Amazon ECS) with AWS Fargate to deploy a multitude of smaller models efficiently. It simplifies feature access for model training and inference, significantly reducing the time and complexity involved in managing data pipelines.

AWS

AWS Machine Learning Machine Learning ML

Use LangChain with PySpark to process documents at massive scale with Amazon SageMaker Studio and Amazon EMR Serverless

AWS Machine Learning Blog

SEPTEMBER 3, 2024

With the introduction of EMR Serverless support for Apache Livy endpoints , SageMaker Studio users can now seamlessly integrate their Jupyter notebooks running sparkmagic kernels with the powerful data processing capabilities of EMR Serverless. This same interface is also used for provisioning EMR clusters.

AWS

AWS Clustering Big Data Big Data

Optimizing MLOps for Sustainability

AWS Machine Learning Blog

SEPTEMBER 11, 2024

AWS published Guidance for Optimizing MLOps for Sustainability on AWS to help customers maximize utilization and minimize waste in their ML workloads. The process begins with data preparation, followed by model training and tuning, and then model deployment and management. This leads to substantial resource consumption.

AWS

AWS Data Preparation ML ML

Accelerate time to insight with Amazon SageMaker Data Wrangler and the power of Apache Hive

AWS Machine Learning Blog

MARCH 10, 2023

Data scientists and data engineers use Apache Spark, Apache Hive, and Presto running on Amazon EMR for large-scale data processing. This blog post will go through how data professionals may use SageMaker Data Wrangler’s visual interface to locate and connect to existing Amazon EMR clusters with Hive endpoints.

Clustering

Clustering AWS ML ML

Perform generative AI-powered data prep and no-code ML over any size of data using Amazon SageMaker Canvas

AWS Machine Learning Blog

AUGUST 15, 2024

You need data engineering expertise and time to develop the proper scripts and pipelines to wrangle, clean, and transform data. Afterward, you need to manage complex clusters to process and train your ML models over these large-scale datasets. These features can find temporal patterns in the data that can influence the baseFare.

ML

ML ML Data Preparation AWS

Training large language models on Amazon SageMaker: Best practices

AWS Machine Learning Blog

MARCH 6, 2023

These factors require training an LLM over large clusters of accelerated machine learning (ML) instances. In the past few years, numerous customers have been using the AWS Cloud for LLM training. We recommend working with your AWS account team or contacting AWS Sales to determine the appropriate Region for your LLM workload.

AWS

AWS Clustering ML ML

Build ML features at scale with Amazon SageMaker Feature Store using data from Amazon Redshift

Flipboard

AUGUST 17, 2023

Amazon Redshift uses SQL to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes, using AWS-designed hardware and ML to deliver the best price-performance at any scale. To do this, we provide an AWS CloudFormation template to create a stack that contains the resources.

ML

ML ML AWS Data Warehouse

Orchestrate Ray-based machine learning workflows using Amazon SageMaker

AWS Machine Learning Blog

SEPTEMBER 18, 2023

With Ray and AIR, the same Python code can scale seamlessly from a laptop to a large cluster. Amazon SageMaker Pipelines allows orchestrating the end-to-end ML lifecycle from data preparation and training to model deployment as automated workflows. The full code can be found on the aws-samples-for-ray GitHub repository.

Machine Learning

Machine Learning Machine Learning ML ML

How Amazon trains sequential ensemble models at scale with Amazon SageMaker Pipelines

AWS Machine Learning Blog

DECEMBER 13, 2024

This helps with data preparation and feature engineering tasks and model training and deployment automation. Moreover, they require a pre-determined number of topics, which was hard to determine in our data set. The approach uses three sequential BERTopic models to generate the final clustering in a hierarchical method.

ML

ML ML Clustering AWS

A review of purpose-built accelerators for financial services

AWS Machine Learning Blog

SEPTEMBER 11, 2024

Learning means identifying and capturing historical patterns from the data, and inference means mapping a current value to the historical pattern. The following figure illustrates the idea of a large cluster of GPUs being used for learning, followed by a smaller number for inference.

AWS

AWS ML ML Clustering

Philips accelerates development of AI-enabled healthcare solutions with an MLOps platform built on Amazon SageMaker

AWS Machine Learning Blog

NOVEMBER 16, 2023

This is a joint blog with AWS and Philips. Since 2014, the company has been offering customers its Philips HealthSuite Platform, which orchestrates dozens of AWS services that healthcare and life sciences companies use to improve patient care.

ML

ML ML AWS AI

How Booking.com modernized its ML experimentation framework with Amazon SageMaker

AWS Machine Learning Blog

FEBRUARY 12, 2024

One of the several challenges faced was adapting the existing on-premises pipeline solution for use on AWS. The solution involved two key components: Modifying and extending existing code – The first part of our solution involved the modification and extension of our existing code to make it compatible with AWS infrastructure.

ML

ML ML AWS Machine Learning

How Vericast optimized feature engineering using Amazon SageMaker Processing

AWS Machine Learning Blog

MAY 3, 2023

This includes gathering, exploring, and understanding the business and technical aspects of the data, along with evaluation of any manipulations that may be needed for the model building process. One aspect of this data preparation is feature engineering.

AWS

AWS Machine Learning Machine Learning ML

Improve RAG accuracy with fine-tuned embedding models on Amazon SageMaker

AWS Machine Learning Blog

JULY 11, 2024

Fine tuning embedding models using SageMaker SageMaker is a fully managed machine learning service that simplifies the entire machine learning workflow, from data preparation and model training to deployment and monitoring. Prerequisites For this walkthrough, you should have the following prerequisites: An AWS account set up.

AWS

AWS ML ML Machine Learning

Fine-tune multimodal models for vision and text use cases on Amazon SageMaker JumpStart

AWS Machine Learning Blog

NOVEMBER 15, 2024

Prerequisites To try out this solution using SageMaker JumpStart, you need the following prerequisites: An AWS account that will contain all of your AWS resources. An AWS Identity and Access Management (IAM) role to access SageMaker. In this section, we cover how to discover these models in SageMaker Studio.

ML

ML ML Python AWS

Explore data with ease: Use SQL and Text-to-SQL in Amazon SageMaker Studio JupyterLab notebooks

AWS Machine Learning Blog

APRIL 16, 2024

IAM role – SageMaker requires an AWS Identity and Access Management (IAM) role to be assigned to a SageMaker Studio domain or user profile to manage permissions effectively. An execution role update may be required to bring in data browsing and the SQL run feature. You need to create AWS Glue connections with specific connection types.

SQL

SQL AWS Database Data Scientist

Authoring custom transformations in Amazon SageMaker Data Wrangler using NLTK and SciPy

AWS Machine Learning Blog

APRIL 17, 2023

In other words, companies need to move from a model-centric approach to a data-centric approach.” – Andrew Ng A data-centric AI approach involves building AI systems with quality data involving data preparation and feature engineering. Custom transforms can be written as separate steps within Data Wrangler.

AWS

AWS Python ML ML

From text to dream job: Building an NLP-based job recommender at Talent.com with Amazon SageMaker

AWS Machine Learning Blog

OCTOBER 23, 2023

Given this mission, Talent.com and AWS joined forces to create a job recommendation engine using state-of-the-art natural language processing (NLP) and deep learning model training techniques with Amazon SageMaker to provide an unrivaled experience for job seekers. The recommendation system has driven an 8.6%

AWS

AWS Deep Learning Deep Learning Machine Learning

Machine learning with decentralized training data using federated learning on Amazon SageMaker

AWS Machine Learning Blog

AUGUST 22, 2023

Many ML algorithms train over large datasets, generalizing patterns it finds in the data and inferring results from those patterns as new unseen records are processed. With SageMaker, data scientists and developers can quickly build and train ML models, and then deploy them into a production-ready hosted environment.

Machine Learning

Machine Learning Machine Learning AWS ML

Analyze Amazon SageMaker spend and determine cost optimization opportunities based on usage, Part 2: SageMaker notebooks and Studio

AWS Machine Learning Blog

MAY 30, 2023

In 2021, we launched AWS Support Proactive Services as part of the AWS Enterprise Support offering. In Part 1 , we showed how to get started using AWS Cost Explorer to identify cost optimization opportunities in SageMaker. You can build custom queries to look up AWS CUR data using standard SQL.

AWS

AWS ML ML EDA

Use foundation models to improve model accuracy with Amazon SageMaker

AWS Machine Learning Blog

NOVEMBER 16, 2023

0, 1, 2 Reference architecture In this post, we use Amazon SageMaker Data Wrangler to ask a uniform set of visual questions for thousands of photos in the dataset. SageMaker Data Wrangler is purpose-built to simplify the process of data preparation and feature engineering. in Data Science. Charles holds a M.S.

ML

ML ML AWS Machine Learning

Implement a custom AutoML job using pre-selected algorithms in Amazon SageMaker Automatic Model Tuning

AWS Machine Learning Blog

NOVEMBER 15, 2023

An AutoML tool applies a combination of different algorithms and various preprocessing techniques to your data. For example, it can scale the data, perform univariate feature selection, conduct PCA at different variance threshold levels, and apply clustering. This solution will incur costs in your AWS account.

Algorithm

Algorithm AWS ML ML

Effectively solve distributed training convergence issues with Amazon SageMaker Hyperband Automatic Model Tuning

AWS Machine Learning Blog

JULY 13, 2023

Amazon SageMaker distributed training jobs enable you with one click (or one API call) to set up a distributed compute cluster, train a model, save the result to Amazon Simple Storage Service (Amazon S3), and shut down the cluster when complete. Finally, launching clusters can introduce operational overhead due to longer starting time.

Clustering

Clustering Algorithm Deep Learning Deep Learning

Artificial Intelligence Using Python: A Comprehensive Guide

Pickl AI

JULY 12, 2024

Data Preparation for AI Projects Data preparation is critical in any AI project, laying the foundation for accurate and reliable model outcomes. This section explores the essential steps in preparing data for AI applications, emphasising data quality’s active role in achieving successful AI models.

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Python Natural Language Processing

Getting Started With Snowflake: Best Practices For Launching

phData

DECEMBER 4, 2023

Thirty seconds is a good default for human users; if you find that queries are regularly queueing, consider making your warehouse a multi-cluster that scales on-demand. Cluster Count If your warehouse has to serve many concurrent requests, you may need to increase the cluster count to meet demand.

Database

Database Clustering SQL Data Pipeline

Master the Power of Machine Learning with PyCaret: A Step-by-Step Guide

Mlearning.ai

JUNE 28, 2023

Table of Contents Introduction to PyCaret Benefits of PyCaret Installation and Setup Data Preparation Model Training and Selection Hyperparameter Tuning Model Evaluation and Analysis Model Deployment and MLOps Working with Time Series Data Conclusion 1. or higher and a stable internet connection for the installation process.

Machine Learning

Machine Learning Machine Learning Data Preparation Data Science

Roadmap to Learn Data Science for Beginners and Freshers in 2023

Becoming Human

MAY 15, 2023

The two most common types of unsupervised learning are clustering , where the algorithm groups similar data points together, and dimensionality reduction , where the algorithm reduces the number of features in the data. It is highly configurable and can integrate with other tools like Git, Docker, and AWS.

Data Science

Data Science Machine Learning Machine Learning Database

Must-Have Skills for a Machine Learning Engineer

Pickl AI

NOVEMBER 28, 2024

Unsupervised Learning Unsupervised learning involves training models on data without labels, where the system tries to find hidden patterns or structures. This type of learning is used when labelled data is scarce or unavailable. Data Transformation Transforming data prepares it for Machine Learning models.

Machine Learning

Machine Learning Machine Learning ML ML

Understanding and Building Machine Learning Models

Pickl AI

NOVEMBER 18, 2024

Key steps involve problem definition, data preparation, and algorithm selection. Data quality significantly impacts model performance. UnSupervised Learning Unlike Supervised Learning, unSupervised Learning works with unlabeled data. The algorithm tries to find hidden patterns or groupings in the data.

Machine Learning

Machine Learning Machine Learning Algorithm Decision Trees

Discover the Most Important Fundamentals of Data Engineering

Pickl AI

NOVEMBER 4, 2024

By implementing efficient data pipelines , organisations can enhance their data processing capabilities, reduce time spent on data preparation, and improve overall data accessibility. Data Storage Solutions Data storage solutions are critical in determining how data is organised, accessed, and managed.

Data Engineering

Data Engineering Data Engineer Data Engineering Data Engineering

Must-Have Prompt Engineering Skills for 2024

ODSC - Open Data Science

JANUARY 29, 2024

These outputs, stored in vector databases like Weaviate, allow Prompt Enginers to directly access these embeddings for tasks like semantic search, similarity analysis, or clustering. You may be expected to use other cloud platforms like AWS, GCP, and others, so don’t neglect them and at least be vaguely familiar with how they work.

Data Science

Data Science Machine Learning Machine Learning Natural Language Processing

Connect, share, and query where your data sits using Amazon SageMaker Unified Studio

Flipboard

MARCH 21, 2025

Through this unified query capability, you can create comprehensive insights into customer transaction patterns and purchase behavior for active products without the traditional barriers of data silos or the need to copy data between systems. Environments are the actual data infrastructure behind a project.

SQL

SQL Data Analyst Data Warehouse AWS

MLOps Landscape in 2023: Top Tools and Platforms

The MLOps Blog

JUNE 27, 2023

For example, if you use AWS, you may prefer Amazon SageMaker as an MLOps platform that integrates with other AWS services. SageMaker Studio offers built-in algorithms, automated model tuning, and seamless integration with AWS services, making it a powerful platform for developing and deploying machine learning solutions at scale.

Machine Learning

Machine Learning Machine Learning ML ML

How to Choose MLOps Tools: In-Depth Guide for 2024

DagsHub

APRIL 21, 2024

A traditional machine learning (ML) pipeline is a collection of various stages that include data collection, data preparation, model training and evaluation, hyperparameter tuning (if needed), model deployment and scaling, monitoring, security and compliance, and CI/CD.

Machine Learning

Machine Learning Machine Learning ML ML

Techniques for reducing costs in LLM architectures

DagsHub

JULY 15, 2024

Data Management Costs Data Collection : Involves sourcing diverse datasets, including multilingual and domain-specific corpora, from various digital sources, essential for developing a robust LLM. While the use of pre-trained models is free, fine-tuning them for specific tasks can lead to costs related to computing and data handling.

Azure

Azure AI AI Database

Accelerate your generative AI distributed training workloads with the NVIDIA NeMo Framework on Amazon EKS

AWS Machine Learning Blog

JULY 16, 2024

In this post, we present a step-by-step guide to run distributed training workloads on an Amazon Elastic Kubernetes Service (Amazon EKS) cluster. The NVIDIA NeMo Framework provides a comprehensive set of tools, scripts, and recipes to support each stage of the LLM journey, from data preparation to training and deployment.

Clustering

Clustering AWS AI AI

How to Build an End-To-End ML Pipeline

The MLOps Blog

MAY 9, 2023

3 Quickly build and deploy an end-to-end ML pipeline with Kubeflow Pipelines on AWS. Again, what goes on in this component is subjective to the data scientist’s initial (manual) data preparation process, the problem, and the data used. Pre-requisites In this demo, you will use MiniKF to set up Kubeflow on AWS.

ML

ML ML Machine Learning Machine Learning

PEFT fine tuning of Llama 3 on SageMaker HyperPod with AWS Trainium

Your guide to generative AI and ML at AWS re:Invent 2024

Webinars

Trending Sources

Enhance your Amazon Redshift cloud data warehouse with easier, simpler, and faster machine learning using Amazon SageMaker Canvas

Webinars

Centralize model governance with SageMaker Model Registry Resource Access Manager sharing

Apply fine-grained data access controls with AWS Lake Formation in Amazon SageMaker Data Wrangler

Welcome to a New Era of Building in the Cloud with Generative AI on AWS

Pre-training genomic language models using AWS HealthOmics and Amazon SageMaker

Building an efficient MLOps platform with OSS tools on Amazon ECS with AWS Fargate

Use LangChain with PySpark to process documents at massive scale with Amazon SageMaker Studio and Amazon EMR Serverless

Optimizing MLOps for Sustainability

Accelerate time to insight with Amazon SageMaker Data Wrangler and the power of Apache Hive

Perform generative AI-powered data prep and no-code ML over any size of data using Amazon SageMaker Canvas

Training large language models on Amazon SageMaker: Best practices

Build ML features at scale with Amazon SageMaker Feature Store using data from Amazon Redshift

Orchestrate Ray-based machine learning workflows using Amazon SageMaker

How Amazon trains sequential ensemble models at scale with Amazon SageMaker Pipelines

A review of purpose-built accelerators for financial services

Philips accelerates development of AI-enabled healthcare solutions with an MLOps platform built on Amazon SageMaker

How Booking.com modernized its ML experimentation framework with Amazon SageMaker

How Vericast optimized feature engineering using Amazon SageMaker Processing

Improve RAG accuracy with fine-tuned embedding models on Amazon SageMaker

Fine-tune multimodal models for vision and text use cases on Amazon SageMaker JumpStart

Explore data with ease: Use SQL and Text-to-SQL in Amazon SageMaker Studio JupyterLab notebooks

Authoring custom transformations in Amazon SageMaker Data Wrangler using NLTK and SciPy

From text to dream job: Building an NLP-based job recommender at Talent.com with Amazon SageMaker

Machine learning with decentralized training data using federated learning on Amazon SageMaker

Analyze Amazon SageMaker spend and determine cost optimization opportunities based on usage, Part 2: SageMaker notebooks and Studio

Use foundation models to improve model accuracy with Amazon SageMaker

Implement a custom AutoML job using pre-selected algorithms in Amazon SageMaker Automatic Model Tuning

Effectively solve distributed training convergence issues with Amazon SageMaker Hyperband Automatic Model Tuning

Artificial Intelligence Using Python: A Comprehensive Guide

Getting Started With Snowflake: Best Practices For Launching

Master the Power of Machine Learning with PyCaret: A Step-by-Step Guide

Roadmap to Learn Data Science for Beginners and Freshers in 2023

Must-Have Skills for a Machine Learning Engineer

Understanding and Building Machine Learning Models

Discover the Most Important Fundamentals of Data Engineering

Must-Have Prompt Engineering Skills for 2024

Connect, share, and query where your data sits using Amazon SageMaker Unified Studio

MLOps Landscape in 2023: Top Tools and Platforms

How to Choose MLOps Tools: In-Depth Guide for 2024

Techniques for reducing costs in LLM architectures

Accelerate your generative AI distributed training workloads with the NVIDIA NeMo Framework on Amazon EKS

How to Build an End-To-End ML Pipeline

Stay Connected