Clustering, ML and System Architecture

Real value, real time: Production AI with Amazon SageMaker and Tecton

AWS Machine Learning Blog

DECEMBER 4, 2024

Businesses are under pressure to show return on investment (ROI) from AI use cases, whether predictive machine learning (ML) or generative AI. Only 54% of ML prototypes make it to production, and only 5% of generative AI use cases make it to production. Using SageMaker, you can build, train and deploy ML models.

ML

ML ML AWS AI

Customize DeepSeek-R1 671b model using Amazon SageMaker HyperPod recipes – Part 2

AWS Machine Learning Blog

MAY 14, 2025

The following diagram illustrates the solution architecture for training using SageMaker HyperPod. With HyperPod, users can begin the process by connecting to the login/head node of the Slurm cluster. Alternatively, you can also use AWS Systems Manager and run a command such as the following to start the session.

Clustering

Clustering AWS ML ML

Transforming financial analysis with CreditAI on Amazon Bedrock: Octus’s journey with AWS

AWS Machine Learning Blog

MARCH 10, 2025

Solution overview The following figure illustrates our system architecture for CreditAI on AWS, with two key paths: the document ingestion and content extraction workflow, and the Q&A workflow for live user query response. He specializes in generative AI, machine learning, and system design.

AWS

AWS Database AI AI

Webinars

Precision in Motion: Why Process Optimization Is the Future of Manufacturing

Airflow Best Practices for ETL/ELT Pipelines

MORE WEBINARS

Reduce ML training costs with Amazon SageMaker HyperPod

AWS Machine Learning Blog

APRIL 10, 2025

As cluster sizes grow, the likelihood of failure increases due to the number of hardware components involved. Each hardware failure can result in wasted GPU hours and requires valuable engineering time to identify and resolve the issue, making the system prone to downtime that can disrupt progress and delay completion.

ML

ML ML Clustering AWS

Ask HN: Who wants to be hired? (July 2025)

Hacker News

JULY 1, 2025

I originally wanted to program numerical libraries for such systems, but I ended up doing AI/ML instead. I have about 3 YoE training PyTorch models on HPC clusters and 1 YoE optimizing PyTorch models, including with custom CUDA kernels. Some: React, IoT, bit o elm, ML, LLM ops and auotmation.

Python

Python AWS SQL ML

Ray jobs on Amazon SageMaker HyperPod: scalable and resilient distributed AI

AWS Machine Learning Blog

APRIL 2, 2025

At its core, Ray offers a unified programming model that allows developers to seamlessly scale their applications from a single machine to a distributed cluster. Ray promotes the same coding patterns for both a simple machine learning (ML) experiment and a scalable, resilient production application.

Clustering

Clustering AWS AI AI

Ask HN: Who is hiring? (July 2025)

Hacker News

JULY 1, 2025

Good at Go, Kubernetes (Understanding how to manage stateful services in a multi-cloud environment) We have a Python service in our Recommendation pipeline, so some ML/Data Science knowledge would be good. You must be independent and self-organized. I wonder if we can move away from representation purely on where you live.

Python

Python AWS ML ML

Accelerate pre-training of Mistral’s Mathstral model with highly resilient clusters on Amazon SageMaker HyperPod

AWS Machine Learning Blog

SEPTEMBER 18, 2024

The compute clusters used in these scenarios are composed of more than thousands of AI accelerators such as GPUs or AWS Trainium and AWS Inferentia , custom machine learning (ML) chips designed by Amazon Web Services (AWS) to accelerate deep learning workloads in the cloud.

Clustering

Clustering AWS ML ML

Multi-account support for Amazon SageMaker HyperPod task governance

AWS Machine Learning Blog

JUNE 6, 2025

In this post, we discuss how an enterprise with multiple accounts can access a shared Amazon SageMaker HyperPod cluster for running their heterogenous workloads. Account A hosts the SageMaker HyperPod cluster. To access Account A’s EKS cluster as a user in Account B, you will need to assume a cluster access role in Account A.

Clustering

Clustering AWS Data Scientist ML

Meeting customer needs with our ML platform redesign

Snorkel AI

MAY 3, 2023

In this article, we share our journey and hope that it helps you design better machine learning systems. Table of contents Why we needed to redesign our interactive ML system In this section, we’ll go over the market forces and technological shifts that compelled us to re-architect our ML system.

ML

ML ML Machine Learning Machine Learning

Accelerate disaster response with computer vision for satellite imagery using Amazon SageMaker and Amazon Augmented AI

AWS Machine Learning Blog

FEBRUARY 24, 2023

AWS recently released Amazon SageMaker geospatial capabilities to provide you with satellite imagery and geospatial state-of-the-art machine learning (ML) models, reducing barriers for these types of use cases. For more information, refer to Preview: Use Amazon SageMaker to Build, Train, and Deploy ML Models Using Geospatial Data.

ML

ML ML Data Pipeline AWS

Redesigning Snorkel’s interactive machine learning systems

Snorkel AI

MAY 3, 2023

In this article, we share our journey and hope that it helps you design better machine learning systems. Table of contents Why we needed to redesign our interactive ML system In this section, we’ll go over the market forces and technological shifts that compelled us to re-architect our ML system.

Machine Learning

Machine Learning Machine Learning ML ML

Redesigning Snorkel’s interactive machine learning systems

Snorkel AI

MAY 3, 2023

In this article, we share our journey and hope that it helps you design better machine learning systems. Table of contents Why we needed to redesign our interactive ML system In this section, we’ll go over the market forces and technological shifts that compelled us to re-architect our ML system.

Machine Learning

Machine Learning Machine Learning ML ML

10 industries that use distributed computing

IBM Journey to AI blog

JULY 18, 2024

Computing Computing is being dominated by major revolutions in artificial intelligence (AI) and machine learning (ML). The algorithms that empower AI and ML require large volumes of training data, in addition to strong and steady amounts of processing power. Distributed computing supplies both.

Cloud Computing

Cloud Computing Database Internet of Things ML

Data Science Current

Real value, real time: Production AI with Amazon SageMaker and Tecton

Customize DeepSeek-R1 671b model using Amazon SageMaker HyperPod recipes – Part 2

Webinars

Trending Sources

Transforming financial analysis with CreditAI on Amazon Bedrock: Octus’s journey with AWS

Webinars

Reduce ML training costs with Amazon SageMaker HyperPod

Ask HN: Who wants to be hired? (July 2025)

Ray jobs on Amazon SageMaker HyperPod: scalable and resilient distributed AI

Ask HN: Who is hiring? (July 2025)

Accelerate pre-training of Mistral’s Mathstral model with highly resilient clusters on Amazon SageMaker HyperPod

Multi-account support for Amazon SageMaker HyperPod task governance

Meeting customer needs with our ML platform redesign

Accelerate disaster response with computer vision for satellite imagery using Amazon SageMaker and Amazon Augmented AI

Redesigning Snorkel’s interactive machine learning systems

Redesigning Snorkel’s interactive machine learning systems

10 industries that use distributed computing

Stay Connected