Data Pipeline, ETL and ML - Data Science Current

The power of remote engine execution for ETL/ELT data pipelines

IBM Journey to AI blog

MAY 15, 2024

Two of the more popular methods, extract, transform, load (ETL ) and extract, load, transform (ELT) , are both highly performant and scalable. Data engineers build data pipelines, which are called data integration tasks or jobs, as incremental steps to perform data operations and orchestrate these data pipelines in an overall workflow.

Data Pipeline

Data Pipeline ETL SQL Database

How to Build ETL Data Pipeline in ML

The MLOps Blog

MAY 17, 2023

From data processing to quick insights, robust pipelines are a must for any ML system. Often the Data Team, comprising Data and ML Engineers , needs to build this infrastructure, and this experience can be painful. However, efficient use of ETL pipelines in ML can help make their life much easier.

ETL

ETL Data Pipeline ML ML

Boost your MLOps efficiency with these 6 must-have tools and platforms

Data Science Dojo

FEBRUARY 20, 2023

Machine learning (ML) is the technology that automates tasks and provides insights. It allows data scientists to build models that can automate specific tasks. It comes in many forms, with a range of tools and platforms designed to make working with ML more efficient. It also has ML algorithms built into the platform.

Machine Learning

Machine Learning Machine Learning AWS Azure

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Automation, Evolved: Your New Playbook for Smarter Knowledge Work

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

What Are AI Credits and How Can Data Scientists Use Them?

ODSC - Open Data Science

APRIL 23, 2025

AI credits from Confluent can be used to implement real-time data pipelines, monitor data flows, and run stream-based ML applications. Amazon Web Services(AWS) AWS offers one of the most extensive AI and ML infrastructures in the world. Modal Modal offers serverless compute tailored for data-intensive workloads.

Data Scientist

Data Scientist Azure Apache Kafka ML

How Kakao Games automates lifetime value prediction from game data using Amazon SageMaker and AWS Glue

AWS Machine Learning Blog

MARCH 1, 2023

Statistical methods and machine learning (ML) methods are actively developed and adopted to maximize the LTV. In this post, we share how Kakao Games and the Amazon Machine Learning Solutions Lab teamed up to build a scalable and reliable LTV prediction solution by using AWS data and ML services such as AWS Glue and Amazon SageMaker.

AWS

AWS ML ML ETL

Maximising Efficiency with ETL Data: Future Trends and Best Practices

Pickl AI

OCTOBER 17, 2024

Summary: This article explores the significance of ETL Data in Data Management. It highlights key components of the ETL process, best practices for efficiency, and future trends like AI integration and real-time processing, ensuring organisations can leverage their data effectively for strategic decision-making.

ETL

ETL Data Warehouse Data Quality Data Governance

Build Data Pipelines: Comprehensive Step-by-Step Guide

Pickl AI

JULY 8, 2024

Summary: This blog explains how to build efficient data pipelines, detailing each step from data collection to final delivery. Introduction Data pipelines play a pivotal role in modern data architecture by seamlessly transporting and transforming raw data into valuable insights.

Data Pipeline

Data Pipeline Data Quality Database Apache Kafka

How to establish lineage transparency for your machine learning initiatives

IBM Journey to AI blog

MAY 20, 2024

Machine learning (ML) has become a critical component of many organizations’ digital transformation strategy. From predicting customer behavior to optimizing business processes, ML algorithms are increasingly being used to make decisions that impact business outcomes.

Machine Learning

Machine Learning Machine Learning Data Scientist ML

The 2021 Executive Guide To Data Science and AI

Applied Data Science

AUGUST 2, 2021

Automation Automating data pipelines and models ➡️ 6. The Data Engineer Not everyone working on a data science project is a data scientist. Data engineers are the glue that binds the products of data scientists into a coherent and robust data pipeline.

Data Science

Data Science Data Scientist ML ML

Supercharging Your Data Pipeline with Apache Airflow (Part 2)

Heartbeat

NOVEMBER 6, 2023

Image Source — Pixel Production Inc In the previous article, you were introduced to the intricacies of data pipelines, including the two major types of existing data pipelines. You might be curious how a simple tool like Apache Airflow can be powerful for managing complex data pipelines.

Data Pipeline

Data Pipeline Clean Data ETL Python

How SnapLogic built a text-to-pipeline application with Amazon Bedrock to translate business intent into action

Flipboard

NOVEMBER 24, 2023

Iris was designed to use machine learning (ML) algorithms to predict the next steps in building a data pipeline. Let’s combine these suggestions to improve upon our original prompt: Human: Your job is to act as an expert on ETL pipelines.

Database

Database AWS ETL SQL

Building an efficient MLOps platform with OSS tools on Amazon ECS with AWS Fargate

AWS Machine Learning Blog

SEPTEMBER 18, 2024

The ZMP analyzes billions of structured and unstructured data points to predict consumer intent by using sophisticated artificial intelligence (AI) to personalize experiences at scale. Hosted on Amazon ECS with tasks run on Fargate, this platform streamlines the end-to-end ML workflow, from data ingestion to model deployment.

AWS

AWS Machine Learning Machine Learning ML

Software Engineering Patterns for Machine Learning

The MLOps Blog

SEPTEMBER 7, 2023

This situation is not different in the ML world. Data Scientists and ML Engineers typically write lots and lots of code. Building a mental model for ETL components Learn the art of constructing a mental representation of the components within an ETL process.

Machine Learning

Machine Learning Machine Learning ETL ML

Reducing hallucinations in LLM agents with a verified semantic cache using Amazon Bedrock Knowledge Bases

AWS Machine Learning Blog

FEBRUARY 21, 2025

Previously, he was a Data & Machine Learning Engineer at AWS, where he worked closely with customers to develop enterprise-scale data infrastructure, including data lakes, analytics dashboards, and ETL pipelines. He specializes in designing, building, and optimizing large-scale data solutions.

AWS

AWS Natural Language Processing Machine Learning Machine Learning

Improving air quality with generative AI

AWS Machine Learning Blog

JUNE 18, 2024

Despite the challenges, Afri-SET, with limited resources, envisions a comprehensive data management solution for stakeholders seeking sensor hosting on their platform, aiming to deliver accurate data from low-cost sensors. With AWS Glue custom connectors, it’s effortless to transfer data between Amazon S3 and other applications.

AWS

AWS AI AI Python

Comparing Tools For Data Processing Pipelines

The MLOps Blog

MARCH 15, 2023

In this post, you will learn about the 10 best data pipeline tools, their pros, cons, and pricing. A typical data pipeline involves the following steps or processes through which the data passes before being consumed by a downstream process, such as an ML model training process.

Data Pipeline

Data Pipeline ETL SQL Data Quality

Deployment of Data and ML Pipelines for the Most Chaotic Industry: The Stirred Rivers of Crypto

The MLOps Blog

DECEMBER 7, 2022

And we at deployr , worked alongside them to find the best possible answers for everyone involved and build their Data and ML Pipelines. Building data and ML pipelines: from the ground to the cloud It was the beginning of 2022, and things were looking bright after the lockdown’s end.

ML

ML ML AWS ETL

Snowflake’s Acquisition of Datavolo: What Does it Mean for Customers?

phData

FEBRUARY 25, 2025

However, one consistent challenge customers face is efficiently integrating and moving data between on-premises systems, cloud environments, and other data sources. Datavolo is more than just an ETL toolit provides functionality for Reverse ETL as well, enabling organizations to push data from Snowflake into other systems.

Data Pipeline

Data Pipeline ETL Data Engineering Data Engineering

How to Version Control Data in ML for Various Data Sources

The MLOps Blog

JANUARY 23, 2023

Dolt LakeFS Delta Lake Pachyderm Git-like versioning Database tool Data lake Data pipelines Experiment tracking Integration with cloud platforms Integrations with ML tools Examples of data version control tools in ML DVC Data Version Control DVC is a version control system for data and machine learning teams.

ML

ML ML Data Lakes Machine Learning

How to Build Machine Learning Systems With a Feature Store

The MLOps Blog

JANUARY 26, 2024

Luckily, we have tried and trusted tools and architectural patterns that provide a blueprint for reliable ML systems. In this article, I’ll introduce you to a unified architecture for ML systems built around the idea of FTI pipelines and a feature store as the central component. But what is an ML pipeline?

Machine Learning

Machine Learning Machine Learning ML ML

How to Build a CI/CD MLOps Pipeline [Case Study]

The MLOps Blog

MARCH 15, 2023

This includes the tools and techniques we used to streamline the ML model development and deployment processes, as well as the measures taken to monitor and maintain models in a production environment. Costs: Oftentimes, cost is the most important aspect of any ML model deployment. This includes data quality, privacy, and compliance.

AWS

AWS ETL ML ML

7 Best Machine Learning Workflow and Pipeline Orchestration Tools 2024

DagsHub

APRIL 7, 2024

Data scientists and machine learning engineers need to collaborate to make sure that together with the model, they develop robust data pipelines. These pipelines cover the entire lifecycle of an ML project, from data ingestion and preprocessing, to model training, evaluation, and deployment.

Machine Learning

Machine Learning Machine Learning ML ML

How to Maximize Time to Value with Fivetran and dbt

phData

OCTOBER 17, 2023

The story is all too common – a business user requests some data, the data team creates/prioritizes a ticket, and said ticket is completed after some number of months (or weeks if you’re lucky) – just to have the data be wrong, and the whole process starts again. Those are scary for data teams to change.

ETL

ETL Data Pipeline Data Engineering Data Engineering

Ensure Success with Trusted Data When Moving To The Cloud

Precisely

JUNE 2, 2023

As companies strive to leverage AI/ML, location intelligence, and cloud analytics into their portfolio of tools, siloed mainframe data often stands in the way of forward momentum. The right data integration technology can vastly simplify things. Streaming data pipelines help to make data available and accessible in real time.

Data Silos

Data Silos ETL Data Quality Data Pipeline

How Does Snowpark Work?

phData

FEBRUARY 7, 2024

On the client side, Snowpark consists of libraries, including the DataFrame API and native Snowpark machine learning (ML) APIs for model development (public preview) and deployment (private preview). Machine Learning Training machine learning (ML) models can sometimes be resource-intensive.

Python

Python ML ML SQL

How to Manage Unstructured Data in AI and Machine Learning Projects

DagsHub

OCTOBER 23, 2024

Managing unstructured data is essential for the success of machine learning (ML) projects. Without structure, data is difficult to analyze and extracting meaningful insights and patterns is challenging. This article will discuss managing unstructured data for AI and ML projects. What is Unstructured Data?

Machine Learning

Machine Learning Machine Learning Data Lakes AI

How Investment Banks and Asset Managers Should Be Leveraging Data in Snowflake

phData

APRIL 18, 2023

Data movements lead to high costs of ETL and rising data management TCO. The inability to access and onboard new datasets prolong the data pipeline’s creation and time to market. Data co-location enables teams to access, join, query, and analyze internal and external vendor data with minimal to no ETL.

Data Silos

Data Silos ETL Clustering Analytics

Data democratization: How data architecture can drive business decisions and AI initiatives

IBM Journey to AI blog

AUGUST 4, 2023

Data mesh Another approach to data democratization uses a data mesh , a decentralized architecture that organizes data by a specific business domain. It uses knowledge graphs, semantics and AI/ML technology to discover patterns in various types of metadata.

Data Lakes

Data Lakes AI AI Data Governance

Exploring the AI and data capabilities of watsonx

IBM Journey to AI blog

JULY 17, 2023

is our enterprise-ready next-generation studio for AI builders, bringing together traditional machine learning (ML) and new generative AI capabilities powered by foundation models. Watsonx.data allows customers to augment data warehouses such as Db2 Warehouse and Netezza and optimize workloads for performance and cost. IBM watsonx.ai

AI

AI AI Machine Learning Machine Learning

Alation 2023.1: Easing Self-Service for the Modern Data Stack with Databricks and dbt Labs

Alation

APRIL 4, 2023

However, the race to the cloud has also created challenges for data users everywhere, including: Cloud migration is expensive, migrating sensitive data is risky, and navigating between on-prem sources is often confusing for users. To build effective data pipelines, they need context (or metadata) on every source.

DataOps

DataOps Data Engineering Data Engineering Data Engineer

The Ultimate Modern Data Stack Migration Guide

phData

JULY 18, 2023

Why Migrate to a Modern Data Stack? Slow Response to New Information: Legacy data systems often lack the computation power necessary to run efficiently and can be cost-inefficient to scale. This typically results in long-running ETL pipelines that cause decisions to be made on stale or old data.

Data Warehouse

Data Warehouse Analytics Analytics Cloud Data

Taking the First Steps Toward Enterprise AI

phData

JUNE 7, 2023

Data scientists use data-driven approaches to enable AI systems to make better predictions, optimize decision-making, and uncover hidden patterns that ultimately drive innovation and improve performance across various domains. This often involves skills in databases, distributed systems, and ETL (Extract, Transform, Load) processes.

AI

AI AI Machine Learning Machine Learning

Introducing generative AI troubleshooting for Apache Spark in AWS Glue (preview)

Flipboard

NOVEMBER 22, 2024

Organizations run millions of Apache Spark applications each month to prepare, move, and process their data for analytics and machine learning (ML). During development, data engineers often spend hours sifting through log files, analyzing execution plans, and making configuration changes to resolve issues. Choose your job.

AWS

AWS AI AI Data Engineering

Fivetran Modern Data Stack Conference 2023: Key Takeaways

Alation

APRIL 14, 2023

Last week, the Alation team had the privilege of joining IT professionals, business leaders, and data analysts and scientists for the Modern Data Stack Conference in San Francisco. So, how can a data catalog support the critical project of building data pipelines? What did attendees take away from the event?

Data Pipeline

Data Pipeline Data Warehouse Cloud Data ETL

The Evolution of Customer Data Modeling: From Static Profiles to Dynamic Customer 360

phData

SEPTEMBER 27, 2024

If the event log is your customer’s diary, think of persistent staging as their scrapbook – a place where raw customer data is collected, organized, and kept for future reference. In traditional ETL (Extract, Transform, Load) processes in CDPs, staging areas were often temporary holding pens for data.

Data Modeling

Data Modeling Data Models Apache Kafka Data Lakes

Generate training data and cost-effectively train categorical models with Amazon Bedrock

AWS Machine Learning Blog

MARCH 27, 2025

In this post, we explore how you can use Amazon Bedrock to generate high-quality categorical ground truth data, which is crucial for training machine learning (ML) models in a cost-sensitive environment. This use case, solvable through ML, can enable support teams to better understand customer needs and optimize response strategies.

AWS

AWS ETL ML ML

Generative AI for agriculture: How Agmatix is improving agriculture with Amazon Bedrock

AWS Machine Learning Blog

NOVEMBER 12, 2024

There are various technologies that help operationalize and optimize the process of field trials, including data management and analytics, IoT, remote sensing, robotics, machine learning (ML), and now generative AI. The first step in developing and deploying generative AI use cases is having a well-defined data strategy.

AWS

AWS AI AI Data Lakes

The Rise and Fall of Data Science Trends: A 2018–2024 Conference Perspective

ODSC - Open Data Science

MARCH 12, 2025

20212024: Interest declined as deep learning and pre-trained models took over, automating many tasks previously handled by classical ML techniques. This shift suggests that while traditional ML is still relevant, its role is now more supportive rather than cutting-edge.

Data Science

Data Science Machine Learning Machine Learning Data Engineer

The power of remote engine execution for ETL/ELT data pipelines

How to Build ETL Data Pipeline in ML

Webinars

Trending Sources

Boost your MLOps efficiency with these 6 must-have tools and platforms

Webinars

What Are AI Credits and How Can Data Scientists Use Them?

How Kakao Games automates lifetime value prediction from game data using Amazon SageMaker and AWS Glue

Maximising Efficiency with ETL Data: Future Trends and Best Practices

Build Data Pipelines: Comprehensive Step-by-Step Guide

How to establish lineage transparency for your machine learning initiatives

The 2021 Executive Guide To Data Science and AI

Supercharging Your Data Pipeline with Apache Airflow (Part 2)

How SnapLogic built a text-to-pipeline application with Amazon Bedrock to translate business intent into action

Building an efficient MLOps platform with OSS tools on Amazon ECS with AWS Fargate

Software Engineering Patterns for Machine Learning

Reducing hallucinations in LLM agents with a verified semantic cache using Amazon Bedrock Knowledge Bases

Improving air quality with generative AI

Comparing Tools For Data Processing Pipelines

Deployment of Data and ML Pipelines for the Most Chaotic Industry: The Stirred Rivers of Crypto

Snowflake’s Acquisition of Datavolo: What Does it Mean for Customers?

How to Version Control Data in ML for Various Data Sources

How to Build Machine Learning Systems With a Feature Store

How to Build a CI/CD MLOps Pipeline [Case Study]

7 Best Machine Learning Workflow and Pipeline Orchestration Tools 2024

How to Maximize Time to Value with Fivetran and dbt

Ensure Success with Trusted Data When Moving To The Cloud

How Does Snowpark Work?

How to Manage Unstructured Data in AI and Machine Learning Projects

How Investment Banks and Asset Managers Should Be Leveraging Data in Snowflake

Data democratization: How data architecture can drive business decisions and AI initiatives

Exploring the AI and data capabilities of watsonx

Alation 2023.1: Easing Self-Service for the Modern Data Stack with Databricks and dbt Labs

The Ultimate Modern Data Stack Migration Guide

Taking the First Steps Toward Enterprise AI

Introducing generative AI troubleshooting for Apache Spark in AWS Glue (preview)

Fivetran Modern Data Stack Conference 2023: Key Takeaways

The Evolution of Customer Data Modeling: From Static Profiles to Dynamic Customer 360

Generate training data and cost-effectively train categorical models with Amazon Bedrock

Generative AI for agriculture: How Agmatix is improving agriculture with Amazon Bedrock

The Rise and Fall of Data Science Trends: A 2018–2024 Conference Perspective

Stay Connected