Data Engineering, Data Pipeline and Definition

10 Best Data Engineering Books [Beginners to Advanced]

Pickl AI

AUGUST 1, 2023

Aspiring and experienced Data Engineers alike can benefit from a curated list of books covering essential concepts and practical techniques. These 10 Best Data Engineering Books for beginners encompass a range of topics, from foundational principles to advanced data processing methods. What is Data Engineering?

Data Engineering

Data Engineering Data Engineering Data Engineer Data Engineering

Feature Platforms?—?A New Paradigm in Machine Learning Operations (MLOps)

IBM Data Science in Practice

MARCH 8, 2023

Additionally, imagine being a practitioner, such as a data scientist, data engineer, or machine learning engineer, who will have the daunting task of learning how to use a multitude of different tools. Source: IBM Cloud Pak for Data MLOps teams often struggle when it comes to integrating into CI/CD pipelines.

Machine Learning

Machine Learning Machine Learning ML ML

What Is DataOps? Definition, Principles, and Benefits

Alation

SEPTEMBER 28, 2022

In essence, DataOps is a practice that helps organizations manage and govern data more effectively. However, there is a lot more to know about DataOps, as it has its own definition, principles, benefits, and applications in real-life companies today – which we will cover in this article! Automated testing to ensure data quality.

DataOps

DataOps Data Pipeline Data Quality Analytics

Webinars

How to Achieve High-Accuracy Results When Using LLMs

MORE WEBINARS

How The Explosive Growth Of Data Access Affects Your Engineer’s Team Efficiency

Smart Data Collective

OCTOBER 17, 2022

Engineering teams, in particular, can quickly get overwhelmed by the abundance of information pertaining to competition data, new product and service releases, market developments, and industry trends, resulting in information anxiety. Explosive data growth can be too much to handle. Data pipeline maintenance.

Big Data

Big Data Big Data Data Engineering Data Engineering

5 Ways Data Engineers Can Support Data Governance

Alation

JANUARY 26, 2023

That’s why many organizations invest in technology to improve data processes, such as a machine learning data pipeline. However, data needs to be easily accessible, usable, and secure to be useful — yet the opposite is too often the case. How can data engineers address these challenges directly?

Data Governance

Data Governance Data Engineering Data Engineer Data Engineering

40 Must-Know Data Science Skills and Frameworks for 2023

ODSC - Open Data Science

FEBRUARY 2, 2023

To get a better grip on those changes we reviewed over 25,000 data scientist job descriptions from that past year to find out what employers are looking for in 2023. Much of what we found was to be expected, though there were definitely a few surprises. You’ll see specific tools in the next section.

Data Science

Data Science Data Scientist Computer Science Computer Science

Advanced Snowflake Features in Coalesce

phData

JULY 4, 2024

This blog will cover creating customized nodes in Coalesce, what new advanced features can already be used as nodes, and how to create them as part of your data pipeline. To create a UDN, we’ll need a node definition that defines how the node should function and templates for how the object will be created and run.

SQL

SQL Data Pipeline Data Engineering Data Engineering

How Do You Call Snowflake Stored Procedures Using dbt Hooks?

phData

AUGUST 2, 2024

Snowflake AI Data Cloud is one of the most powerful platforms, including storage services supporting complex data. Integrating Snowflake with dbt adds another layer of automation and control to the data pipeline. Snowflake stored procedures and dbt Hooks are essential to modern data engineering and analytics workflows.

Data Pipeline

Data Pipeline Python Database SQL

Schema Detection and Evolution in Snowflake

phData

MARCH 1, 2024

This is incredibly useful for both Data Engineers and Data Scientists. During the development phase, Data engineers can quickly use INFER_SCHEMA to scan text files and generate DDLs. Once the table is created, the data load is as simple as using the COPY command.

Data Engineering

Data Engineering Data Engineering Data Engineer Data Engineering

Data Profiling: What It Is and How to Perfect It

Alation

APRIL 18, 2023

For any data user in an enterprise today, data profiling is a key tool for resolving data quality issues and building new data solutions. In this blog, we’ll cover the definition of data profiling, top use cases, and share important techniques and best practices for data profiling today.

Data Profiling

Data Profiling Data Quality Data Governance Data Pipeline

What is Snowflake’s Data Quality Monitoring Feature and How is it Used?

phData

OCTOBER 25, 2024

It’s common to have terabytes of data in most data warehouses, data quality monitoring is often challenging and cost-intensive due to dependencies on multiple tools and eventually ignored. This results in poor credibility and data consistency after some time, leading businesses to mistrust the data pipelines and processes.

Data Quality

Data Quality Data Pipeline Data Governance Database

Top ETL Tools: Unveiling the Best Solutions for Data Integration

Pickl AI

JUNE 7, 2024

It is a process for moving and managing data from various sources to a central data warehouse. This process ensures that data is accurate, consistent, and usable for analysis and reporting. Definition and Explanation of the ETL Process ETL is a data integration method that combines data from multiple sources.

ETL

ETL Data Quality Data Pipeline Data Warehouse

AI-Powered ETL Pipeline Orchestration: Multi-Agent Systems in the Era of Generative AI

ODSC - Open Data Science

FEBRUARY 19, 2025

Well according to Brij Kishore Pandey, it stands for Extract, Transform, Load and is a fundamental process in data engineering, ensuring data moves efficiently from raw sources to structured storage for analysis. The stepsinclude: Extraction : Data is collected from multiple sources (databases, APIs, flatfiles).

ETL

ETL AI AI Data Warehouse

How to Manage Unstructured Data in AI and Machine Learning Projects

DagsHub

OCTOBER 23, 2024

With proper unstructured data management, you can write validation checks to detect multiple entries of the same data. Continuous learning: In a properly managed unstructured data pipeline, you can use new entries to train a production ML model, keeping the model up-to-date.

Machine Learning

Machine Learning Machine Learning Data Lakes AI

Secrets from Data Governance Leaders: DGIQ West 2023 (June 5 – 9)

Alation

MAY 31, 2023

American Family Insurance: Governance by Design – Not as an Afterthought Who: Anil Kumar Kunden , Information Standards, Governance and Quality Specialist at AmFam Group When: Wednesday, June 7, at 2:45 PM Why attend: Learn how to automate and accelerate data pipeline creation and maintenance with data governance, AKA metadata normalization.

Data Governance

Data Governance DataOps Data Pipeline Business Intelligence

Gen AI 101: Testing and Monitoring (Part 4)

phData

AUGUST 15, 2024

This blog provides an overview of applying software engineering best practices to build a test validation and monitoring suite for a non-deterministic generative AI application. Validating the Data Engineering Strategy There is no one-size-fits-all approach to chunking unstructured data.

AI

AI AI Data Engineering Data Engineering

McKinsey QuantumBlack on automating data quality remediation with AI

Snorkel AI

JUNE 22, 2023

So, in those projects, you have more than 70% of the engineering development resources that are tied to data engineering activities. That is a mix of data engineering, feature engineering work, a mix of data transformation work writ large. It is at the level of data quality and joining tasks.

Data Quality

Data Quality ML ML AI

McKinsey QuantumBlack on automating data quality remediation with AI

Snorkel AI

JUNE 22, 2023

So, in those projects, you have more than 70% of the engineering development resources that are tied to data engineering activities. That is a mix of data engineering, feature engineering work, a mix of data transformation work writ large. It is at the level of data quality and joining tasks.

Data Quality

Data Quality ML ML AI

McKinsey QuantumBlack on automating data quality remediation with AI

Snorkel AI

JUNE 22, 2023

So, in those projects, you have more than 70% of the engineering development resources that are tied to data engineering activities. That is a mix of data engineering, feature engineering work, a mix of data transformation work writ large. It is at the level of data quality and joining tasks.

Data Quality

Data Quality ML ML AI

Gen AI 101: Technology Choices (Part 1)

phData

JULY 5, 2024

To provide an example, traditional structured data such as a user’s demographic information can be provided to an AI application to create a more personable experience. Our data engineering blog in this series explores the concept of data engineering and data stores for Gen AI applications in more detail.

AI

AI AI Database AWS

How to Optimize Power BI and Snowflake for Advanced Analytics

phData

MAY 25, 2023

While the loss of certain DAX functions is definitely a shortcoming that we hope Microsoft will address in the near future, the impact of these lost DAX functions is not necessarily as big as you would expect. To get around losing Time Intelligence functions, a robust calendar table is suggested to reference for time-based metrics.

Power BI

Power BI Analytics Analytics Azure

Beginner’s Guide To GCP BigQuery (Part 2)

Mlearning.ai

JULY 10, 2023

Without partitioning, daily data activities will cost your company a fortune and a moment will come where the cost advantage of GCP BigQuery becomes questionable. I’m personally a fan of mandatory partitioning (require partition filter) which restricts you to run a query off a table without specifying a condition on the partitioning column.

SQL

SQL Database Database Administration Data Lakes

The Evolution of Customer Data Modeling: From Static Profiles to Dynamic Customer 360

phData

SEPTEMBER 27, 2024

All this raw data goes into your persistent stage. Then, if you later refine your definition of what constitutes an “engaged” customer, having the raw data in persistent staging allows for easy reprocessing of historical data with the new logic. Your customer data game will never be the same.

Data Modeling

Data Modeling Data Models Apache Kafka Data Lakes

Taking the First Steps Toward Enterprise AI

phData

JUNE 7, 2023

The most critical and impactful step you can take towards enterprise AI today is ensuring you have a solid data foundation built on the modern data stack with mature operational pipelines, including all your most critical operational data. Data Engineer : Data Engineers are responsible for the data infrastructure.

AI

AI AI Machine Learning Machine Learning

The Modern Data Stack Explained: What The Future Holds

Alation

JANUARY 17, 2023

A modern data stack can streamline IT bottlenecks, accelerating access to various teams that require data: Data analysts. Data scientists. Software engineers. Cloud engineers. Data engineers. Basically, a modern data stack can be adopted by any company that wants to improve its data management.

Data Warehouse

Data Warehouse ETL Tableau Cloud Data

Managing Dataset Versions in Long-Term ML Projects

The MLOps Blog

MARCH 20, 2023

However, in scenarios where dataset versioning solutions are leveraged, there can still be various challenges experienced by ML/AI/Data teams. Data aggregation: Data sources could increase as more data points are required to train ML models. Existing data pipelines will have to be modified to accommodate new data sources.

ML

ML ML Machine Learning Machine Learning

How to Build a CI/CD MLOps Pipeline [Case Study]

The MLOps Blog

MARCH 15, 2023

Our activities mostly revolved around: 1 Identifying data sources 2 Collecting & Integrating data 3 Developing Analytical/ML models 4 Integrating the above into a cloud environment 5 Leveraging the cloud to automate the above processes 6 Making the deployment robust & scalable Who was involved in the project?

AWS

AWS ETL ML ML

What Orchestration Tools Help Data Engineers in Snowflake

phData

AUGUST 17, 2023

In the rapidly evolving landscape of data engineering, Snowflake Data Cloud has emerged as a leading cloud-based data warehousing solution, providing powerful capabilities for storing, processing, and analyzing vast amounts of data. What are Orchestration Tools?

Data Engineering

Data Engineering Data Engineering Data Engineer Data Engineering

ML Pipeline Architecture Design Patterns (With 10 Real-World Examples)

The MLOps Blog

AUGUST 11, 2023

This section delves into the common stages in most ML pipelines, regardless of industry or business function. 1 Data Ingestion (e.g., Apache Kafka, Amazon Kinesis) 2 Data Preprocessing (e.g., pandas, NumPy) 3 Feature Engineering and Selection (e.g., Scikit-learn, Feature Tools) 4 Model Training (e.g.,

ML

ML ML Machine Learning Machine Learning

Definite Guide to Building a Machine Learning Platform

The MLOps Blog

MARCH 21, 2023

Other users Some other users you may encounter include: Data engineers , if the data platform is not particularly separate from the ML platform. Analytics engineers and data analysts , if you need to integrate third-party business intelligence tools and the data platform, is not separate. Allegro.io

Machine Learning

Machine Learning Machine Learning Data Scientist ML

Building an efficient MLOps platform with OSS tools on Amazon ECS with AWS Fargate

AWS Machine Learning Blog

SEPTEMBER 18, 2024

It simplifies feature access for model training and inference, significantly reducing the time and complexity involved in managing data pipelines. Additionally, Feast promotes feature reuse, so the time spent on data preparation is reduced greatly. The following figure shows schema definition and model which reference it.

AWS

AWS Machine Learning Machine Learning ML

What is the Snowflake Data Cloud and How Much Does it Cost?

phData

NOVEMBER 9, 2023

Transition to the Data Cloud With multiple ways to interact with your company’s data, Snowflake has built a common access point that handles data lake access, data warehouse access, and data sharing access into one protocol. What kinds of Workloads Does Snowflake Handle?

Data Warehouse

Data Warehouse Data Lakes Clustering Cloud Data

Data Governance for Dummies: Your Questions, Answered

Alation

FEBRUARY 17, 2023

Reichental describes data governance as the overarching layer that empowers people to manage data well ; as such, it is focused on roles & responsibilities, policies, definitions, metrics, and the lifecycle of the data. In this way, data governance is the business or process side. Communication is essential.

Data Governance

Data Governance Data Quality Data Analyst Data Pipeline

Generative AI in Software Development

Mlearning.ai

JUNE 16, 2023

GPT-4 Data Pipelines: Transform JSON to SQL Schema Instantly Blockstream’s public Bitcoin API. The data would be interesting to analyze. From Data Engineering to Prompt Engineering Prompt to do data analysis BI report generation/data analysis In BI/data analysis world, people usually need to query data (small/large).

AI

AI AI Data Analysis Data Analysis

The Ultimate Modern Data Stack Migration Guide

phData

JULY 18, 2023

Key Advantages of Governance Simplified Change Managment: The complexity of the underlying systems is abstracted away from the user, allowing them to simply and declaratively build and change data pipelines. Testing: Data engineering should be treated as a form of software engineering.

Data Warehouse

Data Warehouse Analytics Analytics Cloud Data

Data science

Dataconomy

MARCH 19, 2025

Data science is an interdisciplinary field that utilizes advanced analytics techniques to extract meaningful insights from vast amounts of data. This helps facilitate data-driven decision-making for businesses, enabling them to operate more efficiently and identify new opportunities.

Data Science

Data Science Citizen Data Scientist Data Scientist Machine Learning

10 Best Data Engineering Books [Beginners to Advanced]

Feature Platforms?—?A New Paradigm in Machine Learning Operations (MLOps)

Webinars

Trending Sources

What Is DataOps? Definition, Principles, and Benefits

Webinars

How The Explosive Growth Of Data Access Affects Your Engineer’s Team Efficiency

5 Ways Data Engineers Can Support Data Governance

40 Must-Know Data Science Skills and Frameworks for 2023

Advanced Snowflake Features in Coalesce

How Do You Call Snowflake Stored Procedures Using dbt Hooks?

Schema Detection and Evolution in Snowflake

Data Profiling: What It Is and How to Perfect It

What is Snowflake’s Data Quality Monitoring Feature and How is it Used?

Top ETL Tools: Unveiling the Best Solutions for Data Integration

AI-Powered ETL Pipeline Orchestration: Multi-Agent Systems in the Era of Generative AI

How to Manage Unstructured Data in AI and Machine Learning Projects

Secrets from Data Governance Leaders: DGIQ West 2023 (June 5 – 9)

Gen AI 101: Testing and Monitoring (Part 4)

McKinsey QuantumBlack on automating data quality remediation with AI

McKinsey QuantumBlack on automating data quality remediation with AI

McKinsey QuantumBlack on automating data quality remediation with AI

Gen AI 101: Technology Choices (Part 1)

How to Optimize Power BI and Snowflake for Advanced Analytics

Beginner’s Guide To GCP BigQuery (Part 2)

The Evolution of Customer Data Modeling: From Static Profiles to Dynamic Customer 360

Taking the First Steps Toward Enterprise AI

The Modern Data Stack Explained: What The Future Holds

Managing Dataset Versions in Long-Term ML Projects

How to Build a CI/CD MLOps Pipeline [Case Study]

What Orchestration Tools Help Data Engineers in Snowflake

ML Pipeline Architecture Design Patterns (With 10 Real-World Examples)

Definite Guide to Building a Machine Learning Platform

Building an efficient MLOps platform with OSS tools on Amazon ECS with AWS Fargate

What is the Snowflake Data Cloud and How Much Does it Cost?

Data Governance for Dummies: Your Questions, Answered

Generative AI in Software Development

The Ultimate Modern Data Stack Migration Guide

Data science

Stay Connected