Blog, Data Lakes and Data Pipeline - Data Science Current

CI/CD for Data Pipelines: A Game-Changer with AnalyticsCreator

Data Science Blog

MAY 20, 2024

Continuous Integration and Continuous Delivery (CI/CD) for Data Pipelines: It is a Game-Changer with AnalyticsCreator! The need for efficient and reliable data pipelines is paramount in data science and data engineering. They transform data into a consistent format for users to consume.

Data Pipeline

Data Pipeline Data Warehouse Azure Data Lakes

Differentiating Between Data Lakes and Data Warehouses

Smart Data Collective

SEPTEMBER 23, 2020

While there is a lot of discussion about the merits of data warehouses, not enough discussion centers around data lakes. We talked about enterprise data warehouses in the past, so let’s contrast them with data lakes. Both data warehouses and data lakes are used when storing big data.

Data Lakes

Data Lakes Data Warehouse Big Data Big Data

Build Data Pipelines: Comprehensive Step-by-Step Guide

Pickl AI

JULY 8, 2024

Summary: This blog explains how to build efficient data pipelines, detailing each step from data collection to final delivery. Introduction Data pipelines play a pivotal role in modern data architecture by seamlessly transporting and transforming raw data into valuable insights.

Data Pipeline

Data Pipeline Data Quality Database Apache Kafka

Webinars

How to Achieve High-Accuracy Results When Using LLMs

MORE WEBINARS

How to Build ETL Data Pipeline in ML

The MLOps Blog

MAY 17, 2023

We also discuss different types of ETL pipelines for ML use cases and provide real-world examples of their use to help data engineers choose the right one. What is an ETL data pipeline in ML? Xoriant It is common to use ETL data pipeline and data pipeline interchangeably.

ETL

ETL Data Pipeline ML ML

How Twilio generated SQL using Looker Modeling Language data with Amazon Bedrock

AWS Machine Learning Blog

AUGUST 8, 2024

Managing and retrieving the right information can be complex, especially for data analysts working with large data lakes and complex SQL queries. This post highlights how Twilio enabled natural language-driven data exploration of business intelligence (BI) data with RAG and Amazon Bedrock.

SQL

SQL Data Lakes Data Analyst AWS

Best 8 Data Version Control Tools for Machine Learning 2024

DagsHub

DECEMBER 11, 2023

Best 8 data version control tools for 2023 (Source: DagsHub ) Introduction With business needs changing constantly and the growing size and structure of datasets, it becomes challenging to efficiently keep track of the changes made to the data, which leads to unfortunate scenarios such as inconsistencies and errors in data.

Machine Learning

Machine Learning Machine Learning Data Lakes Big Data

Reducing hallucinations in LLM agents with a verified semantic cache using Amazon Bedrock Knowledge Bases

AWS Machine Learning Blog

FEBRUARY 21, 2025

He specializes in large language models, cloud infrastructure, and scalable data systems, focusing on building intelligent solutions that enhance automation and data accessibility across Amazons operations. Rajesh Nedunuri is a Senior Data Engineer within the Amazon Worldwide Returns and ReCommerce Data Services team.

AWS

AWS Natural Language Processing Machine Learning Machine Learning

Are Data Warehouses Still Relevant?

Dataversity

JANUARY 25, 2023

Over the past few years, enterprise data architectures have evolved significantly to accommodate the changing data requirements of modern businesses. Data warehouses were first introduced in the […] The post Are Data Warehouses Still Relevant?

Data Warehouse

Data Warehouse Data Lakes Cloud Computing Data Pipeline

Comparing Tools For Data Processing Pipelines

The MLOps Blog

MARCH 15, 2023

In this post, you will learn about the 10 best data pipeline tools, their pros, cons, and pricing. A typical data pipeline involves the following steps or processes through which the data passes before being consumed by a downstream process, such as an ML model training process.

Data Pipeline

Data Pipeline ETL SQL Data Quality

Improving air quality with generative AI

AWS Machine Learning Blog

JUNE 18, 2024

The solution addressed in this blog solves Afri-SET’s challenge and was ranked as the top 3 winning solutions. This post presents a solution that uses a generative artificial intelligence (AI) to standardize air quality data from low-cost sensors in Africa, specifically addressing the air quality data integration problem of low-cost sensors.

AWS

AWS AI AI Python

MLOps Landscape in 2023: Top Tools and Platforms

The MLOps Blog

JUNE 27, 2023

LakeFS LakeFS is an open-source platform that provides data lake versioning and management capabilities. It sits between the data lake and cloud object storage, allowing you to version and control changes to data lakes at scale. Flyte Flyte is a platform for orchestrating ML pipelines at scale.

Machine Learning

Machine Learning Machine Learning ML ML

Top 5 Fivetran Connectors for Healthcare

phData

APRIL 29, 2024

In our previous blog, Top 5 Fivetran Connectors for Financial Services , we explored Fivetran’s capabilities that address the data integration needs of the finance industry. Now, let’s cover the healthcare industry, which also has a surging demand for data and analytics, along with the underlying processes to make it happen.

SQL

SQL Data Warehouse Azure Cloud Data

Data science vs data analytics: Unpacking the differences

IBM Journey to AI blog

SEPTEMBER 19, 2023

By analyzing datasets, data scientists can better understand their potential use in an algorithm or machine learning model. The data science lifecycle Data science is iterative, meaning data scientists form hypotheses and experiment to see if a desired outcome can be achieved using available data.

Data Science

Data Science Analytics Analytics Data Scientist

Build ML features at scale with Amazon SageMaker Feature Store using data from Amazon Redshift

Flipboard

AUGUST 17, 2023

Amazon Redshift uses SQL to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes, using AWS-designed hardware and ML to deliver the best price-performance at any scale. Check out the AWS Blog for more practices about building ML features from a modern data warehouse.

ML

ML ML AWS Data Warehouse

How to Version Control Data in ML for Various Data Sources

The MLOps Blog

JANUARY 23, 2023

These tools may have their own versioning system, which can be difficult to integrate with a broader data version control system. For instance, our data lake could contain a variety of relational and non-relational databases, files in different formats, and data stored using different cloud providers. DVC Git LFS neptune.ai

ML

ML ML Data Lakes Machine Learning

Identify cybersecurity anomalies in your Amazon Security Lake data using Amazon SageMaker

AWS Machine Learning Blog

DECEMBER 20, 2023

A novel approach to solve this complex security analytics scenario combines the ingestion and storage of security data using Amazon Security Lake and analyzing the security data with machine learning (ML) using Amazon SageMaker. The dataset used during development of this blog was small.

AWS

AWS ML ML Algorithm

How to use foundation models and trusted governance to manage AI workflow risk

IBM Journey to AI blog

OCTOBER 16, 2023

How to scale AL and ML with built-in governance A fit-for-purpose data store built on an open lakehouse architecture allows you to scale AI and ML while providing built-in governance tools. A data store lets a business connect existing data with new data and discover new insights with real-time analytics and business intelligence.

AI

AI AI Data Warehouse ML

Turnkey Cloud DataOps: Solution from Alation and Accenture

Alation

MARCH 22, 2022

They created each capability as modules, which can either be used independently or together to build automated data pipelines. The table details are extracted from the IDF pipeline information, which then syncs details like column, table, business, and technical metadata. How the IDF Supports a Smarter Data Pipeline.

DataOps

DataOps Data Pipeline Data Engineering Data Engineering

Data architecture strategy for data quality

IBM Journey to AI blog

JANUARY 5, 2023

The first generation of data architectures represented by enterprise data warehouse and business intelligence platforms were characterized by thousands of ETL jobs, tables, and reports that only a small group of specialized data engineers understood, resulting in an under-realized positive impact on the business.

Data Quality

Data Quality Data Lakes Data Warehouse Big Data

Introduction to Apache NiFi and Its Architecture

Pickl AI

JULY 30, 2024

With its user-friendly interface and robust architecture, NiFi simplifies the complexities of data integration, making it an essential component for modern data-driven enterprises. This blog delves into the fundamentals of Apache NiFi, its architecture, and how it can leverage for effective data flow management.

ETL

ETL Data Lakes Big Data Big Data

How data stores and governance impact your AI initiatives

IBM Journey to AI blog

OCTOBER 12, 2023

Securing AI models and their access to data While AI models need flexibility to access data across a hybrid infrastructure, they also need safeguarding from tampering (unintentional or otherwise) and, especially, protected access to data.

AI

AI AI Data Scientist Data Governance

How to Shift from Data Science to Data Engineering

ODSC - Open Data Science

JANUARY 18, 2024

If you are a data scientist, you may be wondering if you can transition into data engineering. The good news is that there are many skills that data scientists already have that are transferable to data engineering. In this blog post, we will discuss how you can become a data engineer if you are a data scientist.

Data Engineering

Data Engineering Data Engineering Data Engineer Data Engineering

How to Effectively Version Control Your Machine Learning Pipeline

phData

AUGUST 20, 2024

Data Versioning Data is often considered the lifeblood that fuels the algorithms in an ML pipeline. Tracking changes and lineage ensures traceability for downstream components of the ML pipeline ingesting the data. Refer to this LakeFS blog post for a more detailed description. This Neptune.AI

Machine Learning

Machine Learning Machine Learning ML ML

Exploring the AI and data capabilities of watsonx

IBM Journey to AI blog

JULY 17, 2023

In this blog, I will cover: What is watsonx.ai? sales conversation summaries, insurance coverage, meeting transcripts, contract information) Generate: Generate text content for a specific purpose, such as marketing campaigns, job descriptions, blogs or articles, and email drafting support. What capabilities are included in watsonx.ai?

AI

AI AI Machine Learning Machine Learning

What is Data Ingestion? Understanding the Basics

Pickl AI

JULY 25, 2024

From extracting information from databases and spreadsheets to ingesting streaming data from IoT devices and social media platforms, It’s the foundation upon which data-driven initiatives are built. Batch Processing In this method, data is collected over a period and then processed in groups or batches.

Apache Kafka

Apache Kafka Data Lakes Data Warehouse Data Quality

Amazon SageMaker Feature Store now supports cross-account sharing, discovery, and access

AWS Machine Learning Blog

FEBRUARY 13, 2024

Let’s demystify this using the following personas and a real-world analogy: Data and ML engineers (owners and producers) – They lay the groundwork by feeding data into the feature store Data scientists (consumers) – They extract and utilize this data to craft their models Data engineers serve as architects sketching the initial blueprint.

AWS

AWS ML ML Machine Learning

AI-Powered Bots in Ocean Predictoor Get a UX Upgrade: CLI & YAML

Ocean Protocol

JANUARY 17, 2024

That’s what this blog post describes. We wanted to professionalize and operationalize the data pipeline, for use by simulation, the bots, and the analytics app. We wanted to extend simulation, into a flow that supported experiments on realtime data and with the possibility of live trading. We’ve evolved it a lot lately!

Data Pipeline

Data Pipeline AI AI Analytics

Data Profiling: What It Is and How to Perfect It

Alation

APRIL 18, 2023

For any data user in an enterprise today, data profiling is a key tool for resolving data quality issues and building new data solutions. In this blog, we’ll cover the definition of data profiling, top use cases, and share important techniques and best practices for data profiling today.

Data Profiling

Data Profiling Data Quality Data Governance Data Pipeline

Getting Started With Snowflake: Best Practices For Launching

phData

DECEMBER 4, 2023

This blog was originally written by Erik Hyrkas and updated for 2024 by Justin Delisi This isn’t meant to be a technical how-to guide — most of those details are readily available via a quick Google search — but rather an opinionated review of key processes and potential approaches. Use with caution, and test before committing to using them.

Clustering

Clustering Database SQL Data Pipeline

How HR Tech Company Sense Scaled their ML Operations using Iguazio

Iguazio

JANUARY 16, 2024

The system’s architecture ensures the data flows through the different systems effectively. First, the data lake is fed from a number of data sources. These include conversational data, ATS Data and more.

ML

ML ML DataOps Data Scientist

What is Snowflake Horizon?

phData

AUGUST 5, 2024

All of these questions describe a concept known as data governance. The Snowflake AI Data Cloud has built an entire blanket of features called Horizon, which tackles all of these questions and more. In this blog, we will explain what Horizon is, what features it includes, how you can use it, and how phData can help along the way.

Data Governance

Data Governance Data Quality Data Lakes ML

What are the Biggest Challenges with Migrating to Snowflake?

phData

FEBRUARY 5, 2024

In this blog, we’re going to answer these questions and more. Walking you through the biggest challenges we have found when migrating our customer’s data from a legacy system to Snowflake. You’re in luck because this blog is for anyone ready to move or thinking about moving to Snowflake who wants to know what’s in store for them.

SQL

SQL Database Data Quality Data Warehouse

How Sense Uses Iguazio as a Key Component of Their ML Stack

Iguazio

JANUARY 16, 2024

The system’s architecture ensures the data flows through the different systems effectively. First, the data lake is fed from a number of data sources. These include conversational data, ATS data, and more.

ML

ML ML DataOps Data Scientist

How Alteryx & Snowflake Accelerates Analytics

phData

FEBRUARY 24, 2023

Data must be available at the right moment for consumption and it might not be the easiest task to develop a strategy around the continuous pipelines and the integrated applications to set up your stack. Alteryx and the Snowflake Data Cloud offer a potential solution to this issue and can speed up your path to Analytics.

Analytics

Analytics Analytics Database Python

Five benefits of a data catalog

IBM Journey to AI blog

DECEMBER 16, 2022

For example, data catalogs have evolved to deliver governance capabilities like managing data quality and data privacy and compliance. It uses metadata and data management tools to organize all data assets within your organization. Everybody wins with a data catalog.

Data Quality

Data Quality Data Governance Data Wrangling Data Scientist

Why Lean Data Management Is Vital for Agile Companies

Pickl AI

DECEMBER 11, 2024

Companies must adapt quickly to changing demands, and lean data management empowers them by enabling faster decisions, seamless collaboration, and improved scalability. This blog explores why lean data management is essential for agile organisations, its principles, and how to implement it effectively.

Data Silos

Data Silos Data Pipeline Artificial Intelligence Artificial Intelligence

Fine-tune your data lineage tracking with descriptive lineage

IBM Journey to AI blog

JULY 1, 2024

Whenever anyone talks about data lineage and how to achieve it, the spotlight tends to shine on automation. This is expected, as automating the process of calculating and establishing lineage is crucial to understanding and maintaining a trustworthy system of data pipelines. Contact your IBM representative for more information.

ETL

ETL Data Lakes Database Data Pipeline

Deploy a predictive maintenance solution for airport baggage handling systems with Amazon Lookout for Equipment

AWS Machine Learning Blog

APRIL 12, 2023

With this service, industrial sensors, smart meters, and OPC UA servers can be connected to an AWS data lake with just a few clicks. From now on, we will launch a retraining every 3 months and, as soon as possible, will use up to 1 year of data to account for the environmental condition seasonality.

AWS

AWS ML ML Machine Learning

Mastering ML Model Performance: Best Practices for Optimal Results

Iguazio

JUNE 25, 2023

In this blog post, we dive into all aspects of ML model performance: which metrics to use to measure performance, best practices that can help and where MLOps fits in. ML model evaluation is an essential part of the MLOps pipeline. Data Ingestion and Processing - MLOps enables data pipeline management and data quality monitoring.

ML

ML ML Clustering Cross Validation

Star Schema vs. Snowflake Schema: Comparing Dimensional Modeling Techniques

Pickl AI

JULY 25, 2024

Introduction Dimensional modelling is crucial for organising data to enhance query performance and reporting efficiency. Effective schema design is essential for optimising data retrieval and analysis in data warehousing. Must Read Blogs: Exploring the Power of Data Warehouse Functionality.

Data Warehouse

Data Warehouse Business Intelligence Business Intelligence Database

What Are The Best Third-Party Data Ingestion Tools For Snowflake?

phData

FEBRUARY 14, 2023

Source data formats can only be Parquer, JSON, or Delimited Text (CSV, TSV, etc.). Streamsets Data Collector StreamSets Data Collector Engine is an easy-to-use data pipeline engine for streaming, CDC, and batch ingestion from any source to any destination.

Data Warehouse

Data Warehouse Azure AWS Database

A Look Inside the Modern Analytics Stack

Dataversity

APRIL 1, 2021

In the data-driven world we live in today, the field of analytics has become increasingly important to remain competitive in business. In fact, a study by McKinsey Global Institute shows that data-driven organizations are 23 times more likely to outperform competitors in customer acquisition and nine times […].

Analytics

Analytics Analytics Data Silos Data Lakes

The Cloud Connection: How Governance Supports Security

Alation

APRIL 14, 2022

Data pipeline orchestration. Moving/integrating data in the cloud/data exploration and quality assessment. For example, data science always consumes “historical” data, and there is no guarantee that the semantics of older datasets are the same, even if their names are unchanged.

Data Governance

Data Governance ML ML Cloud Data

5 Ways Data Engineers Can Support Data Governance

Alation

JANUARY 26, 2023

That’s why many organizations invest in technology to improve data processes, such as a machine learning data pipeline. However, data needs to be easily accessible, usable, and secure to be useful — yet the opposite is too often the case. Pohan Lin also published articles for domains such as PingPlotter and IT Chronicles.

Data Governance

Data Governance Data Engineering Data Engineering Data Engineer

CI/CD for Data Pipelines: A Game-Changer with AnalyticsCreator

Differentiating Between Data Lakes and Data Warehouses

Webinars

Trending Sources

Build Data Pipelines: Comprehensive Step-by-Step Guide

Webinars

How to Build ETL Data Pipeline in ML

How Twilio generated SQL using Looker Modeling Language data with Amazon Bedrock

Best 8 Data Version Control Tools for Machine Learning 2024

Reducing hallucinations in LLM agents with a verified semantic cache using Amazon Bedrock Knowledge Bases

Are Data Warehouses Still Relevant?

Comparing Tools For Data Processing Pipelines

Improving air quality with generative AI

MLOps Landscape in 2023: Top Tools and Platforms

Top 5 Fivetran Connectors for Healthcare

Data science vs data analytics: Unpacking the differences

Build ML features at scale with Amazon SageMaker Feature Store using data from Amazon Redshift

How to Version Control Data in ML for Various Data Sources

Identify cybersecurity anomalies in your Amazon Security Lake data using Amazon SageMaker

How to use foundation models and trusted governance to manage AI workflow risk

Turnkey Cloud DataOps: Solution from Alation and Accenture

Data architecture strategy for data quality

Introduction to Apache NiFi and Its Architecture

How data stores and governance impact your AI initiatives

How to Shift from Data Science to Data Engineering

How to Effectively Version Control Your Machine Learning Pipeline

Exploring the AI and data capabilities of watsonx

What is Data Ingestion? Understanding the Basics

Amazon SageMaker Feature Store now supports cross-account sharing, discovery, and access

AI-Powered Bots in Ocean Predictoor Get a UX Upgrade: CLI & YAML

Data Profiling: What It Is and How to Perfect It

Getting Started With Snowflake: Best Practices For Launching

How HR Tech Company Sense Scaled their ML Operations using Iguazio

What is Snowflake Horizon?

What are the Biggest Challenges with Migrating to Snowflake?

How Sense Uses Iguazio as a Key Component of Their ML Stack

How Alteryx & Snowflake Accelerates Analytics

Five benefits of a data catalog

Why Lean Data Management Is Vital for Agile Companies

Fine-tune your data lineage tracking with descriptive lineage

Deploy a predictive maintenance solution for airport baggage handling systems with Amazon Lookout for Equipment

Mastering ML Model Performance: Best Practices for Optimal Results

Star Schema vs. Snowflake Schema: Comparing Dimensional Modeling Techniques

What Are The Best Third-Party Data Ingestion Tools For Snowflake?

A Look Inside the Modern Analytics Stack

The Cloud Connection: How Governance Supports Security

5 Ways Data Engineers Can Support Data Governance

Stay Connected