Data Lakes, Data Quality and ML - Data Science Current

Governing the ML lifecycle at scale, Part 3: Setting up data governance at scale

Flipboard

NOVEMBER 22, 2024

This post is part of an ongoing series about governing the machine learning (ML) lifecycle at scale. This post dives deep into how to set up data governance at scale using Amazon DataZone for the data mesh. The data mesh is a modern approach to data management that decentralizes data ownership and treats data as a product.

Data Governance

Data Governance ML ML Data Lakes

How to Leverage Machine Learning to Identify Data Errors in a Data Lake

Dataversity

MAY 26, 2022

A data lake becomes a data swamp in the absence of comprehensive data quality validation and does not offer a clear link to value creation. Organizations are rapidly adopting the cloud data lake as the data lake of choice, and the need for validating data in real time has become critical.

Data Lakes

Data Lakes Machine Learning Machine Learning Data Quality

Unlock the power of data governance and no-code machine learning with Amazon SageMaker Canvas and Amazon DataZone

AWS Machine Learning Blog

AUGUST 21, 2024

Amazon DataZone is a data management service that makes it quick and convenient to catalog, discover, share, and govern data stored in AWS, on-premises, and third-party sources. Enterprises can use no-code ML solutions to streamline their operations and optimize their decision-making without extensive administrative overhead.

Machine Learning

Machine Learning Machine Learning Data Governance ML

Webinars

Maximizing Profit and Productivity: The New Era of AI-Powered Accounting

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

MLOps Landscape in 2023: Top Tools and Platforms

The MLOps Blog

JUNE 27, 2023

Alignment to other tools in the organization’s tech stack Consider how well the MLOps tool integrates with your existing tools and workflows, such as data sources, data engineering platforms, code repositories, CI/CD pipelines, monitoring systems, etc. and Pandas or Apache Spark DataFrames.

Machine Learning

Machine Learning Machine Learning ML ML

Data Engineering for IoT Applications: Unleashing the Power of the Internet of Things

Data Science Connect

JULY 28, 2023

This data is then integrated into centralized databases for further processing and analysis. Data Cleaning and Preprocessing IoT data can be noisy, incomplete, and inconsistent. Data engineers employ data cleaning and preprocessing techniques to ensure data quality, making it ready for analysis and decision-making.

Internet of Things

Internet of Things Data Engineering Data Engineering Data Engineering

Use the Amazon SageMaker and Salesforce Data Cloud integration to power your Salesforce apps with AI/ML

AWS Machine Learning Blog

AUGUST 4, 2023

If you are a returning user to SageMaker Studio, in order to ensure Salesforce Data Cloud is enabled, upgrade to the latest Jupyter and SageMaker Data Wrangler kernels. This completes the setup to enable data access from Salesforce Data Cloud to SageMaker Studio to build AI and machine learning (ML) models.

ML

ML ML AWS AI

Mastering ML Model Performance: Best Practices for Optimal Results

Iguazio

JUNE 25, 2023

Evaluating ML model performance is essential for ensuring the reliability, quality, accuracy and effectiveness of your ML models. In this blog post, we dive into all aspects of ML model performance: which metrics to use to measure performance, best practices that can help and where MLOps fits in.

ML

ML ML Clustering Cross Validation

How to Build ETL Data Pipeline in ML

The MLOps Blog

MAY 17, 2023

From data processing to quick insights, robust pipelines are a must for any ML system. Often the Data Team, comprising Data and ML Engineers , needs to build this infrastructure, and this experience can be painful. However, efficient use of ETL pipelines in ML can help make their life much easier.

ETL

ETL Data Pipeline ML ML

Now available in Tableau 2021.1—Einstein Discovery in Tableau, quick LODs, a new unified notification experience, and more

Tableau

FEBRUARY 17, 2021

Powered by machine learning (ML), Einstein Discovery provides predictions and recommendations within Tableau workflows for accelerated and smarter decision-making. Here’s how: Einstein dashboard extension : Enjoy on-demand, interpretable ML predictions from Einstein Discovery natively integrated in Tableau dashboards.

Tableau

Tableau Azure Data Quality ML

Apply fine-grained data access controls with AWS Lake Formation in Amazon SageMaker Data Wrangler

AWS Machine Learning Blog

AUGUST 21, 2023

Amazon SageMaker Data Wrangler reduces the time it takes to collect and prepare data for machine learning (ML) from weeks to minutes. Data is frequently kept in data lakes that can be managed by AWS Lake Formation , giving you the ability to implement fine-grained access control using a straightforward grant or revoke procedure.

AWS

AWS Data Lakes Clustering Data Preparation

Reducing hallucinations in LLM agents with a verified semantic cache using Amazon Bedrock Knowledge Bases

AWS Machine Learning Blog

FEBRUARY 21, 2025

Previously, he was a Data & Machine Learning Engineer at AWS, where he worked closely with customers to develop enterprise-scale data infrastructure, including data lakes, analytics dashboards, and ETL pipelines. He specializes in designing, building, and optimizing large-scale data solutions.

AWS

AWS Natural Language Processing Machine Learning Machine Learning

11 Open Source Data Exploration Tools You Need to Know in 2023

ODSC - Open Data Science

FEBRUARY 24, 2023

Its goal is to help with a quick analysis of target characteristics, training vs testing data, and other such data characterization tasks. Apache Superset GitHub | Website Apache Superset is a must-try project for any ML engineer, data scientist, or data analyst.

Exploratory Data Analysis

Exploratory Data Analysis Data Visualization Data Analysis Data Analysis

Principal Financial Group uses AWS Post Call Analytics solution to extract omnichannel customer insights

AWS Machine Learning Blog

NOVEMBER 15, 2023

Therefore, when the Principal team started tackling this project, they knew that ensuring the highest standard of data security such as regulatory compliance, data privacy, and data quality would be a non-negotiable, key requirement.

AWS

AWS Analytics Analytics ML

Modern Data Architectures Provide a Foundation for Innovation

Precisely

JUNE 6, 2023

The group kicked off the session by exchanging ideas about what it means to have a modern data architecture. Atif Salam noted that as recently as a year ago, the primary focus in many organizations was on ingesting data and building data lakes.

Data Observability

Data Observability Data Lakes Data Quality ETL

The Audience for Data Catalogs and Data Intelligence

Alation

JUNE 21, 2022

Over time, we called the “thing” a data catalog , blending the Google-style, AI/ML-based relevancy with more Yahoo-style manual curation and wikis. Thus was born the data catalog. In our early days, “people” largely meant data analysts and business analysts. ML and DataOps teams).

DataOps

DataOps Data Scientist Data Quality Data Pipeline

AI that’s ready for business starts with data that’s ready for AI

IBM Journey to AI blog

JULY 3, 2024

With an open data lakehouse architecture, you can now optimize your data warehouse workloads for price performance and modernize traditional data lakes with better performance and governance for AI. Effective data quality management is crucial to mitigating these risks.

AI

AI AI Data Quality Database

Five benefits of a data catalog

IBM Journey to AI blog

DECEMBER 16, 2022

For example, data catalogs have evolved to deliver governance capabilities like managing data quality and data privacy and compliance. It uses metadata and data management tools to organize all data assets within your organization. Ensuring data quality is made easier as a result.

Data Quality

Data Quality Data Governance Data Wrangling Data Scientist

Now available in Tableau 2021.1—Einstein Discovery in Tableau, quick LODs, a new unified notification experience, and more

Tableau

FEBRUARY 17, 2021

Powered by machine learning (ML), Einstein Discovery provides predictions and recommendations within Tableau workflows for accelerated and smarter decision-making. Here’s how: Einstein dashboard extension : Enjoy on-demand, interpretable ML predictions from Einstein Discovery natively integrated in Tableau dashboards.

Tableau

Tableau Azure Data Quality ML

12 AI Insight Talks to Help Improve Your Company’s AI Game at ODSC West

ODSC - Open Data Science

OCTOBER 25, 2024

Building an Open, Governed Lakehouse with Apache Iceberg and Apache Polaris (Incubating) Yufei Gu | Senior Software Engineer | Snowflake In this session, you’ll explore how open-source table formats are revolutionizing data architectures by enabling the power and efficiency of data warehouses within data lakes.

AI

AI AI Data Scientist Data Lakes

Modern Data Management Essentials: Exploring Data Fabric

Precisely

JULY 18, 2024

Key Takeaways Data Fabric is a modern data architecture that facilitates seamless data access, sharing, and management across an organization. Data management recommendations and data products emerge dynamically from the fabric through automation, activation, and AI/ML analysis of metadata.

Data Lakes

Data Lakes Data Warehouse Data Governance Machine Learning

How to Manage Unstructured Data in AI and Machine Learning Projects

DagsHub

OCTOBER 23, 2024

Managing unstructured data is essential for the success of machine learning (ML) projects. Without structure, data is difficult to analyze and extracting meaningful insights and patterns is challenging. This article will discuss managing unstructured data for AI and ML projects. What is Unstructured Data?

Machine Learning

Machine Learning Machine Learning Data Lakes AI

How to Effectively Handle Unstructured Data Using AI

DagsHub

NOVEMBER 11, 2024

In general, this data has no clear structure because it may manifest real-world complexity, such as the subtlety of language or the details in a picture. Advanced methods are needed to process unstructured data, but its unstructured nature comes from how easily it is made and shared in today's digital world.

AI

AI AI Data Lakes Database

Find Your AI Solutions at the ODSC West AI Expo

ODSC - Open Data Science

OCTOBER 20, 2023

You’ll use MLRun, Langchain, and Milvus for this exercise and cover topics like the integration of AI/ML applications, leveraging Python SDKs, as well as building, testing, and tuning your work. In this session, we’ll demonstrate how you can fine-tune a Gen AI model, build a Gen AI application, and deploy it in 20 minutes.

AI

AI AI Data Science Machine Learning

Tackling AI’s data challenges with IBM databases on AWS

IBM Journey to AI blog

MARCH 14, 2024

Businesses face significant hurdles when preparing data for artificial intelligence (AI) applications. The existence of data silos and duplication, alongside apprehensions regarding data quality, presents a multifaceted environment for organizations to manage.

AWS

AWS Database ETL AI

What is Snowflake Horizon?

phData

AUGUST 5, 2024

Horizon addresses key aspects of data governance, including: Compliance Security Access Privacy Interoperability Throughout the remainder of this blog, we will dive deeper into each of the above components and take a look at the ways in which Horizon can help. We will begin with compliance.

Data Governance

Data Governance Data Quality Data Lakes ML

Build Data Pipelines: Comprehensive Step-by-Step Guide

Pickl AI

JULY 8, 2024

Common options include: Relational Databases: Structured storage supporting ACID transactions, suitable for structured data. NoSQL Databases: Flexible, scalable solutions for unstructured or semi-structured data. Data Warehouses : Centralised repositories optimised for analytics and reporting.

Data Pipeline

Data Pipeline Data Quality Database Apache Kafka

Top Data Analytics Skills and Platforms for 2023

ODSC - Open Data Science

APRIL 3, 2023

Skills like effective verbal and written communication will help back up the numbers, while data visualization (specific frameworks in the next section) can help you tell a complete story. Data Wrangling: Data Quality, ETL, Databases, Big Data The modern data analyst is expected to be able to source and retrieve their own data for analysis.

Analytics

Analytics Analytics Data Analyst Data Science

What Do You Actually Need from a Data Catalog Tool?

Alation

SEPTEMBER 23, 2021

There are many other features and functions , but to verify that a data catalog functions as platform, look for these five features: Intelligence – A data catalog should leverage AI/ML-driven pattern detection, including popularity, pattern matching, and provenance/impact analysis. Curation can power key insights.

Data Preparation

Data Preparation SQL Data Governance Data Analysis

Comparing Tools For Data Processing Pipelines

The MLOps Blog

MARCH 15, 2023

Data pipeline stages But before delving deeper into the technical aspects of these tools, let’s quickly understand the core components of a data pipeline succinctly captured in the image below: Data pipeline stages | Source: Author What does a good data pipeline look like?

Data Pipeline

Data Pipeline ETL SQL Data Quality

A Guide to Data Analytics in the Travel Industry

Alation

MARCH 21, 2023

Having been in business for over 50 years, ARC had accumulated a massive amount of data that was stored in siloed, on-premises servers across its 7 business domains. Using Alation, ARC automated the data curation and cataloging process. “So

Analytics

Analytics Analytics Data Silos Big Data

The Ultimate Guide to Data Preparation for Machine Learning

DagsHub

FEBRUARY 29, 2024

Typically, flashy new algorithms or state-of-the-art models capture both public imagination and the interest of data scientists, but messy data can undermine even the most sophisticated model. For instance, bad data is reported to cost the US $3 Trillion per year and poor quality data costs organizations an average of $12.9

Data Preparation

Data Preparation Machine Learning Machine Learning Data Governance

The Evolution of Customer Data Modeling: From Static Profiles to Dynamic Customer 360

phData

SEPTEMBER 27, 2024

Data Quality Management : Persistent staging provides a clear demarcation between raw and processed customer data. This makes it easier to implement and manage data quality processes, ensuring your marketing efforts are based on clean, reliable data. New user sign-up? Workout completed?

Data Modeling

Data Modeling Data Models Apache Kafka Data Lakes

How to Build an End-To-End ML Pipeline

The MLOps Blog

MAY 9, 2023

One of the most prevalent complaints we hear from ML engineers in the community is how costly and error-prone it is to manually go through the ML workflow of building and deploying models. Building end-to-end machine learning pipelines lets ML engineers build once, rerun, and reuse many times. If all goes well, of course ?

ML

ML ML Machine Learning Machine Learning

Architect defense-in-depth security for generative AI applications using the OWASP Top 10 for LLMs

AWS Machine Learning Blog

JANUARY 26, 2024

The goal of this post is to empower AI and machine learning (ML) engineers, data scientists, solutions architects, security teams, and other stakeholders to have a common mental model and framework to apply security best practices, allowing AI/ML teams to move fast without trading off security for speed.

AWS

AWS ML ML AI

Perform generative AI-powered data prep and no-code ML over any size of data using Amazon SageMaker Canvas

AWS Machine Learning Blog

AUGUST 15, 2024

Starting today, you can interactively prepare large datasets, create end-to-end data flows, and invoke automated machine learning (AutoML) experiments on petabytes of data—a substantial leap from the previous 5 GB limit. Organizations often struggle to extract meaningful insights and value from their ever-growing volume of data.

ML

ML ML Data Preparation AWS

Definite Guide to Building a Machine Learning Platform

The MLOps Blog

MARCH 21, 2023

From gathering and processing data to building models through experiments, deploying the best ones, and managing them at scale for continuous value in production—it’s a lot. As the number of ML-powered apps and services grows, it gets overwhelming for data scientists and ML engineers to build and deploy models at scale.

Machine Learning

Machine Learning Machine Learning Data Scientist ML

Shaping the future: OMRON’s data-driven journey with AWS

AWS Machine Learning Blog

APRIL 3, 2025

Amazon AppFlow was used to facilitate the smooth and secure transfer of data from various sources into ODAP. Additionally, Amazon Simple Storage Service (Amazon S3) served as the central data lake, providing a scalable and cost-effective storage solution for the diverse data types collected from different systems.

AWS

AWS Data Governance Data Silos SQL

Did Big Data Deliver Business Transformation & Improved CX?

Alation

AUGUST 4, 2022

And where data was available, the ability to access and interpret it proved problematic. Big data can grow too big fast. Left unchecked, data lakes became data swamps. Some data lake implementations required expensive ‘cleansing pumps’ to make them navigable again.

Big Data

Big Data Big Data Apache Kafka Data Lakes

Exploring the Power of Data Warehouse Functionality

Pickl AI

JUNE 11, 2024

Self-Service Analytics User-friendly interfaces and self-service analytics tools empower business users to explore data independently without relying on IT departments. Best Practices for Maximizing Data Warehouse Functionality A data warehouse, brimming with historical data, holds immense potential for unlocking valuable insights.

Data Warehouse

Data Warehouse ETL Data Mining Data Mining

Scale knowledge management use cases with generative AI

IBM Journey to AI blog

JULY 27, 2023

Data quality strongly impacts the quality and usefulness of content produced by an AI model, underscoring the significance of addressing data challenges. It provides the combination of data lake flexibility and data warehouse performance to help to scale AI.

AI

AI AI Data Scientist Data Quality

How Data Governance Supports Analytics

Alation

JANUARY 27, 2022

What Are the Top Data Challenges to Analytics? The proliferation of data sources means there is an increase in data volume that must be analyzed. Large volumes of data have led to the development of data lakes , data warehouses, and data management systems. Establishes Trust in Data.

Data Governance

Data Governance Analytics Analytics Data Quality

What Is Data Modernization? 5 Benefits Worth Knowing

Alation

APRIL 19, 2022

In that sense, data modernization is synonymous with cloud migration. Modern data architectures, like cloud data warehouses and cloud data lakes , empower more people to leverage analytics for insights more efficiently. Consolidating all data across your organization builds trust in the data.

Data Governance

Data Governance Cloud Data Database Data Silos

Data democratization: How data architecture can drive business decisions and AI initiatives

IBM Journey to AI blog

AUGUST 4, 2023

By leveraging data services and APIs, a data fabric can also pull together data from legacy systems, data lakes, data warehouses and SQL databases, providing a holistic view into business performance. It uses knowledge graphs, semantics and AI/ML technology to discover patterns in various types of metadata.

Data Lakes

Data Lakes AI AI Data Governance

Generative AI for agriculture: How Agmatix is improving agriculture with Amazon Bedrock

AWS Machine Learning Blog

NOVEMBER 12, 2024

There are various technologies that help operationalize and optimize the process of field trials, including data management and analytics, IoT, remote sensing, robotics, machine learning (ML), and now generative AI. Multi-source data is initially received and stored in an Amazon Simple Storage Service (Amazon S3) data lake.

AWS

AWS AI AI Data Lakes

Governing the ML lifecycle at scale, Part 3: Setting up data governance at scale

How to Leverage Machine Learning to Identify Data Errors in a Data Lake

Webinars

Trending Sources

Unlock the power of data governance and no-code machine learning with Amazon SageMaker Canvas and Amazon DataZone

Webinars

MLOps Landscape in 2023: Top Tools and Platforms

Data Engineering for IoT Applications: Unleashing the Power of the Internet of Things

Use the Amazon SageMaker and Salesforce Data Cloud integration to power your Salesforce apps with AI/ML

Mastering ML Model Performance: Best Practices for Optimal Results

How to Build ETL Data Pipeline in ML

Now available in Tableau 2021.1—Einstein Discovery in Tableau, quick LODs, a new unified notification experience, and more

Apply fine-grained data access controls with AWS Lake Formation in Amazon SageMaker Data Wrangler

Reducing hallucinations in LLM agents with a verified semantic cache using Amazon Bedrock Knowledge Bases

11 Open Source Data Exploration Tools You Need to Know in 2023

Principal Financial Group uses AWS Post Call Analytics solution to extract omnichannel customer insights

Modern Data Architectures Provide a Foundation for Innovation

The Audience for Data Catalogs and Data Intelligence

AI that’s ready for business starts with data that’s ready for AI

Five benefits of a data catalog

Now available in Tableau 2021.1—Einstein Discovery in Tableau, quick LODs, a new unified notification experience, and more

12 AI Insight Talks to Help Improve Your Company’s AI Game at ODSC West

Modern Data Management Essentials: Exploring Data Fabric

How to Manage Unstructured Data in AI and Machine Learning Projects

How to Effectively Handle Unstructured Data Using AI

Find Your AI Solutions at the ODSC West AI Expo

Tackling AI’s data challenges with IBM databases on AWS

What is Snowflake Horizon?

Build Data Pipelines: Comprehensive Step-by-Step Guide

Top Data Analytics Skills and Platforms for 2023

What Do You Actually Need from a Data Catalog Tool?

Comparing Tools For Data Processing Pipelines

A Guide to Data Analytics in the Travel Industry

The Ultimate Guide to Data Preparation for Machine Learning

The Evolution of Customer Data Modeling: From Static Profiles to Dynamic Customer 360

How to Build an End-To-End ML Pipeline

Architect defense-in-depth security for generative AI applications using the OWASP Top 10 for LLMs

Perform generative AI-powered data prep and no-code ML over any size of data using Amazon SageMaker Canvas

Definite Guide to Building a Machine Learning Platform

Shaping the future: OMRON’s data-driven journey with AWS

Did Big Data Deliver Business Transformation & Improved CX?

Exploring the Power of Data Warehouse Functionality

Scale knowledge management use cases with generative AI

How Data Governance Supports Analytics

What Is Data Modernization? 5 Benefits Worth Knowing

Data democratization: How data architecture can drive business decisions and AI initiatives

Generative AI for agriculture: How Agmatix is improving agriculture with Amazon Bedrock

Stay Connected