Data Lakes, Data Preparation and Machine Learning

Data lakes vs. data warehouses: Decoding the data storage debate

Data Science Dojo

JANUARY 12, 2023

When it comes to data, there are two main types: data lakes and data warehouses. What is a data lake? An enormous amount of raw data is stored in its original format in a data lake until it is required for analytics applications. Which one is right for your business?

Data Lakes

Data Lakes Data Warehouse Hadoop Machine Learning

Unlock the power of data governance and no-code machine learning with Amazon SageMaker Canvas and Amazon DataZone

AWS Machine Learning Blog

AUGUST 21, 2024

Amazon DataZone is a data management service that makes it quick and convenient to catalog, discover, share, and govern data stored in AWS, on-premises, and third-party sources. The data lake environment is required to configure an AWS Glue database table, which is used to publish an asset in the Amazon DataZone catalog.

Machine Learning

Machine Learning Machine Learning Data Governance ML

The Ultimate Guide to Data Preparation for Machine Learning

DagsHub

FEBRUARY 29, 2024

Introduction Machine learning models learn patterns from data and leverage the learning, captured in the model weights, to make predictions on new, unseen data. Data, is therefore, essential to the quality and performance of machine learning models.

Data Preparation

Data Preparation Machine Learning Machine Learning Data Governance

Webinars

Agent Tooling: Connecting AI to Your Tools, Systems & Data

Automation, Evolved: Your New Playbook for Smarter Knowledge Work

How to Modernize Manufacturing Without Losing Control

Mastering Apache Airflow® 3.0: What’s New (and What’s Next) for Data Orchestration

MORE WEBINARS

Data mining

Dataconomy

MARCH 4, 2025

Data mining is a fascinating field that blends statistical techniques, machine learning, and database systems to reveal insights hidden within vast amounts of data. Businesses across various sectors are leveraging data mining to gain a competitive edge, improve decision-making, and optimize operations.

Data Mining

Data Mining Data Mining Data Mining Decision Trees

How Marubeni is optimizing market decisions using AWS machine learning and analytics

AWS Machine Learning Blog

MARCH 8, 2023

MPII is using a machine learning (ML) bid optimization engine to inform upstream decision-making processes in power asset management and trading. This solution helps market analysts design and perform data-driven bidding strategies optimized for power asset profitability. Data comes from disparate sources in a number of formats.

AWS

AWS Machine Learning Machine Learning Analytics

How Northpower used computer vision with AWS to automate safety inspection risk assessments

AWS Machine Learning Blog

SEPTEMBER 27, 2024

Solution overview Amazon SageMaker is a fully managed service that helps developers and data scientists build, train, and deploy machine learning (ML) models. Data preparation SageMaker Ground Truth employs a human workforce made up of Northpower volunteers to annotate a set of 10,000 images.

AWS

AWS Data Lakes ML ML

Perform generative AI-powered data prep and no-code ML over any size of data using Amazon SageMaker Canvas

AWS Machine Learning Blog

AUGUST 15, 2024

Amazon SageMaker Canvas now empowers enterprises to harness the full potential of their data by enabling support of petabyte-scale datasets. Importing data from the SageMaker Data Wrangler flow allows you to interact with a sample of the data before scaling the data preparation flow to the full dataset.

ML

ML ML Data Preparation AWS

Apply fine-grained data access controls with AWS Lake Formation in Amazon SageMaker Data Wrangler

AWS Machine Learning Blog

AUGUST 21, 2023

Amazon SageMaker Data Wrangler reduces the time it takes to collect and prepare data for machine learning (ML) from weeks to minutes. SageMaker Data Wrangler supports fine-grained data access control with Lake Formation and Amazon Athena connections.

AWS

AWS Data Lakes Clustering Data Preparation

An integrated experience for all your data and AI with Amazon SageMaker Unified Studio (preview)

Flipboard

DECEMBER 11, 2024

Many of these applications are complex to build because they require collaboration across teams and the integration of data, tools, and services. Data engineers use data warehouses, data lakes, and analytics tools to load, transform, clean, and aggregate data. Big Data Architect.

SQL

SQL AWS Data Lakes AI

MAS AI/ML Modernization Accelerator: Air Compressor Use Case

IBM Data Science in Practice

JANUARY 9, 2024

Each of these accelerators leverages state-of-the-art algorithms and machine learning techniques to identify anomalies accurately and in real-time. Solution 2: Migrate 3rd party models to MAS (Custom Model) This data science solution predicts anomalies in air compressor assets using an isolation forest model.

ML

ML ML AI AI

MLOps Landscape in 2023: Top Tools and Platforms

The MLOps Blog

JUNE 27, 2023

How to evaluate MLOps tools and platforms Like every software solution, evaluating MLOps (Machine Learning Operations) tools and platforms can be a complex task as it requires consideration of varying factors. Pay-as-you-go pricing makes it easy to scale when needed.

Machine Learning

Machine Learning Machine Learning ML ML

Simplify continuous learning of Amazon Comprehend custom models using Comprehend flywheel

AWS Machine Learning Blog

MARCH 1, 2023

Flywheel creates a data lake (in Amazon S3) in your account where all the training and test data for all versions of the model are managed and stored. Periodically, the new labeled data (to retrain the model) can be made available to flywheel by creating datasets. The data can be accessed from AWS Open Data Registry.

Data Lakes

Data Lakes AWS ML ML

FMOps/LLMOps: Operationalize generative AI and differences with MLOps

AWS Machine Learning Blog

SEPTEMBER 1, 2023

ML operationalization summary As defined in the post MLOps foundation roadmap for enterprises with Amazon SageMaker , ML and operations (MLOps) is the combination of people, processes, and technology to productionize machine learning (ML) solutions efficiently. For them, the end-to-end MLOps lifecycle and infrastructure is necessary.

AI

AI AI ML ML

Your guide to generative AI and ML at AWS re:Invent 2023

AWS Machine Learning Blog

NOVEMBER 22, 2023

Now all you need is some guidance on generative AI and machine learning (ML) sessions to attend at this twelfth edition of re:Invent. In this chalk talk, learn how to select and use your preferred environment to perform end-to-end ML development steps, from preparing data to building, training, and deploying your ML models.

AWS

AWS ML ML AI

Accelerating AI/ML development at BMW Group with Amazon SageMaker Studio

Flipboard

NOVEMBER 24, 2023

In an increasingly digital and rapidly changing world, BMW Group’s business and product development strategies rely heavily on data-driven decision-making. With that, the need for data scientists and machine learning (ML) engineers has grown significantly.

ML

ML ML AWS AI

Introducing watsonx: The future of AI for business

IBM Journey to AI blog

MAY 9, 2023

After some impressive advances over the past decade, largely thanks to the techniques of Machine Learning (ML) and Deep Learning , the technology seems to have taken a sudden leap forward. It helps facilitate the entire data and AI lifecycle, from data preparation to model development, deployment and monitoring.

AI

AI AI Data Warehouse Machine Learning

Improving air quality with generative AI

AWS Machine Learning Blog

JUNE 18, 2024

More than 170 tech teams used the latest cloud, machine learning and artificial intelligence technologies to build 33 solutions. The output data is transformed to a standardized format and stored in a single location in Amazon S3 in Parquet format, a columnar and efficient storage format.

AWS

AWS AI AI Python

Implementing Knowledge Bases for Amazon Bedrock in support of GDPR (right to be forgotten) requests

AWS Machine Learning Blog

MAY 31, 2024

Data preparation Before creating a knowledge base using Knowledge Bases for Amazon Bedrock, it’s essential to prepare the data to augment the FM in a RAG implementation. He is passionate about cloud and machine learning.

AWS

AWS Machine Learning Machine Learning Database

What is Data Mining?

Pickl AI

FEBRUARY 21, 2023

It involves using statistical and computational techniques to identify patterns and trends in the data that are not readily apparent. Data mining is often used in conjunction with other data analytics techniques, such as machine learning and predictive analytics, to build models that can be used to make predictions and inform decision-making.

Data Mining

Data Mining Data Mining Data Mining Data Scientist

How OLAP and AI can enable better business

IBM Journey to AI blog

DECEMBER 7, 2023

Increased operational efficiency benefits Reduced data preparation time : OLAP data preparation capabilities streamline data analysis processes, saving time and resources.

Data Preparation

Data Preparation Database Data Analysis Data Analysis

How Light & Wonder built a predictive maintenance solution for gaming machines on AWS

AWS Machine Learning Blog

JUNE 22, 2023

Utilizing data streamed through LnW Connect, L&W aims to create better gaming experience for their end-users as well as bring more value to their casino customers. With predictive maintenance, L&W can get advanced warning of machine breakdowns and proactively dispatch a service team to inspect the issue.

AWS

AWS ML ML Machine Learning

Build ML features at scale with Amazon SageMaker Feature Store using data from Amazon Redshift

Flipboard

AUGUST 17, 2023

Amazon Redshift is the most popular cloud data warehouse that is used by tens of thousands of customers to analyze exabytes of data every day. AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, ML, and application development.

ML

ML ML AWS Data Warehouse

How and When to Use Dataflows in Power BI

phData

SEPTEMBER 28, 2023

Dataflows represent a cloud-based technology designed for data preparation and transformation purposes. Dataflows have different connectors to retrieve data, including databases, Excel files, APIs, and other similar sources, along with data manipulations that are performed using Online Power Query Editor.

Power BI

Power BI Data Preparation Machine Learning Machine Learning

Exploring the AI and data capabilities of watsonx

IBM Journey to AI blog

JULY 17, 2023

is our enterprise-ready next-generation studio for AI builders, bringing together traditional machine learning (ML) and new generative AI capabilities powered by foundation models. Automated development: Automates data preparation, model development, feature engineering and hyperparameter optimization using AutoAI.

AI

AI AI Machine Learning Machine Learning

What Is a Data Catalog?

Alation

FEBRUARY 13, 2020

Figure 1 illustrates the typical metadata subjects contained in a data catalog. Figure 1 – Data Catalog Metadata Subjects. Datasets are the files and tables that data workers need to find and access. They may reside in a data lake, warehouse, master data repository, or any other shared data resource.

Data Lakes

Data Lakes Data Analysis Data Analysis Big Data

3 Major Trends at Strata New York 2017

DataRobot Blog

OCTOBER 3, 2017

This highlights the two companies’ shared vision on self-service data discovery with an emphasis on collaboration and data governance. 2) When data becomes information, many (incremental) use cases surface. Standard Chartered Bank (SCB), a customer of Paxata, spoke about data democratization at SCB. DataRobot Data Prep.

Data Lakes

Data Lakes Azure Data Pipeline Hadoop

Modern Data Management Essentials: Exploring Data Fabric

Precisely

JULY 18, 2024

Ensures consistent, high-quality data is readily available to foster innovation and enable you to drive competitive advantage in your markets through advanced analytics and machine learning. You must be able to continuously catalog, profile, and identify the most frequently used data. Increase metadata maturity.

Data Lakes

Data Lakes Data Warehouse Data Governance Machine Learning

Your Complete Roadmap to Become an Azure Data Scientist

Pickl AI

SEPTEMBER 5, 2024

As businesses increasingly turn to cloud solutions, Azure stands out as a leading platform for Data Science, offering powerful tools and services for advanced analytics and Machine Learning. This roadmap aims to guide aspiring Azure Data Scientists through the essential steps to build a successful career.

Azure

Azure Data Scientist Data Science Machine Learning

Use the Amazon SageMaker and Salesforce Data Cloud integration to power your Salesforce apps with AI/ML

AWS Machine Learning Blog

AUGUST 4, 2023

Train a recommendation model in SageMaker Studio using training data that was prepared using SageMaker Data Wrangler. The real-time inference call data is first passed to the SageMaker Data Wrangler container in the inference pipeline, where it is preprocessed and passed to the trained model for product recommendation.

ML

ML ML AWS AI

Tackling AI’s data challenges with IBM databases on AWS

IBM Journey to AI blog

MARCH 14, 2024

Try Db2 Warehouse SaaS on AWS for free   Netezza SaaS on AWS IBM® Netezza® Performance Server is a cloud-native data warehouse designed to operationalize deep analytics, data mining and BI by unifying, accessing and scaling all types of data across the hybrid cloud. Netezza

AWS

AWS Database ETL AI

The Top AI Slides from ODSC West 2024

ODSC - Open Data Science

NOVEMBER 19, 2024

ODSC West 2024 showcased a wide range of talks and workshops from leading data science, AI, and machine learning experts. This blog highlights some of the most impactful AI slides from the world’s best data science instructors, focusing on cutting-edge advancements in AI, data modeling, and deployment strategies.

Deep Learning

Deep Learning Deep Learning Data Science AI

Popular Data Transformation Tools: Importance and Best Practices

Pickl AI

OCTOBER 10, 2024

Summary: Data transformation tools streamline data processing by automating the conversion of raw data into usable formats. These tools enhance efficiency, improve data quality, and support Advanced Analytics like Machine Learning. Aggregation : Combining multiple data points into a single summary (e.g.,

Data Quality

Data Quality AWS Machine Learning Machine Learning

Discover the Most Important Fundamentals of Data Engineering

Pickl AI

NOVEMBER 4, 2024

Role of Data Engineers in the Data Ecosystem Data Engineers play a crucial role in the data ecosystem by bridging the gap between raw data and actionable insights. They are responsible for building and maintaining data architectures, which include databases, data warehouses, and data lakes.

Data Engineering

Data Engineering Data Engineering Data Engineering Data Engineer

Snowflake Snowpark: cloud SQL and Python ML pipelines

Snorkel AI

MAY 26, 2023

[link] Ahmad Khan, head of artificial intelligence and machine learning strategy at Snowflake gave a presentation entitled “Scalable SQL + Python ML Pipelines in the Cloud” about his company’s Snowpark service at Snorkel AI’s Future of Data-Centric AI virtual conference in August 2022.

SQL

SQL ML ML Python

Snowflake Snowpark: cloud SQL and Python ML pipelines

Snorkel AI

MAY 26, 2023

[link] Ahmad Khan, head of artificial intelligence and machine learning strategy at Snowflake gave a presentation entitled “Scalable SQL + Python ML Pipelines in the Cloud” about his company’s Snowpark service at Snorkel AI’s Future of Data-Centric AI virtual conference in August 2022.

SQL

SQL ML ML Python

10 Best Data Engineering Books [Beginners to Advanced]

Pickl AI

AUGUST 1, 2023

Key Components of Data Engineering Data Ingestion : Gathering data from various sources, such as databases, APIs, files, and streaming platforms, and bringing it into the data infrastructure. Data Processing: Performing computations, aggregations, and other data operations to generate valuable insights from the data.

Data Engineering

Data Engineering Data Engineering Data Engineering Data Engineer

Google’s Dr. Arsanjani on Enterprise Foundation Model Challenges

Snorkel AI

MARCH 2, 2023

From a software engineering perspective, machine-learning models, if you look at it in terms of the number of parameters and in terms of size, started out from the transformer models. So the application started to go from the pure software-engineering/machine-learning domain to industry and the sciences, essentially.

Machine Learning

Machine Learning Machine Learning Data Preparation AI

Google’s Arsanjani on Enterprise Foundation Model Challenges

Snorkel AI

MARCH 2, 2023

From a software engineering perspective, machine-learning models, if you look at it in terms of the number of parameters and in terms of size, started out from the transformer models. So the application started to go from the pure software-engineering/machine-learning domain to industry and the sciences, essentially.

Machine Learning

Machine Learning Machine Learning Data Preparation AI

Driving Data Catalog Adoption

Alation

FEBRUARY 13, 2020

Data Literacy—Many line-of-business people have responsibilities that depend on data analysis but have not been trained to work with data. Their tendency is to do just enough data work to get by, and to do that work primarily in Excel spreadsheets. Who needs data literacy training? Who can provide the training?

Data Governance

Data Governance Data Analysis Data Analysis Data Preparation

How to Use Exploratory Notebooks [Best Practices]

The MLOps Blog

OCTOBER 20, 2023

Placing functions for plotting, data loading, data preparation, and implementations of evaluation metrics in plain Python modules keeps a Jupyter notebook focused on the exploratory analysis | Source: Author Using SQL directly in Jupyter cells There are some cases in which data is not in memory (e.g.,

SQL

SQL Database Data Scientist Python

Getting Started With Snowflake: Best Practices For Launching

phData

DECEMBER 4, 2023

If you answer “yes” to any of these questions, you will need cloud storage, such as Amazon AWS’s S3, Azure Data Lake Storage or GCP’s Google Storage. Knowing this, you want to have data prepared in a way to optimize your load. It might be tempting to have massive files and let the system sort it out.

Clustering

Clustering Database SQL Data Pipeline

Architect defense-in-depth security for generative AI applications using the OWASP Top 10 for LLMs

AWS Machine Learning Blog

JANUARY 26, 2024

The goal of this post is to empower AI and machine learning (ML) engineers, data scientists, solutions architects, security teams, and other stakeholders to have a common mental model and framework to apply security best practices, allowing AI/ML teams to move fast without trading off security for speed.

AWS

AWS ML ML AI

How to Build an End-To-End ML Pipeline

The MLOps Blog

MAY 9, 2023

They run scripts manually to preprocess their training data, rerun the deployment scripts, manually tune their models, and spend their working hours keeping previously developed models up to date. Building end-to-end machine learning pipelines lets ML engineers build once, rerun, and reuse many times.

ML

ML ML Machine Learning Machine Learning

Data lakes vs. data warehouses: Decoding the data storage debate

Unlock the power of data governance and no-code machine learning with Amazon SageMaker Canvas and Amazon DataZone

Webinars

Trending Sources

The Ultimate Guide to Data Preparation for Machine Learning

Webinars

Data mining

How Marubeni is optimizing market decisions using AWS machine learning and analytics

How Northpower used computer vision with AWS to automate safety inspection risk assessments

Perform generative AI-powered data prep and no-code ML over any size of data using Amazon SageMaker Canvas

Apply fine-grained data access controls with AWS Lake Formation in Amazon SageMaker Data Wrangler

An integrated experience for all your data and AI with Amazon SageMaker Unified Studio (preview)

MAS AI/ML Modernization Accelerator: Air Compressor Use Case

MLOps Landscape in 2023: Top Tools and Platforms

Simplify continuous learning of Amazon Comprehend custom models using Comprehend flywheel

FMOps/LLMOps: Operationalize generative AI and differences with MLOps

Your guide to generative AI and ML at AWS re:Invent 2023

Accelerating AI/ML development at BMW Group with Amazon SageMaker Studio

Introducing watsonx: The future of AI for business

Improving air quality with generative AI

Implementing Knowledge Bases for Amazon Bedrock in support of GDPR (right to be forgotten) requests

What is Data Mining?

How OLAP and AI can enable better business

How Light & Wonder built a predictive maintenance solution for gaming machines on AWS

Build ML features at scale with Amazon SageMaker Feature Store using data from Amazon Redshift

How and When to Use Dataflows in Power BI

Exploring the AI and data capabilities of watsonx

What Is a Data Catalog?

3 Major Trends at Strata New York 2017

Modern Data Management Essentials: Exploring Data Fabric

Your Complete Roadmap to Become an Azure Data Scientist

Use the Amazon SageMaker and Salesforce Data Cloud integration to power your Salesforce apps with AI/ML

Tackling AI’s data challenges with IBM databases on AWS

The Top AI Slides from ODSC West 2024

Popular Data Transformation Tools: Importance and Best Practices

Discover the Most Important Fundamentals of Data Engineering

Snowflake Snowpark: cloud SQL and Python ML pipelines

Snowflake Snowpark: cloud SQL and Python ML pipelines

10 Best Data Engineering Books [Beginners to Advanced]

Google’s Dr. Arsanjani on Enterprise Foundation Model Challenges

Google’s Arsanjani on Enterprise Foundation Model Challenges

Driving Data Catalog Adoption

How to Use Exploratory Notebooks [Best Practices]

Getting Started With Snowflake: Best Practices For Launching

Architect defense-in-depth security for generative AI applications using the OWASP Top 10 for LLMs

How to Build an End-To-End ML Pipeline

Stay Connected