Data Lakes, Data Preparation and Database

Data lakes vs. data warehouses: Decoding the data storage debate

Data Science Dojo

JANUARY 12, 2023

When it comes to data, there are two main types: data lakes and data warehouses. What is a data lake? An enormous amount of raw data is stored in its original format in a data lake until it is required for analytics applications. Which one is right for your business?

Data Lakes

Data Lakes Data Warehouse Hadoop Machine Learning

Data mining

Dataconomy

MARCH 4, 2025

Data mining is a fascinating field that blends statistical techniques, machine learning, and database systems to reveal insights hidden within vast amounts of data. Businesses across various sectors are leveraging data mining to gain a competitive edge, improve decision-making, and optimize operations.

Data Mining

Data Mining Data Mining Data Mining Decision Trees

Tackling AI’s data challenges with IBM databases on AWS

IBM Journey to AI blog

MARCH 14, 2024

The existence of data silos and duplication, alongside apprehensions regarding data quality, presents a multifaceted environment for organizations to manage. Also, traditional database management tasks, including backups, upgrades and routine maintenance drain valuable time and resources, hindering innovation.

AWS

AWS Database ETL AI

Webinars

How to Achieve High-Accuracy Results When Using LLMs

Maximizing Profit and Productivity: The New Era of AI-Powered Accounting

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

An integrated experience for all your data and AI with Amazon SageMaker Unified Studio (preview)

Flipboard

DECEMBER 11, 2024

Many of these applications are complex to build because they require collaboration across teams and the integration of data, tools, and services. Data engineers use data warehouses, data lakes, and analytics tools to load, transform, clean, and aggregate data. Expand your database starting from glue_db_.

SQL

SQL AWS Data Lakes AI

Unlock the power of data governance and no-code machine learning with Amazon SageMaker Canvas and Amazon DataZone

AWS Machine Learning Blog

AUGUST 21, 2024

Amazon DataZone is a data management service that makes it quick and convenient to catalog, discover, share, and govern data stored in AWS, on-premises, and third-party sources. The sample dataset Upload the dataset to Amazon S3 and crawl the data to create an AWS Glue database and tables.

Machine Learning

Machine Learning Machine Learning Data Governance ML

Apply fine-grained data access controls with AWS Lake Formation in Amazon SageMaker Data Wrangler

AWS Machine Learning Blog

AUGUST 21, 2023

You can streamline the process of feature engineering and data preparation with SageMaker Data Wrangler and finish each stage of the data preparation workflow (including data selection, purification, exploration, visualization, and processing at scale) within a single visual interface.

AWS

AWS Data Lakes Clustering Data Preparation

How OLAP and AI can enable better business

IBM Journey to AI blog

DECEMBER 7, 2023

Online analytical processing (OLAP) database systems and artificial intelligence (AI) complement each other and can help enhance data analysis and decision-making when used in tandem. Defining OLAP today OLAP database systems have significantly evolved since their inception in the early 1990s.

Data Preparation

Data Preparation Database Data Analysis Data Analysis

Accelerating AI/ML development at BMW Group with Amazon SageMaker Studio

Flipboard

NOVEMBER 24, 2023

JuMa is tightly integrated with a range of BMW Central IT services, including identity and access management, roles and rights management, BMW Cloud Data Hub (BMW’s data lake on AWS) and on-premises databases.

ML

ML ML AWS AI

Improving air quality with generative AI

AWS Machine Learning Blog

JUNE 18, 2024

The solution harnesses the capabilities of generative AI, specifically Large Language Models (LLMs), to address the challenges posed by diverse sensor data and automatically generate Python functions based on various data formats. This allows for data to be aggregated for further manufacturer-agnostic analysis.

AWS

AWS AI AI Python

Introduction to Power BI Datamarts

ODSC - Open Data Science

JUNE 12, 2023

They all agree that a Datamart is a subject-oriented subset of a data warehouse focusing on a particular business unit, department, subject area, or business functionality. The Datamart’s data is usually stored in databases containing a moving frame required for data analysis, not the full history of data.

Power BI

Power BI Data Warehouse ETL Data Preparation

Build ML features at scale with Amazon SageMaker Feature Store using data from Amazon Redshift

Flipboard

AUGUST 17, 2023

Amazon Redshift uses SQL to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes, using AWS-designed hardware and ML to deliver the best price-performance at any scale. If you want to do the process in a low-code/no-code way, you can follow option C.

ML

ML ML AWS Data Warehouse

FMOps/LLMOps: Operationalize generative AI and differences with MLOps

AWS Machine Learning Blog

SEPTEMBER 1, 2023

These teams are as follows: Advanced analytics team (data lake and data mesh) – Data engineers are responsible for preparing and ingesting data from multiple sources, building ETL (extract, transform, and load) pipelines to curate and catalog the data, and prepare the necessary historical data for the ML use cases.

AI

AI AI ML ML

How and When to Use Dataflows in Power BI

phData

SEPTEMBER 28, 2023

Dataflows represent a cloud-based technology designed for data preparation and transformation purposes. Dataflows have different connectors to retrieve data, including databases, Excel files, APIs, and other similar sources, along with data manipulations that are performed using Online Power Query Editor.

Power BI

Power BI Data Preparation Machine Learning Machine Learning

The Top AI Slides from ODSC West 2024

ODSC - Open Data Science

NOVEMBER 19, 2024

Despite the rise of big data technologies and cloud computing, the principles of dimensional modeling remain relevant. This session delved into how these traditional techniques have adapted to data lakes and real-time analytics, emphasizing their enduring importance for building scalable, efficient data systems.

Deep Learning

Deep Learning Deep Learning Data Science AI

Implementing Knowledge Bases for Amazon Bedrock in support of GDPR (right to be forgotten) requests

AWS Machine Learning Blog

MAY 31, 2024

Challenges associated with these stages involve not knowing all touchpoints where data is persisted, maintaining a data pre-processing pipeline for document chunking, choosing a chunking strategy, vector database, and indexing strategy, generating embeddings, and any manual steps to purge data from vector stores and keep it in sync with source data.

AWS

AWS Machine Learning Machine Learning Database

MLOps Landscape in 2023: Top Tools and Platforms

The MLOps Blog

JUNE 27, 2023

See also Thoughtworks’s guide to Evaluating MLOps Platforms End-to-end MLOps platforms End-to-end MLOps platforms provide a unified ecosystem that streamlines the entire ML workflow, from data preparation and model development to deployment and monitoring. Dolt Dolt is an open-source relational database system built on Git.

Machine Learning

Machine Learning Machine Learning ML ML

Modern Data Management Essentials: Exploring Data Fabric

Precisely

JULY 18, 2024

Without access to all critical and relevant data, the data that emerges from a data fabric will have gaps that delay business insights required to innovate, mitigate risk, or improve operational efficiencies. You must be able to continuously catalog, profile, and identify the most frequently used data.

Data Lakes

Data Lakes Data Warehouse Data Governance Machine Learning

Discover the Most Important Fundamentals of Data Engineering

Pickl AI

NOVEMBER 4, 2024

Role of Data Engineers in the Data Ecosystem Data Engineers play a crucial role in the data ecosystem by bridging the gap between raw data and actionable insights. They are responsible for building and maintaining data architectures, which include databases, data warehouses, and data lakes.

Data Engineering

Data Engineering Data Engineering Data Engineering Data Engineer

Exploring the AI and data capabilities of watsonx

IBM Journey to AI blog

JULY 17, 2023

Visual modeling: Delivers easy-to-use workflows for data scientists to build data preparation and predictive machine learning pipelines that include text analytics, visualizations and a variety of modeling methods. foundation models to help users discover, augment, and enrich data with natural language.

AI

AI AI Machine Learning Machine Learning

How Alteryx & Snowflake Accelerates Analytics

phData

FEBRUARY 24, 2023

Alteryx provides organizations with an opportunity to automate access to data, analytics , data science, and process automation all in one, end-to-end platform. Its capabilities can be split into the following topics: automating inputs & outputs, data preparation, data enrichment, and data science.

Analytics

Analytics Analytics Database Python

10 Best Data Engineering Books [Beginners to Advanced]

Pickl AI

AUGUST 1, 2023

The primary goal of Data Engineering is to transform raw data into a structured and usable format that can be easily accessed, analyzed, and interpreted by Data Scientists, analysts, and other stakeholders. Future of Data Engineering The Data Engineering market will expand from $18.2

Data Engineering

Data Engineering Data Engineering Data Engineering Data Engineer

Your guide to generative AI and ML at AWS re:Invent 2023

AWS Machine Learning Blog

NOVEMBER 22, 2023

AWS

AWS ML ML AI

Getting Started With Snowflake: Best Practices For Launching

phData

DECEMBER 4, 2023

More on this topic later; but for now, keep in mind that the simplest method is to create a naming convention for database objects that allows you to identify the owner and associated budget. The extended period will allow you to perform Time Travel activities, such as undropping tables or comparing new data against historical values.

Clustering

Clustering Database SQL Data Pipeline

Popular Data Transformation Tools: Importance and Best Practices

Pickl AI

OCTOBER 10, 2024

Talend Talend is a leading data integration platform known for its extensive tools for transforming, cleansing, and integrating data across multiple sources. It integrates well with cloud services, databases, and big data platforms like Hadoop, making it suitable for various data environments.

Data Quality

Data Quality AWS Machine Learning Machine Learning

How to Use Exploratory Notebooks [Best Practices]

The MLOps Blog

OCTOBER 20, 2023

Placing functions for plotting, data loading, data preparation, and implementations of evaluation metrics in plain Python modules keeps a Jupyter notebook focused on the exploratory analysis | Source: Author Using SQL directly in Jupyter cells There are some cases in which data is not in memory (e.g., Redshift).

SQL

SQL Database Data Scientist Python

Google’s Dr. Arsanjani on Enterprise Foundation Model Challenges

Snorkel AI

MARCH 2, 2023

Data preparation, train and tune, deploy and monitor. We have data pipelines and data preparation. A database of prompt examples may need to be required for each of these phases. In the data pipeline phase—I’m just going to call out things that I think are more important than the obvious.

Machine Learning

Machine Learning Machine Learning Data Preparation AI

Google’s Arsanjani on Enterprise Foundation Model Challenges

Snorkel AI

MARCH 2, 2023

Data preparation, train and tune, deploy and monitor. We have data pipelines and data preparation. A database of prompt examples may need to be required for each of these phases. In the data pipeline phase—I’m just going to call out things that I think are more important than the obvious.

Machine Learning

Machine Learning Machine Learning Data Preparation AI

Your Complete Roadmap to Become an Azure Data Scientist

Pickl AI

SEPTEMBER 5, 2024

Storage Solutions: Secure and scalable storage options like Azure Blob Storage and Azure Data Lake Storage. Key features and benefits of Azure for Data Science include: Scalability: Easily scale resources up or down based on demand, ideal for handling large datasets and complex computations.

Azure

Azure Data Scientist Data Science Machine Learning

Snowflake Snowpark: cloud SQL and Python ML pipelines

Snorkel AI

MAY 26, 2023

And that’s really key for taking data science experiments into production. And so data scientists might be leveraging one compute service and might be leveraging an extracted CSV for their experimentation. And there, instead of materializing them in your database, you can just compute them on the fly.

SQL

SQL ML ML Python

Snowflake Snowpark: cloud SQL and Python ML pipelines

Snorkel AI

MAY 26, 2023

And that’s really key for taking data science experiments into production. And so data scientists might be leveraging one compute service and might be leveraging an extracted CSV for their experimentation. And there, instead of materializing them in your database, you can just compute them on the fly.

SQL

SQL ML ML Python

Architect defense-in-depth security for generative AI applications using the OWASP Top 10 for LLMs

AWS Machine Learning Blog

JANUARY 26, 2024

Also consider using Amazon Security Lake to automatically centralize security data from AWS environments, SaaS providers, on premises, and cloud sources into a purpose-built data lake stored in your account.

AWS

AWS ML ML AI

Data mining

Dataconomy

FEBRUARY 26, 2025

KDD provides a structured framework to convert raw data into actionable knowledge. The KDD process Data gathering Data preparation Data mining Data analysis and interpretation Data mining process components Understanding the components of the data mining process is essential for effective implementation.

Data Mining

Data Mining Data Mining Data Mining Data Preparation

Data Science Current

Data lakes vs. data warehouses: Decoding the data storage debate

Data mining

Webinars

Trending Sources

Tackling AI’s data challenges with IBM databases on AWS

Webinars

An integrated experience for all your data and AI with Amazon SageMaker Unified Studio (preview)

Unlock the power of data governance and no-code machine learning with Amazon SageMaker Canvas and Amazon DataZone

Apply fine-grained data access controls with AWS Lake Formation in Amazon SageMaker Data Wrangler

How OLAP and AI can enable better business

Accelerating AI/ML development at BMW Group with Amazon SageMaker Studio

Improving air quality with generative AI

Introduction to Power BI Datamarts

Build ML features at scale with Amazon SageMaker Feature Store using data from Amazon Redshift

FMOps/LLMOps: Operationalize generative AI and differences with MLOps

How and When to Use Dataflows in Power BI

The Top AI Slides from ODSC West 2024

Implementing Knowledge Bases for Amazon Bedrock in support of GDPR (right to be forgotten) requests

MLOps Landscape in 2023: Top Tools and Platforms

Modern Data Management Essentials: Exploring Data Fabric

Discover the Most Important Fundamentals of Data Engineering

Exploring the AI and data capabilities of watsonx

How Alteryx & Snowflake Accelerates Analytics

10 Best Data Engineering Books [Beginners to Advanced]

Your guide to generative AI and ML at AWS re:Invent 2023

Getting Started With Snowflake: Best Practices For Launching

Popular Data Transformation Tools: Importance and Best Practices

How to Use Exploratory Notebooks [Best Practices]

Google’s Dr. Arsanjani on Enterprise Foundation Model Challenges

Google’s Arsanjani on Enterprise Foundation Model Challenges

Your Complete Roadmap to Become an Azure Data Scientist

Snowflake Snowpark: cloud SQL and Python ML pipelines

Snowflake Snowpark: cloud SQL and Python ML pipelines

Architect defense-in-depth security for generative AI applications using the OWASP Top 10 for LLMs

Data mining

Stay Connected