Blog, Data Lakes and Data Preparation

Data lakes vs. data warehouses: Decoding the data storage debate

Data Science Dojo

JANUARY 12, 2023

When it comes to data, there are two main types: data lakes and data warehouses. What is a data lake? An enormous amount of raw data is stored in its original format in a data lake until it is required for analytics applications. Which one is right for your business?

Data Lakes

Data Lakes Data Warehouse Hadoop Machine Learning

The Ultimate Guide to Data Preparation for Machine Learning

DagsHub

FEBRUARY 29, 2024

Data, is therefore, essential to the quality and performance of machine learning models. This makes data preparation for machine learning all the more critical, so that the models generate reliable and accurate predictions and drive business value for the organization. Why do you need Data Preparation for Machine Learning?

Data Preparation

Data Preparation Machine Learning Machine Learning Data Governance

An integrated experience for all your data and AI with Amazon SageMaker Unified Studio (preview)

Flipboard

DECEMBER 11, 2024

Many of these applications are complex to build because they require collaboration across teams and the integration of data, tools, and services. Data engineers use data warehouses, data lakes, and analytics tools to load, transform, clean, and aggregate data. Big Data Architect. option("sep", ",").load("s3://aws-blogs-artifacts-public/artifacts/BDB-4798/data/venue.csv")

SQL

SQL AWS Data Lakes AI

Webinars

How to Achieve High-Accuracy Results When Using LLMs

MORE WEBINARS

How Northpower used computer vision with AWS to automate safety inspection risk assessments

AWS Machine Learning Blog

SEPTEMBER 27, 2024

Data preparation SageMaker Ground Truth employs a human workforce made up of Northpower volunteers to annotate a set of 10,000 images. The model was then fine-tuned with training data from the data preparation stage. The sunburst graph below is a visualization of this classification.

AWS

AWS Data Lakes ML ML

Unlock the power of data governance and no-code machine learning with Amazon SageMaker Canvas and Amazon DataZone

AWS Machine Learning Blog

AUGUST 21, 2024

Amazon DataZone is a data management service that makes it quick and convenient to catalog, discover, share, and govern data stored in AWS, on-premises, and third-party sources. The data lake environment is required to configure an AWS Glue database table, which is used to publish an asset in the Amazon DataZone catalog.

Machine Learning

Machine Learning Machine Learning Data Governance ML

Apply fine-grained data access controls with AWS Lake Formation in Amazon SageMaker Data Wrangler

AWS Machine Learning Blog

AUGUST 21, 2023

You can streamline the process of feature engineering and data preparation with SageMaker Data Wrangler and finish each stage of the data preparation workflow (including data selection, purification, exploration, visualization, and processing at scale) within a single visual interface.

AWS

AWS Data Lakes Clustering Data Preparation

Simplify continuous learning of Amazon Comprehend custom models using Comprehend flywheel

AWS Machine Learning Blog

MARCH 1, 2023

Flywheel creates a data lake (in Amazon S3) in your account where all the training and test data for all versions of the model are managed and stored. Periodically, the new labeled data (to retrain the model) can be made available to flywheel by creating datasets. One for the data lake for Comprehend flywheel.

Data Lakes

Data Lakes AWS ML ML

MAS AI/ML Modernization Accelerator: Air Compressor Use Case

IBM Data Science in Practice

JANUARY 9, 2024

In this blog, we delve into 4 different “on-ramps” we created in a MAS Accelerator to offer a straightforward path to harnessing the power of AI in MAS, wherever you may be on your MAS AI/ML modernization journey. In our scenario, the data is stored in the Cloud Object Storage in Watson Studio.

ML

ML ML AI AI

How OLAP and AI can enable better business

IBM Journey to AI blog

DECEMBER 7, 2023

Increased operational efficiency benefits Reduced data preparation time : OLAP data preparation capabilities streamline data analysis processes, saving time and resources. IBM watsonx.data is the next generation OLAP system that can help you make the most of your data.

Data Preparation

Data Preparation Database Data Analysis Data Analysis

FMOps/LLMOps: Operationalize generative AI and differences with MLOps

AWS Machine Learning Blog

SEPTEMBER 1, 2023

These teams are as follows: Advanced analytics team (data lake and data mesh) – Data engineers are responsible for preparing and ingesting data from multiple sources, building ETL (extract, transform, and load) pipelines to curate and catalog the data, and prepare the necessary historical data for the ML use cases.

AI

AI AI ML ML

How Marubeni is optimizing market decisions using AWS machine learning and analytics

AWS Machine Learning Blog

MARCH 8, 2023

Data collection and ingestion The data collection and ingestion layer connects to all upstream data sources and loads the data into the data lake. Therefore, the ingestion components need to be able to manage authentication, data sourcing in pull mode, data preprocessing, and data storage.

AWS

AWS Machine Learning Machine Learning Analytics

Improving air quality with generative AI

AWS Machine Learning Blog

JUNE 18, 2024

The solution addressed in this blog solves Afri-SET’s challenge and was ranked as the top 3 winning solutions. This post presents a solution that uses a generative artificial intelligence (AI) to standardize air quality data from low-cost sensors in Africa, specifically addressing the air quality data integration problem of low-cost sensors.

AWS

AWS AI AI Python

What Is a Data Catalog?

Alation

FEBRUARY 13, 2020

Figure 1 illustrates the typical metadata subjects contained in a data catalog. Figure 1 – Data Catalog Metadata Subjects. Datasets are the files and tables that data workers need to find and access. They may reside in a data lake, warehouse, master data repository, or any other shared data resource.

Data Lakes

Data Lakes Data Analysis Data Analysis Big Data

What is Data Mining?

Pickl AI

FEBRUARY 21, 2023

Businesses require Data Scientists to perform Data Mining processes and invoke valuable data insights using different software and tools. What is Data Mining and how is it related to Data Science ? Let’s learn from the following blog! What is Data Mining? are the various data mining tools.

Data Mining

Data Mining Data Mining Data Mining Data Scientist

Introducing watsonx: The future of AI for business

IBM Journey to AI blog

MAY 9, 2023

It offers its users advanced machine learning, data management , and generative AI capabilities to train, validate, tune and deploy AI systems across the business with speed, trusted data, and governance. It helps facilitate the entire data and AI lifecycle, from data preparation to model development, deployment and monitoring.

AI

AI AI Data Warehouse Machine Learning

Tackling AI’s data challenges with IBM databases on AWS

IBM Journey to AI blog

MARCH 14, 2024

Whether it’s for ad hoc analytics, data transformation, data sharing, data lake modernization or ML and gen AI, you have the flexibility to choose. Integrated solutions for zero-ETL data preparation: IBM databases on AWS offer integrated solutions that eliminate the need for ETL processes in data preparation for AI.

AWS

AWS Database ETL AI

What Do You Actually Need from a Data Catalog Tool?

Alation

SEPTEMBER 23, 2021

Data Catalogs for Data Science & Engineering – Data catalogs that are primarily used for data science and engineering are typically used by very experienced data practitioners. It also catalogs datasets and operations that includes data preparation features and functions.

Data Preparation

Data Preparation SQL Data Governance Data Analysis

How and When to Use Dataflows in Power BI

phData

SEPTEMBER 28, 2023

Dataflows allow users to establish source connections and retrieve data, and subsequent data transformations can be conducted using the online Power Query Editor. In this blog, we will provide insights into the process of creating Dataflows and offer guidance on when to choose them to address real-world use cases effectively.

Power BI

Power BI Data Preparation Machine Learning Machine Learning

Exploring the AI and data capabilities of watsonx

IBM Journey to AI blog

JULY 17, 2023

In this blog, I will cover: What is watsonx.ai? sales conversation summaries, insurance coverage, meeting transcripts, contract information) Generate: Generate text content for a specific purpose, such as marketing campaigns, job descriptions, blogs or articles, and email drafting support. What capabilities are included in watsonx.ai?

AI

AI AI Machine Learning Machine Learning

Build ML features at scale with Amazon SageMaker Feature Store using data from Amazon Redshift

Flipboard

AUGUST 17, 2023

Amazon Redshift uses SQL to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes, using AWS-designed hardware and ML to deliver the best price-performance at any scale. Check out the AWS Blog for more practices about building ML features from a modern data warehouse.

ML

ML ML AWS Data Warehouse

Implementing Knowledge Bases for Amazon Bedrock in support of GDPR (right to be forgotten) requests

AWS Machine Learning Blog

MAY 31, 2024

Data preparation Before creating a knowledge base using Knowledge Bases for Amazon Bedrock, it’s essential to prepare the data to augment the FM in a RAG implementation. Data discovery and findability Findability is an important step of the process.

AWS

AWS Machine Learning Machine Learning Database

Shopping for Data

Alation

FEBRUARY 20, 2020

Even something like gamification may emerge as a way to fully engage data shoppers as a community. Behind the scenes, ‘backroom services” will power the storefront, performing such tasks as data acquisition, data preparation, data curation and cataloging, and tracking. Building the EDM.

Data Warehouse

Data Warehouse Data Lakes Hadoop Data Preparation

The Top AI Slides from ODSC West 2024

ODSC - Open Data Science

NOVEMBER 19, 2024

ODSC West 2024 showcased a wide range of talks and workshops from leading data science, AI, and machine learning experts. This blog highlights some of the most impactful AI slides from the world’s best data science instructors, focusing on cutting-edge advancements in AI, data modeling, and deployment strategies.

Deep Learning

Deep Learning Deep Learning Data Science AI

How Light & Wonder built a predictive maintenance solution for gaming machines on AWS

AWS Machine Learning Blog

JUNE 22, 2023

In LnW Connect, an encryption process was designed to provide a secure and reliable mechanism for the data to be brought into an AWS data lake for predictive modeling. Data preprocessing and feature engineering In this section, we discuss our methods for data preparation and feature engineering.

AWS

AWS ML ML Machine Learning

Use the Amazon SageMaker and Salesforce Data Cloud integration to power your Salesforce apps with AI/ML

AWS Machine Learning Blog

AUGUST 4, 2023

Train a recommendation model in SageMaker Studio using training data that was prepared using SageMaker Data Wrangler. The real-time inference call data is first passed to the SageMaker Data Wrangler container in the inference pipeline, where it is preprocessed and passed to the trained model for product recommendation.

ML

ML ML AWS AI

Your guide to generative AI and ML at AWS re:Invent 2023

AWS Machine Learning Blog

NOVEMBER 22, 2023

Mai-Lan Tomsen Bukovec, Vice President, Technology | AIM250-INT | Putting your data to work with generative AI Thursday November 30 | 12:30 PM – 1:30 PM (PST) | Venetian | Level 5 | Palazzo Ballroom B How can you turn your data lake into a business advantage with generative AI? You must bring your laptop to participate.

AWS

AWS ML ML AI

Driving Data Catalog Adoption

Alation

FEBRUARY 13, 2020

In a recent blog, titled Collaboration and Crowdsourcing with Data Cataloging , I discussed the importance of participation by all data stakeholders as a key to getting maximum value from your data catalog. Their tendency is to do just enough data work to get by, and to do that work primarily in Excel spreadsheets.

Data Governance

Data Governance Data Analysis Data Analysis Data Preparation

MLOps Landscape in 2023: Top Tools and Platforms

The MLOps Blog

JUNE 27, 2023

See also Thoughtworks’s guide to Evaluating MLOps Platforms End-to-end MLOps platforms End-to-end MLOps platforms provide a unified ecosystem that streamlines the entire ML workflow, from data preparation and model development to deployment and monitoring.

Machine Learning

Machine Learning Machine Learning ML ML

How Alteryx & Snowflake Accelerates Analytics

phData

FEBRUARY 24, 2023

Data must be available at the right moment for consumption and it might not be the easiest task to develop a strategy around the continuous pipelines and the integrated applications to set up your stack. Alteryx and the Snowflake Data Cloud offer a potential solution to this issue and can speed up your path to Analytics.

Analytics

Analytics Analytics Database Python

Your Complete Roadmap to Become an Azure Data Scientist

Pickl AI

SEPTEMBER 5, 2024

Summary: This blog provides a comprehensive roadmap for aspiring Azure Data Scientists, outlining the essential skills, certifications, and steps to build a successful career in Data Science using Microsoft Azure. Storage Solutions: Secure and scalable storage options like Azure Blob Storage and Azure Data Lake Storage.

Azure

Azure Data Scientist Data Science Machine Learning

Google’s Dr. Arsanjani on Enterprise Foundation Model Challenges

Snorkel AI

MARCH 2, 2023

In media and gaming: designing game storylines, scripts, auto-generated blogs, articles and tweets, and grammar corrections and text formatting. Data preparation, train and tune, deploy and monitor. We have data pipelines and data preparation. It can cover the gamut.

Machine Learning

Machine Learning Machine Learning Data Preparation AI

Google’s Arsanjani on Enterprise Foundation Model Challenges

Snorkel AI

MARCH 2, 2023

In media and gaming: designing game storylines, scripts, auto-generated blogs, articles and tweets, and grammar corrections and text formatting. Data preparation, train and tune, deploy and monitor. We have data pipelines and data preparation. It can cover the gamut.

Machine Learning

Machine Learning Machine Learning Data Preparation AI

Getting Started With Snowflake: Best Practices For Launching

phData

DECEMBER 4, 2023

This blog was originally written by Erik Hyrkas and updated for 2024 by Justin Delisi This isn’t meant to be a technical how-to guide — most of those details are readily available via a quick Google search — but rather an opinionated review of key processes and potential approaches.

Clustering

Clustering Database SQL Data Pipeline

How to Use Exploratory Notebooks [Best Practices]

The MLOps Blog

OCTOBER 20, 2023

Placing functions for plotting, data loading, data preparation, and implementations of evaluation metrics in plain Python modules keeps a Jupyter notebook focused on the exploratory analysis | Source: Author Using SQL directly in Jupyter cells There are some cases in which data is not in memory (e.g.,

SQL

SQL Database Data Scientist Python

Architect defense-in-depth security for generative AI applications using the OWASP Top 10 for LLMs

AWS Machine Learning Blog

JANUARY 26, 2024

Also consider using Amazon Security Lake to automatically centralize security data from AWS environments, SaaS providers, on premises, and cloud sources into a purpose-built data lake stored in your account.

AWS

AWS ML ML AI

How to Build an End-To-End ML Pipeline

The MLOps Blog

MAY 9, 2023

The pipelines are interoperable to build a working system: Data (input) pipeline (data acquisition and feature management steps) This pipeline transports raw data from one location to another. Model/training pipeline This pipeline trains one or more models on the training data with preset hyperparameters.

ML

ML ML Machine Learning Machine Learning

Perform generative AI-powered data prep and no-code ML over any size of data using Amazon SageMaker Canvas

AWS Machine Learning Blog

AUGUST 15, 2024

Importing data from the SageMaker Data Wrangler flow allows you to interact with a sample of the data before scaling the data preparation flow to the full dataset. This improves time and performance because you don’t need to work with the entirety of the data during preparation.

ML

ML ML Data Preparation AWS

Deep Thoughts on Data Flow with Alation & Trifacta

Alation

FEBRUARY 20, 2020

Data lakes, while useful in helping you to capture all of your data, are only the first step in extracting the value of that data. We recently announced an integration with Trifacta to seamlessly integrate the Alation Data Catalog with self-service data prep applications to help you solve this issue.

Data Lakes

Data Lakes ETL Data Analyst Data Preparation

3 Major Trends at Strata New York 2017

DataRobot Blog

OCTOBER 3, 2017

This highlights the two companies’ shared vision on self-service data discovery with an emphasis on collaboration and data governance. 2) When data becomes information, many (incremental) use cases surface.

Data Lakes

Data Lakes Azure Data Pipeline Hadoop

Data lakes vs. data warehouses: Decoding the data storage debate

The Ultimate Guide to Data Preparation for Machine Learning

Webinars

Trending Sources

An integrated experience for all your data and AI with Amazon SageMaker Unified Studio (preview)

Webinars

How Northpower used computer vision with AWS to automate safety inspection risk assessments

Unlock the power of data governance and no-code machine learning with Amazon SageMaker Canvas and Amazon DataZone

Apply fine-grained data access controls with AWS Lake Formation in Amazon SageMaker Data Wrangler

Simplify continuous learning of Amazon Comprehend custom models using Comprehend flywheel

MAS AI/ML Modernization Accelerator: Air Compressor Use Case

How OLAP and AI can enable better business

FMOps/LLMOps: Operationalize generative AI and differences with MLOps

How Marubeni is optimizing market decisions using AWS machine learning and analytics

Improving air quality with generative AI

What Is a Data Catalog?

What is Data Mining?

Introducing watsonx: The future of AI for business

Tackling AI’s data challenges with IBM databases on AWS

What Do You Actually Need from a Data Catalog Tool?

How and When to Use Dataflows in Power BI

Exploring the AI and data capabilities of watsonx

Build ML features at scale with Amazon SageMaker Feature Store using data from Amazon Redshift

Implementing Knowledge Bases for Amazon Bedrock in support of GDPR (right to be forgotten) requests

Shopping for Data

The Top AI Slides from ODSC West 2024

How Light & Wonder built a predictive maintenance solution for gaming machines on AWS

Use the Amazon SageMaker and Salesforce Data Cloud integration to power your Salesforce apps with AI/ML

Your guide to generative AI and ML at AWS re:Invent 2023

Driving Data Catalog Adoption

MLOps Landscape in 2023: Top Tools and Platforms

How Alteryx & Snowflake Accelerates Analytics

Your Complete Roadmap to Become an Azure Data Scientist

Google’s Dr. Arsanjani on Enterprise Foundation Model Challenges

Google’s Arsanjani on Enterprise Foundation Model Challenges

Getting Started With Snowflake: Best Practices For Launching

How to Use Exploratory Notebooks [Best Practices]

Architect defense-in-depth security for generative AI applications using the OWASP Top 10 for LLMs

How to Build an End-To-End ML Pipeline

Perform generative AI-powered data prep and no-code ML over any size of data using Amazon SageMaker Canvas

Deep Thoughts on Data Flow with Alation & Trifacta

3 Major Trends at Strata New York 2017

Stay Connected