Data Lakes, Database and ML - Data Science Current

What Is a Lakebase?

databricks

JUNE 11, 2025

Published: June 11, 2025 Announcements 5 min read by Ali Ghodsi , Stas Kelvich , Heikki Linnakangas , Nikita Shamgunov , Arsalan Tavakoli-Shiraji , Patrick Wendell , Reynold Xin and Matei Zaharia Share this post Keep up with us Subscribe Summary Operational databases were not designed for today’s AI-driven applications.

Database

Database Data Lakes ETL Analytics

Governing ML lifecycle at scale: Best practices to set up cost and usage visibility of ML workloads in multi-account environments

AWS Machine Learning Blog

NOVEMBER 14, 2024

By setting up automated policy enforcement and checks, you can achieve cost optimization across your machine learning (ML) environment. The following table provides examples of a tagging dictionary used for tagging ML resources. A reference architecture for the ML platform with various AWS services is shown in the following diagram.

ML

ML ML AWS Machine Learning

Streaming Machine Learning Without a Data Lake

ODSC - Open Data Science

MAY 31, 2023

Be sure to check out his talk, “ Apache Kafka for Real-Time Machine Learning Without a Data Lake ,” there! The combination of data streaming and machine learning (ML) enables you to build one scalable, reliable, but also simple infrastructure for all machine learning tasks using the Apache Kafka ecosystem.

Data Lakes

Data Lakes Machine Learning Machine Learning Apache Kafka

Webinars

What’s New in Apache Airflow® 3.0—And How Will It Reshape Your Data Workflows?

MORE WEBINARS

Build a financial research assistant using Amazon Q Business and Amazon QuickSight for generative AI–powered insights

Flipboard

MAY 14, 2025

Their information is split between two types of data: unstructured data (such as PDFs, HTML pages, and documents) and structured data (such as databases, data lakes, and real-time reports). Different types of data typically require different tools to access them.

AWS

AWS AI AI Database

Search enterprise data assets using LLMs backed by knowledge graphs

Flipboard

NOVEMBER 27, 2024

The ingestion pipeline (3) ingests metadata (1) from services (2), including Amazon DataZone, AWS Glue, and Amazon Athena , to a Neptune database after converting the JSON response from the service APIs into an RDF triple format. For more details about RDF data format, refer to the W3C documentation. raw_customer". account } WHERE { ?asset

AWS

AWS Database ML ML

Unstructured data management and governance using AWS AI/ML and analytics services

Flipboard

OCTOBER 25, 2023

After decades of digitizing everything in your enterprise, you may have an enormous amount of data, but with dormant value. However, with the help of AI and machine learning (ML), new software tools are now available to unearth the value of unstructured data. These services write the output to a data lake.

AWS

AWS ML ML Analytics

Accelerating AI/ML development at BMW Group with Amazon SageMaker Studio

Flipboard

NOVEMBER 24, 2023

With that, the need for data scientists and machine learning (ML) engineers has grown significantly. Data scientists and ML engineers require capable tooling and sufficient compute for their work. Data scientists and ML engineers require capable tooling and sufficient compute for their work.

ML

ML ML AWS AI

ML architecture

Dataconomy

MAY 6, 2025

ML architecture forms the backbone of any effective machine learning system, shaping how it processes data and learns from it. Understanding the various components of ML architecture can empower organizations to design better systems that can adapt to evolving needs. What is ML architecture?

ML

ML ML Machine Learning Machine Learning

An integrated experience for all your data and AI with Amazon SageMaker Unified Studio (preview)

Flipboard

DECEMBER 11, 2024

Many of these applications are complex to build because they require collaboration across teams and the integration of data, tools, and services. Data engineers use data warehouses, data lakes, and analytics tools to load, transform, clean, and aggregate data. Choose Continue.

SQL

SQL AWS Data Lakes AI

Your guide to generative AI and ML at AWS re:Invent 2023

AWS Machine Learning Blog

NOVEMBER 22, 2023

Now all you need is some guidance on generative AI and machine learning (ML) sessions to attend at this twelfth edition of re:Invent. In addition to several exciting announcements during keynotes, most of the sessions in our track will feature generative AI in one form or another, so we can truly call our track “Generative AI and ML.”

AWS

AWS ML ML AI

Build ML features at scale with Amazon SageMaker Feature Store using data from Amazon Redshift

Flipboard

AUGUST 17, 2023

Amazon Redshift is the most popular cloud data warehouse that is used by tens of thousands of customers to analyze exabytes of data every day. SageMaker Studio is the first fully integrated development environment (IDE) for ML. The next step is to build ML models using features selected from one or multiple feature groups.

ML

ML ML AWS Data Warehouse

Architect a mature generative AI foundation on AWS

Flipboard

MAY 30, 2025

A generative AI foundation can provide primitives such as models, vector databases, and guardrails as a service and higher-level services for defining AI workflows, agents and multi-agents, tools, and also a catalog to encourage reuse. Considerations here are choice of vector database, optimizing indexing pipelines, and retrieval strategies.

AWS

AWS AI AI Database

Query structured data from Amazon Q Business using Amazon QuickSight integration

AWS Machine Learning Blog

DECEMBER 3, 2024

Although generative AI is fueling transformative innovations, enterprises may still experience sharply divided data silos when it comes to enterprise knowledge, in particular between unstructured content (such as PDFs, Word documents, and HTML pages), and structured data (real-time data and reports stored in databases or data lakes).

AWS

AWS Database Data Silos Data Lakes

Harmonize data using AWS Glue and AWS Lake Formation FindMatches ML to build a customer 360 view

Flipboard

JUNE 26, 2023

Companies are faced with the daunting task of ingesting all this data, cleansing it, and using it to provide outstanding customer experience. Typically, companies ingest data from multiple sources into their data lake to derive valuable insights from the data. Run the AWS Glue ML transform job.

AWS

AWS ML ML ETL

How Twilio generated SQL using Looker Modeling Language data with Amazon Bedrock

AWS Machine Learning Blog

AUGUST 8, 2024

As one of the largest AWS customers, Twilio engages with data, artificial intelligence (AI), and machine learning (ML) services to run their daily workloads. Data is the foundational layer for all generative AI and ML applications. The following diagram illustrates the solution architecture.

SQL

SQL Data Lakes Data Analyst AWS

Unlock the power of data governance and no-code machine learning with Amazon SageMaker Canvas and Amazon DataZone

AWS Machine Learning Blog

AUGUST 21, 2024

Amazon DataZone is a data management service that makes it quick and convenient to catalog, discover, share, and govern data stored in AWS, on-premises, and third-party sources. Enterprises can use no-code ML solutions to streamline their operations and optimize their decision-making without extensive administrative overhead.

Machine Learning

Machine Learning Machine Learning Data Governance ML

Retrieval-Augmented Generation with LangChain, Amazon SageMaker JumpStart, and MongoDB Atlas semantic search

Flipboard

NOVEMBER 17, 2023

Generative AI models have the potential to revolutionize enterprise operations, but businesses must carefully consider how to harness their power while overcoming challenges such as safeguarding data and ensuring the quality of AI-generated content. Amazon SageMaker enables enterprises to build, train, and deploy machine learning (ML) models.

K-nearest Neighbors

K-nearest Neighbors AWS Clustering Database

Tackling AI’s data challenges with IBM databases on AWS

IBM Journey to AI blog

MARCH 14, 2024

The existence of data silos and duplication, alongside apprehensions regarding data quality, presents a multifaceted environment for organizations to manage. Also, traditional database management tasks, including backups, upgrades and routine maintenance drain valuable time and resources, hindering innovation.

AWS

AWS Database ETL AI

Use Amazon SageMaker Canvas to build machine learning models using Parquet data from Amazon Athena and AWS Lake Formation

AWS Machine Learning Blog

JUNE 5, 2023

Data is the foundation for machine learning (ML) algorithms. One of the most common formats for storing large amounts of data is Apache Parquet due to its compact and highly efficient format. Athena allows applications to use standard SQL to query massive amounts of data on an S3 data lake.

Machine Learning

Machine Learning Machine Learning AWS Data Lakes

AWS re:Invent 2023 Amazon Redshift Sessions Recap

Flipboard

DECEMBER 18, 2023

Customers use Amazon Redshift as a key component of their data architecture to drive use cases from typical dashboarding to self-service analytics, real-time analytics, machine learning (ML), data sharing and monetization, and more. Discover how you can use Amazon Redshift to build a data mesh architecture to analyze your data.

AWS

AWS Data Warehouse ETL SQL

How Thomson Reuters built an AI platform using Amazon SageMaker to accelerate delivery of ML projects

AWS Machine Learning Blog

JANUARY 13, 2023

Since then, TR has achieved many more milestones as its AI products and services are continuously growing in number and variety, supporting legal, tax, accounting, compliance, and news service professionals worldwide, with billions of machine learning (ML) insights generated every year. The challenges. Solution overview.

ML

ML ML AWS Data Scientist

Data Engineering for IoT Applications: Unleashing the Power of the Internet of Things

Data Science Connect

JULY 28, 2023

Data Collection and Integration Data engineers are responsible for designing robust data collection systems that gather information from various IoT devices and sensors. This data is then integrated into centralized databases for further processing and analysis.

Internet of Things

Internet of Things Data Engineering Data Engineer Data Engineering

MLOps and DevOps: Why Data Makes It Different

O'Reilly Media

OCTOBER 19, 2021

As with many burgeoning fields and disciplines, we don’t yet have a shared canonical infrastructure stack or best practices for developing and deploying data-intensive applications. What does a modern technology stack for streamlined ML processes look like? Why: Data Makes It Different. All ML projects are software projects.

ML

ML ML Data Scientist AWS

MLOps Landscape in 2023: Top Tools and Platforms

The MLOps Blog

JUNE 27, 2023

Alignment to other tools in the organization’s tech stack Consider how well the MLOps tool integrates with your existing tools and workflows, such as data sources, data engineering platforms, code repositories, CI/CD pipelines, monitoring systems, etc. and Pandas or Apache Spark DataFrames.

Machine Learning

Machine Learning Machine Learning ML ML

Best 8 Data Version Control Tools for Machine Learning 2024

DagsHub

DECEMBER 11, 2023

DVC Released in 2017, Data Version Control ( DVC for short) is an open-source tool created by iterative. DVC can be used for versioning data and models, to track experiments and compare any data, code, parameters models and graphical plots of performance. DVC can efficiently handle large files and machine learning models.

Machine Learning

Machine Learning Machine Learning Data Lakes Data Science

Generative AI operating models in enterprise organizations with Amazon Bedrock

AWS Machine Learning Blog

JANUARY 29, 2025

However, even in a decentralized model, often LOBs must align with central governance controls and obtain approvals from the CCoE team for production deployment, adhering to global enterprise standards for areas such as access policies, model risk management, data privacy, and compliance posture, which can introduce governance complexities.

AWS

AWS AI AI Database

Why companies need to accelerate data warehousing solution modernization

IBM Journey to AI blog

APRIL 24, 2023

Data warehouse is the base architecture for artificial intelligence and machine learning (AI/ML) solutions as well. Benefits of new data warehousing technology Everything is data, regardless of whether it’s structured, semi-structured, or unstructured.

Data Warehouse

Data Warehouse Data Lakes Database Big Data

10 Things AWS Can Do for Your SaaS Company

Smart Data Collective

FEBRUARY 20, 2022

Data storage databases. Your SaaS company can store and protect any amount of data using Amazon Simple Storage Service (S3), which is ideal for data lakes, cloud-native applications, and mobile apps. This article finally gets to the core question we started with: what can AWS do for your SaaS business?

AWS

AWS Cloud Computing Data Lakes Database

Generating value from enterprise data: Best practices for Text2SQL and generative AI

AWS Machine Learning Blog

JANUARY 4, 2024

To do this, the text input is transformed into a structured representation, and from this representation, a SQL query that can be used to access a database is created. The primary goal of Text2SQL is to make querying databases more accessible to non-technical users, who can provide their queries in natural language. gymnast_id = t2.

SQL

SQL Database AI AI

Accelerating time-to-insight with MongoDB time series collections and Amazon SageMaker Canvas

AWS Machine Learning Blog

DECEMBER 18, 2023

In this post, we will explore the potential of using MongoDB’s time series data and SageMaker Canvas as a comprehensive solution. MongoDB Atlas MongoDB Atlas is a fully managed developer data platform that simplifies the deployment and scaling of MongoDB databases in the cloud. Setup the Database access and Network access.

Clustering

Clustering AWS Database ML

Generate financial industry-specific insights using generative AI and in-context fine-tuning

AWS Machine Learning Blog

NOVEMBER 12, 2024

The following question requires complex industry knowledge-based analysis of data from multiple columns in the ETF database. In entered the Big Data space in 2013 and continues to explore that area. He is actively working on projects in the ML space and has presented at numerous conferences including Strata and GlueCon.

SQL

SQL AWS AI AI

FMOps/LLMOps: Operationalize generative AI and differences with MLOps

AWS Machine Learning Blog

SEPTEMBER 1, 2023

ML operationalization summary As defined in the post MLOps foundation roadmap for enterprises with Amazon SageMaker , ML and operations (MLOps) is the combination of people, processes, and technology to productionize machine learning (ML) solutions efficiently.

AI

AI AI ML ML

Top 5 Tools for Building an Interactive Analytics App

Smart Data Collective

OCTOBER 27, 2021

Accessing data stored on Google BigQuery is secured with default and customer-managed encryption keys, and you can easily share any business intelligence insight derived from such data with teams and members of your organization with a few clicks. Druid is a real-time analytics database from Apache. Conclusion.

Analytics

Analytics Analytics Data Warehouse Business Intelligence

Azure Machine Learning – Empowering Your Data Science Journey

How to Learn Machine Learning

MAY 2, 2025

Azure Machine Learning is Microsoft’s enterprise-grade service that provides a comprehensive environment for data scientists and ML engineers to build, train, deploy, and manage machine learning models at scale. You can explore its capabilities through the official Azure ML Studio documentation. Awesome, right?

Azure

Azure Machine Learning Machine Learning Data Science

11 Open Source Data Exploration Tools You Need to Know in 2023

ODSC - Open Data Science

FEBRUARY 24, 2023

There are many well-known libraries and platforms for data analysis such as Pandas and Tableau, in addition to analytical databases like ClickHouse, MariaDB, Apache Druid, Apache Pinot, Google BigQuery, Amazon RedShift, etc. This tool automatically detects problems in an ML dataset.

Exploratory Data Analysis

Exploratory Data Analysis Data Visualization Data Analysis Data Analysis

Generate actionable insights for predictive maintenance management with Amazon Monitron and Amazon Kinesis

AWS Machine Learning Blog

APRIL 18, 2023

Amazon Monitron is an end-to-end condition monitoring solution that enables you to start monitoring equipment health with the aid of machine learning (ML) in minutes, so you can implement predictive maintenance and reduce unplanned downtime. For the detailed Amazon Monitron installation guide, refer to Getting started with Amazon Monitron.

AWS

AWS ML ML Database

How to Version Control Data in ML for Various Data Sources

The MLOps Blog

JANUARY 23, 2023

However, there are some key differences that we need to consider: Size and complexity of the data In machine learning, we are often working with much larger data. Basically, every machine learning project needs data. First of all, machine learning engineers and data scientists often use data from different data vendors.

ML

ML ML Data Lakes Machine Learning

Improving air quality with generative AI

AWS Machine Learning Blog

JUNE 18, 2024

The solution harnesses the capabilities of generative AI, specifically Large Language Models (LLMs), to address the challenges posed by diverse sensor data and automatically generate Python functions based on various data formats. This allows for data to be aggregated for further manufacturer-agnostic analysis.

AWS

AWS Python AI AI

Apply fine-grained data access controls with AWS Lake Formation in Amazon SageMaker Data Wrangler

AWS Machine Learning Blog

AUGUST 21, 2023

Amazon SageMaker Data Wrangler reduces the time it takes to collect and prepare data for machine learning (ML) from weeks to minutes. Data is frequently kept in data lakes that can be managed by AWS Lake Formation , giving you the ability to implement fine-grained access control using a straightforward grant or revoke procedure.

AWS

AWS Data Lakes Clustering Data Preparation

Reinventing the data experience: Use generative AI and modern data architecture to unlock insights

AWS Machine Learning Blog

JUNE 13, 2023

The combination of large language models (LLMs), including the ease of integration that Amazon Bedrock offers, and a scalable, domain-oriented data infrastructure positions this as an intelligent method of tapping into the abundant information held in various analytics databases and data lakes.

Database

Database SQL AWS AI

Anomaly detection in streaming time series data with online learning using Amazon Managed Service for Apache Flink

AWS Machine Learning Blog

SEPTEMBER 11, 2024

In this post, we demonstrate how to build a robust real-time anomaly detection solution for streaming time series data using Amazon Managed Service for Apache Flink and other AWS managed services. This solution employs machine learning (ML) for anomaly detection, and doesn’t require users to have prior AI expertise.

AWS

AWS ML ML Apache Kafka

How to Build ETL Data Pipeline in ML

The MLOps Blog

MAY 17, 2023

From data processing to quick insights, robust pipelines are a must for any ML system. Often the Data Team, comprising Data and ML Engineers , needs to build this infrastructure, and this experience can be painful. However, efficient use of ETL pipelines in ML can help make their life much easier.

ETL

ETL Data Pipeline ML ML

Build a robust text-to-SQL solution generating complex queries, self-correcting, and querying diverse data sources

AWS Machine Learning Blog

FEBRUARY 28, 2024

Structured Query Language (SQL) is a complex language that requires an understanding of databases and metadata. The solution in this post aims to bring enterprise analytics operations to the next level by shortening the path to your data using natural language. This table is used for finding the correct table, database, and attributes.

SQL

SQL AWS Database ML

Cloud Data Science News – Beta 6

Data Science 101

DECEMBER 16, 2019

Azure Data Factory Preserves Metadata during File Copy When performing a File copy between Amazon S3, Azure Blob, and Azure Data Lake Gen 2, the metadata will be copied as well. Azure Database for MySQL now supports MySQL 8.0 Data Labeling in Azure ML Studio. Not a huge update but still a nice feature.

Cloud Data

Cloud Data Data Science Azure Natural Language Processing

What Is a Lakebase?

Governing ML lifecycle at scale: Best practices to set up cost and usage visibility of ML workloads in multi-account environments

Webinars

Trending Sources

Streaming Machine Learning Without a Data Lake

Webinars

Build a financial research assistant using Amazon Q Business and Amazon QuickSight for generative AI–powered insights

Search enterprise data assets using LLMs backed by knowledge graphs

Unstructured data management and governance using AWS AI/ML and analytics services

Accelerating AI/ML development at BMW Group with Amazon SageMaker Studio

ML architecture

An integrated experience for all your data and AI with Amazon SageMaker Unified Studio (preview)

Your guide to generative AI and ML at AWS re:Invent 2023

Build ML features at scale with Amazon SageMaker Feature Store using data from Amazon Redshift

Architect a mature generative AI foundation on AWS

Query structured data from Amazon Q Business using Amazon QuickSight integration

Harmonize data using AWS Glue and AWS Lake Formation FindMatches ML to build a customer 360 view

How Twilio generated SQL using Looker Modeling Language data with Amazon Bedrock

Unlock the power of data governance and no-code machine learning with Amazon SageMaker Canvas and Amazon DataZone

Retrieval-Augmented Generation with LangChain, Amazon SageMaker JumpStart, and MongoDB Atlas semantic search

Tackling AI’s data challenges with IBM databases on AWS

Use Amazon SageMaker Canvas to build machine learning models using Parquet data from Amazon Athena and AWS Lake Formation

AWS re:Invent 2023 Amazon Redshift Sessions Recap

How Thomson Reuters built an AI platform using Amazon SageMaker to accelerate delivery of ML projects

Data Engineering for IoT Applications: Unleashing the Power of the Internet of Things

MLOps and DevOps: Why Data Makes It Different

MLOps Landscape in 2023: Top Tools and Platforms

Best 8 Data Version Control Tools for Machine Learning 2024

Generative AI operating models in enterprise organizations with Amazon Bedrock

Why companies need to accelerate data warehousing solution modernization

10 Things AWS Can Do for Your SaaS Company

Generating value from enterprise data: Best practices for Text2SQL and generative AI

Accelerating time-to-insight with MongoDB time series collections and Amazon SageMaker Canvas

Generate financial industry-specific insights using generative AI and in-context fine-tuning

FMOps/LLMOps: Operationalize generative AI and differences with MLOps

Top 5 Tools for Building an Interactive Analytics App

Azure Machine Learning – Empowering Your Data Science Journey

11 Open Source Data Exploration Tools You Need to Know in 2023

Generate actionable insights for predictive maintenance management with Amazon Monitron and Amazon Kinesis

How to Version Control Data in ML for Various Data Sources

Improving air quality with generative AI

Apply fine-grained data access controls with AWS Lake Formation in Amazon SageMaker Data Wrangler

Reinventing the data experience: Use generative AI and modern data architecture to unlock insights

Anomaly detection in streaming time series data with online learning using Amazon Managed Service for Apache Flink

How to Build ETL Data Pipeline in ML

Build a robust text-to-SQL solution generating complex queries, self-correcting, and querying diverse data sources

Cloud Data Science News – Beta 6

Stay Connected