2023, Data Pipeline and Database - Data Science Current

MLOps Landscape in 2023: Top Tools and Platforms

The MLOps Blog

JUNE 27, 2023

As you delve into the landscape of MLOps in 2023, you will find a plethora of tools and platforms that have gained traction and are shaping the way models are developed, deployed, and monitored. Open-source tools have gained significant traction due to their flexibility, community support, and adaptability to various workflows.

Machine Learning

Machine Learning Machine Learning ML ML

11 Open Source Data Exploration Tools You Need to Know in 2023

ODSC - Open Data Science

FEBRUARY 24, 2023

There are many well-known libraries and platforms for data analysis such as Pandas and Tableau, in addition to analytical databases like ClickHouse, MariaDB, Apache Druid, Apache Pinot, Google BigQuery, Amazon RedShift, etc. VisiData works with CSV files, Excel spreadsheets, SQL databases, and many other data sources.

Exploratory Data Analysis

Exploratory Data Analysis Data Visualization Data Analysis Data Analysis

Webinars

Going Beyond Chatbots: Connecting AI to Your Tools, Systems, & Data

Automation, Evolved: Your New Playbook for Smarter Knowledge Work

Smart Tech + Human Expertise = How to Modernize Manufacturing Without Losing Control

MORE WEBINARS

How to Build Effective Data Pipelines in Snowpark

phData

AUGUST 6, 2024

As today’s world keeps progressing towards data-driven decisions, organizations must have quality data created from efficient and effective data pipelines. For customers in Snowflake, Snowpark is a powerful tool for building these effective and scalable data pipelines.

Data Pipeline

Data Pipeline Python Data Engineering Data Engineer

Navigating the World of Data Engineering: A Beginners Guide.

Towards AI

MARCH 21, 2023

Last Updated on March 21, 2023 by Editorial Team Author(s): Data Science meets Cyber Security Originally published on Towards AI. Navigating the World of Data Engineering: A Beginner’s Guide. A GLIMPSE OF DATA ENGINEERING ❤ IMAGE SOURCE: BY AUTHOR Data or data? What are ETL and data pipelines?

Data Engineering

Data Engineering Data Engineer Data Engineering Data Engineering

Orchestration Frameworks 101: Simplifying LLM-App Interactions with LangChain and Llama Index

Data Science Dojo

SEPTEMBER 14, 2023

This orchestration process encompasses interactions with external APIs, retrieval of contextual data from vector databases, and maintaining memory across multiple LLM calls. This makes it easy to connect your data pipeline to the data sources that you need. It is known for its extensibility and modularity.

Data Pipeline

Data Pipeline Python Database AI

Feature Platforms?—?A New Paradigm in Machine Learning Operations (MLOps)

IBM Data Science in Practice

MARCH 8, 2023

Hidden Technical Debt in Machine Learning Systems More money, more problems — Rise of too many ML tools 2012 vs 2023 — Source: Matt Turck People often believe that money is the solution to a problem. A feature platform should automatically process the data pipelines to calculate that feature. Spark, Flink, etc.)

Machine Learning

Machine Learning Machine Learning ML ML

Supercharging Your Data Pipeline with Apache Airflow (Part 2)

Heartbeat

NOVEMBER 6, 2023

Image Source — Pixel Production Inc In the previous article, you were introduced to the intricacies of data pipelines, including the two major types of existing data pipelines. You might be curious how a simple tool like Apache Airflow can be powerful for managing complex data pipelines.

Data Pipeline

Data Pipeline Clean Data ETL Python

Meet the Seattle-area startups that just graduated from Y Combinator

Flipboard

SEPTEMBER 25, 2023

Y Combinator Photo) Seattle-area startups that just graduated from Y Combinator’s summer 2023 batch are tackling a wide range of problems — with plenty of help from artificial intelligence. Neum AI at its core is an enabler for generative AI applications by helping connect data into vector databases and making it accessible for RAG.

Data Pipeline

Data Pipeline AI AI Natural Language Processing

Mainframe Technology Trends for 2023

Precisely

JANUARY 19, 2023

In 2023 and beyond, we expect the open source trend to continue, with steady growth in the adoption of tools like Feilong, Tessla, Consolez, and Zowe. In 2023, expect to see broader adoption of streaming data pipelines that bring mainframe data to the cloud, offering a powerful tool for “modernizing in place.”

AWS

AWS Cloud Computing Data Pipeline Big Data

Discovering the Role of Data Science in a Cloud World

Pickl AI

DECEMBER 26, 2024

billion in 2023 to USD 1,266.4 Defining Cloud Computing in Data Science Cloud computing provides on-demand access to computing resources such as servers, storage, databases, and software over the Internet. Key Features Tailored for Data Science These platforms offer specialised features to enhance productivity.

Data Science

Data Science Cloud Computing Machine Learning Machine Learning

phData Toolkit July 2023 Update

phData

JULY 29, 2023

Operational Risks identify operational risks such as data loss or failures in the event of an unforeseen outage or disaster. Performance Optimization identify and fix bottlenecks in your data pipelines so that you can get the most out of your Snowflake investment.

SQL

SQL Database Data Pipeline

Linked Data Event Streams and TimescaleDB for Real-time Timeseries Data Management

Towards AI

FEBRUARY 25, 2023

Last Updated on March 1, 2023 by Editorial Team Author(s): Samuel Van Ackere Originally published on Towards AI. This article shows how to effortlessly insert sensor data in the form of an LDES into a TimescaleDB database. First, a data flow must be configured to ingest a Linked Data Event Stream into PostgreSQL.

Database

Database Data Pipeline AI AI

phData Toolkit December 2023 Update

phData

JANUARY 10, 2024

Please spend a few minutes browsing the apps and tools available in the phData Toolkit today to set yourself up for success in 2023. Explore the phData Toolkit The post phData Toolkit December 2023 Update appeared first on phData. The tool now runs on 8 threads as opposed to the original single thread!

Data Warehouse

Data Warehouse Data Profiling Data Pipeline Database

phData Toolkit February 2023 Update

phData

MARCH 1, 2023

This allows you to perform tasks such as ensuring data quality against data sources (once or over time), compare data metrics and metadata across environments, and create/manage data pipelines for all your tables and views. Fixed an issue showing invalid timestamp/precision issues when scanning an Impala database.

SQL

SQL Data Pipeline Data Quality Database

Announcing the ODSC West 2023 Preliminary Schedule

ODSC - Open Data Science

SEPTEMBER 20, 2023

ODSC West 2023 is just a couple of months away, and we couldn’t be more excited to be able to share our Preliminary Schedule with you! Day 1: Monday, October 30th (Bootcamp, VIP, Platinum) Day 1 of ODSC West 2023 will feature our hands-on training sessions, workshops, and tutorials and will be open to Platinum, Bootcamp, and VIP pass holders.

Data Wrangling

Data Wrangling Data Science Machine Learning Machine Learning

phData Toolkit March 2023 Update

phData

MARCH 31, 2023

For the Data Source Tool, we’ve addressed the following: Fixed an issue where view filters wouldn’t be disabled when using enabled = false. Fixed an issue when filtering tables in a database where only the first table listed would be scanned.

SQL

SQL Data Profiling Data Pipeline Database

phData Toolkit August 2023 Update

phData

SEPTEMBER 7, 2023

This is commonly handled in code that pulls data from databases, but you can also do this within the SQL query itself. We encourage you to spend a few minutes browsing the apps and tools available in the phData Toolkit today to set yourself up for success in 2023.

SQL

SQL Data Profiling Data Pipeline Database

phData Toolkit June 2023 Update

phData

JUNE 26, 2023

Translate CATALOG_COLLATION in CREATE DATABASE Add BOM-aware file reading so that files with a BOM are read with the encoding specified. We encourage you to spend a few minutes browsing the apps and tools available in the phData Toolkit today to set yourself up for success in 2023.

SQL

SQL Data Profiling Data Pipeline Data Governance

Building a Dataset for Triplet Loss with Keras and TensorFlow

Flipboard

FEBRUARY 13, 2023

Project Structure Creating Our Configuration File Creating Our Data Pipeline Preprocessing Faces: Detection and Cropping Summary Citation Information Building a Dataset for Triplet Loss with Keras and TensorFlow In today’s tutorial, we will take the first step toward building our real-time face recognition application. The dataset.py

Data Pipeline

Data Pipeline Deep Learning Deep Learning Python

Discover the Most Important Fundamentals of Data Engineering

Pickl AI

NOVEMBER 4, 2024

Effective data governance enhances quality and security throughout the data lifecycle. What is Data Engineering? Data Engineering is designing, constructing, and managing systems that enable data collection, storage, and analysis. This section explores essential aspects of Data Engineering.

Data Engineering

Data Engineering Data Engineer Data Engineering Data Engineering

Improving air quality with generative AI

AWS Machine Learning Blog

JUNE 18, 2024

On December 6 th -8 th 2023, the non-profit organization, Tech to the Rescue , in collaboration with AWS, organized the world’s largest Air Quality Hackathon – aimed at tackling one of the world’s most pressing health and environmental challenges, air pollution. This allows for data to be aggregated for further manufacturer-agnostic analysis.

AWS

AWS AI AI Python

Best 8 Data Version Control Tools for Machine Learning 2024

DagsHub

DECEMBER 11, 2023

A complete overview revealing a diverse range of strengths and weaknesses for each data versioning tool. It does not support the ‘dvc repro’ command to reproduce its data pipeline. DVC Released in 2017, Data Version Control ( DVC for short) is an open-source tool created by iterative.

Machine Learning

Machine Learning Machine Learning Data Lakes Database

Triplet Loss with Keras and TensorFlow

Flipboard

MARCH 6, 2023

In the previous tutorial of this series, we built the dataset and data pipeline for our Siamese Network based Face Recognition application. Specifically, we looked at an overview of triplet loss and discussed what kind of data samples are required to train our model with the triplet loss. What's next? Raha, and A. Thanki, eds.,

Deep Learning

Deep Learning Deep Learning Data Pipeline Computer Science

Getting Started With Snowflake: Best Practices For Launching

phData

DECEMBER 4, 2023

More on this topic later; but for now, keep in mind that the simplest method is to create a naming convention for database objects that allows you to identify the owner and associated budget. The extended period will allow you to perform Time Travel activities, such as undropping tables or comparing new data against historical values.

Clustering

Clustering Database SQL Data Pipeline

The Shift from Models to Compound AI Systems

BAIR

FEBRUARY 17, 2024

This is enforced with the `more` excerpt separator. --> AI caught everyone’s attention in 2023 with Large Language Models (LLMs) that can be instructed to perform general tasks, such as translation or coding, just by prompting. AI applications have always required careful monitoring of both model outputs and data pipelines to run reliably.

AI

AI AI DataOps Data Pipeline

A Primer to Scaling Pandas

ODSC - Open Data Science

AUGUST 23, 2023

Modin empowers practitioners to use pandas on data at scale, without requiring them to change a single line of code. Modin leverages our cutting-edge academic research on dataframes — the abstraction underlying pandas to bring the best of databases and distributed systems to dataframes. Run operations in pandas - all in Snowflake!

Data Warehouse

Data Warehouse Data Science Database SQL

Building an efficient MLOps platform with OSS tools on Amazon ECS with AWS Fargate

AWS Machine Learning Blog

SEPTEMBER 18, 2024

Context In early 2023, Zeta’s machine learning (ML) teams shifted from traditional vertical teams to a more dynamic horizontal structure, introducing the concept of pods comprising diverse skill sets. Additionally, Feast promotes feature reuse, so the time spent on data preparation is reduced greatly.

AWS

AWS Machine Learning Machine Learning ML

Gen AI 101: Data Engineering (Part 2)

phData

JULY 19, 2024

This article was co-written by Lawrence Liu & Safwan Islam While the title ‘ Machine Learning Engineer ’ may sound more prestigious than ‘Data Engineer’ to some, the reality is that these roles share a significant overlap. Generative AI has unlocked the value of unstructured text-based data.

Data Engineering

Data Engineering Data Engineer Data Engineering Data Engineering

How to Unlock Real-Time Analytics with Snowflake?

phData

MAY 3, 2024

What is Apache Kafka, and How is it Used in Building Real-time Data Pipelines? It is capable of handling high-volume and high-velocity data. Apache Kafka is an open-source event distribution platform. It is highly scalable, has high availability, and has low latency. Example: openssl rsa -in C:tmpnew_rsa_key_v1.p8

Apache Kafka

Apache Kafka Analytics Analytics ETL

What are Snowflake Dynamic Tables?

phData

NOVEMBER 2, 2023

Managing data pipelines efficiently is paramount for any organization. The Snowflake Data Cloud has introduced a groundbreaking feature that promises to simplify and supercharge this process: Snowflake Dynamic Tables. What are Snowflake Dynamic Tables?

Data Pipeline

Data Pipeline SQL Data Warehouse Data Engineering

How Data Observability Helps to Build Trusted Data

Precisely

SEPTEMBER 18, 2023

Data observability is a key element of data operations (DataOps). It enables a big-picture understanding of the health of your organization’s data through continuous AI/ML-enabled monitoring – detecting anomalies throughout the data pipeline and preventing data downtime.

Data Observability

Data Observability Data Quality Data Pipeline DataOps

Gen AI 101: Technology Choices (Part 1)

phData

JULY 5, 2024

In a survey conducted in 2023 , over three-quarters of the executives surveyed believed that artificial intelligence would disrupt their business strategy. In theory, if two vector embeddings are close to one another in vector space, then the underlying data the vectors represent are semantically similar.

AI

AI AI Database AWS

What Are dbt Artifacts

phData

FEBRUARY 8, 2024

Data Modeling, dbt has gradually emerged as a powerful tool that largely simplifies the process of building and handling data pipelines. dbt is an open-source command-line tool that allows data engineers to transform, test, and document the data into one single hub which follows the best practices of software engineering.

Data Modeling

Data Modeling Data Models Data Warehouse Database

Implementing GenAI in Practice

Iguazio

JANUARY 22, 2024

You can watch the full talk this blog post is based on, which took place at ODSC West 2023, here. Feedback - Collect production data, metadata, and metrics to tune the model and application further, and to enable governance and explainability. The importance of data pipelines lies in the fact that data pipelines improve quality.

Data Pipeline

Data Pipeline ML ML Data Warehouse

How to Version Control Data in ML for Various Data Sources

The MLOps Blog

JANUARY 23, 2023

However, there are some key differences that we need to consider: Size and complexity of the data In machine learning, we are often working with much larger data. Basically, every machine learning project needs data. First of all, machine learning engineers and data scientists often use data from different data vendors.

ML

ML ML Data Lakes Machine Learning

How to Ingest Salesforce Data into Snowflake Using Salesforce Sync Out

phData

SEPTEMBER 15, 2023

To configure Salesforce and Snowflake using the Sync Out connector, follow these steps: Step 1: Create Snowflake Objects To use Sync Out with Snowflake, you need to configure the following Snowflake objects appropriately in your Snowflake account: Database and schema that will be used for the Salesforce data.

Data Warehouse

Data Warehouse Tableau Data Silos Analytics

phData Toolkit December 2022 Update

phData

DECEMBER 29, 2022

We hope you’ve had a fantastic holiday season, filled up on delicious food, and are as excited as us to kick off the 2023 calendar year. Traditionally, database administrators (DBAs) would run scripts that were manually generated through each environment to make changes to the database. But what does this actually mean?

SQL

SQL Database Database Administration Data Profiling

How to Translate SQL Scripts Into Matillion Jobs

phData

JULY 12, 2023

In this blog, we’ll explore how Matillion Jobs can simplify the data transformation process by allowing users to visualize the data flow of a job from start to finish. Step 1 - Source Inputs For our example, our data will come from two table inputs for “cities” and “orders”. Database : Source Database of the table.

SQL

SQL ETL Database Data Pipeline

How to Translate SQL Scripts Into Matillion Jobs

phData

APRIL 21, 2023

In this blog, we’ll explore how Matillion Jobs can simplify the data transformation process by allowing users to visualize the data flow of a job from start to finish. Step 1 – Source Inputs For our example, our data will come from two table inputs for “cities” and “orders”. Database: Source Database of the table.

SQL

SQL ETL Database Data Pipeline

How to Build Machine Learning Systems With a Feature Store

The MLOps Blog

JANUARY 26, 2024

A feature store is a data platform that supports the creation and use of feature data throughout the lifecycle of an ML model, from creating features that can be reused across many models to model training to model inference (making predictions). It can also transform incoming data on the fly.

Machine Learning

Machine Learning Machine Learning ML ML

Location APIs: Powering Greater Accuracy and Context for Your Business Applications

Precisely

NOVEMBER 9, 2023

According to the 2023 Data Integrity Trends and Insights Report , data quality is the #1 barrier to achieving data integrity. And poor address quality is the top challenge preventing business leaders from effectively using location data to add context and multidimensional value to their decision-making processes.

Data Quality

Data Quality Data Pipeline Database

AI-Powered Bots in Ocean Predictoor Get a UX Upgrade: CLI & YAML

Ocean Protocol

JANUARY 17, 2024

We launched Predictoor and its Data Farming incentives in September & November 2023, respectively. Flows We released pdr-backend when we launched Predictoor in September 2023, and have been continually improving it since then: fixing bugs, reducing onboarding friction, and adding more capabilities (eg simulation flow).

Data Pipeline

Data Pipeline AI AI Analytics

Performance Benefits of Snowpark for ML Workloads

phData

MARCH 22, 2023

Top Use Cases of Snowpark With Snowpark, bringing business logic to data in the cloud couldn’t be easier. Transitioning work to Snowpark allows for faster ML deployment, easier scaling, and robust data pipeline development. ML Applications For data scientists, models can be developed in Python with common machine learning tools.

ML

ML ML Machine Learning Machine Learning

Top 10 Data Pipeline Interview Questions to Read in 2023

MLOps Landscape in 2023: Top Tools and Platforms

Webinars

Trending Sources

11 Open Source Data Exploration Tools You Need to Know in 2023

Webinars

How to Build Effective Data Pipelines in Snowpark

Navigating the World of Data Engineering: A Beginners Guide.

Orchestration Frameworks 101: Simplifying LLM-App Interactions with LangChain and Llama Index

Feature Platforms?—?A New Paradigm in Machine Learning Operations (MLOps)

Supercharging Your Data Pipeline with Apache Airflow (Part 2)

Meet the Seattle-area startups that just graduated from Y Combinator

Mainframe Technology Trends for 2023

Discovering the Role of Data Science in a Cloud World

phData Toolkit July 2023 Update

Linked Data Event Streams and TimescaleDB for Real-time Timeseries Data Management

phData Toolkit December 2023 Update

phData Toolkit February 2023 Update

Announcing the ODSC West 2023 Preliminary Schedule

phData Toolkit March 2023 Update

phData Toolkit August 2023 Update

phData Toolkit June 2023 Update

Building a Dataset for Triplet Loss with Keras and TensorFlow

Discover the Most Important Fundamentals of Data Engineering

Improving air quality with generative AI

Best 8 Data Version Control Tools for Machine Learning 2024

Triplet Loss with Keras and TensorFlow

Getting Started With Snowflake: Best Practices For Launching

The Shift from Models to Compound AI Systems

A Primer to Scaling Pandas

Building an efficient MLOps platform with OSS tools on Amazon ECS with AWS Fargate

Gen AI 101: Data Engineering (Part 2)

How to Unlock Real-Time Analytics with Snowflake?

What are Snowflake Dynamic Tables?

How Data Observability Helps to Build Trusted Data

Gen AI 101: Technology Choices (Part 1)

What Are dbt Artifacts

Implementing GenAI in Practice

How to Version Control Data in ML for Various Data Sources

How to Ingest Salesforce Data into Snowflake Using Salesforce Sync Out

phData Toolkit December 2022 Update

How to Translate SQL Scripts Into Matillion Jobs

How to Translate SQL Scripts Into Matillion Jobs

How to Build Machine Learning Systems With a Feature Store

Location APIs: Powering Greater Accuracy and Context for Your Business Applications

AI-Powered Bots in Ocean Predictoor Get a UX Upgrade: CLI & YAML

Performance Benefits of Snowpark for ML Workloads

Stay Connected