Article, Data Engineering and Hadoop

Integration of Python with Hadoop and Spark

Analytics Vidhya

MAY 30, 2021

ArticleVideo Book This article was published as a part of the Data Science Blogathon Introduction Big data is the collection of data that is vast. The post Integration of Python with Hadoop and Spark appeared first on Analytics Vidhya.

Hadoop

Hadoop Python Big Data Big Data

An Introduction to Hadoop Ecosystem for Big Data

Analytics Vidhya

MAY 27, 2022

This article was published as a part of the Data Science Blogathon. Introduction Every day the internet generates billions of bytes of data. Every time you put on a dog filter, watch cat videos or order food from your favourite restaurant, you generate data.

Hadoop

Hadoop Big Data Big Data Data Science

HIVE – A DATA WAREHOUSE IN HADOOP FRAMEWORK

Analytics Vidhya

MAY 30, 2021

ArticleVideo Book This article was published as a part of the Data Science Blogathon Different components in the Hadoop Framework Introduction Hadoop is. The post HIVE – A DATA WAREHOUSE IN HADOOP FRAMEWORK appeared first on Analytics Vidhya.

Hadoop

Hadoop Data Warehouse Data Science Analytics

Webinars

Maximizing Profit and Productivity: The New Era of AI-Powered Accounting

Automation, Evolved: Your New Playbook For Smarter Knowledge Work

MORE WEBINARS

Hadoop Ecosystem

Analytics Vidhya

OCTOBER 9, 2022

This article was published as a part of the Data Science Blogathon. Introduction Apache Hadoop is an open-source framework designed to facilitate interaction with big data. Still, for those unfamiliar with this technology, one question arises, what is big data?

Hadoop

Hadoop Apache Hadoop Big Data Big Data

Frequent Itemset Mining Using MapReduce on Hadoop

Analytics Vidhya

SEPTEMBER 14, 2022

This article was published as a part of the Data Science Blogathon. Introduction Every Data Science enthusiast’s journey goes through one of the most classical data problems – Frequent Itemset Mining, also sometimes referred to as Association Rule Mining or Market Basket Analysis.

Hadoop

Hadoop Data Science Analytics Analytics

Introduction to Apache Sqoop

Analytics Vidhya

JULY 25, 2022

This article was published as a part of the Data Science Blogathon. Introduction Apache Sqoop is a big data engine for transferring data between Hadoop and relational database servers. Big Data Sqoop can also be […].

Hadoop

Hadoop Big Data Big Data Data Engineering

Learn Everything about MapReduce Architecture & its Components

Analytics Vidhya

JULY 5, 2022

This article was published as a part of the Data Science Blogathon. Introduction MapReduce is part of the Apache Hadoop ecosystem, a framework that develops large-scale data processing. Other components of Apache Hadoop include Hadoop Distributed File System (HDFS), Yarn, and Apache Pig.

Apache Hadoop

Apache Hadoop Hadoop Data Science Algorithm

Workings of Hadoop Distributed File System (HDFS)

Analytics Vidhya

MAY 5, 2022

This article was published as a part of the Data Science Blogathon. Introduction This article will discuss the Hadoop Distributed File System, its features, components, functions, and benefits. This article also describes the working and real-time applications. Both structured and complex data can […].

Hadoop

Hadoop Data Science Analytics Analytics

Get to Know Apache Flume from Scratch!

Analytics Vidhya

MAY 12, 2022

This article was published as a part of the Data Science Blogathon. Introduction Apache Flume, a part of the Hadoop ecosystem, was developed by Cloudera. Initially, it was designed to handle log data solely, but later, it was developed to process event data. The post Get to Know Apache Flume from Scratch!

Hadoop

Hadoop Data Science Analytics Analytics

Partitioning and Bucketing in Hive

Analytics Vidhya

JUNE 30, 2022

This article was published as a part of the Data Science Blogathon. Introduction Hive is a popular data warehouse built on top of Hadoop that is used by companies like Walmart, Tiktok, and AT&T. It is an important technology for data engineers to learn and master.

Data Warehouse

Data Warehouse Hadoop Data Engineering Data Engineer

A Brief Introduction to Apache HBase and it’s Architecture

Analytics Vidhya

OCTOBER 12, 2022

This article was published as a part of the Data Science Blogathon. Introduction Since the 1970s, relational database management systems have solved the problems of storing and maintaining large volumes of structured data.

Hadoop

Hadoop Big Data Big Data Data Science

An Introduction to MapReduce with a Word Count Example

Analytics Vidhya

MAY 18, 2022

This article was published as a part of the Data Science Blogathon. Introduction Hadoop facilitates the processing of large datasets in a distributed manner and provides the foundation on which other services and applications can be built. MapReduce and HDFS are the two main components of Hadoop.

Hadoop

Hadoop Data Science Analytics Analytics

Most Frequently Asked Apache HBase Interview Questions

Analytics Vidhya

AUGUST 1, 2022

This article was published as a part of the Data Science Blogathon. Introduction HBase is a column-oriented non-relational database management system that operates on Hadoop Distributed File System (HDFS). HBase provides a fault-tolerant manner of storing sparse data sets, which are prevalent in several big data use cases.

Hadoop

Hadoop Big Data Big Data Data Science

Introduction to Partitioned hive table and PySpark

Analytics Vidhya

OCTOBER 28, 2021

This article was published as a part of the Data Science Blogathon What is the need for Hive? The official description of Hive is- ‘Apache Hive data warehouse software project built on top of Apache Hadoop for providing data query and analysis.

Apache Hadoop

Apache Hadoop Data Warehouse Hadoop SQL

Warehouse, Lake or a Lakehouse – What’s Right for you?

Analytics Vidhya

OCTOBER 10, 2022

This article was published as a part of the Data Science Blogathon. Introduction Most of you would know the different approaches for building a data and analytics platform. You would have already worked on systems that used traditional warehouses or Hadoop-based data lakes. Selecting one among […].

Data Lakes

Data Lakes Hadoop Data Science Analytics

An Overview on DDL Commands in Apache Hive

Analytics Vidhya

APRIL 29, 2022

This article was published as a part of the Data Science Blogathon. Introduction Apache Hadoop is the most used open-source framework in the industry to store and process large data efficiently. Hive is built on the top of Hadoop for providing data storage, query and processing capabilities.

Apache Hadoop

Apache Hadoop Hadoop SQL Data Science

What is Apache Impala- Features and Architecture

Analytics Vidhya

AUGUST 17, 2022

This article was published as a part of the Data Science Blogathon. Introduction Impala is an open-source and native analytics database for Hadoop. Vendors such as Cloudera, Oracle, MapReduce, and Amazon have shipped Impala. If you want to learn all things Impala, you’ve come to the right place.

Hadoop

Hadoop Data Science Database Analytics

Apache Zookeeper Architecture and Installation

Analytics Vidhya

AUGUST 3, 2022

This article was published as a part of the Data Science Blogathon. Introduction Zookeeper in Hadoop can be considered a centralized repository where distributed applications can put data into and retrieve data from. For clarity, Zookeeper can be […].

Hadoop

Hadoop Data Science Analytics Analytics

Apache Pig Architecture and Execution Modes

Analytics Vidhya

JULY 10, 2022

This article was published as a part of the Data Science Blogathon. The Apache Pig is built on top of Hadoop. Provides a stream of data processing for large data sets. Apache Pork offers a high-quality language. It is another way of quoting more than Reduce Map (MR).

Hadoop

Hadoop Data Science Analytics Analytics

Most Asked Interview Questions on Apache Spark

Analytics Vidhya

AUGUST 26, 2022

This article was published as a part of the Data Science Blogathon. Introduction Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark’s in-memory data processing capabilities make it 100 times faster than Hadoop. The most […].

Hadoop

Hadoop Data Science Analytics Analytics

An Introduction to Apache Pig For Absolute Beginners!

Analytics Vidhya

AUGUST 8, 2021

ArticleVideo Book This article was published as a part of the Data Science Blogathon This article is focused on Apache Pig. It is a high-level. The post An Introduction to Apache Pig For Absolute Beginners! appeared first on Analytics Vidhya.

Data Science

Data Science Analytics Analytics Hadoop

Basic Concept Behind Apache Hive and Elasticsearch

Analytics Vidhya

SEPTEMBER 4, 2022

This article was published as a part of the Data Science Blogathon. Introduction I’ve always wondered how big companies like Google process their information or how companies like Netflix can perform searches in concise times.

Data Science

Data Science Analytics Analytics Hadoop

Apache Flume Interview Questions

Analytics Vidhya

JULY 27, 2022

This article was published as a part of the Data Science Blogathon. Introduction to Apache Flume Apache Flume is a data ingestion mechanism for gathering, aggregating, and transmitting huge amounts of streaming data from diverse sources, such as log files, events, and so on, to a centralized data storage.

Data Science

Data Science Analytics Analytics Hadoop

Simplify Your Data Engineering Journey: The Essential PySpark Cheat Sheet for Success!

Towards AI

FEBRUARY 2, 2024

I hope that you have sufficient knowledge of big data and Hadoop concepts like Map, reduce, transformations, actions, lazy evaluation, and many more topics in Hadoop and Spark. Extracting day, month and year from date column: #extract year, month, and day details from the data framedf.select(year("date column").distinct().orderBy(year("date

Data Engineering

Data Engineering Data Engineering Data Engineer Data Engineering

How data engineers tame Big Data?

Dataconomy

FEBRUARY 23, 2023

Data engineers play a crucial role in managing and processing big data. They are responsible for designing, building, and maintaining the infrastructure and tools needed to manage and process large volumes of data effectively. What is data engineering?

Big Data

Big Data Big Data Data Engineering Data Engineering

Getting started with Apache Pig!

Analytics Vidhya

JUNE 24, 2022

This article was published as a part of the Data Science Blogathon. Apache Pig is capable of working on any kind of data, similar to a pig who can eat anything. Introduction After reading the heading Apache Pig, the first question that hits every mind is, why the word Pig? Pig is nothing but a […].

Data Science

Data Science Analytics Analytics Hadoop

A Dive into Apache Flume: Installation, Setup, and Configuration

Analytics Vidhya

MARCH 7, 2023

Introduction Apache Flume is a tool/service/data ingestion mechanism for gathering, aggregating, and delivering huge amounts of streaming data from diverse sources, such as log files, events, and so on, to centralized data storage. Flume is a tool that is very dependable, distributed, and customizable. einsteinerupload of.

Analytics

Analytics Analytics Hadoop Data Engineering

Introduction to Apache Kafka: Fundamentals and Working

Analytics Vidhya

DECEMBER 30, 2022

This article was published as a part of the Data Science Blogathon. Introduction Have you ever wondered how Instagram recommends similar kinds of reels while you are scrolling through your feed or ad recommendations for similar products that you were browsing on Amazon?

Apache Kafka

Apache Kafka Data Science Analytics Analytics

An Overview of HDFS: NameNodes and DataNodes

Analytics Vidhya

APRIL 20, 2022

This article was published as a part of the Data Science Blogathon. Introduction Modern applications and products deal with large amounts of data. The quantity of data being processed and utilised in modern times is enormous. How to manage large files and data. So, the question arises?

Data Science

Data Science Analytics Analytics Hadoop

Basic Concept and Backend of AWS Elasticsearch

Analytics Vidhya

OCTOBER 4, 2022

This article was published as a part of the Data Science Blogathon. It is a Lucene-based search engine developed in Java but supports clients in various languages such as Python, C#, Ruby, and PHP. It takes unstructured data from multiple sources as input and stores it […].

AWS

AWS Data Science Python Analytics

A Comprehensive Guide On Apache Sqoop

Analytics Vidhya

SEPTEMBER 23, 2022

This article was published as a part of the Data Science Blogathon. Introduction Hi Everyone, In this guide, we will discuss Apache Sqoop. We will discuss the Sqoop import and export processes with different modes and also cover Sqoop-hive integration. In this guide, I will go over Apache Sqoop in depth so that whenever you […].

Data Science

Data Science Analytics Analytics Hadoop

A Beginners’ Guide to Apache Hadoop’s HDFS

Analytics Vidhya

MAY 5, 2022

This article was published as a part of the Data Science Blogathon. Introduction With a huge increment in data velocity, value, and veracity, the volume of data is growing exponentially with time. This outgrows the storage limit and enhances the demand for storing the data across a network of machines.

Data Science

Data Science Analytics Analytics Apache Hadoop

Apache Sqoop: Features, Architecture and Operations

Analytics Vidhya

SEPTEMBER 18, 2022

This article was published as a part of the Data Science Blogathon. Introduction Apache SQOOP is a tool designed to aid in the large-scale export and import of data into HDFS from structured data repositories. Relational databases, enterprise data warehouses, and NoSQL systems are all examples of data storage.

Data Warehouse

Data Warehouse Data Science Database Analytics

A Detailed Introduction on Data Lakes and Delta Lakes

Analytics Vidhya

AUGUST 31, 2022

This article was published as a part of the Data Science Blogathon. Introduction A data lake is a central data repository that allows us to store all of our structured and unstructured data on a large scale.

Data Lakes

Data Lakes Big Data Big Data Data Science

Beginners Guide to Data Warehouse Using Hive Query Language

Analytics Vidhya

APRIL 29, 2022

This article was published as a part of the Data Science Blogathon. Introduction Have you ever wondered how big IT giants store and process huge amounts of data? storing the data […].

Data Warehouse

Data Warehouse Database Data Science Analytics

Discover the Most Important Fundamentals of Data Engineering

Pickl AI

NOVEMBER 4, 2024

Summary: The fundamentals of Data Engineering encompass essential practices like data modelling, warehousing, pipelines, and integration. Understanding these concepts enables professionals to build robust systems that facilitate effective data management and insightful analysis. What is Data Engineering?

Data Engineering

Data Engineering Data Engineering Data Engineer Data Engineering

The Data Dilemma: Exploring the Key Differences Between Data Science and Data Engineering

Pickl AI

JULY 25, 2023

Unfolding the difference between data engineer, data scientist, and data analyst. Data engineers are essential professionals responsible for designing, constructing, and maintaining an organization’s data infrastructure. Data Visualization: Matplotlib, Seaborn, Tableau, etc.

Data Engineering

Data Engineering Data Engineering Data Engineer Data Engineering

Coding vs Data Science: A comprehensive guide to unraveling the differences

Data Science Dojo

JULY 7, 2023

In the technology-driven world we inhabit, two skill sets have risen to prominence and are a hot topic: coding vs data science. Essential Skills for Data Science Data Science , while incorporating coding, demands a different skill set. Statistics helps data scientists to estimate, predict and test hypotheses.

Data Science

Data Science Data Scientist Python Decision Trees

The Backbone of Data Engineering: 5 Key Architectural Patterns Explained

Mlearning.ai

MAY 16, 2023

Data engineering is a rapidly growing field that designs and develops systems that process and manage large amounts of data. There are various architectural design patterns in data engineering that are used to solve different data-related problems.

Data Engineering

Data Engineering Data Engineering Data Engineer Data Engineering

Why Improving Problem-Solving Skills is Crucial for Data Engineers?

DataSeries

AUGUST 15, 2024

Enrich data engineering skills by building problem-solving ability with real-world projects, teaming with peers, participating in coding challenges, and more. Globally several organizations are hiring data engineers to extract, process and analyze information, which is available in the vast volumes of data sets.

Data Engineering

Data Engineering Data Engineering Data Engineer Data Engineering

Essential Branding Guidelines For Aspiring Data Scientists

Smart Data Collective

AUGUST 22, 2019

Over the past year, job openings for data scientists increased by 56%. People that pursue a career in data science can expect excellent job security and very competitive salaries. However, a background in data analytics, Hadoop technology or related competencies doesn’t guarantee success in this field.

Data Scientist

Data Scientist Data Science Hadoop Big Data

Integration of Python with Hadoop and Spark

An Introduction to Hadoop Ecosystem for Big Data

Webinars

Trending Sources

HIVE – A DATA WAREHOUSE IN HADOOP FRAMEWORK

Webinars

Hadoop Ecosystem

Frequent Itemset Mining Using MapReduce on Hadoop

Introduction to Apache Sqoop

Learn Everything about MapReduce Architecture & its Components

Workings of Hadoop Distributed File System (HDFS)

Get to Know Apache Flume from Scratch!

Partitioning and Bucketing in Hive

A Brief Introduction to Apache HBase and it’s Architecture

An Introduction to MapReduce with a Word Count Example

Most Frequently Asked Apache HBase Interview Questions

Introduction to Partitioned hive table and PySpark

Warehouse, Lake or a Lakehouse – What’s Right for you?

Top Interview Questions & Answers for Apache Oozie

An Overview on DDL Commands in Apache Hive

What is Apache Impala- Features and Architecture

Top 20 Apache Oozie Interview Questions

Apache Zookeeper Architecture and Installation

Apache Pig Architecture and Execution Modes

Most Asked Interview Questions on Apache Spark

An Introduction to Apache Pig For Absolute Beginners!

Top Interview Questions & Answers for Apache Sqoop

Basic Concept Behind Apache Hive and Elasticsearch

Apache Flume Interview Questions

Simplify Your Data Engineering Journey: The Essential PySpark Cheat Sheet for Success!

How data engineers tame Big Data?

Getting started with Apache Pig!

A Dive into Apache Flume: Installation, Setup, and Configuration

Introduction to Apache Kafka: Fundamentals and Working

An Overview of HDFS: NameNodes and DataNodes

Basic Concept and Backend of AWS Elasticsearch

A Comprehensive Guide On Apache Sqoop

A Beginners’ Guide to Apache Hadoop’s HDFS

Apache Sqoop: Features, Architecture and Operations

A Detailed Introduction on Data Lakes and Delta Lakes

Beginners Guide to Data Warehouse Using Hive Query Language

Discover the Most Important Fundamentals of Data Engineering

The Data Dilemma: Exploring the Key Differences Between Data Science and Data Engineering

Coding vs Data Science: A comprehensive guide to unraveling the differences

The Backbone of Data Engineering: 5 Key Architectural Patterns Explained

Why Improving Problem-Solving Skills is Crucial for Data Engineers?

Essential Branding Guidelines For Aspiring Data Scientists

Stay Connected