Seamless Integration with Big Data Eco-System - CarbonData - Apache Software Foundation
DUE TO SPAM, SIGN-UP IS DISABLED. Goto
Selfserve wiki signup
and request an account.
CarbonData
Pages
Blog
Page tree
Browse pages
tachments (0)
Page History
Resolved comments
Page Information
View in Hierarchy
View Source
Export to PDF
Export to Word
Copy Page Tree
Jira links
Seamless Integration with Big Data Eco-System
Created by
Pallavi Singh
, last modified on
May 10, 2017
INTRODUCTION
Dataframe & SQL Compliance.
DESCRIPTION
It has built-in spark integration for Spark 1.6.2, 2.1 and interfaces for Spark SQL, DataFrame API and query optimization. It supports bulk data ingestion and allows saving of spark dataframes as CarbonData files.
CARBONDATA-SPARK INTEGRATION
Figure 1 : CarbonData -Spark Integration
Apache CarbonData uses spark for data management and query optimisation.
It has its own reader and writer and Hadoop stores all the files on the HDFS. Apache CarbonData acts as a SparkSQL Data Source
CARBONDATA AS A SPARKSQL DATA SOURCE
Figure 2 : Roles of Spark Components in CarbonData
i) Parser/Analyzer
Parser : Parser parses every incoming query e.g. insert, update and delete.
The parser is internally hooked to the query that is fired. And it is used to parse the new SQL syntaxes. Following below are the new SQL Syntax's like update/delete, Compaction, etc. Once the syntax has been parsed, we start further processing of the query.
Resolve Relation : Carbon data source that enable buildScan and insert.
ii) Optimise and Physical Planning
The next step is to find what can be optimised.
One simple optimisation is
lazy decoding
Lazy decoding: Decoding the dictionary values only when it is required to fetch data, while continuing all the work on the dictionary values.
Example query :
The lazy decode leveraging the global dictionary.
Figure 3 : Lazy decoding
All the search and filter values can be applied to the dictionary values, and they can be converted to the actual values when you want them to be given to the spark.
iii) Execution
Spark Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.
The features of RDDs (decomposing the name):
Resilient, i.e. fault-tolerant with the help of
RDD lineage graph
and so able to recompute missing or damaged partitions due to node failures.
Distributed with data residing on multiple nodes in a
cluster
Dataset is a collection of
partitioned data
with primitive values or values of values, e.g. tuples or other objects (that represent records of the data you work with).
Figure 4: Sequence Diagram  (Spark 1.6.2)
Class
Description
Package
CarbonScanRDD
Query execution RDD, Leveraging multi level index for efficient filtering and scan DML related RDDs.
org.apache.carbondata.spark.rdd.CarbonScanRdd
DetailQueryExecutor
Carbon query interface
org.apache.carbondata.core.scan.executor.QueryExecutor
DetailQueryResult
Internal query interface, execute the query return the iterator over query result
org.apache.carbondata.core.scan.result.iterator.AbstractDetailQueryResultIterat
DetailBlockIterator
Blocklet iterator to process the blocklet
org.apache.carbondata.core.scan.processor.AbstractDataBlockIterator
FilterScanner
Interface for scanning the blocklet, there are two type of scanner non filter scanner and filter scanner.
org.apache.carbondata.core.scan.scanner.BlockletScanner
DictionaryBasedResultCollector
Prepares the query results from scanned result
org.apache.carbondata.core.scan.collector.ScannedResultCollecto
FilterExecutor
Interface for executing the filter in executor side
org.apache.carbondata.core.scan.filter.executor.FilterExecuter
DimensionColumnChunkReader
Reader interface for reading and uncompressing the blocklet dimension column data.
org.apache.carbondata.core.datastore.chunk.reader.DimensionColumnChunkReader
MeasureColumnChunkReader
Reader interface for reading and uncompressing the blocklet measure column data.
org.apache.carbondata.core.datastore.chunk.reader.MeasureColumnChunkReader
No labels
Overview
Content Tools
Atlassian Confluence Open Source Project License
granted to Apache Software Foundation.
Evaluate Confluence today
Atlassian Confluence
8.5.31
Printed by Atlassian Confluence 8.5.31
Report a bug
Atlassian News
Atlassian
{"serverDuration": 102, "requestCorrelationId": "10b79965ba9a8fbc"}