Skip to content

Stream Grids 5.supported transformations

sriharshaboda edited this page Sep 16, 2017 · 3 revisions

Following are the list of supported transformations in Stream Grids.

Filter

You can provide multiple conditions combined by AND or OR logic by clicking on the + button for every new condition.

trans-filter

Union

No further user input is required for this operator. User can pass multiple streams through this operator and union of data across all streams will be done. If schema of data differs across different streams, an all-encompassing schema would be generated thereby resulting in no data loss.

Take

This is to print a given sample number of records in a given batch. Give input for number of records to print.

trans-take

Persist

This is to change the persistence mechanism of DStreams.

trans-persist

Repartition

This is to change the number of partitions of the the DStream, partitions are a way to configure parallelism

trans-repart

Join

User can join multiple streams using this operator. User is required to provide joining column and one or more of output columns for each Stream and also the type of Join (inner/outer etc.)

join

Sort

Provide the column based on which the streaming data needs to be sorted and the corresponding order (asc/desc).

trans-sort

Aggregation

Using this operator you can do multiple aggregations like min,max,mean,count etc all in one transformation. You can add more by clicking on the + button.

trans-aggr

Intersection

Does what the name suggests. Gives common subset of data across multiple streams of same schema. No user input needed.

Window

Converts a regular DStream to a window DStream. User can choose between a fixed or sliding window and has to provide the window and sliding duration(has to be a multiple of batch duration in milliseconds).

trans-window

Count

Gives a count of records in a given batch for a particular stream. No user input needed.

Distinct

Filters out duplicate records in the stream and emits only unique records. No user input needed.

Map

Support to add a custom map function. Executor Plugin takes the FQCN of the custom Map class and Upload jar takes the custom jar which contains the above class. For details on how to implement a custom Map function refer to 'transform' method signature in interface transformations.Transformation.java (Source code location: openbdre/spark-streaming/src/main/java/transformations/Transformation.java)

trans-map

MapToPair

This transformation changes a DStream to DStream<K,V>. Accepts a custom function which returns a DStream<K,V>. Can also assign one of the columns of current stream as a Key for the paired DStream.

trans-map2pair

Flatmap

Similar to Map function above. Contains two options, FlatMap and FlatMapToPair and accepts FQCN and corresponding jar as input. For details on how to implement a custom FlatMap function refer to 'transform' method signature in interface transformations.Transformation.java (Source code location: openbdre/spark-streaming/src/main/java/transformations/Transformation.java)

trans-flatmap

Reduce/ReduceByKey

Similar to above, user has to provide FQCN of implementing class and the jar containing it. Also supports ReduceByWindow & ReduceByKeyAndWindow. Hence accepts window/slide duration(in milliseconds). For details on how to implement a custom Map function refer to 'transform' method signature in interface transformations.Transformation.java (Source code location: openbdre/spark-streaming/src/main/java/transformations/Transformation.java)

reduce

Group/GroupByKey

Similar to above, window support is also available.

Deduplication

Two types:

Window Deduplication:

Eliminates duplicates over a period of time (window duration in milliseconds). User to provide the column to check for deduplication and duration to hold the state for duplicate checking.

trans_wi_dedup

HBase Deduplication:

Eliminates duplicates by looking up a HBase table, if any of the records in current batch of stream are already present in the given HBase table, those records are filtered out. User to input the field in source stream schema which needs to be looked for duplicates and HBase persistent store connection name and the HBase table.

trans_hbase_dedup

Enricher

This transformation is used to lookup external stores and enrich the fields present in the source stream. For this feature to work, we need to have braodcast some external data already. For information on how to braodcast external data into a Stream Workflow refer to this: https://gitlab.com/bdre/documentation/wikis/stream-grids---how-to-lookup-from-external-data-inside-a-stream-workflow

User needs to provide the field of source stream and corresponding broadcast identifier to handle external lookups.

enricher

Custom

To write your own Custom transformation to handle some business logic, user would have to provide FQCN of implementing class and the jar containing it. For details on how to implement a custom function refer to 'transform' method signature in interface transformations.Transformation.java (Source code location: openbdre/spark-streaming/src/main/java/transformations/Transformation.java)

custom

Clone this wiki locally