-
Notifications
You must be signed in to change notification settings - Fork 0
Stream Grids 5.supported transformations
Following are the list of supported transformations in Stream Grids.
Filter
You can provide multiple conditions combined by AND or OR logic by clicking on the + button for every new condition.
Union
No further user input is required for this operator. User can pass multiple streams through this operator and union of data across all streams will be done. If schema of data differs across different streams, an all-encompassing schema would be generated thereby resulting in no data loss.
Take
This is to print a given sample number of records in a given batch. Give input for number of records to print.
Persist
This is to change the persistence mechanism of DStreams.
Repartition
This is to change the number of partitions of the the DStream, partitions are a way to configure parallelism
Join
User can join multiple streams using this operator. User is required to provide joining column and one or more of output columns for each Stream and also the type of Join (inner/outer etc.)
Sort
Provide the column based on which the streaming data needs to be sorted and the corresponding order (asc/desc).
Aggregation
Using this operator you can do multiple aggregations like min,max,mean,count etc all in one transformation. You can add more by clicking on the + button.
Intersection
Does what the name suggests. Gives common subset of data across multiple streams of same schema. No user input needed.
Window
Converts a regular DStream to a window DStream. User can choose between a fixed or sliding window and has to provide the window and sliding duration(has to be a multiple of batch duration in milliseconds).
Count
Gives a count of records in a given batch for a particular stream. No user input needed.
Distinct
Filters out duplicate records in the stream and emits only unique records. No user input needed.
Map
Support to add a custom map function. Executor Plugin takes the FQCN of the custom Map class and Upload jar takes the custom jar which contains the above class. For details on how to implement a custom Map function refer to 'transform' method signature in interface transformations.Transformation.java (Source code location: openbdre/spark-streaming/src/main/java/transformations/Transformation.java)
MapToPair
This transformation changes a DStream to DStream<K,V>. Accepts a custom function which returns a DStream<K,V>. Can also assign one of the columns of current stream as a Key for the paired DStream.
Flatmap
Similar to Map function above. Contains two options, FlatMap and FlatMapToPair and accepts FQCN and corresponding jar as input. For details on how to implement a custom FlatMap function refer to 'transform' method signature in interface transformations.Transformation.java (Source code location: openbdre/spark-streaming/src/main/java/transformations/Transformation.java)
Reduce/ReduceByKey
Similar to above, user has to provide FQCN of implementing class and the jar containing it. Also supports ReduceByWindow & ReduceByKeyAndWindow. Hence accepts window/slide duration(in milliseconds). For details on how to implement a custom Map function refer to 'transform' method signature in interface transformations.Transformation.java (Source code location: openbdre/spark-streaming/src/main/java/transformations/Transformation.java)
Group/GroupByKey
Similar to above, window support is also available.
Deduplication
Two types:
Window Deduplication:
Eliminates duplicates over a period of time (window duration in milliseconds). User to provide the column to check for deduplication and duration to hold the state for duplicate checking.
HBase Deduplication:
Eliminates duplicates by looking up a HBase table, if any of the records in current batch of stream are already present in the given HBase table, those records are filtered out. User to input the field in source stream schema which needs to be looked for duplicates and HBase persistent store connection name and the HBase table.
Enricher
This transformation is used to lookup external stores and enrich the fields present in the source stream. For this feature to work, we need to have braodcast some external data already. For information on how to braodcast external data into a Stream Workflow refer to this: https://gitlab.com/bdre/documentation/wikis/stream-grids---how-to-lookup-from-external-data-inside-a-stream-workflow
User needs to provide the field of source stream and corresponding broadcast identifier to handle external lookups.
Custom
To write your own Custom transformation to handle some business logic, user would have to provide FQCN of implementing class and the jar containing it. For details on how to implement a custom function refer to 'transform' method signature in interface transformations.Transformation.java (Source code location: openbdre/spark-streaming/src/main/java/transformations/Transformation.java)