API 

importDataFrame(data, src)

Imports a parallel collection from another worker. The number of partitions will be the same as in the original worker.

Parameters

data (IDataFrame(T)) – Parallel collection of source data.
src (ISource) – (Optional) Auxiliary function to configure executor, its use may vary between languages. Must implement at least IBeforeFunction interface.

Returns

A parallel collection with data elements.

Return type

textFile(path, minPartitions)

Creates a parallel collection by splitting a text file to create at least minPartitions partitions.

Parameters

path (String) – File path.
minPartitions (Integer) – Minimal number of partitions.

Returns

A parallel collection of strings.

Return type

IDataFrame(String)

Raises

IDriverException – An error is generated if the file does not exist or cannot be read.

plainFile(path, minPartitions, delim)

Creates a parallel collection by splitting a file using a custom delimiter to create at least minPartitions partitions.

Parameters

path (String) – File path.
minPartitions (Integer) – Minimal number of partitions. :delim String delim: A one-character string.

Returns

A parallel collection of strings.

Return type

IDataFrame(String)

Raises

IDriverException – An error is generated if the file does not exist or cannot be read.

partitionObjectFile(path, src)

Creates a parallel collection from binary partition files. See IDataFrame.saveAsObjectFile()

Parameters

path (String) – File path without the .part* extension.
src (ISource) – (Optional) Auxiliary function to configure executor, its use may vary between languages. Must implement at east IBeforeFunction interface.

Returns

A parallel collection with type stored in the binary file.

Return type

Raises

IDriverException – An error is generated if any file do not exist or cannot be read.

partitionTextFile(path)

Creates a parallel collection from text partition files. See IDataFrame.saveAsTextFile()

Parameters: path (String) – File path without the .part* extension.
Returns: A parallel collection of strings.
Return type: IDataFrame(String)
Raises: IDriverException – An error is generated if any file do not exist or cannot be read.

partitionJsonFile(path, src, objectMapping)

Creates a parallel collection from json partition files. See IDataFrame.saveAsJsontFile()

Parameters

path (String) – File path without the .part* extension.
src (ISource) – (Optional) Auxiliary function to configure executor, its use may vary between languages. Must implement at least IBeforeFunction interface.
objectMapping (Boolean) – (Optional) If true, json objects are transformed to objects.

Returns

A parallel collection of mapped object, if objectMapping is true or otherwise a generic json type is used.

Return type

IDataFrame(Json) or IDataFrame(T)

Raises

IDriverException – An error is generated if any file do not exist or cannot be read.

loadLibrary(path)

Loads a library of functions in the executor processes. Functions may be invoked using only their name in any ISource. Library type depends on the programming language of executor.

The library can be defined in two ways:

Path to a library file. Library must be compiled if the language requires it.
Source code in plain text, executor will take care of compiling if necessary. This allows you to create functions dynamically from the driver.

Parameters: path (String) – Library path or Source code.
Raises: IDriverException – An error is generated if libreary does not exist or cannot be read.

execute(src)

Runs a function in the executors.

Parameters: src (IIVoidFunction0 or ISource) – Function to be executed.

executeTo(src)

Runs a function in the executors and generates a parallel collection.

Parameters: src (IFunction0 or ISource) – Function to be executed.
Returns: A parallel collection created with the elements returned by the function.
Return type: IDataFrame(T)

call(src, data)

Runs a function that has been previously loaded by IWorker.loadLibrary(). Values returned by the function will generate a parallel collection. Note, this function is designed to execute functions in format name, it does not allow to use the other formats.

Parameters

src (IFunction or IFunction0 or ISource) – Function name and its arguments. It must implement IFunction interface if data is supplied or IFunction0 otherwise.
data (IDataFrame(T)) – (Optional) A parallel collection of elements to be processed by the src function.

Returns

A parallel collection created with the elements returned by src function.

Return type

voidCall(src, data)

Runs a function that has been previously loaded by IWorker.loadLibrary(). Like IWorker.call() but with no return.

Parameters

src (IVoidFunction or IVoidFunction0 or ISource) – Function name and its arguments. It must implement IVoidFunction interface if data is supplied or IVoidFunction0 otherwise. Note, this function is designed to execute functions in format name, it does not allow to use the other formats.
data (IDataFrame(T)) – (Optional) A parallel collection of elements to be processed by the src function.

IDataFrame

The class IDataFrame represents a parallel collection of elements distributed among the worker executors. All functions defined within this class process the elements in a parallel and distributed way.

class IDataFrame(T)

class T: Represents the type associated with the parallel collection. Dynamic languages do not have to make it visible to the user, it is the input value type for most of the functions defined in IDataFrame.

setName(name)

Sets or changes the name associated with the IDataFrame. The new name will affect only this IDataFrame and future tasks created from it.

Parameters: name (String) – New name.

persist(cacheLevel)

Sets a cache level for elements so that it only needs to be computed once.

Parameters: cacheLevel (ICacheLevel) – level of cache.

cache(cacheLevel): Sets a cache level ICacheLevel.PRESERVE for elements so that it only needs to be computed once.

unpersist(): Elements cache is disabled. Alias for IDataFrame.uncahe.

uncahe(): Elements cache is disabled. Alias for IDataFrame.unpersist.

partitions()

Gets the number of partitions.

Returns: Number of partitions.
Return type: Integer.

saveAsObjectFile(path, compression)

Saves elements as binary files.

Parameters

path (String) – path to store the data.
compression (Integer) – compresion level (0-9).

Raises

IDriverException – An error is generated if files exists or cannot be write.

saveAsTextFile(path)

Saves elements as text files.

Parameters: path (String) – path to store the data.
Raises: IDriverException – An error is generated if files exists or cannot be write.

saveAsJsonFile(path, pretty)

Saves elements as json files.

Parameters

path (String) – path to store the data.
pretty (Boolean) – uses an ident format instead of compact.

Raises

IDriverException – An error is generated if files exists or cannot be write.

repartition(numPartitions, preserveOrdering, global)

Creates a new Dataframe with a fixes number of partitions.

Parameters

numPartitions (Integer) – number of partitions.
preserveOrdering (Boolean) – The order of the elements does not change.
global (Boolean) – Elements are balanced between different executors. If false, Elements are only balanced within each executor.

Returns

A Dataframe with numPartitions.

Return type

partitionByRandom(numPartitions, seed)

Creates a new Dataframe with a fixes number of partitions. Elements are randomly distributed among the executors.

Parameters: numPartitions (Integer) – number of partitions. :param Integer seed: Initializes the random number generator.
Returns: A Dataframe with numPartitions.
Return type: IDataFrame(T)

partitionByHash(numPartitions)

Creates a new Dataframe with a fixes number of partitions. Elements are distributed using a hash function among the executors.

Parameters: numPartitions (Integer) – number of partitions.
Returns: A Dataframe with numPartitions.
Return type: IDataFrame(T)

partitionBy(src, numPartitions)

Creates a new Dataframe with a fixes number of partitions. Elements are distributed using a custom function among the executors. The same function return assigns the same partition.

Parameters

src (IFunction(T, Integer) or ISource.) – Function argument.
numPartitions (Integer) – number of partitions.

Returns

A Dataframe with numPartitions.

Return type

map(src)

Performs a map operation.

Parameters: src (IFunction(T, R) or ISource.) – Function argument.
Returns: A Dataframe with result elements.
Return type: IDataFrame(R)

mapWithIndex(src)

Performs a map operation. Like IDataFrame.map but global index of the element is available as the first argument of the function.

Parameters: src (IFunction2(Integer, T, R) or ISource.) – Function argument.
Returns: A Dataframe with result elements.
Return type: IDataFrame(R)

filter(src)

Performs a filter operation. Only items that return True will be retained.

Parameters: src (IFunction(T, Boolean) or ISource.) – Function argument.
Returns: A Dataframe with result elements.
Return type: IDataFrame(T)

flatmap(src)

Performs a flatmap operation. Like IDataFrame.map but each element can generate any number of results.

Parameters: src (IFunction(T, Iterable(R)) or ISource.) – Function argument.
Returns: A Dataframe with result elements.
Return type: IDataFrame(R)

keyBy(src)

Assigns each element a key with the return of the function.

Parameters: src (IFunction(T, R) or ISource.) – Function argument.
Returns: A Dataframe of pairs with result elements.
Return type: IPairDataFrame(R, T)

mapPartitions(src, preservesPartitioning)

Performs a specialized map that is called only once for each partition, elements can be accessed using an iterator.

Parameters

src (IFunction(IReadIterator(T), Iterable(R)) or ISource.) – Function argument.
preservesPartitioning (Boolean) – Preserves partitioning

Returns

A Dataframe with result elements.

Return type

IDataFrame(R)

mapPartitionsWithIndex(src, preservesPartitioning)

Performs a specialized map that is called only once for each partition, elements can be accessed using an iterator. Like IDataFrame.mapPartitions but global index of the partition is available as the first argument of the function.

Parameters

src (IFunction2(Integer, IReadIterator(T), Iterable(R)) or ISource.) – Function argument.
preservesPartitioning (Boolean) – Preserves partitioning

Returns

A Dataframe with result elements.

Return type

IDataFrame(R)

mapExecutor(src)

Performs a specialized map that is called only once for each executor, elements can be accessed using a list of lists where first list represents each partition. Function argument can be modified to add or remove values, if you want to generate other value type use :class: IDataFrame.mapExecutorTo.

Parameters: src (IVoidFunction(List(List(T))) or ISource.) – Function argument.
Returns: A Dataframe with result elements.
Return type: IDataFrame(R)

mapExecutorTo(src)

Performs a specialized map that is called only once for each executor, elements can be accessed using a list of lists where first list represents each partition. A new list of lists must be returned to generate new partitions.

Parameters: src (IFunction(List(List(T)), List(List(R))) or ISource.) – Function argument.
Returns: A Dataframe with result elements.
Return type: IDataFrame(R)

groupBy(src, numPartitions)

Groups elements that share the same key, which is obtained from the return of the function.

Parameters

src (IFunction(T, R)) or ISource.) – Function argument.
numPartitions (Integer) – (Optional) Number of resulting partitions.

Returns

A Dataframe of pairs with result elements.

Return type

IPairDataFrame(R, List(T))

sort(ascending, numPartitions)

Sort the elements using their natural order.

Parameters

ascending (Boolean) – Allows you to choose between ascending and descending order.
numPartitions (Integer) – (Optional) Number of resulting partitions.

Returns

A Dataframe with result elements.

Return type

sortBy(src, ascending, numPartitions)

Sort the elements using a custom function, that checks if the first argument is less than the second.

Parameters

src (IFunction2(T, T, Boolean)) or ISource.) – Function argument.
ascending (Boolean) – Allows you to choose between ascending and descending order.
numPartitions (Integer) – (Optional) Number of resulting partitions.

Returns

A Dataframe with result elements.

Return type

union(other, preserveOrder, src)

Merges elements of two dataframes.

Parameters

other (IDataFrame(T)) – other dataframe.
preserveOrder (Boolean) – If true, the second dataframe is concatenated to the first, otherwise they are mixed.
src (ISource) – (Optional) Auxiliary function to configure executor, its use may vary between languages. Must implement at east IBeforeFunction interface.

Returns

A Dataframe with result elements of the two dataframes.

Return type

distinct(numPartitions, src)

Duplicate elements are eliminated.

Parameters

numPartitions (Integer) – Number of resulting partitions.
src (ISource) – (Optional) Auxiliary function to configure executor, its use may vary between languages. Must implement at east IBeforeFunction interface.

Returns

A Dataframe with result elements.

Return type

reduce(src)

Accumulate the elements using a custom function, which must be associative and commutative. Like IDataFrame.treeReduce but final accumulation is performed in a single executor.

Parameters: src (IFunction2(T, T, T)) or ISource.) – Function argument.
Returns: Element resulting from accumulation.
Return type: T

treeReduce(src)

Accumulate the elements using a custom function, which must be associative and commutative. Like IDataFrame.reduce but final accumulation is performed in parallel using multiple executors.

Parameters: src (IFunction2(T, T, T)) or ISource.) – Function argument.
Returns: Element resulting from accumulation.
Return type: T

collect()

Retrieve all the elements.

Returns: All the elements.
Return type: List(T)

aggregate(zero, seqOp, combOp)

Accumulate the elements using two functions, which must be associative and commutative. Like :class: IDataFrame.treeAggregate` but final accumulation is performed in a single executor.

Parameters

zero (IFunction0(R)) or ISource.) – Function argument to generate initial value of target type.
seqOp (IFunction2(T, R, R)) or ISource.) – Function argument to accumulate the elements of each partition.
combOp (IFunction2(R, R, R)) or ISource.) – Function argument to accumulate the results of all partitions .

Returns

Element resulting from accumulation.

Return type

R

treeAggregate(zero, seqOp, combOp)

Accumulate the elements using two functions, which must be associative and commutative. Like IDataFrame.aggregate but final accumulation is performed in parallel using multiple executors.

Parameters

zero (IFunction0(R)) or ISource.) – Function argument to generate initial value of target type.
seqOp (IFunction2(T, R, R)) or ISource.) – Function argument to accumulate the elements of each partition.
combOp (IFunction2(R, R, R)) or ISource.) – Function argument to accumulate the results of all partitions .

Returns

Element resulting from accumulation.

Return type

R

fold(zero, src)

Accumulate the elements using a initial value and custom function, which must be associative and commutative. Like IDataFrame.treeFold but final accumulation is performed in a single executor.

Parameters

zero (IFunction0(R)) or ISource.) – Function argument to generate initial value of target type.
src (IFunction2(T, T, T)) or ISource.) – Function argument to accumulate.

Returns

Element resulting from accumulation.

Return type

treeFold(zero, src)

Accumulate the elements using a initial value and custom function, which must be associative and commutative. Like IDataFrame.treeFold but final accumulation is performed in parallel using multiple executors.

Parameters

zero (IFunction0(R)) or ISource.) – Function argument to generate initial value of target type.
src (IFunction2(T, T, T)) or ISource.) – Function argument to accumulate.

Returns

Element resulting from accumulation.

Return type

take(num)

Retrieves the first num elements.

Parameters: num (Integer) – Number of elements.
Returns: First num elements.
Return type: List(T)

foreach(src)

Calls a custom function once for each element.

Parameters: src (IVoidFunction(T) or ISource.) – Function argument.

foreachPartition(src)

Calls a custom function once for each partition, elements can be accessed using an iterator.

Parameters: src (IVoidFunction(IReadIterator(T)) or ISource.) – Function argument.

foreachExecutor(src)

Calls a custom function once for each executor, elements can be accessed using a list of lists where first list represents each partition.

Parameters: src (IVoidFunction(List(List(T))) or ISource.) – Function argument.

top(num, cmp)

Retrieves the first num elements in descending order. A custom function can be used to checks if the first argument is less than the second

Parameters

num (Integer) – Number of elements.
cmp (IFunction2(T, T, Boolean)) or ISource.) – (Optional) Comparator to be used instead of the natural order.

Returns

First num elements.

Return type

List(T)

takeOrdered(num, cmp)

Retrieves the first num elements in ascending order. A custom function can be used to checks if the first argument is less than the second

Parameters

num (Integer) – Number of elements.
cmp (IFunction2(T, T, Boolean)) or ISource.) – (Optional) Comparator to be used instead of the natural order.

Returns

First num elements.

Return type

List(T)

sample(withReplacement, fraction, seed)

Generates a random sample records from the original elements.

Parameters

withReplacement (Boolean) – An element can be selected more than once.
fraction (Float) – Percentage of the sample.
seed (Integer) – Initializes the random number generator.

Returns

A Dataframe with result elements.

Return type

takeSample(withReplacement, num, seed)

Generates and Retrieves a random sample of num records from the original elements.

Parameters

withReplacement (Boolean) – An element can be selected more than once.
num (Integer) – Number of elements.
seed (Integer) – Initializes the random number generator.

Returns

A Dataframe with result elements.

Return type

count()

Count the elements.

Returns: Number of elements.
Return type: Integer

max(cmp)

Retrieves the element with the maximum value. A custom function can be used to checks if the first argument is less than the second. Like Dataframe.top with num=1

Parameters

num (Integer) – Number of elements.
cmp (IFunction2(T, T, Boolean)) or ISource.) – (Optional) Comparator to be used instead of the natural order.

Returns

Element with the maximum value.

Return type

min(cmp)

Retrieves the element with the minimal value. A custom function can be used to checks if the first argument is less than the second. Like Dataframe.takeOrdered with num=1

Parameters

num (Integer) – Number of elements.
cmp (IFunction2(T, T, Boolean)) or ISource .) – (Optional) Comparator to be used instead of the natural order.

Returns

Element with the minimal value.

Return type

toPair()

Converts IDataFrame to IPairDataFrame when IDataFrame.T is a Pair of IPairDataFrame.K and IPairDataFrame.V.

Returns: A Dataframe of pairs
Return type: IPairDataFrame(K, V)

class IPairDataFrame(K, V)

Extends IDataFrame funtionality when IDataFrame.T is a Pair

class K: Represents the value type associated with the parallel collection. Dynamic languages do not have to make it visible to the user, it is the key input value type for most of the functions defined in IPairDataFrame.

class V: Represents the value type associated with the parallel collection. Dynamic languages do not have to make it visible to the user, it is the value input value type for most of the functions defined in IPairDataFrame.

join(other, preserveOrder, numPartitions, src)

Joins an element of this collection with an element of other that share the same key.

Parameters

other (IPairDataFrame(K, V)) – other dataframe.
numPartitions (Integer) – Number of resulting partitions.
src (ISource) – (Optional) Auxiliary function to configure executor, its use may vary between languages. Must implement at east IBeforeFunction interface.

Returns

A Dataframe of pairs with result elements.

Return type

IPairDataFrame(K, Pair(V, V))

flatMapValues(src)

Performs a map function only on the values while preserving the key. Like IPairDataFrame.mapValues but each element can generate any number of results, key is duplicated or deleted if necessary.

Parameters: src (IFunction(V, R) or ISource.) – Function argument.
Returns: A Dataframe of pairs with result elements.
Return type: IPairDataFrame(K, R)

mapValues(src)

Performs a map function only on the values while preserving the key.

Parameters: src (IFunction(V, R) or ISource.) – Function argument.
Returns: A Dataframe of pairs with result elements.
Return type: IPairDataFrame(K, R)

groupByKey(numPartitions, src)

Groups elements that share the same key.

Parameters

numPartitions (Integer) – Number of resulting partitions.
src (ISource) – (Optional) Auxiliary function to configure executor, its use may vary between languages. Must implement at east IBeforeFunction interface.

Returns

A Dataframe of pairs with result elements.

Return type

IPairDataFrame(K, List(V))

reduceByKey(src, numPartitions, localReduce)

Accumulate the values that share the same key using a custom function, which must be associative and commutative.

Parameters

src (IFunction2(V, V, V)) or ISource.) – Function argument.
numPartitions (Integer) – Number of resulting partitions.
localReduce (Boolean) – Accumulate the values that share the same key in a executor before global accumulation. Reduces the size of the exchange if there are duplicated keys in multiple partitions.

Returns

A Dataframe of pairs with result elements.

Return type

aggregateByKey(zero, seqOp, combOp, numPartitions)

Accumulate the values that share the same key using two functions, which must be associative and commutative.

Parameters

zero (IFunction0(R)) or ISource.) – Function argument to generate initial value of target type.
seqOp (IFunction2(V, R, R)) or ISource.) – Function argument to accumulate the values that share the same key of each partition.
combOp (IFunction2(R, R, R)) or ISource.) – Function argument to accumulate the results that share the same key of all partitions .
numPartitions (Integer) – Number of resulting partitions.

Returns

A Dataframe of pairs with result elements.

Return type

foldByKey(zero, src, numPartitions, localFold)

Accumulate the values that share the same key using a initial value and custom function, which must be associative and commutative.

Parameters

zero (IFunction0(R)) or ISource.) – Function argument to generate initial value of target type.
src (IFunction2(V, V, V)) or ISource.) – Function argument to accumulate.
numPartitions (Integer) – Number of resulting partitions.
localFold (Boolean) – Accumulate the values that share the same key in a executor before global accumulation. Reduces the size of the exchange if there are duplicated keys in multiple partitions.

Returns

A Dataframe of pairs with result elements.

Return type

sortByKey(ascending, numPartitions, src)

Sort the keys using their natural order.

Parameters

ascending (Boolean) – Allows you to choose between ascending and descending order.
numPartitions (Integer) – Number of resulting partitions.
src (ISource) – (Optional) Auxiliary function to configure executor, its use may vary between languages. Must implement at east IBeforeFunction interface.

Returns

A Dataframe of pairs with result elements.

Return type

keys()

Retrieve unique keys.

Returns: The unique keys.
Return type: List(K)

values()

Retrieve unique values.

Returns: The unique values.
Return type: List(V)

sampleByKey(withReplacement, fractions, seed, native)

Generates a random sample records from the values that share the same key.

Parameters

withReplacement (Boolean) – An element can be selected more than once.
fraction (Map(K, Float)) – Percentage of the sample by key. Absences are taken as zero.
seed (Integer) – Initializes the random number generator.
native (Boolean) – (Optional) sends fractions with native serialization.

Returns

A Dataframe with result elements.

Return type