In [1]:

```
:load KarpsDisplays KarpsDagDisplay
:extension OverloadedStrings
import Spark.Core.Dataset
import Spark.Core.Context
import Spark.Core.Functions
import Spark.Core.Column
import Spark.Core.ColumnFunctions
```

If you have worked with large Spark jobs, you know that workflows can become really large and convoluted. Karps provides some tools to decompose and organize workflows, but still retain the power of lazy execution and full-pipeline optimization.

As a simple example, we will compute the mean of a dataset containing integers. In practice, such an operation that involves computing the mean is just one step of a much larger pipeline. We will see how to build and vizualize such pipelines.

Like Spark, Karps comes with a built-in visualization tool to explore how data is being generated. For any `dataset`

or `observable`

, you can use the `showGraph`

command to see the graph of computations for this node. This graph is displayed thanks to the TensorBoard tool that comes with Google TensorFlow.

In [2]:

```
set = dataset ([1 ,2, 3, 4]::[Int]) @@ "initial_set"
myCount = count set
showGraph myCount
```

In [3]:

```
myMean :: Dataset Int -> LocalData Double
myMean ds' =
let s2 = asDouble (count ds') @@ "count"
c1 = asCol ds'
s1 = asDouble(sumCol c1) @@ "sum"
in (s2 / s1) @@ "mean"
```

Then we can use this function on a dataset:

In [4]:

```
let ds = dataset ([1,2,3] :: [Int]) @@ "data"
m = myMean ds
x = m + 1
```

In [5]:

```
showNameGraph x
```

In [6]:

```
let ds = dataset ([1 ,2, 3, 4]::[Int]) @@ "data"
let c = count ds
let c2 = (c + identity c) `logicalParents` [untyped ds] @@ "c2"
```

In [7]:

```
:t logicalParents
```

Here is the mean function, this time with some labels for logical parents.

In [8]:

```
myMean2 :: Dataset Int -> LocalData Int
myMean2 ds' =
let s2 = count ds' @@ "count"
c1 = asCol ds'
s1 = sumCol c1 @@ "sum"
-- Only one parent (ds')
-- Also, the name is NOT set. It will be set by the calling node.
in ((s1 + s2) `logicalParents` [untyped ds'])
```

In [9]:

```
ds = dataset ([1,2,3] :: [Int]) @@ "data"
m2 = (myMean2 ds) @@ "my_mean"
x2 = m2 + 1
```

In [10]:

```
showGraph x2
```

In [11]:

```
showGraph x2
```

Of course, this nesting can occur for arbitrary depths.

*Note* In this preview, the algorithm that performs the name resolution may fail for complicated cases.

Organizing computations is important for large pipelines in practice. Karps provides some basic tools to organize and compose pipelines of arbitrary complexity.

In [ ]:

```
```