Introduction to Rx
Kindle edition (2012)
Data is not always valuable is its raw form. Sometimes we need to consolidate, collate, combine or condense the mountains of data we receive into more consumable bite sized chunks. Consider fast moving data from domains like instrumentation, finance, signal processing and operational intelligence. This kind of data can change at a rate of over ten values per second. Can a person actually consume this? Perhaps for human consumption, aggregate values like averages, minimums and maximums can be of more use.
Continuing with the theme of reducing an observable sequence, we will look at the aggregation functions that are available to us in Rx. Our first set of methods continues on from our last chapter, as they take an observable sequence and reduce it to a sequence with a single value. We then move on to find operators that can transition a sequence back to a scalar value, a functional fold.
Just before we move on to introducing the new operators, we will quickly create our own extension method. We will use this 'Dump' extension method to help build our samples.
Those who use LINQPad will recognize that this is the source of inspiration. For those who have not used LINQPad, I highly recommend it. It is perfect for whipping up quick samples to validate a snippet of code. LINQPad also fully supports the IObservable<T> type.
Count is a very familiar extension method for those that use LINQ on IEnumerable<T>. Like all good method names, it "does what it says on the tin". The Rx version deviates from the IEnumerable<T> version as Rx will return an observable sequence, not a scalar value. The return sequence will have a single value being the count of the values in the source sequence. Obviously we cannot provide the count until the source sequence completes.
If you are expecting your sequence to have more values than a 32 bit integer can hold, there is the option to use the LongCount extension method. This is just the same as Count except it returns an IObservable<long>.
Min, Max, Sum and Average
Other common aggregations are Min, Max, Sum and Average. Just like Count, these all return a sequence with a single value. Once the source completes the result sequence will produce its value and then complete.
The Min and Max methods have overloads that allow you to provide a custom implementation of an IComparer<T> to sort your values in a custom way. The Average extension method specifically calculates the mean (as opposed to median or mode) of the sequence. For sequences of integers (int or long) the output of Average will be an IObservable<double>. If the source is of nullable integers then the output will be IObservable<double?>. All other numeric types (float, double, decimal and their nullable equivalents) will result in the output sequence being of the same type as the input sequence.
Finally we arrive at the set of methods in Rx that meet the functional description
of catamorphism/fold. These methods will take an IObservable<T> and
Caution should be prescribed whenever using any of these fold methods on an observable sequence, as they are all blocking. The reason you need to be careful with blocking methods is that you are moving from an asynchronous paradigm to a synchronous one, and without care you can introduce concurrency problems such as locking UIs and deadlocks. We will take a deeper look into these problems in a later chapter when we look at concurrency.
It is worth noting that in the soon to be released .NET 4.5 and Rx 2.0 will provide
support for avoiding these concurrency problems. The new
keywords and related features in Rx 2.0 can help exit the monad in a safer way.
The First() extension method simply returns the first value from a sequence.
If the source sequence does not have any values (i.e. is an empty sequence) then the First method will throw an exception. You can cater for this in three ways:
- Use a try/catch blocks around the First() call
- Use Take(1) instead. However, this will be asynchronous, not blocking.
- Use FirstOrDefault extension method instead
The FirstOrDefault will still block until the source produces any notification.
If the notification is an OnError then it will be thrown. If the notification
is an OnNext then that value will be returned, otherwise if it is an OnCompleted
the default will be returned. As we have seen in earlier methods, we can either
choose to use the parameterless method in which the default value will be
(i.e. null for reference types or the zero value for value types), alternatively
we can provide our own default value to use.
A special mention should be made for the unique relationship that BehaviorSubject and the First() extension method has. The reason behind this is that the BehaviorSubject is guaranteed to have a notification, be it a value, an error or a completion. This effectively removes the blocking nature of the First extension method when used with a BehaviorSubject. This can be used to make behavior subjects act like properties.
The Last and LastOrDefault will block until the source completes and then return the last value. Just like the First() method any OnError notifications will be thrown. If the sequence is empty then Last() will throw an InvalidOperationException, but you can use LastOrDefault to avoid this.
The Single extension method is for getting the single value from a sequence. The difference between this and First() or Last() is that it helps to assert your assumption that the sequence will only contain a single value. The method will block until the source produces a value and then completes. If the sequence produces any other combination of notifications then the method will throw. This method works especially well with AsyncSubject instances as they only produce a single value sequences.
Build your own aggregations
If the provided aggregations do not meet your needs, you can build your own. Rx provides two different ways to do this.
The Aggregate method allows you to apply an accumulator function to the sequence. For the basic overload, you need to provide a function that takes the current state of the accumulated value and the value that the sequence is pushing. The result of the function is the new accumulated value. This overload signature is as follows:
If you wanted to produce your own version of Sum for
int values, you could do so by
providing a function that just adds to the current state of the accumulator.
This overload of Aggregate has several problems. First is that it requires the aggregated value must be the same type as the sequence values. We have already seen in other aggregates like Average this is not always the case. Secondly, this overload needs at least one value to be produced from the source or the output will error with an InvalidOperationException. It should be completely valid for us to use Aggregate to create our own Count or Sum on an empty sequence. To do this you need to use the other overload. This overload takes an extra parameter which is the seed. The seed value provides an initial accumulated value. It also allows the aggregate type to be different to the value type.
To update our Sum implementation to use this overload is easy. Just add the seed which will be 0. This will now return 0 as the sum when the sequence is empty which is just what we want. You also now can also create your own version of Count.
As an exercise write your own Min and Max methods using Aggregate.
You will probably find the IComparer<T> interface useful, and in
particular the static
Comparer<T>.Default property. When you
have done the exercise, continue to the example implementations...
Examples of creating Min and Max from Aggregate:
While Aggregate allows us to get a final value for sequences that will complete, sometimes this is not what we need. If we consider a use case that requires that we get a running total as we receive values, then Aggregate is not a good fit. Aggregate is also not a good fit for infinite sequences. The Scan extension method however meets this requirement perfectly. The signatures for both Scan and Aggregate are the same; the difference is that Scan will push the result from every call to the accumulator function. So instead of being an aggregator that reduces a sequence to a single value sequence, it is an accumulator that we return an accumulated value for each value of the source sequence. In this example we produce a running total.
It is probably worth pointing out that you use Scan with TakeLast() to produce Aggregate.
As another exercise, use the methods we have covered so far in the book to produce a sequence of running minimum and running maximums. The key here is that each time we receive a value that is less than (or more than for a Max operator) our current accumulator we should push that value and update the accumulator value. We don't however want to push duplicate values. For example, given a sequence of [2, 1, 3, 5, 0] we should see output like [2, 1, 0] for the running minimum, and [2, 3, 5] for the running maximum. We don't want to see [2, 1, 2, 2, 0] or [2, 2, 3, 5, 5]. Continue to see an example implementation.
Example of a running minimum:
Example of a running maximum:
While the only functional differences between the two examples is checking greater instead of less than, the examples show two different styles. Some people prefer the terseness of the first example, others like their curly braces and the verbosity of the second example. The key here was to compose the Scan method with the Distinct or DistinctUntilChanged methods. It is probably preferable to use the DistinctUntilChanged so that we internally are not keeping a cache of all values.
Rx also gives you the ability to partition your sequence with features like the standard LINQ operator GroupBy. This can be useful for taking a single sequence and fanning out to many subscribers or perhaps taking aggregates on partitions.
MinBy and MaxBy
The MinBy and MaxBy operators allow you to partition your sequence based on a key selector function. Key selector functions are common in other LINQ operators like the IEnumerable<T>ToDictionary or GroupBy and the Distinct method. Each method will return you the values from the key that was the minimum or maximum respectively.
Take note that each Min and Max operator has an overload that takes a comparer. This allows for comparing custom types or custom sorting of standard types.
Consider a sequence from 0 to 10. If we apply a key selector that partitions the values in to groups based on their modulus of 3, we will have 3 groups of values. The values and their keys will be as follows:
- 0, key: 0
- 1, key: 1
- 2, key: 2
- 3, key: 0
- 4, key: 1
- 5, key: 2
- 6, key: 0
- 7, key: 1
- 8, key: 2
- 9, key: 0
We can see here that the minimum key is 0 and the maximum key is 2. If therefore, we applied the MinBy operator our single value from the sequence would be the list of [0,3,6,9]. Applying the MaxBy operator would produce the list [2,5,8]. The MinBy and MaxBy operators will only yield a single value (like an AsyncSubject) and that value will be an IList<T> with zero or more values.
If instead of the values for the minimum/maximum key, you wanted to get the minimum value for each key, then you would need to look at GroupBy.
The GroupBy operator allows you to partition your sequence just as IEnumerable<T>'s GroupBy operator does. In a similar fashion to how the IEnumerable<T> operator returns an IEnumerable<IGrouping<TKey, T>>, the IObservable<T>GroupBy operator returns an IObservable<IGroupedObservable<TKey, T>>.
I find the last two overloads a little redundant as we could easily just compose a Select operator to the query to get the same functionality.
In a similar fashion that the IGrouping<TKey, T> type extends the IEnumerable<T>, the IGroupedObservable<T> just extends IObservable<T> by adding a Key property. The use of the GroupBy effectively gives us a nested observable sequence.
To use the GroupBy operator to get the minimum/maximum value for each key, we can first partition the sequence and then Min/Max each partition.
The code above would work, but it is not good practice to have these nested subscribe calls. We have lost control of the nested subscription, and it is hard to read. When you find yourself creating nested subscriptions, you should consider how to apply a better pattern. In this case we can use SelectMany which we will look at in the next chapter.
The concept of a sequence of sequences can be somewhat overwhelming at first, especially if both sequence types are IObservable. While it is an advanced topic, we will touch on it here as it is a common occurrence with Rx. I find it easier if I can conceptualize a scenario or example to understand concepts better.
Examples of Observables of Observables:
- Partitions of Data
- You may partition data from a single source so that it can easily be filtered and shared to many sources. Partitioning data may also be useful for aggregates as we have seen. This is commonly done with the GroupBy operator.
- Online Game servers
- Consider a sequence of servers. New values represent a server coming online. The value itself is a sequence of latency values allowing the consumer to see real time information of quantity and quality of servers available. If a server went down then the inner sequence can signify that by completing.
- Financial data streams
- New markets or instruments may open and close during the day. These would then stream price information and could complete when the market closes.
- Chat Room
- Users can join a chat (outer sequence), leave messages (inner sequence) and leave a chat (completing the inner sequence).
- File watcher
- As files are added to a directory they could be watched for modifications (outer sequence). The inner sequence could represent changes to the file, and completing an inner sequence could represent deleting the file.
Considering these examples, you could see how useful it could be to have the concept of nested observables. There are a suite of operators that work very well with nested observables such as SelectMany, Merge and Switch that we look at in future chapters.
When working with nested observables, it can be handy to adopt the convention that a new sequence represents a creation (e.g. A new partition is created, new game host comes online, a market opens, users joins a chat, creating a file in a watched directory). You can then adopt the convention for what a completed inner sequence represents (e.g. Game host goes offline, Market Closes, User leave chat, File being watched is deleted). The great thing with nested observables is that a completed inner sequence can effectively be restarted by creating a new inner sequence.
In this chapter we are starting to uncover the power of LINQ and how it applies
to Rx. We chained methods together to recreate the effect that other methods already
provide. While this is academically nice, it also allows us to starting thinking
in terms of functional composition. We have also seen that some methods work nicely
with certain types:
First() + BehaviorSubject<T>,
Single() + AsyncSubject<T>,
etc. We have covered the second of our three classifications of operators, catamorphism.
Next we will discover more methods to add to our functional composition tool belt
and also find how Rx deals with our third functional concept, bind.
Consolidating data into groups and aggregates enables sensible consumption of mass data. Fast moving data can be too overwhelming for batch processing systems and human consumption. Rx provides the ability to aggregate and partition on the fly, enabling real-time reporting without the need for expensive CEP or OLAP products.
Additional recommended reading
|<< Back to : Inspection||Moving on to : Transformation of sequences>>|