Bart was a Visual C# MVP, now he works at Microsoft on the WPF dev team as Software Developer Engineer. Prior to this new challenge, Bart was active in the Belgian community on many Microsoft technologies, most of the time focusing on CLR, languages innovation and frameworks. In his role, he's been speaking at various events and attended several international conferences including TechEd Europe, IT Forum and the PDC. Bart is a DZone MVB and is not an employee of DZone and has posted 24 posts at DZone. You can read more from them at their website. View Full User Profile

SelectMany: Probably The Most Powerful LINQ Operator

08.20.2008
| 43792 views |
  • submit to reddit

Hi there back again. Hope everyone is already exploiting the power of LINQ on a fairly regular basis. Okay, everyone knows by now how simple LINQ queries with a where and select (and orderby, and Take and Skip and Sum, etc) are translated from a query comprehension into an equivalent expression for further translation:

from p in products where p.Price > 100 select p.Name

becomes

products.Where(p => p.Price > 100).Select(p => p.Name)

All blue syntax highlighting has gone; the compiler is happy with what remains and takes it from there in a left-to-right fashion (so, it depends on the signature of the found Where method whether or not we take the route of anonymous methods or, in case of an Expression<…> signature, the route of expression trees).

But let’s make things slightly more complicated and abstract:

from i in a
from j in b
where i > j
select i + j

It’s more complicated because we have two from clauses; it’s more abstract because we’re using names with no intrinsic meaning. Let’s assume a and b are IEnumerable<int> sequences in what follows. Actually what the above query means in abstract terms is:

(a X b).Where((i, j) => i > j).Select((i, j) => i + j)

where X is a hypothetical Cartesian product operator, i.e. given a = { 1, 4, 7 } and b = { 2, 5, 8 }, it produces { (1,2), (1,5), (1,8), (4,2), (4,5), (4,8), (7,2), (7,5), (7,8) }, or all the possible pairs with elements from the first sequence combined with an element from the second sequence. For the record, the generalized from of such a pair – having any number of elements – would be a tuple. If we would have this capability, Where would get a sequence of such tuples, and it could identify a tuple in its lambda expression as a set of parameters (i, j). Similarly, Select would do the same and everyone would be happy. You can verify the result would be { 6, 9, 12 }.

Back to reality now: we don’t have the direct equivalent of Cartesian product in a form that produces tuples. In addition to this, the Where operator in LINQ has a signature like this:

IEnumerable<T> Where<T>(this IEnumerable<T> source, Func<T, bool> predicate)

where the predicate parameter is a function of one – and only one – argument. The lambda (i, j) => i > j isn’t compatible with this since it has two arguments. A similar remark holds for Select. So, how can we get around this restriction? SelectMany is the answer.

Demystifying SelectMany

What’s the magic SelectMany all about? Where could we better start our investigation than by looking at one of its signatures?

IEnumerable<TResult> SelectMany<TSource, TCollection, TResult>(
this IEnumerable<TSource> source,
Func<TSource, IEnumerable<TCollection>> collectionSelector,
Func<TSource, TCollection, TResult> resultSelector)

Wow, might be a little overwhelming at first. What does it do? Given a sequence of elements (called source) of type TSource, it asks every such element (using collectionSelector) for a sequence of – in some way related – elements of type TCollection. Next, it combines the currently selected TSource element with all of the TCollection elements in the returned sequence and feed it in to resultSelector to produce a TResult that’s returned. Still not clear? The implementation says it all and is barely three lines:

foreach (TSource item in source)
foreach (TCollection subItem in collectionSelector(item))
yield return resultSelector(item, subItem);

This already gives us a tremendous amount of power. Here’s a sample:

products.SelectMany(p => p.Categories, (p, c) => p.Name + “ has category “ + c.Name)

How can we use this construct to translate multiple from clauses you might wonder? Well, there’s no reason the function passed in as the first argument (really the second after rewriting the extension method, i.e. the collectionSelector) uses the TSource argument to determine the IEnumerable<TCollection> result. For example:

products.SelectMany(p => new int[] { 1, 2, 3 }, (p, i) => p.Name + “ with irrelevant number “ + i)

will produce a sequence of strings like “Chai with irrelevant number 1”, “Chai with irrelevant number 2”, “Chai with irrelevant number 3”, and similar for all subsequent products. This sample doesn’t make sense but it illustrates that SelectMany can be used to form a Cartesian product-like sequence. Let’s focus on our initial sample:

var a = new [] { 1, 4, 7 };
var b = new [] { 2, 5, 8 };

from i in a
from j in b
select i + j;

I’ve dropped the where clause for now to simplify things a bit. With our knowledge of SelectMany above we can now translate the LINQ query into:

a.SelectMany(i => b, …)

This means: for every i in a, “extract” the sequence b and feed it into …. What’s the …’s signature? Something from a (i.e. an int) and something from the result of the collectionSelector (i.e. an int from b), is mapped onto some result. Well, in this case we can combine those two values by summing them, therefore translating the select clause in one go:

a.SelectMany(i => b, (i, j) => i + j)

What happens when we introduce a seemingly innocent where clause in between?

from i in a
from j in b
where i > j
select i + j;

The first two lines again look like:

a.SelectMany(i => b, …)

However, going forward from there we’ll need to be able to reference i (from a) and j (from b) in both the where and select clause that follow but both the corresponding Where and Select methods only take in “single values”:

IEnumerable<TSource> Where<TSource>(this IEnumerable<TSource> source, Func<TSource, bool> predicate);
IEnumerable<TResult> Select<TSource, TResult>(this IEnumerable<TSource> source, Func<TSource, TResult> projection);

So what can we do to combine the value i and j into one single object? Right, use an anonymous type:

a.SelectMany(i => b, (i, j) => new { i = i, j = j })

This produces a sequence of objects that have two public properties “i” and “j” (since it’s anonymous we don’t care much about casing, and indeed the type never bubbles up to the surface in the query above, because of what follows:

a.SelectMany(i => b, (i, j) => new { i = i, j = j }).Where(anon => anon.i > anon.j).Select(anon => anon.i + anon.j)

In other words, all references to i and j in the where and select clauses in the original query expression have been replaced by references to the corresponding properties in the anonymous type spawned by SelectMany.

Lost in translation

This whole translation of this little query above puts quite some work on the shoulder of the compiler (assuming a and b are IEnumerable<int> and nothing more, i.e. no IQueryable<int>):

  • The lambda expression i => b captures variable b, hence a closure is needed.
  • That same lambda expression acts as a parameter to SelectMany, so an anonymous method will be created inside the closure class.
  • For new { i = i, j = j } an anonymous type needs to be generated.
  • SelectMany’s second argument, Where’s first argument and Select’s first argument are all lambda expressions that generate anonymous methods as well.

As a little hot summer evening exercise, I wrote all of this plumbing manually to show how much code would be needed in C# 2.0 minus closures and anonymous methods (more or less C# 1.0 plus generics). Here’s where we start from:

class Q
{
IEnumerable<int> GetData(IEnumerable<int> a, IEnumerable<int> b)
{
return from i in a
from j in b
where i > j
select i + j;
}
}

This translates into:

class Q
{
IEnumerable<int> GetData(IEnumerable<int> a, IEnumerable<int> b)
{
Closure0 __closure = new Closure0();
__closure.b = b;

return Enumerable.Select(
Enumerable.Where(
Enumerable.SelectMany(
a,
new Func<int, IEnumerable<int>>(__closure.__selectMany1),
new Func<int, int, Anon0<int, int>>(__selectMany2)
),
new Func<Anon0<int, int>, bool>(__where1)
),
new Func<Anon0<int, int>, int>(__select1)
);
}

private class Closure0
{
public IEnumerable<int> b;

public IEnumerable<int> __selectMany1(int i)
{
return b;
}
}

private static Anon0<int, int> __selectMany2(int i, int j)
{
return new Anon0<int, int>(i, j);
}

private static bool __where1(Anon0<int, int> anon)
{
return anon.i > anon.j;
}

private static int __select1(Anon0<int, int> anon)
{
return anon.i + anon.j;
}
}

private class Anon0<TI, TJ> // generics allow reuse of type for all anonymous types with 2 properties, hence the use of EqualityComparers in the implementation
{
private readonly TI _i;
private readonly TJ _j;

public Anon0(TI i, TJ t2) { _i = i; _j = j; }

public TI i { get { return _i; } }
public TJ j { get { return _j; } }

public override bool Equals(object o)
{
Anon0<TI, TJ> anonO = o as Anon0<TI, TJ>;
return anonO != null && EqualityComparer<TI>.Default.Equals(_i, anonO._i) && EqualityComparer<TJ>.Default.Equals(_j, anonO._j);
}

public override int GetHashCode()
{
return EqualityComparer<TI>.Default.GetHashCode(_i) ^ EqualityComparer<TJ>.Default.GetHashCode(_j); // lame quick-and-dirty hash code
}

public override string ToString()
{
return “( i = “ + i + “, j = ” + j + “ }”; // lame without StringBuilder
}
}

Just a little thought… Would you like to go through this burden to write a query? “Syntactical sugar” might have some bad connotation to some, but it can be oh so sweet baby!

Bind in disguise

Fans of “monads”, a term from category theory that has yielded great results in the domain of functional programming as a way to make side-effects explicit through the type system (e.g. the IO monad in Haskell), will recognize SelectMany’s (limited) signature to match the one of bind:

IEnumerable<TResult> SelectMany<TSource, TResult>(
this IEnumerable<TSource> source,
Func<TSource, IEnumerable<TResult>> collectionSelector)

corresponds to:

(>>=) :: M x –> (x –> M y) –> M y

Which is Haskell’s bind operator. For those familiar with Haskell, the “do” notation – that allows the visual illusion of embedding semi-colon curly brace style of “imperative programming” in Haskell code – is syntactical sugar on top of this operator, defined (recursively) as follows:

do { e }            = e
do { e; s } = e >>= \_ –> do { s }
do { x <- e; s } = e >>= (\x –> do { s })
do { let x = e; s } = let x = e in do { s }

Rename to SelectMany, replace M x by IEnumerable<x> and assume a non-curried form and you end up with:

SelectMany :: (IEnumerable<x>, x –> IEnumerable<y>) –> IEnumerable<y>

Identifying x with TSource, y with TResult and turning a –> b into Func<a, b> yields:

SelectMany :: Func<IEnumerable<TSource>, Func<TSource, IEnumerable<TResult>>, IEnumerable<TResult>>

and you got identically the same signature as the SelectMany we started from. For the curious, M in the original form acts as a type constructor, something the CLR doesn’t support since it lacks higher-order kinded polymorphism; it’s yet another abstraction one level higher than generics that math freaks love to use in category theory. The idea is that if you can prove laws to be true in some “structure” and you can map that structure onto an another “target structure” by means of some mapping function, corresponding laws will hold true in the “target structure” as well. For instance:

({ even, odd }, +)

and

({ pos, neg }, *)

can be mapped onto each other pairwise and recursively, making it possible to map laws from the first one to the second one, e.g.

even + odd –> odd
pos * neg –> neg

This is a largely simplified sample of course, I’d recommend everyone who’s interested to get a decent book on category theory to get into the gory details.

A word of caution

Now that you know how SelectMany works, can you think of a possible implication when selecting from multiple sources? Let me give you a tip: nested foreachs. This is an uninteresting sentence that acts as a placeholder in the time space while you’re thinking about the question. Got it? Indeed, order matters. Writing the following two lines of code produces a different query with a radically different execution pattern:

from i in a from j in b …
from j in b from i in a …

Those roughly correspond to:

foreach (var i in a)
foreach (var j in b)

versus

foreach (var j in b)
foreach (var i in a)

But isn’t this much ado about nothing? No, not really. What if iterating over b is much more costly than iterating over a? For example,

from p in localCollectionOfProducts
from c in sqlTableOfCategories

This means that for every product iterated locally, we’ll reach out to the database to iterate over the (retrieved) categories. If both were local, there wouldn’t be a problem of course; if both were remote, the (e.g.) SQL translation would take care of it to keep the heavy work on the remote machine. If you want to see the difference yourself, you can use the following simulation:


using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Threading;

class Q
{
static void Main()
{
Stopwatch sw = new Stopwatch();

Console.WriteLine("Slow first");
sw.Start();
foreach (var s in Perf<int,char>(Slow(), Fast()))
Console.WriteLine(s);
sw.Stop();
Console.WriteLine(sw.Elapsed);

sw.Reset();

Console.WriteLine("Fast first");
sw.Start();
foreach (var s in Perf<char,int>(Fast(), Slow()))
Console.WriteLine(s);
sw.Stop();
Console.WriteLine(sw.Elapsed);
}

static IEnumerable<string> Perf<S,T>(IEnumerable<S> a, IEnumerable<T> b)
{
return from i in a
from j in b
select i + "," + j;
}

static IEnumerable<int> Slow()
{
Console.Write("Connecting... ");
Thread.Sleep(2000); // mimic query overhead (e.g. remote server)
Console.WriteLine("Done!");
yield return 1;
yield return 2;
yield return 3;
}

static IEnumerable<char> Fast()
{
return new [] { 'a', 'b', 'c' };
}
}

This produces:

Obviously, it might be the case you’re constructing a query that can only execute by reaching out to the server multiple times, e.g. because order of the result matters (see screenshot above for an illustration of the ordering influence – but some local sorting operation might help too in order to satisfy such a requirement) or because the second query source depends on the first one (from i in a from j in b(i) …). There’s no silver bullet for a solution but knowing what happens underneath the covers certainly provides the necessary insights to come up with scenario-specific solutions.

Happy binding!

References
Published at DZone with permission of Bart De Smet, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)