Who Ever Said LINQ Predicates Need To Be Boolean-Valued?

Back in the wonderful world of LINQ. This time I’d like to highlight one of the most important properties of LINQ: its flexibility. On various occasions I’ve already stated that the potential of LINQ is infinity², the reason for it being the infinite fan-in (LINQ though C#, VB, F#, PowerShell, various transport mechanisms, etc) multiplied by the infinite fan-out (LINQ to SQL, Entities, XML, Objects, DataSets, SharePoint, Active Directory, Amazon, etc). In this post we forget about the former infinity factor and move on to the latter, where we have a whole spectrum of query provider implementations. The spectrum is bordered by two extremes:

  • Completely local, through IEnumerable<T> (some will debate whether this is a query provider). LINQ to Objects and XML fit in this bucket. The cost to implement this is virtually zero, you just need to make sure your API speaks in terms of IEnumerable<T> (e.g. in LINQ to XML we have methods like Descendants that return sequences of XML nodes). On the flip side though, you can’t delegate execution of the query to another execution engine, e.g. by “remoting” the query to SQL.
  • Completely remotable, through IQueryable<T>. LINQ to SQL, SharePoint, Active Directory, Amazon, etc fall under this umbrella. Implementation cost is higher for this type of providers as the expression tree produced by the IQueryable extension methods needs to be turned into whatever domain-specific language you want to target, e.g. SQL, CAML, LDAP, web service calls, etc. However, the potential to control and delegate execution is huge.

Notice the italics for the word remotable in the latter category: the provider shouldn’t necessarily remote (in the broadest sense of the word, from cross-process to cross-machine) the execution, instead it can be a transformer that rewrites expression trees to optimize or decompose them, etc. Similarly, the provider could be intelligent about distribution of the query execution: maybe it can divide-and-conquer, splitting the work across servers or doing part of the work locally. I’m not a huge fan of the latter “intelligence” (because it becomes too “smart”) as such providers tend to be dishonest about where executes what:

var res = (from p in products where p.Price > 100 select new { Name = p.ProductName }).Take(5);

Maybe the provider executes the where-predicate locally, meaning we’d suck in too many rows. Or it might not know how to do a projection remotely, causing us to suck in too many columns. Similarly, the Take operation might not translate well in the target domain causing similar data transmission overhead worries. Personally I prefer explicit transitioning markers using AsEnumerable to clearly delimit the borders of local versus remote execution. Obviously, when remote and local are relatively close to each other (e.g. cross-process) this type of “smartness” might be well-justified (e.g. LINQ to MSI queries a local file; or LINQ to SharePoint might run in an ASP.NET application on the same machine as WSS; or LINQ to SQL could potentially and hypothetically run inside the database relational engine itself). Back to where we were though...

LINQ as a language pattern

In between those two extremes there’s a whole range of possibilities because of the way LINQ really works: as a front-end language pattern, just like foreach, lock, using, etc are patterns on top of other stuff (IEnumerable, Monitor, IDisposable, etc). What does that imply? The compiler just knows about LINQ as a set of keywords that can be turned into specific method calls, which happen to reflect query operators. To provide some context, refer to my earlier post on “C# 3.0 Query Expression Translation Cheat Sheet” where you’ll see translation rules like:

from x in e where f...

becomes

from x in (e).Where(x => f)...

That’s just it, and nothing more. Keeping the title of this post in mind, do you see any types whatsoever being expressed here? No, the only requirement for the Where method (when invoked like an instance method on whatever e’s type is) is to take in something with a type compatible with “x => f” where f’s type can be inferred of course because f is an expression. Notice how I said “like an instance method”: the method call just follows the rules of method (overload) resolution, including – but not limited to – C# 3.0 extension methods when everything else fails. The only thing defined here is the type of f: the type of x can be anything (in most typical cases though it corresponds to the entity type from the fed in query expression in e) and the same holds for the return type of Where.

Similarly, translation rules like:

from x in e select v

becomes

(e).Select(x => v)

are equally flexible with regards to typing. From this point of view we could (and should) see LINQ as just another language pattern around certain methods, not necessarily related to a specific interface type (therefore establishing a sort of limited duck typing), just like foreach only needs a “compatible GetEnumerator method” (see §8.8.4 of the C# 3.0 language specification, second step of the determination process during foreach-compilation). Noteworthy is a counter-example where an interface type is required to use a specific language feature: the using statement requires an IDisposable object to operate on. The key takeaway though is that LINQ falls in the former category and just needs the right methods to translate away the query expression keywords.

A different view on predicates

Let’s zoom in and investigate just one query operator: where. As we saw before it maps onto a Where method that takes in a lambda of the form x => f. In here, x can be seen as an input element in the source sequence being processed, while f acts as the body of the predicate, establishing a filter condition. Although one could misuse the where operator to implement totally different semantics, we’ll assume we’re dealing with filtering capabilities, i.e. an operator that restricts the returned elements from a given sequence based on some condition also known as a predicate. As we discussed before, f shouldn’t necessarily be Boolean-typed, although the two common implementations – IEnumerable and IQueryable – define it as such:

IEnumerable<T> Where<T>(this IEnumerable<T> source, Func<T, bool> predicate)
IQueryable<T> Where<T>(this IQueryable<T> source, Expression<Func<T, bool>> predicate)

This means when using sources e of any of the above types (ignoring a little caveat I’ll discuss below), the translation for the where operator:

from x in (e).Where(x => f)...

will bind x to T and f to bool, which gets enforced at compile-time (remember co- and contravariance rules for delegates, Func<T, bool> is just one of those). What’s the caveat I was hinting at? Well, first of all (incredibly obvious, I admit) is that this will only work if the Where extension method is brought in scope by means of the using keyword. LINQ doesn’t care about extension methods at all, but when the front-end language compiler rewrites the query expressions into chains of methods call, method resolution kicks in where extension methods are the fallback plan. That brings us to the real caveat: extension methods are only “fallbacks” and instance methods take precedence. Therefore instance methods have the capacity to “hide” extension methods with the same name and signature:

class MyList<T> : IEnumerable<T>
{
public IEnumerable<T> Where(Func<T, bool> predicate)
{

}

public IEnumerator<T> GetEnumerator()
{

}

System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
{

}
}

When using this class – even with System.Linq (and therefore the extension methods in System.Linq.Enumerable) in scope – you’ll see that our self-defined Where operator takes precedence:

while another overload is still available through the extension methods:

his simple fact is easy to overlook but it’s incredibly important to leverage the full spectrum of possible LINQ provider implementations. Maybe you want all of the LINQ to Objects stuff to “just work” except for a few operators you want to implement yourself. Equally you could override IQueryable behavior when you wish to do so. Needless to say, the above just works fine with the where keyword too. To prove this point, let’s use a simple breakpoint:

var res = from x in new MyList<string>() { "abba" } where x.StartsWith("a") select x;
foreach(var x in res);

As you can see, while iterating over the resulting query, the highlighted lambda is being hit while being called from the MyList<string>.Where method. Quiz: Would setting a breakpoint on “x” in Select reveal something in this particular case? Why (not)?

In contrast, the projection lambda in select will be called by an extension method on IEnumerable<T>. To illustrate this, I’ve tweaked the query a bit (why?) and disabled the “Just my code” feature in Visual Studio:

var res = from x in new MyList<string>() { "abba" } where x.StartsWith("a") select x.ToUpper();
foreach(var x in res);

 

So how can we do even more creative stuff using this technique? What about doing this:

class MyList<T> : IEnumerable<T>
{
public IEnumerable<T> Where(Func<T, Regex> filter)
{
foreach (T item in this)
if (filter(item).Match(item.ToString()).Success)
yield return item;
}

public IEnumerator<T> GetEnumerator()
{

}

System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
{

}
}

Does this work? You bet :-). Actually we don’t even need to be IEnumerable<T>, but let’s keep that for later. Here’s how you can use it:

var res = from x in new MyList<string>() { "abba", "baab" } where new Regex("^a.*a$") select x;
foreach (var x in res);

This will only return abba in the foreach loop. Notice we’re actually doing nothing with the input of the “filter” function argument, but we could by referring to x in the where clause. In this case, most likely we only want to process strings (as a regular expression operates on strings) and the regular expression won’t be variant based on the input. To make it a bit more concrete, here’s a sample of a dictionary:

class Dictionary : IEnumerable<string>
{
private List<string> _lst = new List<string>();

public IEnumerable<string> Where(Func<object, Regex> filter)
{
Regex predicate = filter(null);
foreach (string word in _lst)
if (predicate.Match(word).Success)
yield return word;
}

public IEnumerator<string> GetEnumerator()
{
return _lst.GetEnumerator();
}

System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
{
return _lst.GetEnumerator();
}

public void Add(string word)
{
_lst.Add(word);
}
}

Because we derive from IEnumerable<T>, we get all the LINQ to Objects operators for free, including the other Where overloads:

So now we can query our dictionary using a “regular” Boolean-valued predicate (with two distinct overloads: Func<string, bool> and Func<string, int, bool> where int represents the position in the original sequence), but also using a regular expression based predicate.

Taking it across the border: IQueryable (or something like it)

What' we’ve done before was just relying on local execution using IEnumerable’s LINQ to Objects extensions and some of our overrides and/or “query operator” overloads. In an equally simple manner we can make this work with remotable queries using IQueryable; the only difference would be in the signature of the Where operator. To scale things down a bit and avoid the burden of IQueryable’s “gang of three” properties, we’ll cook our own little query model that conforms with the LINQ query operators:

class DictionaryService
{
public DictionaryQuery Where(Expression<Func<object, Regex>> filter)
{
return new DictionaryQuery(filter.Body);
}
}

The DictionaryService class acts as an entry-point to our service that’s capable of retrieving words based on regular expressions. It only supports a filtering operator, which will make it work together with LINQ. Notice we’re using an Expression<...> for the predicate so that we’ll get expression trees we can parse at runtime in order to perform some kind of remote execution. We’ll cook our own Regex class for reasons to be explained further on:

class Regex
{
private string _regex;

public Regex(string regex)
{
_regex = regex;
}

public static Regex operator&(Regex regex1, Regex regex2)
{
//
// Just making the expression API happy to use Expression.And.
//
throw new NotSupportedException("Only for use in expression trees.");
}
}

Now we can turn our attention to the remaining DictionaryQuery class. The start is fairly simple:

class DictionaryQuery
{
private Expression _filter;

internal DictionaryQuery(Expression filter)
{
_filter = filter;
}

public DictionaryQuery Where(Expression<Func<object, Regex>> filter)
{
return new DictionaryQuery(Expression.And(_filter, filter.Body));
}

public IEnumerator<string> GetEnumerator()
{
string[] filters = Compile();
string[] results = GetWords(filters);
return results.AsEnumerable().GetEnumerator();
}

Basically, a query consists of a filter that represents the aggregation of where filter clauses. This is stored in _filter. To allow for the user to specify multiple conditions, we have a Where operator available on the query object itself as well. All this operator does in this case is creating a new filter clause that “ands” together the original filter and the new filter. To do this, we just grab the passed in new filter’s body and hook it up with the original filter in a BinaryExpression of node type And. Notice we’re a bit lazy here and ideally we’d walk the tree of the passed in filter’s body to make sure it’s lambda parameter free (left as an exercise, tip: use a visitor). Expression.And will (statically) verify whether there’s an “And” (& in C# speak) operator available for use with the two specified arguments, hence we need our operator overload on Regex:

public static Regex operator&(Regex regex1, Regex regex2)
{
//
// Just making the expression API happy to use Expression.And.
//
throw new NotSupportedException("Only for use in expression trees.");
}

Notice this one is solely meant to be used in expression trees and doesn’t do anything useful. As a little exercise for the reader, what would it take to implement the && operator (corresponds to AndAlso in expression trees)? What would be the implications concerning the semantics and the evaluation strategy on the target service? Can we map the semantics nicely and if so, how?

The more important thing here though is the fact we’re implementing a GetEnumerator method. This will allow the user to iterate over the query results, which means we must translate the _filter into some service call to fetch results. This happens in a couple of stages: first we compile _filter into something the service understands (in this case just an array of strings) and next we send the compiled result (which we could cache if we want, so that subsequent iterations won’t trigger compilation again) to a service to process the request, getting the results back to iterate over. Question for the reader: why don’t we implement IEnumerable<string> here? Would it be desirable? Explain the pros and cons.

Question remains how the compilation happens. Here it is:

private string[] Compile()
{
return Compile(_filter).ToArray();
}

private IEnumerable<string> Compile(Expression filter)
{
switch (filter.NodeType)
{
case ExpressionType.And:
{
BinaryExpression binEx = (BinaryExpression)filter;
foreach (string regex in Compile(binEx.Left).Concat(Compile(binEx.Right)))
yield return regex;
}
break;
case ExpressionType.New:
{
NewExpression newEx = (NewExpression)filter;
if (newEx.Type != typeof(Regex))
throw new NotSupportedException("Invalid leaf node type in query: " + newEx.Type);

//
// Only one constructor is public.
//
Expression arg = newEx.Arguments[0];
if (arg.NodeType != ExpressionType.Constant)
throw new NotSupportedException("Regex constructor uses in query should have a constant argument value.");

ConstantExpression constEx = (ConstantExpression)arg;
string value = (string)constEx.Value;
if (value == null)
yield break;
else
yield return value;
}
break;
default:
throw new NotSupportedException("Invalid expression type in query: " + filter.NodeType);
}
}

This is fairly straightforward as well. We just expect two types of nodes: BinaryExpression nodes of type And used as the query composition operator, and NewExpression nodes for the leaf nodes (although leaf isn’t strictly true, why?) representing the atoms of the query predicate. Most code is boilerplate error checking but the points of interest are:

  • The use of iterators to produce a stream of strings representing the individual regular expressions. Notice the use of recursion in the And case (unfortunately we don’t have a yield foreach construct yet).
  • How we get the the value of the regular expression on the leaf nodes, through the first argument on the constructor. Actually, we don’t even need the private field in Regex in this case as we’re purely using it for expression trees and we can get to it through the constructor’s first argument in the expression tree.

To show what the tree looks like so far, consider the following query:

var dict = new DictionaryService();
var res = from word in dict
where new Regex("^A.*a$")
where new Regex(".*m.*")
select word;

foreach (var x in res);

Setting a breakpoint on Compile allows us to inspect the filter:

Here we just see the ToString representation but when clicking the magnifier glass (assuming the Expression Tree Visualizer sample is installed), we get to see the entire tree:

Notice where our information of interest lives and how our expression tree compiler extracts it carefully from the tree. Another little quiz for the reader: is the query below equivalent to the one above (watch out for details)? Why (not)?

var dict = new DictionaryService();
var res = from word in dict
where new Regex("^A.*a$") && new Regex(".*m.*")
select word;

Last but not least we should take a look at our service itself. Obviously you could imagine any real service technology being used, such as WCF, but let’s stick with some simple fake local execution:

private string[] GetWords(string[] regexs)
{
//
// This acts as an imaginary remote service.
//
var conditions = new List<System.Text.RegularExpressions.Regex>();
foreach (string regex in regexs)
conditions.Add(new System.Text.RegularExpressions.Regex(regex));

var res = from a in AppDomain.CurrentDomain.GetAssemblies()
from t in a.GetTypes()
from m in t.GetMethods()
let word = m.Name
where conditions.All(r => r.Match(word).Success)
select word;

return res.ToArray();
}

The biggest source of in-memory data is without doubt reflection, so let’s use it :-). Here we’re using real regular expressions (Questions: Why did we wrap them in the first place? Could we use some serialization mechanism? Explain!) to do the matching as part of a LINQ query (regular LINQ to Objects this time – it would be interesting to use our “LINQ-by-Regex” implementation recursively :-)). Notice how we return an array which has implications for the laziness when consuming data. Depending on the transport protocol the implementation could be streaming data to the client, so that the client can terminate the request at any point in time (when breaking from the foreach loop for instance). Or we could implement some kind of paging to retrieve the results in batches.

On my computer the following query:

var dict = new DictionaryService();
var res = from word in dict
where new Regex("^A.*a$")
where new Regex(".*m.*")
select word;

returns

"AutoIncrementCannotSetIfHasData"
"AddSchema"

when running it in the context of a Windows Forms application created with Visual Studio 2008.

Exercise: Think of ways to simplify this regular-expression based (remote!) querying capability, especially with regards to the end-user writing queries targeting it. Can we get rid of the “Regex” word in there? Why (not)? How could we compile other Boolean operators to compose more complex queries and send them across to the server (tip: *fix).

References
0

Bart was a Visual C# MVP, now he works at Microsoft on the WPF dev team as Software Developer Engineer. Prior to this new challenge, Bart was active in the Belgian community on many Microsoft technologies, most of the time focusing on CLR, languages innovation and frameworks. In his role, he's been speaking at various events and attended several international conferences including TechEd Europe, IT Forum and the PDC. Bart is a DZone MVB and is not an employee of DZone and has posted 13 posts at DZone. You can read more from them at their website.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

chase.saunders replied on Fri, 2008/09/19 - 9:25am

First off, I want to thank you for this article because it sharpened my understanding of LINQ plumbing.  But I'm not sure what the payoff is of re-implementing Where semantics. 

As far as I can tell, nothing is gained by overlaoding Where to take a regex but more work.  After all, it's not like you've escaped the semantics of Where that require a relational expression to be evaluated as true or false... you're writing the same code you would write with the built-in Where clause (namely, calling the boolean values Success function of regex) AND reimplementing builtin logic of Where.  That's a lot of work, with some potential risks, in order to write code inside a class instead of writing similar code inside a lambda expression. 

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.