Introduction
I am solving the following problem:
- Given a C# type X with one or more publicly visible constants what are all the C# types in the solution that depend on the constants in the type X?
Because constants are inlined at compile time, there is no point inspecting the binaries. We need to inspect the source code using Roslyn API.
I am not going to use Semantic Analysis, because it would be very expensive. Instead I will use regex to check if the given file seems to use the constants and then verify using the Syntax Tree. Not bulletproof, but good enough and relatively fast.
Right now the stats are:
- Project Count: 333
- File Count: 45280
- SSD drive
My implementation
The overall scheme is:
- Produce a stream of the relevant
Microsoft.Build.Evaluation.Project
objects, which would give us the list of C# files - From the C# files produce a stream of C# file content
- For each C# file content match the given regex and for every match determine if the relevant constant is used using the Syntax Tree of that C# file content. If affirmative - report the respective type.
I do not want to overload with details, on the other hand I would like to provide enough details to be able to explain my difficulty. It is not much, please bear with me.
I use several small auxiliary types:
ProjectItem
private class ProjectItem
{
public readonly string AssemblyName;
public readonly CSharpParseOptions ParseOptions;
public readonly IEnumerable<string> CSFilePaths;
public ProjectItem(TypeMap typeMap, string asmName)
{
AssemblyName = asmName;
var asmProps = typeMap.Assemblies[asmName];
ParseOptions = asmProps.GetParseOptions();
CSFilePaths = new Project(asmProps.ProjectPath).GetItems("Compile").Select(item => item.GetMetadataValue("FullPath"));
}
public IEnumerable<TextItem> YieldTextItems() => CSFilePaths.Select(csFilePath => new TextItem(this, csFilePath, File.ReadAllText(csFilePath)));
}
Where TypeMap
is a registry of all the types and assemblies used in our solutions. Some other code has built it previously. Consider it as an oracle that can answer some questions, like "Give me the parse options (or project path) for the given assembly". But it does not specify the list of C# files used by the project. For that we need to instantiate the respective Microsoft.Build.Evaluation.Project
instance. Which is expensive.
TextItem
private class TextItem
{
public readonly string AssemblyName;
public readonly CSharpParseOptions ParseOptions;
public readonly string CSFilePath;
public readonly string Text;
public TextItem(ProjectItem item, string csFilePath, string text)
{
AssemblyName = item.AssemblyName;
ParseOptions = item.ParseOptions;
CSFilePath = csFilePath;
Text = text;
}
public IEnumerable<TypeDefKey> YieldDependentTypes(TypeMap typeMap, TypeDefKey constTypeDefKey, Regex regex)
{
...
SyntaxTree syntaxTree = null;
foreach (Match m in regex.Matches(Text))
{
if (syntaxTree == null)
{
syntaxTree = CSharpSyntaxTree.ParseText(Text, ParseOptions, CSFilePath);
...
}
...
if (IsTheRegexMatchIndeedCorrespondsToTheGivenConstantType(syntaxTree, ...))
{
var typeDefKey = GetTheType(syntaxTree, ...);
yield return typeDefKey;
}
}
}
}
Given the aforementioned types I have came up with this simple TPL Dataflow pipeline:
var regex = GetRegex(...);
var dependentAssemblies = GetDependentAssemblies(...);
var dependentTypes = new ConcurrentDictionary<TypeDefKey, object>();
var produceCSFilePaths = new TransformManyBlock<ICollection<string>, ProjectItem>(asmNames => asmNames.Select(asmName => new ProjectItem(typeMap, asmName)));
var produceCSFileText = new TransformManyBlock<ProjectItem, TextItem>(p => p.YieldTextItems());
var produceDependentTypes = new TransformManyBlock<TextItem, TypeDefKey>(t => t.YieldDependentTypes(typeMap, constTypeDefKey, regex));
var getDependentTypes = new ActionBlock<TypeDefKey>(typeDefKey => dependentTypes.TryAdd(typeDefKey, null));
var linkOptions = new DataflowLinkOptions { PropagateCompletion = true };
produceCSFilePaths.LinkTo(produceCSFileText, linkOptions);
produceCSFileText.LinkTo(produceDependentTypes, linkOptions);
produceDependentTypes.LinkTo(getDependentTypes, linkOptions);
produceCSFilePaths.Post(dependentAssemblies);
produceCSFilePaths.Complete();
getDependentTypes.Completion.Wait();
The problems and the questions
- It is slow - takes about 50 seconds, CPU utilization is low. I realize there is a lot of IO here, but there is still CPU involved to apply the regex and parse the content into the respective Syntax Trees.
- I do not understand how I can use
TransformManyBlock
with async IO. TheProjectItem.YieldTextItems
function could return anIObservable<TextItem>
or anIAsyncEnumerable<TextItem>
, but theTransformManyBlock
would not recognize it. I am new to TPL Dataflow and so it is unclear to me how I can work around this. That is why I am using the blockingFile.ReadAllText
instead ofFile.ReadAllTextAsync
. - I think my pipeline uses ThreadPool threads (through the default TaskScheduler), but shouldn't it use the real threads? Like the ones created with
Task.Factory.StartNew(..., TaskCreationOptions.LongRunning);
? So, does it use the "proper" threads? And if not - how to fix it? I have seen recommendations to implement a customTaskScheduler
, but I could not find an example. The existing ones seem to rely on internal implementation, so not clear how to proceed. - I tried to increase MaxDegreeOfParallelism for the
ProjectItem
andTextItem
production, since these two are mostly IO and thus much slower than the last part - examining the C# file content. But it did not yield much improvement in performance. It is my understanding that the slower the piece of pipeline the more parallelism should be there. On the other hand I do not know how much parallelism can there be when reading from an SSD. It is unclear at all how to profile it.