By: Pablo Díaz
My main intention with this article is to explain how I saw an opportunity relying on Graph Oriented Databases, Graph DBs, to find answers to questions we were having in our current software development process, specifically in the field of keeping track of the dependency graph that lies for all the different projects that compose the general software solution for a customer we work with.
What is the need?
We currently do not know what is the dependency graph our whole software solution relies on; this is something developers and OPS team have in their heads, however, it is never documented, because either that initiative was done once but it never got updated later (thus you can consider this as a dead document) or because at the beginning this graph was so tiny that it was really easy for us to have it present in our heads, however, then things started to grow, maybe the initial monolith got broken into different independent modules (aka microservices) and then it suddenly became a real challenge keeping track of these dependencies, especially when we were in the phase of generating the release packages, which project or library should get first done so that dependent ones can continue and also managing the risk of updating a library and knowing in advance which other projects will get impacted/benefited with this new library version; so, you can imagine other scenarios where this will help out (and as a matter of fact, I will talk about some of them at the end of this article), however, for the sake of simplicity, let’s say these were all of our concerns.
So, wouldn’t it be fantastic to have a graph that depicts this whole dependency between our projects and libraries, that can also get updated itself frequently, so it can be considered as a live and reliable resource?
The solution: Graph DBs to the rescue
As a first approach, I started to research if there existed already a solution that worked as a plugin for the tools our team was relying on so that it can magically integrate with both the code versioning tool as well as the deployment artifacts repository.
I do not know if I researched enough, but I did not find anything that worked for me; so I decided to build this integration tool by my own and what came out of this process, the most important for me: it was really fun to do.
I first started to research how I could interact with the code versioning tool API, which, for this specific customer is Bitbucket (a git repository, whose API documentation can be found here). I was interested in collecting information about each C# repository that composes the overall solution, so I can then collect all library dependencies each csproj file referred to (I have narrowed the research only for C# projects because that is what we mostly work with, however, as you can guess, this same technique can be done for Angular repositories as well as for other languages); these are the ones that are retrieved when restoring the project dependencies from the deployment artifact repository, the one I will continue talking about next.
Then, I continue researching how I could interact with the deployment artifact repository, which, for this specific customer is Nexus (whose API documentation can be found here). By querying its API, I was able to know if a specific artifact (aka library version) still existed in that repository and know about its NuGet package dependencies, specifically those ones that are libraries we created (i.e. not external ones, just for the moment), so that I can discover this recursive dependency graph that this artifact repository stores.
Now, I continued researching which could be the best tool to correlate this information gathered in the previous steps, and that was where I concluded this was a real scenario where Graph Oriented Databases come to the rescue. In the past, I have played with Neo4j and I have liked it a lot, especially its Cypher Query Language, which allows you to write queries in a very natural way (I love it!), so I decided to use it for this specific purpose.
Diving into the solution
Once I had the whole plan, I started executing it.
First, I created a C# console application that collected library dependency information for each C# project that our Bitbucket repository contains. This application stores that information in a CSV file (named “repos.csv”). You can find its source code here.
Then, this same console application iterates through each library version collected previously and start to query the Nexus API, in order to know the first layer of dependency. Then, it started to iterate though each NuGet dependency that each library has and queries Nexus API again, for the second layer of dependencies, and so on and so forth. You will see that this algorithm uses a cache strategy, for the sake of performance, because these libraries keep showing up frequently, so it was an obvious thing to do. In the end, this whole information is saved in a CSV file too (named “libs.csv”).
Later, I used these two files created in the previous process and imported them into a new Neo4j database, by performing a little ETL process with them. In the source repository for this article, you can find the source code of these import queries.
In the end, the overall graph scheme looks like this:
Here you can see the Library node has a recursive relationship with itself: that is the most important reason why relying on a graph database storage and querying strategy was the right choice to go for, now that performing queries for such recursive relationship in another type of databases is such as nightmare you would not want to be supporting.
The benefit of this solution: Questions this model can answer
You might come with different suggestions of what this model can help answering (and I will be very interested into knowing those, so please write them in the comments area of this blog), however, for the sake of brevity, I have thought about a few of them:
- What is the right order to go with when generating release packages?
When the time has come to create release packages, it is very important to know which libraries should get generated first, according to the overall dependency graph we have just now built, so that we can then coordinate among these different teams which will go first, thinking that different teams might own different libraries. A great thing about Neo4j visualization tool is that it allows you to see this whole dependency graph and analyze it, which I find it more natural, like in the way we connect nodes in our head, instead of seeing the same data from the flat files we have just created (this would be the way to go if we ever do the same thing with RDBMS). This is an example of how this dependency graph looks like, for one specific C# project
2. Are there any projects that point to different release versions of the same library?
It will become very important to know that all projects rely on the very same library version, the newest, the better (there could be specific exceptions to this). Pointing to a different library version can become the root cause of possible production issues in the future, so, knowing this in advance is priceless for our customer’s success. Here is an example of how it looks like
The previous diagram depicts the need for having a unique version of that specific library (purple nodes) that all C# projects (orange nodes) should be pointing to, however, as you can see, it is really simple pointing out which are those projects that are not pointing to the latest release version of that library, so that all orange nodes will point to a single purple node.
3. Is there a releasable package that depends on a pre-release library version?
The desired scenario when generating the release packages is that all libraries involved should be in their release state; however, there could be times where maybe a developer left a specific project pointing to a beta version of a library (i.e. a pre-release version); if this ever happens, then it is quite often finding this as an undesired state, now that these beta versions can have work that is still in progress, thus it should not be shipped along with a release. With the model we have just created, we can run a query that will enable us checking which could be the projects that will fall into this scenario so that we can react fast if needed; again, more control and visibility with the releasable artifacts. Here is an example of how it looks like
4. Which library versions can we delete from Nexus, without causing “side effects”?
I can remember asking this question to our OPS team, because we needed to speed the NuGet restore process, and it seems that deleting old artifacts from Nexus improved this process. However, the OPS team asked me: How do I know which specific artifacts (i.e. library versions) I can delete without causing problems to existing dependencies? As you can imagine, answering this question is kind of hard, because you might have to sit with the OPS team and iterate through each one and start guessing. Now, with this model we have just created, this won’t be the case anymore, now that we can query this dependency graph and know upfront which specific artifacts can be deleted (i.e. those ones that do not appear in the graph.
5. Is there any project that depends on a Nexus artifact that no longer exists?
This question is related to the previous one, but with a different meaning: you might want to know which project dependencies no longer exists, which will cause a build issue when trying to restore them; you can either build each project and know which ones fail because of this topic, or you can use this model we have built and query it in order to know upfront which projects would fall into such scenario. Here it is an example of how it looks like
6. What are the most used libraries?
One thing you will gain by having this dependency graph relying on a graph database is that you can visualize it, which means you can naturally see nodes getting connected to each other, thus it will allow you to determine things such as which are those libraries that are used the most, among other things we have just talked about, and things this article does not cover.
If you are interested into knowing which where the cypher queries I used to answer these questions, please check them here
Performing an analysis of a dependency graph such as the one that library dependency imposes, is a topic every development team will always go through; relying on a good tool to perform this analysis is critical, so that it can become efficient; here is where Graph Oriented Databases work like a charm, as we have experienced it, now that they naturally let you see these dependencies getting connected and it will also let you write queries to discover interesting facts of these dependencies.
Thank you very much for reading this far, and I would be very interested in your feedback, please leave them as comments next.
If you would like to get in touch with us, you can do it so by clicking in this link: https://yuxiglobal.com/contact