New generative AI tools are attracting considerable hype for their ability to automate various developer tasks. Some vendors even claim they can improve developer productivity by 10-20% or more. Yet many CTOs believe caution is warranted to discern where and how they can make a meaningful difference across the full development pipeline. For example, gains in code generation need to be weighed against their impact on potentially increased costs on product planning, security, and testing teams.
ComplyAdvantage CTO Mark Watson has been kicking the tires of new AI development tools for his organization. Watson is no newcomer when it comes to actually developing AI rather than with AI. ComplyAdvantage has incorporated traditional machine learning and deep learning AI into its compliance applications to help financial firms filter out fraud, money laundering, and other risks. It's also starting to kick the tires of new generative AI capabilities to help identify and quantify risks across a much larger data stream.
His key takeaway for CTOs is that they must focus on streamlining their toolchain, orchestrating better developer metrics, and building trust with the development team. AI tools need to be built on top of this solid foundation to succeed.
Over a long career with successful exits, such as Volantis, Antenna Software, Causata, and SKIPJAQ, Watson has learned that implementing and calibrating metrics across the different development and product teams is critical for success. He explains:
My approach to these things is that you can never have too much data, so whether it's data about what's happening in your engineering organization, or data what's happening in your product, or data that supports the customer conversation and stuff like that. Supporting that, you need products that generate that data, which are typically the better ones. So GitLab is pretty key for us in terms of holding stuff together. And then we have a bunch of other stuff, such as our observability solutions. And we also run metrics across what the developers contribute to GitLab and link that into JIRA, for example, to look at development, developer behaviors and productivity, so that's a big part of what we do.
For example, better data has made it easier to see that getting developers to make many small commits makes it easier to test and pinpoint the root cause of problems. If an engineer saves up a large piece of code, then troubleshooting problems becomes more of a detective exercise to see what happened. So, they track these kinds of patterns in their engineering teams.
But it's also important to consider how to show developers where they might be able to improve rather than shame them for doing the wrong thing. For example, managers and engineers can hypothesize which approach is working better and then test whether their hypothesis is correct.
When Watson first arrived, there were about 20 different development teams, and they were not running on a common cadence. This made it difficult to analyze their metrics comparably, so he worked with the managers to put them all on a common cadence.
Another essential insight is that it's useful to standardize the kinds of tools and processes across the organization to run better experiments. For example, Watson has found that having a few modern tools is better than lots of small ones. This also reduces the burden of linking tools and calibrates the metrics of things. Watson says:
If you look at one thing in three tools that you combine together, you may end up with three different numbers measuring the same thing because they all might be subtly different. So, there's a kind of move, I think rightly, towards kind of more monolithic capabilities.
At ComplyAdvantage, the firm has standardized on GitLab for continuous integration, Argo for application delivery, and Jira for issue tracking. They also hook other tools to this core foundation for monitoring different things and security. This consolidated approach to metrics provides consistent feedback to drive process refinement. Watson observes that striking the right balance between tool sprawl and giving developer freedom requires some forethought and diplomacy:
There's a tug from the developers towards technology proliferation, whereas there's a tug from the CTO level and VP of engineering level towards technology consolidation because it makes their life easier from a management point of view.
One big challenge is that developers tend to get enamored by the latest technology and incorporate it into their workflow. However, Watson is often skeptical about their arguments. His solution was to create a technology map curated by the developers themselves. He explains:
So, we established a tech radar. And then, because I didn't want to be seen as the monarch of the tech radar, I established a parliament to sort it by our senior developers, and that congress meets regularly on those things. But we did have to go through an initial paring down exercise to sort that out. You know, developers are pretty smart people, and if you try to be too autocratic about introducing things, they see it as a breach of fundamental human rights. So you've got to be quite careful about how you manage developers because they won't be shy in giving you their opinions if they think you've got it wrong.
Recently, ComplyAdvantage has begun exploring different AI development tools. Watson observed there have been some anecdotal stories about how developers can become more productive using these tools. However, he feels that it is important to quantify where and how a particular organization sees improvements. He explains:
I'm hearing from some of my peers that they have seen 20% productivity improvements in junior and mid-level programmers from these tools. But I am a kind of devil-in-the-details guy. So, it's kind of, 'Well, what do you describe as a junior and mid-level programmer?' and 'Do you have the same description as us?' So we're running some pilots. The other thing that fascinates me is how do you measure productivity in the first place. There are lots of different ways of doing that, and you have to baseline it. Does the improvement make sense, or is it some fake velocity metric or a vanity metric you're providing to justify your investment in this space?
GitLab recently compiled a survey on developer productivity to understand the problems companies were running into and the opportunities for improvement. Stephen Walters, Field CTO for GitLab, said that one of the things they found is that organizations have been finding it difficult to measure productivity. This is important to consider from the perspective of value streams that consider all of the efforts across the organization that go into creating value. Walters explains:
Something that we're getting as feedback from the marketplace is that productivity isn't necessarily translating itself into business outcomes and business productivity, which is why GitLab is focusing not just on improving the developer experience. This is about improving the experience across the entire value stream. It's for everybody involved. So we're looking at those generative AI capabilities within planning and within security so that we can ensure that when productivity improvements are made within one area of the organization, that translates and flows through to the entire value stream so that there's genuine business outcomes at the end of it.
Value Stream Mapping (VSM) is an emerging technique for understanding all the critical steps in a process to quantify the time and effort involved in creating new business value. ComplyAdvantage uses the term development pipeline to describe something similar. This holistic focus helps teams think beyond just the development process to consider the costs on other teams. Walters says:
So, within the entire software delivery than the entire value stream, the developer part is probably about five to ten percent of the entire process. Some actions occur beforehand around design, planning, and business decisions that are being made around the focus of what it is that's going to be delivered. And then there are also stages afterward. It's how you deploy. It's about ensuring security, quality, compliance, and governance.
For example, it might seem great if a developer producing five thousand lines of code per week uses generative AI to produce forty thousand lines. But if they introduce ten times as many security vulnerabilities in the process, it might cost the whole organization more to rectify these.
Vendor hype around AI development productivity tends to gloss over the fact that modern app development is a team sport that needs to consider the role of players across the organization in security, testing, and product planning. Efforts, like those of GitLab, to quantify and support new tools across this broader value chain will be important for AI-enabled development to see meaningful adoption.
I was also struck by the challenges of consolidating meaningful metrics across a large development organization. At one level, it seems obvious that different tools should quantify metrics similarly. However, Watson's experience suggests that this is not always the case. It also seems important to include developers' feedback and support to strike the right balance between the desire to try out the shiny new thing against the backdrop of the full development pipeline from product ideation, testing, security, and deployment.