How Code Is Used
Joshua Tan

The following technical memo was prepared for trade negotiators and policymakers involved in the regulation of trade in AI goods and services. It was commissioned by the Digital Trade Alliance.

 

 

Introduction

 

In this memo, we review the concepts of source code and algorithm and the ways in which they show up across the fields of software engineering, theoretical computer science, and AI. We will refrain from too much debate on their precise definitions—a subject more suited for lawyers, philosophers, and dictionarists—and instead focus on the ways that these concepts are actually used by scientists and engineers. We then refine these technical use-cases of source code and algorithms into a simple model of how code is regulated.

 

In what follows, an algorithm is a specification of a computational process, and source code (or just “code”) is a specification of a computational process that can be compiled and executed by a computer. So, yes, to be perfectly clear: every example of source code is also an example of an algorithm. In the discussion below, we’ll consider some cases where the line between the two is very clear and other cases where that line is not very clear at all. Because of its importance to present debates about AI, we will also consider the distinct use-cases of AI models, defined here as the source code produced through an AI training process, as compared to other instances of source code.

 

Software engineering: specifications for the working developer

 

Software engineering deals with the production of (source) code, the structuring and organization of code, and the packaging of code into software. It provides concepts, algorithms, and practical principles for building efficient, production-ready software.

 

 

Engineers write source code using programming languages. To operate, a computer needs precise instructions in the form of binary machine code. Programmers and engineers very rarely write machine code. Instead, they use programming languages to convert source code into machine code, where a programming language is just a syntax or a set of rules that converts source code into machine code. Typically, a programming language comes with a compiler, an intricate program that converts the well-formed source code automatically into machine code. 

 

 

There are many different programming languages, and they come in many different flavors: functional, logical, visual, object-oriented, reactive, diagrammatic, and so on. What’s important to understand is that the vast majority of programming languages are essentially methods of manipulating text in the form of source code.

 

 

Engineers use code to manipulate computers. A computer is a machine that takes coded instructions and performs a set of physical operations, from spinning or slowing a hardware disk to emitting radio waves to passing a current through an array of silicon logic gates. Code triggers and manipulates these physical operations, which are also affected by other physical forces such as electrostatic noise (something extremely important to consider when designing computer chips). As a consequence, code is also limited by the computer(s) within which it operates. Indeed, code is often specialized to the kind of computer that it runs on—consider smart contract programming, or the whole field of edge computing.

 

 

Engineers optimize code (and algorithms) to minimize the physical resources that computers consume. Computers are physical machines, and engineers optimize code and algorithms to minimize the resources they consume. There are two properties of code and algorithms that every engineer cares about to some degree: the amount of time that the code or algorithm takes to run, and the amount of space that the code or algorithm needs to store in memory. For a fixed algorithm or piece of code, we usually refer to these properties as its time and space complexity, respectively. 

 

 

Engineers communicate through source code. Source code is usually human-readable. That’s partly because source code is still mostly produced by humans (I’ll cover machine-produced code later), and partly because software engineers use source code to communicate with other engineers. Source code of any significant complexity is typically produced by teams of developers and engineers. Further, once produced, code needs to be maintained and changed in response to bug reports, feature requests, and evolving hardware and software dependencies. Proper documentation of code is essential to maintainability, and software teams typically dedicate substantial time to documenting code. For example, annotations in source code, commonly called comments, are ignored by the compiler but appear in almost every example of source code.

 

 

Note, just because source code is supposed to be human-readable does not mean that it is easy to read. Programmers often produce a form of “spaghetti code” that is extremely hard to understand and thus to modify or maintain. Certain programs will also compress source code into unreadable forms; other programs will deliberately obfuscate source code.

 

 

Engineers structure code into modules. Just as long textbooks need to be sectioned into chapters, sections, and bibliographies, long pieces of code need to be structured into modules, classes, and libraries. Modularization is especially important if a piece of code needs to be reused across a given project, but it’s also a critical tool managing the complexity of code as well as dependencies between different pieces of code. A lot of thought goes into the architectural questions of how to modularize code; how exactly this structuring takes place, and what goes into it, are beyond the scope of this memo.

 

 

Engineers access code (and algorithms) through pre-written libraries, packages, and API services. Software engineers do not write most of the code that gets compiled into their software applications. In modern software engineering, code is packaged into pre-written software libraries either built into the programming language or imported before compile time; according to npm, a package manager, over 97% of the code in modern web applications is contained in the software packages available on the npm. This pre-written code is then compiled along with the software engineer’s own code in order to produce executable machine code. A large part of a software engineer’s skill lies in their familiarity with such libraries and knowing when to call them. 

 

Similarly, by far the most common way an engineer today will interact with a known algorithm is by calling a software library that implements that algorithm. For more junior developers, while knowledge of fundamental algorithms is important (at least within code interviews), in practice that knowledge is often regarded as secondary to the ability to quickly call upon existing functions and libraries, the ability to manipulate common data structures (strings, arrays, tensors, databases, etc.), and the ability to communicate and document the intention of the code.

 

 

Of note: modern web platforms are often composed of many interoperable services accessed through application programming interfaces, or APIs. For example, in order to display a search result for “Pittsburgh”, a typical search engine may not only query its internal records but also call the API of a weather service to display the current weather and the API of a reviews website to list popular destinations and restaurants. These APIs often constitute part of the design and are implemented as part of the source code of the software / website.

 

 

Engineers study and design algorithms to reason about problems. Quite often, a software engineer will encounter problems that no existing software library or function can solve. In these cases, knowledge of known algorithms and their use-cases becomes quite important. Software engineers use such known algorithms to reason about and define solutions for problems before implementing or testing those solutions through code. For example, understanding how MapReduce works (a particular algorithm in parallel computation) may help you solve a problem with a different form of parallel computation in edge computation. Or, knowing how the original PageRank algorithm works in Google’s search engine may help you generalize the algorithm to a different but related problem in organizing contribution graphs (ref. SourceCred). Reasoning about algorithms this way takes substantial creativity and expertise, and knowledge and command of a wide array of algorithms is considered one of the hallmarks of an experienced software engineer.

 

An important takeaway from this use-case: the algorithms that engineers encounter are typically more abstract than source code because they are used by engineers to reason abstractly about a problem without having to program it in a computer. A single algorithm like quicksort can be implemented in many different ways, through different source code, while remaining the same algorithm.

 

Theoretical computer science: the algorithm as computational process

 

The field of theoretical computer science abstracts away from source code in order to study different properties and representations of algorithms including their denotational semantics, their operational semantics, their computational complexity, the properties of their programming language, and how they vary within different models of computation (quantum, probabilistic, and so on). Each of these subjects constitute a research field in their own right, and each has a slightly different way of thinking about algorithms.

 

 

Computer scientists study and design algorithms to reason about problems. Computer science is a broad field of research, and many computer scientists study problems, come up with algorithms, and produce source code in the exact same way that software engineers do; they simply do so within a research-based context that emphasizes prototyping and experimentation rather than the implementation of business logic within production-ready software.

 

 

Computer scientists optimize algorithms by studying their properties. When a computer scientist designs an algorithm, they often want that algorithm to be in some sense optimal. We spoke earlier about space and time complexity, but there are many other properties that a computer scientist may want to optimize, depending on the particular algorithm. For example, they may want to optimize the speed of an error-correcting algorithm at the cost of small amounts of error. Or, they may want to optimize an image compression algorithm to give optimal resolutions for a fixed compression ratio. Or, alternately, they may want to optimize the design of a networking protocol to handle cases. It is usually not possible to optimize all desirable properties of an algorithm at once, so a computer scientist must make choices about which ones to focus on. A significant part of research in particular subjects of computer science involves understanding the relationships and tradeoffs between the different properties of algorithms.

 

 

Computer scientists reason about algorithms by proving facts about their behavior. A very common way in which computer scientists think about algorithms is to model them mathematically and then prove certain mathematical facts about the possible behavior of that algorithm. For example, computer scientists often model learning algorithms in terms of certain statistical operations in order to prove bounds on how many errors they can make on a typical data set. This aspect of computer science goes back to its origins as a branch of applied mathematics.

 

 

Computer scientists manipulate code as a form of data. Code is just a form of text, and computers are excellent devices for processing and manipulating text. A particular way in which some computer scientists interact with code is to treat it as its own form of data. Lisp, for example, is a programming language in which source code itself is treated as a first-class object which can be created, manipulated, and called by other programs. Lisp is deeply connected to early research in artificial intelligence, and variants of the language are still actively taught and researched in university computer science departments around the world. More recently, computer scientists have begun to parse large data sets of code in order to better understand patterns and potential errors in code—and even to automatically generate code

 

 

This particular use-case of code points at something important. Recall that source code is a specification of a computational process that can be compiled and executed by a computer. So what counts as source code depends on “what can be compiled and executed by a computer”, and this last quality is evolving. In the future and already to some degree today, programmers might write source code by providing natural-language requests to a sophisticated AI compiler that then compiles and executes the request. This shift is not unique to AI. There is a similar history of advances in compilers, programming language design, and other aspects of computer science; for example, the source code that computer scientists and engineers write today is already much closer to natural language than the binary that early programmers wrote or the algebra of logical circuits.

 

 

Computer scientists reason about algorithms by constructing models of computation. Theoretical computer science is replete with more or less obscure models of computation like Turing machines, finite automata, lambda calculi, and combinator algebras among other, even more exotic models (e.g. categorical, graphical, higher-dimensional). These mathematical models are especially important in the field of programming language theory. Computer scientists use them to associate a semantics to a programming language through which they can reason about the behavior, properties, and relative power of that language. Computer scientists also use these models to design new programming languages; a oft-repeated joke is that mathematicians invented Haskell (a programming language) in order to trick computer scientists into learning category theory (a branch of math).

 

 

You might hear the term “Turing-complete” bandied around by computer scientists. To be clear, most computers are not Turing machines. Your cellphone is not a Turing machine, because your cellphone is not a closed system that is cut off from its external environment.

 

AI: towards robust algorithmic systems

 

As a field of research, artificial intelligence studies the ways in which computational agents reason and act in the world. A popular subfield of AI, machine learning, uses a mix of precise learning algorithms (such as gradient descent or AdaBoost), techniques (such as batch norm or dropout), and more open-ended learning architectures (such as generative adversarial networks or ensemble methods) in order to produce algorithmic systems that then operate on data. Without conflating the two subjects, we will address some of the ways in which AI engineers use machine learning code, algorithms, and data within practical AI systems.

 

 

AI engineers and researchers write and deploy code to computers. Much of what is commonly referred to as “AI” is just normal code deployed on normal computers. When you play a game like checkers or Starcraft against a computer opponent, you are interacting with what is for all intents and purposes a normal computer program developed by software engineers. What gives these examples the feel of AI is the interactive nature of the program, its role in standing in for a human opponent, and, often, its usage of algorithms like search or planning that were pioneered in the long history of AI research.

 

 

AI engineers and researchers modify code by training AI models with data. Many examples of AI are not only programmed by engineers but also trained on a data set, meaning that the code which constitutes the AI is adapted to a given task through a learning process that draws relevant information out of a data set. The results of training are often stored in large parameter sets. Such large parameter sets are the most typical form of a trained AI model, or just model, which we define here as the source code produced by a training process. Though these models do not fit our typical expectations of source code—they are not structured like typical source code, they are not produced directly by human developers, and they are not intended for nor usually accessible to human interpretation—trained AI models do constitute a form of source code insofar as they represent instructions to a computer. It does not matter that such models are not intended to be read by a human; many examples of source code (e.g. auto-generated source code) are also not intended to be read by a human. It does not matter that such models are not produced by a human; many variations of source code have been produced through automated means throughout the history of computer science. It does not matter whether the model is fully pre-trained or whether it is an assemblage of data, training algorithms, and hyperparameters that need to be trained and compiled before deployment; sometimes source code is compiled before delivery and sometimes it is not. For an engineer or computer scientist, these are all instances of source code. A typical AI application will include both typical source code and an AI model (or multiple models), sometimes interacting in complicated ways. 

 

 

Examples of training and models are especially prevalent in machine learning, but there are also examples outside of machine learning where data is used to train an agent, e.g. inductive inference. 

 

 

AI engineers and researchers build data pipelines and infrastructure. A significant part of building, training, and deploying any real-world AI application comes down to obtaining, cleaning, and transforming the data used to train and/or operate the AI. In many cases, these data engineering tasks take up to 45% of the time used to produce a model. This fact does not diminish the importance and complexity of building modern learning architectures and algorithms, which require substantial tinkering, engineering, and fine tuning to produce robust models. However, the importance of data pipelines emphasizes (1) the limitations of these architectures in the absence of good data (what people in AI often refer to as “garbage in, garbage out”), and (2) the substantial infrastructure necessary to produce good data.

 

 

While many applications are trained on data from the open internet (a practice that has created some controversy even before the recent round of foundation models), many of these pipelines also involve purchased data and/or some form of human-in-the-loop process, e.g. manually labeling and tagging of specific classification data or a human signal on mistakes in semi-supervised learning. Other parts of the data pipeline are executed entirely by code, e.g. data augmentation, cleaning, and de-duplication.

 

 

AI engineers and researchers access and deploy pre-trained AI models. Notwithstanding the importance of data in training an AI, the field of AI is moving toward greater and greater use of pre-trained models within complex applications. Like the pre-written libraries and packages that they resemble, pre-trained models are often used to speed up development and simplify deployments. Indeed, an increasing number of AI applications today are built on top of so-called foundation models such as the open-source Stable Diffusion or OpenAI’s GPT-3.

 

 

While a pre-trained AI model is a form of pre-written software, there are some important differences between how software engineers typically use pre-written libraries and how AI engineers typically use pre-trained models. Perhaps most substantially, pre-trained AI models are often used to train and adapt new AI models, such as when an AI engineer uses a pre-trained discriminator within a generative adversarial network, when they bootstrap training, or within any form of transfer learning. This is reminiscent of how code from a library call is used and ported into a compiled application, but typically the code of a pre-trained AI model is not reproduced directly in a subsequently-trained model.

 

In other ways, pre-trained AI models are just like pre-written libraries. For example, you can download many models directly to your machine—meaning that you have access to all the internal code and parameters of a trained AI model under whatever terms was attached to the library—and then compile them directly into your own application. Others are “black boxes” deployed offsite and accessed through API services.

 

 

AI engineers design AI systems and AI algorithms to be robust. We often expect AI to appear as “autonomous” agents acting independently of its developers, but existing AI systems such as Facebook’s News Feed, Amazon’s Alexa, or the Google Car are rarely any more or less autonomous than traditional programs—for example, computer viruses are, by design, extremely autonomous, but we do not regard viruses as a form of AI. Further, many actual AI systems require a “human-in-the-loop” to function, often someone labeling data, checking boundary cases that the AI cannot handle, or supervising it to some extent.

 

 

What distinguishes an AI from a traditional program is its capacity to act under conditions of uncertainty or complexity. AI engineers often use the word ‘robustness’ to describe what they are seeking to achieve in designing these systems and algorithms: they want an AI to be robust to noise or to unforeseen variations in a domain or a task. In machine learning, one also uses the term ‘generalize’, as in ‘the algorithm generalizes its performance on the training data to data it has never seen before’. The method for achieving robustness or generalization is a complicated function of the particular algorithm or architecture and its implementation within an engineered system. For example, an algorithm based on first-order logic may have a hard time navigating a physical obstacle course, but so might a learning-optimized agent whose cameras are misaligned.

 

Regulating code

 

So far, we have covered a range of pragmatic use-cases for how software engineers, computer scientists use the concepts of source code, algorithm, and AI models in practice. In this section, we connect these use-cases to questions around the regulation and ownership of source code. 

 

 

First, we start with a simple model of how source code is regulated (see Figure 1). Then, for each use-case, we consider which regulatory mechanisms it informs or enables. Finally, we dive into two specific relationships: (1) how each use-case informs or does not inform the specific regulatory mechanism of legal ownership, and (2) how the specific use-case of pre-trained AI models relates to each listed regulatory mechanism.

 

 

Figure 1. A simple model of how source code is regulated.

 

 

How to regulate a piece of code

 

A piece of source code can be regulated in many ways. To an engineer, a number of constraints play directly into their choices in writing code, from resource constraints to the expressivity of the programming language to the presence of external libraries. Once code is written, it is almost always reviewed through other mechanisms, from code-based compile-type checks (many of which are built into a programming language) to run-time checks to norm- and rule-based mechanisms like the development paradigm and the commit review system. Once a piece of code leaves the engineer’s hands or is deployed, it must be tested and evaluated—through bug reports, through market adoption (by consumers or by other engineers), and ultimately through various laws. These examples are represented in Figure 1.

 

Some of these mechanisms are local: they can be enforced on a piece of code without involving anyone beyond the engineer writing it. Others are intrinsically non-local: they require other humans to implement. Moreover, some of these mechanisms are laws or rules that are enforced through a central authority; others are norms that are enforced by a community; other mechanisms are enforced by the architectural and code-based constraints of the space (“code is law”); and still others are enforced through market-based mechanisms such as price or availability through supply and demand.

 

From use to regulation

 

Understanding how code is actually used can help clarify and inform how each of these regulatory mechanisms actually work on code—and how they might work better. In Figure 2, we relate each use-case to the set of regulatory mechanisms that it supports or enables. 

 

Figure 2. Relating the use-cases of source code and algorithm to regulatory mechanisms for governing code.

 

For example, the fact that engineers communicate through source code supports or qualifies the use of requirements & software specifications to regulate code (because requirements are often communicated and pegged directly within sections of source code), the use of development paradigms (because the particular paradigm, e.g. agile, often lead to different communication and documentation patterns), bug reports (because bug reports reference specific sections of code), and the commit process and platform (because many reviews depend on reading and understanding the intention of the source code, including documentation within the code).

 

An example: legal ownership

 

Consider the fact that code (and algorithms) can be legally owned. That is, code is intellectual property that can be owned (and traded), entailing rights that can then be enforced within legal jurisdictions. The fact of legal ownership changes the incentives around the production of code (e.g. due to who can control it and how it can be monetized), and thus changes what code is commissioned and written.

 

 

The legal ownership of code is influenced by several facts about code:

  • The fact that engineers structure code into modules implies that sufficiently-independent modules can be owned and sold as separate pieces of software. Note that a software module is different from a feature of software that may be turned on or off (as in many open-core business models in software); modules often correspond to features but not always.
  • The fact that engineers access code (and algorithms) through pre-written libraries or packages implies that, for most code to function, it needs to be accompanied by many external software libraries and other functional dependencies. This often adds costs or at least additional legal risk to using that software, especially if some of those dependencies come with certain copyright conditions.
  • Finally, the fact that computer scientists manipulate code as a form of data implies that source code can be regulated as a form of data. That is, laws developed for data can also, in principle, be applied to source code.

 

Of course, code does not have to be “owned”; it can be and often is open-sourced, facilitating the sharing and use of code.

 

An example: pre-trained AI models

 

AI engineers and researchers use pre-trained AI models in their work, from small discriminator models used to bootstrap learning for a specific domain to large foundation models that serve as the basis for entire applications.

 

 

The fact that AI engineers and researchers use pre-trained AI models has several implications for the regulatory mechanisms that govern code:

  • Government regulation of a particular AI model may need to accommodate the fact that AI models are used to train other AI models. For example, banning a particular AI model may also mean banning all additional models trained on that model.
  • Intellectual property rights over a pre-trained AI model typically entails rights against copying or reproducing that code in certain contexts, but what if another developer uses your pre-trained AI model in order to train their own model, or simply uses data generated by your AI model? Licenses may need to reflect this kind of use-case.
  • Due to the sensitivity of these parameter models, the owners and developers of pre-trained AI models may wish to hold them as trade secrets, which are widely used to guarantee revenue streams and ward off unauthorized uses. There are other reasons to maintain secrecy, e.g. OpenAI’s declared motivation to prevent AI from being used irresponsibly.
  • Pre-trained AI models, especially large foundation models, are challenging the typical, task-focused development paradigm for AI applications that has dominated the field since its inception. These new models are provided for a wide variety of general tasks and then sometimes used as a building block within specialized applications.
  • A new commit process or versioning system for updating or using pre-trained AI models within larger AI applications or in training new models could go a long way to ameliorating certain concerns related to the uses and rights associated with these models.
  • A pre-trained AI model constitutes an external software library, and using one in a program constitutes a dependency of that program.

 

Conclusion

 

In this essay we have explored algorithms as a superset of code and code as a superset of trained AI models, and we have presented some of the ways in which algorithms, code, and AI models are used by engineers and computer scientists. These use-cases are not comprehensive, and they are subject to change. The ways in which engineers and computer scientists interact with algorithms and code has clearly evolved over the past several decades, and the ways that they use and manipulate AI models is evolving dramatically year-by-year. As we have mentioned earlier, the very nature of source code is evolving because computers themselves are evolving. To be relevant, any regulation of code—whether by lawmakers and regulators, by companies and organizations, or by the engineers themselves—must keep in mind these evolutions.

 

Acknowledgements

 

We would like to thank Kristina Irion and Deepika Yadav for their helpful comments in the development of this essay.

 

Appendix

 

The data used to generate the figures can be found below:

  1. A simple regulatory model: https://airtable.com/shrRCyOBarSthGYll 
  2. Use-cases and regulatory mechanisms: https://airtable.com/shrKVKWm3tZGPXbtp