Determining the Source of Truth for Software Components

1812

Abstract: Having access to a list of software components and their respective meta-data is critical to performing various DevOps tasks successfully. After considering the varying requirements of the different tasks, we determined that representing a software component as a “collection of files” provided an optimal representation. Conversely, when file-level information is missing, most tasks become more costly or outright impossible to complete.

Introduction

Having access to the list of software components that comprise a software solution, sometimes referred to as the Software Bill of Materials (SBOM), is a requirement for the successful execution of the following DevOps tasks:

  • Open Source and Third-party license compliance
  • Security Vulnerability Management
  • Malware Protection
  • Export Compliance
  • Functionally Safe Certification

A community effort, led by the National Telecommunications and Information Administration (NTIA) [1], is underway to create an SBOM exchange format driven mainly by the Security Vulnerability Management task. The holy grail of an effective SBOM design is twofold:

  1. Define a commonly agreed-upon data structure that best represents a software       component and;
  2. Devise a method that uniquely and effectively identifies each software component.

A component must represent a broad spectrum of software types including (but not limited to): a single source file, a library, an application executable, a container, a Linux runtime, or a more complex system that is composed of some combination of these types. For instance, a collection of source files (e.g., busybox 1.27.2) or a collection of three containers (e.g., a half dozen scripts and documentation) are two examples.

Because we must handle an eclectic range of component types, finding the right granular level of representation is critical. If it is too large, we will not be able to represent all the component types and the corresponding meta-information required to support the various DevOps tasks. On the other hand, it may add unnecessary complexity, cost, and friction to adoption if it is too small.

Traditionally, components have been represented at the software package or archive level, where the name and version are the primary means of identifying the component. This has several challenges, with the two biggest ones being:

  1. The fact that two different software components can have the same name yet be different, and
  2. Conversely, two copies of software with different names could be identical.

Another traditional method is to rely on the hash of the software using one of several methods – e.g., SHA1, SHA256, or MD5. This works well when your software component represents a single file, but it presents a problem when describing more complex components composed of multiple files. For example, the same collection of source files (e.g., busybox-1.27.2 [2]) could be packaged using different archive methods (e.g., .zip, .gz, .bz2), resulting in the same set of files having different hashes due to the different archive methods used. 

After considering the different requirements for the various DevOps tasks listed above, and given the broad range of software component types, we concluded that representing a software component as a “collection of files” where the “file” serves as the atomic unit provides an optimal representation. 

This granular level enables access to metadata at the file level, leading to a higher quality outcome when performing the various DevOps tasks (e.g., file-level licensing for license compliance, file-level vulnerability data for security management, and file-level cryptography info for export compliance).  To compute the unique identifier for a given component we recommend taking the “hash” of all the “file hashes” of the files that comprise a component. This enables unique identification independent of how the files are packaged. We discuss this approach in more detail in the sections that follow.

Why the File Level Matters

To obtain the most accurate information to support the various DevOps tasks sufficiently, one would need access to metadata at the atomic file level. This should not be surprising given that files serve as the building blocks from which software is built. If we represented software at any higher level (e.g., just name and version), pertinent information would be lost. 

License Compliance

If you want to understand all the licenses that impose obligations and restrictions on a program or library, you will need to know all the licenses of the files from which it was built (derived). There are many instances where, although an open source software component’s top level license is declared to be one license, it is common to find a half dozen or more other licenses within the codebase which typically impose additional obligations.  Popular open source projects usually borrow from other projects with different licenses. The open sharing of code is the force behind the success of the Open Source movement. For this reason, we must accept license diversity as the rule rather than the exception.  

This means that a project is often subject to the obligations of multiple licenses. Consider the impact of this on the use of busybox, which provides a lot of latitude regarding the features included in a build. How one configures busybox will determine which files are used. Knowing which files are used is the only way to know which licenses are applicable. For instance, although the top-level license is GPL-2.0, the source file math.c [3] has three licenses governing it (GPL-2.0, MIT, and BSD) because it was derived from three different projects.  

If one distributed a solution that includes an instance of busybox derived from math.c and provided a written offer for source code, one would need to reproduce their respective license notices in the documentation to comply with the MIT and BSD licenses. Furthermore, we have recently seen an open source component with Apache as the top-level license, yet deep within the bowels of the source code lies a set of proprietary files. These examples illustrate why having file level information is mission-critical.

Security Vulnerability Management

The Heartbleed vulnerability was identified within the OpenSSL component in 2014. Many web servers used OpenSSL to provide secure communication between a browser and a website. If left unpatched, it would allow attackers unprecedented access to sensitive information such as login and password credentials [4]. This vulnerability could be isolated to a single line within a single file. Therefore, the easiest and most definitive way to understand whether one was exposed was to determine whether their instance of OpenSSL was built using that file.

The Amnesia:33 vulnerability announcement [5], reported in November 2020, suggested that any software solution that included the FNET component was affected. With only the name and version of the FNET component to go on, one would have incorrectly concluded the Zephyr LTS 1.14 operating system was vulnerable.  However, by examining the file level source, one could have quickly determined the impacted files were not part of the Zephyr build, making it definitively clear that Zephyr was not in fact vulnerable [6].  Having to conduct a product recall when a product is not affected would be highly unproductive and costly. However, in the absence of file-level information, the analysis would not be possible and would have likely caused unnecessary worry, work, and cost. These examples further illustrate why having access to file-level information is mission-critical.

Export Compliance

The output quality of an Export Compliance program also depends on having access to file-level data. Although different governments have different rules and requirements concerning software export license compliance, most policies center around the use of cryptography methods and algorithms. To understand which cryptography libraries and algorithms are implemented, one needs to inspect the file-level source code. Depending on how a given software solution is built and which cryptography-based files are used (or not used), one should classify the software concerning the different jurisdiction policies. Having access to file-level data would also enable one to determine the classification for any given jurisdiction dynamically. The requirements of the export compliance task also mean that knowing what is at the file level is mission-critical.

Functional Safety

The objective of the functional safety software certification process is to mitigate the unacceptable risk of physical injury or of damage to the health of people and/or property. The standards that govern functional safety (e.g., ISO/IEC 61508, ISO 26262, …) require that the full system context is known to assess and mitigate risk successfully. Therefore, full system transparency requires verification and validation at the source file level, which includes understanding all the source code and build tools used, and how it was configured. The inclusion of components of unknown content and provenance would increase risk and prohibit most certifications. Thus, functionally safe certification represents yet another task where having file-level information becomes mission-critical.

Component Identification

One of the biggest challenges in managing software components is the ability to identify each one uniquely. Developing a high confidence method ensures that two copies of a component are the same when they represent identical content and different if the content is not identical. Furthermore, we want to avoid creating a dependency on a central component registry as a requirement for determining a component’s identifier. Therefore, an additional requirement is to be able to compute a unique identifier simply by examining the component’s contents.

Understanding a component’s file-level composition can play a critical role in designing such a method. Recall that our goal is to allow a software component to represent a wide spectrum of component types ranging from a single source file to a collection of containers and other files. Each component could therefore be broken down into a collection of files. This representation enables the construction of a method that can uniquely identify any given component. 

File hash methods such as SHA1, SHA256, and MD5 are effective at uniquely identifying a single file. However, when representing a component as a collection of files, we can uniquely represent it by creating a meta-hash – i.e., by taking the hash of “all file hashes” of the files that comprise a component. That is, i) generate a hash for each file (e.g., using SHA256), ii) sort the list of hashes, and iii) take the hash of the sorted list. Thus, the meta-hash approach would enable us to identify a component based solely on its content uniquely, and no registry or repository of truth is required.  

Conclusion

Having access to software components and their respective metadata is mission-critical to executing various DevOps tasks. Therefore, it is vital to establish the right level of granularity to ensure we can capture all the required data. This challenge is further complicated by the need to handle an eclectic range of component types. Therefore, finding the right granular level of representation is critical. If it is too large, we will not represent all the component types and the meta-information needed to support the DevOps function. If it is too small, we could add unnecessary complexity, cost, and friction to adoption. We have determined that a file-level representation is optimal for representing the various component types, capturing all the necessary information, and providing an effective method to identify components uniquely.

References

[1] NTIA: Software Bill of Materials web page, https://www.ntia.gov/SBOM

[2] Busybox Project, https://busybox.net/

[3] Software Heritage archive: math.c, https://archive.softwareheritage.org/api/1/content/sha1:695d7abcac1da03e484bcb0defbee53d4652c347/raw/ 

[4] Wikipedia: Heartbleed, https://en.wikipedia.org/wiki/Heartbleed

[5] “AMNESIA:33: Researchers Disclose 33 Vulnerabilities Across Four Open Source TCP/IP Libraries”, https://www.tenable.com/blog/amnesia33-researchers-disclose-33-vulnerabilities-tcpip-libraries-uip-fnet-picotcp-nutnet

[6] Zephyr Security Update on Amnesia:33, https://www.zephyrproject.org/zephyr-security-update-on-amnesia33/