Environment variables in Linux

Aug 26, 2019

6 minutes read

Intro

Environment variables are used to pass information to a process from the running environment, usually the shell. Many people know and interact with environment variables using the shell builtin called export.

An example of setting environment variables is if you have a C project that you want to compile using make. The make program uses the CC environment variable to figure out which compiler to use. Otherwise, it falls back on cc which is the OS’s default C compiler. Say your default compiler was gcc and you wanted to compile with clang. You’d do something like this:

CC=clang make

Environment variables are easy to set, but hard to know if a program requires them. The only way to know is to read the source code, or hope the documentation is up-to-date. This makes them poor for maintainability because there is not a clear, defined contract that dictates they are required. In the case of make, it is stated in its man page.

I’d like to share how environment variables work. I’d also like a share a little-known caveat about the /proc filesystem. In order to do that, I’d like to first take a deep dive into how environment variables are actually set on Linux.

Deep dive

Before diving in, let’s talk about where environment variables live. Environment variables live in the program’s stack in user-space along with arguments passed via command line (argv).

Linux process stack

Where do they come from?

There are three ways they can be set:

Inherited from the parent process (often from a shell).
Explicitly passed via a call to a specific variant of exec(3).
In your program, you use setenv(3) or putenv(3). Or for the brave, by manually manipulating the environ(7) pointer. And we will talk more about this one too.

Note: setenv(3) and putenv(3) happen to be for C, but every programming language ought to have a similar counterpart.

But where do they really come from?

In Linux, there’s a file called /etc/environment which comes pre-populated by your distro (distribution). When logging in for the very first time, the shell will read this file to initialize its environment. The shell will also load files such as /etc/profile. In bash, the ~/.bashrc, which may also contain environment variables.

Afterwards, the environment “persists” throughout the system by inheritance through the fork(2) and exec (usually execve(2)) calls.

How did they get on the stack?

In the picture above, the environment variables are stored after argv. Here’s a snippet of how glibc starts your program’s main function: ¹

int __libc_start_main(
        /* Pointer to the program's main function */
        int (*main) (int, char**, char**),
        /* argc and argv */
        int argc, char **argv,
        /* Pointers to initialization and finalization functions */
        __typeof (main) init, void (*fini) (void),
        /* Finalization function for the dynamic linker */
        void (*rtld_fini) (void),
        /* End of stack */
        void* stack_end) {

...

char **ev = &argv[argc + 1];
__environ = ev;

...

}

You’ll notice that only argv is in the function prototype, but not envp, which would be the pointer to our environment variables. Our environment pointer is obtained by skipping past all the arguments (argc), plus one. Then it is stored in a mysterious __environ variable. What is this variable?

The __environ variable lives in the glibc runtime as a global. You can reference it in your code because it has external linkage. ² Functions such as setenv(3) operate on __environ. This is the variable mentioned earlier that you can manipulate.

#include <unistd.h>

...

extern char **environ;

...

int main(...

Note: The code snippet shows environ, but this is the same variable as __environ.

But how did it get there? Why does glibc assume it is there?

The Linux kernel initiates the execution of your programs. It sets up the stack in such a way that the environment variables will follow the argv pointer. ³

Thus, before executing the main function, the glibc, looks in memory to where the kernel has prepared the stack, and “wires” up the pointers. One of those pointers is the __environ pointer.

What about programs written and compiled with their own runtime?

All the examples have been in C – what about other programming languages? Golang is a language that brings its own runtime and creates statically compiled binaries with it. How does it know about environment variables?

Well, Golang is no different. In fact, every runtime retrieves the environment pointer in the same fashion as glibc because the kernel sets up the stack all the same, regardless of the runtime. Going back to the Golang example, Golang has a similar prelude before executing your main function in it’s runtime package that sets up environment variables. ⁴

Gotcha: /proc/<pid>/environ not reliable

A common way to view the environment of a process is by looking at /proc/<pid>/environ. The /proc filesystem is a trove of information on processes provided by the kernel. The kernel can show environ because it has an initial reference to the environment when it creates a new process.

This is a common technique when debugging. It is useful to know what the process’ environment is. Environment variables may be used to control the process’ behavior.

However, if a process changes its environment variables during runtime, viewing /proc/<pid>/environ will contain stale information. In other words, the /proc filesystem does not update the environ entry.

Why does it not update?

Recall that to change the environment, setenv(3) or putenv(3) is used. glibc provides those functions. In other words, it is not a syscall, so the kernel is not aware that any of those functions have been called. They are user-space calls, operating on user-space memory.

Specifically, the kernel has an internal data structure which stores a bunch of metadata about the process. This data structure looks like:

struct mm_struct {
    ...
    unsigned long start_code, end_code, start_data, end_data;
    unsigned long start_brk, brk, start_stack;
    unsigned long arg_start, arg_end, env_start, env_end;
    ...
};

This struct is defined inside include/linux/mm_types.h.

When a new process is created, the kernel stores the address of some items on the stack, such as the start and end of argv, and the environment.

The environment addresses (env_start and env_end) are set in the create_elf_tables() inside fs/binfmt_elf.c when the ELF program is loaded.

What this means is, if you added an environment variable at runtime, this would put the variable past the value stored in env_end. But env_end is never updated to reflect that.

I plan to expand on the details above such as briefly describing what ELF is. I also plan to understand why if you modified an existing environment variable, which has no effect on the environment addresses, the change is still not reflected in /proc/<pid>/environ. My instinct is some kind of caching. Stay tuned.

Conclusion

Environment variables live on the stack.
Environment variables are per-process.
Accessing or modifying environment variables is handled via a userspace call; this is frequently mistaken for being a syscall.
Processes inherit the environment of its parent, unless instructed not to.
Each process has an entry in /proc/<pid>/environ where a copy of the environment is made at process-creation time.
The above entry is stale if the process changes its environment during execution.

Sources

For more information on the glibc: https://eli.thegreenplace.net/2012/08/13/how-statically-linked-programs-run-on-linux

Tags: blog

Back to posts