Intro
Environment variables are used to pass information to a process from the
running environment, usually the shell. Many people know and interact with
environment variables using the shell builtin called export
.
An example of setting environment variables is if you have a C project that you
want to compile using make
. The make
program uses the CC
environment
variable to figure out which compiler to use. Otherwise, it falls back on cc
which is the OS’s default C compiler. Say your default compiler was gcc
and
you wanted to compile with clang
. You’d do something like this:
CC=clang make
Environment variables are easy to set, but hard to know if a program requires
them. The only way to know is to read the source code, or hope the
documentation is up-to-date. This makes them poor for maintainability because
there is not a clear, defined contract that dictates they are required. In the
case of make
, it is stated in its man page.
I’d like to share how environment variables work. I’d also like a share a
little-known caveat about the /proc
filesystem. In order to do that, I’d
like to first take a deep dive into how environment variables are actually set
on Linux.
Deep dive
Before diving in, let’s talk about where environment variables live.
Environment variables live in the program’s stack in user-space along with
arguments passed via command line (argv
).
Where do they come from?
There are three ways they can be set:
Inherited from the parent process (often from a shell).
Explicitly passed via a call to a specific variant of
exec(3)
.In your program, you use
setenv(3)
orputenv(3)
. Or for the brave, by manually manipulating theenviron(7)
pointer. And we will talk more about this one too.
Note: setenv(3)
and
putenv(3)
happen to be
for C, but every programming language ought to have a similar counterpart.
But where do they really come from?
In Linux, there’s a file called /etc/environment
which comes pre-populated by
your distro (distribution). When logging in for the very first time, the shell
will read this file to initialize its environment. The shell will also load
files such as /etc/profile
. In bash
, the ~/.bashrc
, which may also
contain environment variables.
Afterwards, the environment “persists” throughout the system by inheritance
through the fork(2)
and
exec
(usually
execve(2)
) calls.
How did they get on the stack?
In the picture above, the environment variables are stored after argv
. Here’s
a snippet of how glibc
starts your
program’s main function: 1
int __libc_start_main(
/* Pointer to the program's main function */
int (*main) (int, char**, char**),
/* argc and argv */
int argc, char **argv,
/* Pointers to initialization and finalization functions */
__typeof (main) init, void (*fini) (void),
/* Finalization function for the dynamic linker */
void (*rtld_fini) (void),
/* End of stack */
void* stack_end) {
...
char **ev = &argv[argc + 1];
__environ = ev;
...
}
You’ll notice that only argv
is in the function prototype, but not envp
,
which would be the pointer to our environment variables. Our environment
pointer is obtained by skipping past all the arguments (argc
), plus one. Then
it is stored in a mysterious __environ
variable. What is this variable?
The
__environ
variable lives in the glibc
runtime as a global. You can reference it in your
code because it has external linkage. 2 Functions such as
setenv(3)
operate on
__environ
. This is the variable mentioned earlier that you can manipulate.
#include <unistd.h>
...
extern char **environ;
...
int main(...
Note: The code snippet shows environ
, but this is the same variable as
__environ
.
But how did it get there? Why does glibc assume it is there?
The Linux kernel initiates the execution of your programs. It sets up the stack
in such a way that the environment variables will follow the argv
pointer. 3
Thus, before executing the main
function, the glibc
, looks in memory to
where the kernel has prepared the stack, and “wires” up the pointers. One of
those pointers is the __environ
pointer.
What about programs written and compiled with their own runtime?
All the examples have been in C – what about other programming languages? Golang is a language that brings its own runtime and creates statically compiled binaries with it. How does it know about environment variables?
Well, Golang is no different. In fact, every runtime retrieves the environment
pointer in the same fashion as glibc
because the kernel sets up the stack all
the same, regardless of the runtime. Going back to the Golang example, Golang
has a similar prelude before executing your main
function in it’s runtime
package that sets up environment variables. 4
Gotcha: /proc/<pid>/environ not reliable
A common way to view the environment of a process is by looking at
/proc/<pid>/environ
. The /proc
filesystem is a trove of information on
processes provided by the kernel. The kernel can show environ
because it has
an initial reference to the environment when it creates a new process.
This is a common technique when debugging. It is useful to know what the process’ environment is. Environment variables may be used to control the process’ behavior.
However, if a process changes its environment variables during runtime,
viewing /proc/<pid>/environ
will contain stale information. In other words,
the /proc
filesystem does not update the environ
entry.
Why does it not update?
Recall that to change the environment,
setenv(3)
or
putenv(3)
is used.
glibc
provides those functions. In other words, it is not a syscall, so the
kernel is not aware that any of those functions have been called. They are
user-space calls, operating on user-space memory.
Specifically, the kernel has an internal data structure which stores a bunch of metadata about the process. This data structure looks like:
struct mm_struct {
...
unsigned long start_code, end_code, start_data, end_data;
unsigned long start_brk, brk, start_stack;
unsigned long arg_start, arg_end, env_start, env_end;
...
};
This struct is defined inside
include/linux/mm_types.h
.
When a new process is created, the kernel stores the address of some items on
the stack, such as the start and end of argv
, and the environment.
The environment addresses (env_start
and env_end
) are set in the
create_elf_tables()
inside fs/binfmt_elf.c
when the ELF program is loaded.
What this means is, if you added an environment variable at runtime, this would
put the variable past the value stored in env_end
. But env_end
is never
updated to reflect that.
I plan to expand on the details above such as briefly describing what ELF is. I
also plan to understand why if you modified an existing environment variable,
which has no effect on the environment addresses, the change is still not
reflected in /proc/<pid>/environ
. My instinct is some kind of caching. Stay
tuned.
Conclusion
- Environment variables live on the stack.
- Environment variables are per-process.
- Accessing or modifying environment variables is handled via a userspace call; this is frequently mistaken for being a syscall.
- Processes inherit the environment of its parent, unless instructed not to.
- Each process has an entry in
/proc/<pid>/environ
where a copy of the environment is made at process-creation time. - The above entry is stale if the process changes its environment during execution.
Sources
For more information on the glibc
:
https://eli.thegreenplace.net/2012/08/13/how-statically-linked-programs-run-on-linux