Preface
At Survata, we do a lot of data processing using Python and its suite of data processing libraries like pandas and Scikit-learn. This means we use a lot of cloud computing resources, and as a result, our monthly hosting bill can be… hefty. One way to trim the amount you spend on cloud resources is to make sure you don’t ask for more resources than you actually use. Cloud providers make it really easy to spin up a multiple-GB-of-RAM server — but if your actual running process only uses a fraction of that memory, you’re wasting resources — and that means money!
However, you can’t optimize the resources you use if you don’t know what you’re actually using.
Option 1: Ask the operating system
The easiest way to track memory usage is to use the operating system itself. You can use top to provide an overview of the resources you’re using over time. Alternatively, if you want a spot inspection of resource usage, you can use the ps command:
The -m flag instructs ps to show results in order of which processes are using the most memory. The -o flag controls which properties of each process are displayed — in this case, the percentage of CPU being used, the percentage of system memory being consumed, and the command line of the process being executed. The CPU percentage counts one full CPU core as 100% usage, so if you have a 4-core machine, it’s possible to see a total of up to 400% CPU usage. There are other output options to display other process properties, and other flags to ps to control which processes are displayed.
Combined with some creative shell scripting, you could write a monitoring script that uses ps to track memory usage of your tasks over time. Most hosting providers will also provide dashboards for monitoring machine-level resource usage. There are also profilers like py-spy that can be used to wrap the execution of a Python process and measure it’s memory and CPU usage. These profilers use operating system calls, combined with a knowledge of how Python code executes, to take periodic measurements of your program as it runs, and identify which parts of your code are using resources.
Unfortunately, this approach isn’t always viable for data pipeline tasks. In our situation, we’re using AWS Batch as a host for our compute tasks, which obscures the operating system-level interface. Each deployed task is wrapped in a Docker container; that task then nominates how much memory and CPU it needs to run.
This containerization process obscures how much memory is being used inside the container. From the hosting provider’s perspective, a Docker container that allocates 8GB of RAM is using all that memory, even if the code running inside the container only allocates a fraction of that amount.
So — we need to monitor memory usage inside the container.
Your first inclination might be to use the same operating system techniques, but inside the container. While this does technically work, general advice is that a Docker container should run a single process — so running a second monitoring process inside a container isn’t a good option.
Measuring memory usage from outside the running process also obscures collection of metrics that would allow correlate memory usage with properties of the data being analyzed. For example, does memory usage scale with the number of data in the data set? Or is it related to the complexity of the analysis performed? When analyzing at the level of the operating system, it may be difficult to collect metrics on the operation of the underlying analysis.
What we need is a way to monitor the memory usage of a running Python process, from inside that process.
Option 2: tracemalloc
The Python interpreter has a remarkable number of hooks into its operation that can be used to monitor and introspect into Python code as it runs. These hooks are used by pdb to provide debugging; they’re also used by coverage to provide test coverage. They’re also used by the tracemalloc module to provide a window into memory usage.
tracemalloc is a standard library module added in Python 3.4 that tracks every individual memory blocks allocated by the Python interpreter. tracemalloc is able to provide extremely fine-grained information about memory allocations in the running Python process:
- test_mem.py
- #!/usr/bin/env python3
- import tracemalloc
- import time
- if __name__ == '__main__':
- tracemalloc.start()
- my_list = []
- for i in range(10000):
- my_list.extend(list(range(1000)))
- time.sleep(5)
- current, peak = tracemalloc.get_traced_memory()
- print(f"Current memory usage is {current / 10**6}MB; Peak was {peak / 10**6}MB")
- tracemalloc.stop()
Calling tracemalloc.start() starts the tracing process. While tracing is underway, you can ask for details of what has been allocated; in this case, we’re just asking for the current and peak memory allocation. Calling tracemalloc.stop() removes the hooks and clears any traces that have been gathered.
There’s a price to be paid for this level of detail, though. tracemalloc injects itself deep into the running Python process — which, as you might expect, comes with a performance cost. In our testing, we observed a 30% slowdown when using tracemalloc on a running analysis run. This might be OK when profiling an individual process, but in production, you really don’t want a 30% performance hit just so you can monitor memory usage.
Option 3: Sampling
Luckily, the Python standard library provides another way to observe memory usage — the resource module. The resource module provides basic controls for resources that a program allocates — including memory usage:
- import resource
- usage = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
However, unlike the tracemalloc module, the resource module doesn’t track usage over time — it only provides a point sampling. So, we need to implement a way to sample memory usage over time. First — we define a class to perform the memory monitoring:
- import resource
- from time import sleep
- class MemoryMonitor:
- def __init__(self):
- self.keep_measuring = True
- def measure_usage(self):
- max_usage = 0
- while self.keep_measuring:
- max_usage = max(
- max_usage,
- resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
- )
- sleep(0.1)
- return max_usage
- from concurrent.futures import ThreadPoolExecutor
- with ThreadPoolExecutor() as executor:
- monitor = MemoryMonitor()
- mem_thread = executor.submit(monitor.measure_usage)
- try:
- fn_thread = executor.submit(my_analysis_function)
- result = fn_thread.result()
- finally:
- monitor.keep_measuring = False
- max_usage = mem_thread.result()
- print(f"Peak memory usage: {max_usage}")
Using this approach, we’re effectively sampling memory usage over time. Most of the work will be done in the main analysis thread; but every 0.1s, the monitor thread will wake up, take a memory measurement, store it if memory usage has increased, and go back to sleep.
The performance overhead of this sampling approach is minimal. Although sampling every 0.1 seconds might sound like a lot, it’s an eternity in CPU time, and as a result, there is a negligible impact on overall processing time. This sampling rate can be tuned, too; if you do see an overhead, you can increase the pause between samples; or, if you need more precise data, you can decrease the pause.
The downside is that the sampling-based monitoring approach is imprecise. You’re only sampling memory usage, so short-lived memory allocation spikes will be lost in this analysis. However, for the purposes of optimizing cloud resource allocation, we only need rough numbers. We are only looking to answer whether our process is using 8GB or 10GB of RAM, not differentiate at the byte (or even megabyte) level.
Conclusion
It’s impossible to improve something you aren’t measuring. Armed with more information about the memory usage of our analysis tasks, we’re now in a much better position to optimize our resource usage. And, we’ve been able to collect that information with relatively little code and relatively little performance overhead.