NumPy Memory Mapping

When working with large datasets in Python, loading entire arrays into RAM can quickly exhaust your system’s memory resources. This is where NumPy memory mapping becomes invaluable. NumPy memory mapping allows you to work with arrays stored on disk as if they were in memory, providing efficient access to large datasets without overwhelming your RAM. In this comprehensive guide, we’ll explore how NumPy memory mapping works, its various applications, and practical techniques for implementing memory-mapped arrays in your data processing workflows.

NumPy memory mapping creates a direct correspondence between disk files and array objects, enabling you to access and modify large datasets efficiently. This technique is particularly useful when dealing with datasets that are too large to fit in available RAM or when you need to share data between multiple processes.

Understanding NumPy Memory Mapping

NumPy memory mapping is implemented through the numpy.memmap class, which creates a memory-mapped file object that behaves like a standard NumPy array. Unlike regular arrays that reside entirely in RAM, memory-mapped arrays keep their data on disk and load only the portions you’re actively using into memory. This lazy-loading approach makes NumPy memory mapping extremely efficient for large-scale data operations.

The core advantage of NumPy memory mapping is that it allows you to perform array operations on datasets much larger than your available RAM. When you access elements of a memory-mapped array, the operating system automatically handles the transfer of data between disk and memory, making NumPy memory mapping transparent to your code.

Creating Memory-Mapped Arrays

The numpy.memmap() function is the primary method for creating NumPy memory mapping objects. This function accepts several parameters that control how the memory-mapped array interacts with the underlying file.

dtype Parameter

The dtype parameter specifies the data type of the array elements in NumPy memory mapping. This determines how data is stored on disk and interpreted when accessed.

import numpy as np

# Create memory-mapped array with float64 dtype
mmap_float = np.memmap('data_float.dat', dtype='float64', mode='w+', shape=(100,))
print(f"Float array dtype: {mmap_float.dtype}")

mode Parameter

The mode parameter controls the file access mode for NumPy memory mapping. Different modes provide different levels of read and write access.

# Read-only mode
mmap_readonly = np.memmap('data.dat', dtype='int32', mode='r', shape=(50,))

# Read-write mode
mmap_readwrite = np.memmap('data_rw.dat', dtype='float32', mode='r+', shape=(50,))

# Write mode (creates new file)
mmap_write = np.memmap('data_new.dat', dtype='int64', mode='w+', shape=(50,))

shape Parameter

The shape parameter defines the dimensions of the memory-mapped array in NumPy memory mapping, just like regular NumPy arrays.

# 1D memory-mapped array
mmap_1d = np.memmap('array_1d.dat', dtype='float32', mode='w+', shape=(1000,))

# 2D memory-mapped array
mmap_2d = np.memmap('array_2d.dat', dtype='int32', mode='w+', shape=(100, 50))

# 3D memory-mapped array
mmap_3d = np.memmap('array_3d.dat', dtype='float64', mode='w+', shape=(10, 20, 30))

offset Parameter

The offset parameter in NumPy memory mapping specifies the byte offset in the file where the array data begins, useful for reading specific portions of large files.

# Create array starting at byte offset 1024
mmap_offset = np.memmap('data_offset.dat', dtype='float32', mode='w+', 
                        shape=(100,), offset=1024)
print(f"Array starts at byte offset: 1024")

Writing Data to Memory-Mapped Arrays

NumPy memory mapping supports writing data to disk-backed arrays using the same syntax as regular NumPy arrays. Changes are automatically synchronized to the underlying file.

# Create memory-mapped array for writing
mmap_write_data = np.memmap('write_data.dat', dtype='float64', mode='w+', shape=(10,))

# Write individual values
mmap_write_data[0] = 3.14159
mmap_write_data[1] = 2.71828

# Write array slice
mmap_write_data[2:5] = [1.0, 2.0, 3.0]

# Write using broadcasting
mmap_write_data[5:] = 9.99

Reading Data from Memory-Mapped Arrays

Reading data from NumPy memory mapping arrays works identically to reading from standard NumPy arrays, with automatic caching of accessed portions.

# Create and populate memory-mapped array
mmap_read = np.memmap('read_data.dat', dtype='int32', mode='w+', shape=(20,))
mmap_read[:] = np.arange(20) * 10

# Read individual elements
value_0 = mmap_read[0]
value_10 = mmap_read[10]

# Read slices
first_five = mmap_read[:5]
last_five = mmap_read[-5:]

# Read with boolean indexing
even_values = mmap_read[mmap_read % 20 == 0]

Flushing Memory-Mapped Arrays

The flush() method ensures that all changes to NumPy memory mapping arrays are written to disk immediately, rather than waiting for the operating system’s automatic synchronization.

# Create memory-mapped array
mmap_flush = np.memmap('flush_data.dat', dtype='float32', mode='w+', shape=(100,))

# Modify data
mmap_flush[:] = np.random.rand(100)

# Force write to disk
mmap_flush.flush()
print("Data flushed to disk")

Memory-Mapped Array Modes

NumPy memory mapping supports several file access modes that determine how you can interact with the underlying data file.

Read-Only Mode (‘r’)

Read-only mode in NumPy memory mapping allows you to access existing data without modification capabilities.

# First create some data
initial_data = np.memmap('readonly_data.dat', dtype='int32', mode='w+', shape=(50,))
initial_data[:] = np.arange(50)
initial_data.flush()
del initial_data

# Open in read-only mode
mmap_readonly = np.memmap('readonly_data.dat', dtype='int32', mode='r', shape=(50,))
print(f"First 5 values: {mmap_readonly[:5]}")

Copy-on-Write Mode (‘c’)

Copy-on-write mode in NumPy memory mapping creates a copy of the data in memory when modifications are attempted, leaving the original file unchanged.

# Create original data
original = np.memmap('cow_data.dat', dtype='float64', mode='w+', shape=(30,))
original[:] = np.linspace(0, 1, 30)
original.flush()
del original

# Open in copy-on-write mode
mmap_cow = np.memmap('cow_data.dat', dtype='float64', mode='c', shape=(30,))
mmap_cow[0] = 999.0  # Changes only in-memory copy

Read-Write Mode (‘r+’)

Read-write mode in NumPy memory mapping allows both reading and writing to existing files.

# Create initial file
existing = np.memmap('readwrite_data.dat', dtype='int64', mode='w+', shape=(40,))
existing[:] = np.arange(40) ** 2
existing.flush()
del existing

# Open in read-write mode
mmap_rw = np.memmap('readwrite_data.dat', dtype='int64', mode='r+', shape=(40,))
mmap_rw[10:20] = 0  # Modify existing data
mmap_rw.flush()

Write Mode (‘w+’)

Write mode in NumPy memory mapping creates a new file or overwrites an existing one, providing both read and write access.

# Create new file with write mode
mmap_new = np.memmap('new_data.dat', dtype='float32', mode='w+', shape=(100, 100))
mmap_new[:] = np.random.randn(100, 100)
mmap_new.flush()

Working with Multidimensional Memory-Mapped Arrays

NumPy memory mapping fully supports multidimensional arrays, enabling efficient handling of matrices and higher-dimensional data structures.

# Create 2D memory-mapped array
mmap_2d_array = np.memmap('matrix_data.dat', dtype='float64', mode='w+', shape=(50, 100))

# Populate with data
mmap_2d_array[:] = np.random.random((50, 100))

# Access rows
first_row = mmap_2d_array[0, :]

# Access columns
first_column = mmap_2d_array[:, 0]

# Access submatrices
submatrix = mmap_2d_array[10:20, 20:40]

Reshaping Memory-Mapped Arrays

NumPy memory mapping arrays can be reshaped just like regular arrays, as long as the total number of elements remains constant.

# Create 1D memory-mapped array
mmap_reshape = np.memmap('reshape_data.dat', dtype='int32', mode='w+', shape=(120,))
mmap_reshape[:] = np.arange(120)

# Reshape to 2D
mmap_2d_reshaped = mmap_reshape.reshape(10, 12)

# Reshape to 3D
mmap_3d_reshaped = mmap_reshape.reshape(4, 5, 6)

Performing Operations on Memory-Mapped Arrays

NumPy memory mapping arrays support most NumPy operations, though some operations may load data into memory temporarily for processing.

# Create memory-mapped arrays
mmap_ops1 = np.memmap('ops1.dat', dtype='float64', mode='w+', shape=(1000,))
mmap_ops2 = np.memmap('ops2.dat', dtype='float64', mode='w+', shape=(1000,))

mmap_ops1[:] = np.random.randn(1000)
mmap_ops2[:] = np.random.randn(1000)

# Arithmetic operations
sum_result = mmap_ops1 + mmap_ops2
product_result = mmap_ops1 * mmap_ops2

# Statistical operations
mean_val = np.mean(mmap_ops1)
std_val = np.std(mmap_ops1)
max_val = np.max(mmap_ops1)

Comprehensive Example: Large-Scale Sensor Data Processing

Here’s a complete example demonstrating NumPy memory mapping for processing large sensor datasets efficiently:

import numpy as np
import os

# Configuration for sensor data
NUM_SENSORS = 50
NUM_READINGS = 100000
SENSOR_FILE = 'sensor_readings.dat'
PROCESSED_FILE = 'processed_readings.dat'
STATISTICS_FILE = 'sensor_statistics.dat'

# Create memory-mapped array for raw sensor data
print("Creating sensor data array...")
sensor_data = np.memmap(SENSOR_FILE, dtype='float32', mode='w+', 
                        shape=(NUM_SENSORS, NUM_READINGS))

# Simulate sensor readings (temperature in Celsius)
print("Generating simulated sensor readings...")
for sensor_id in range(NUM_SENSORS):
    # Each sensor has slightly different baseline and noise characteristics
    baseline = 20.0 + sensor_id * 0.5
    noise = np.random.randn(NUM_READINGS) * 2.0
    trend = np.linspace(0, 5, NUM_READINGS)  # Gradual warming trend
    
    sensor_data[sensor_id, :] = baseline + noise + trend

sensor_data.flush()
print(f"Sensor data size: {sensor_data.nbytes / (1024**2):.2f} MB")

# Process data: convert to Fahrenheit and apply calibration
print("\nProcessing sensor data...")
processed_data = np.memmap(PROCESSED_FILE, dtype='float32', mode='w+', 
                          shape=(NUM_SENSORS, NUM_READINGS))

# Process in chunks to minimize memory usage
chunk_size = 10000
for i in range(0, NUM_READINGS, chunk_size):
    end_idx = min(i + chunk_size, NUM_READINGS)
    
    # Convert Celsius to Fahrenheit and apply 2% calibration factor
    celsius_chunk = sensor_data[:, i:end_idx]
    fahrenheit_chunk = (celsius_chunk * 9/5 + 32) * 1.02
    processed_data[:, i:end_idx] = fahrenheit_chunk

processed_data.flush()

# Calculate statistics for each sensor
print("\nCalculating sensor statistics...")
statistics = np.memmap(STATISTICS_FILE, dtype='float64', mode='w+', 
                      shape=(NUM_SENSORS, 5))  # mean, std, min, max, median

for sensor_id in range(NUM_SENSORS):
    sensor_readings = processed_data[sensor_id, :]
    
    statistics[sensor_id, 0] = np.mean(sensor_readings)
    statistics[sensor_id, 1] = np.std(sensor_readings)
    statistics[sensor_id, 2] = np.min(sensor_readings)
    statistics[sensor_id, 3] = np.max(sensor_readings)
    statistics[sensor_id, 4] = np.median(sensor_readings)

statistics.flush()

# Display results
print("\n" + "="*70)
print("SENSOR STATISTICS (Temperature in Fahrenheit)")
print("="*70)
print(f"{'Sensor':<8} {'Mean':<10} {'Std Dev':<10} {'Min':<10} {'Max':<10} {'Median':<10}")
print("-"*70)

for sensor_id in range(min(10, NUM_SENSORS)):  # Show first 10 sensors
    stats = statistics[sensor_id]
    print(f"#{sensor_id:<7} {stats[0]:<10.2f} {stats[1]:<10.2f} {stats[2]:<10.2f} "
          f"{stats[3]:<10.2f} {stats[4]:<10.2f}")

# Identify anomalous readings (beyond 3 standard deviations)
print("\n" + "="*70)
print("ANOMALY DETECTION")
print("="*70)

anomaly_count = 0
for sensor_id in range(NUM_SENSORS):
    mean = statistics[sensor_id, 0]
    std = statistics[sensor_id, 1]
    
    sensor_readings = processed_data[sensor_id, :]
    anomalies = np.abs(sensor_readings - mean) > 3 * std
    num_anomalies = np.sum(anomalies)
    
    if num_anomalies > 0:
        anomaly_count += num_anomalies
        anomaly_percent = (num_anomalies / NUM_READINGS) * 100
        print(f"Sensor #{sensor_id}: {num_anomalies} anomalies ({anomaly_percent:.2f}%)")

print(f"\nTotal anomalies detected: {anomaly_count}")
print(f"Overall anomaly rate: {(anomaly_count / (NUM_SENSORS * NUM_READINGS)) * 100:.3f}%")

# Calculate cross-sensor correlations (sample of sensors)
print("\n" + "="*70)
print("CROSS-SENSOR CORRELATION ANALYSIS")
print("="*70)

sample_sensors = [0, 10, 20, 30, 40]
correlation_matrix = np.zeros((len(sample_sensors), len(sample_sensors)))

for i, sensor_i in enumerate(sample_sensors):
    for j, sensor_j in enumerate(sample_sensors):
        if i <= j:
            # Calculate correlation coefficient
            correlation = np.corrcoef(
                processed_data[sensor_i, :10000],  # Use first 10k readings
                processed_data[sensor_j, :10000]
            )[0, 1]
            correlation_matrix[i, j] = correlation
            correlation_matrix[j, i] = correlation

print("\nCorrelation Matrix (sample sensors):")
print(f"{'':>8}", end='')
for sensor in sample_sensors:
    print(f"S{sensor:<7}", end='')
print()

for i, sensor_i in enumerate(sample_sensors):
    print(f"S{sensor_i:<7}", end='')
    for j, sensor_j in enumerate(sample_sensors):
        print(f"{correlation_matrix[i, j]:>7.3f}", end=' ')
    print()

# File size information
print("\n" + "="*70)
print("MEMORY-MAPPED FILE INFORMATION")
print("="*70)
print(f"Raw sensor data: {os.path.getsize(SENSOR_FILE) / (1024**2):.2f} MB")
print(f"Processed data: {os.path.getsize(PROCESSED_FILE) / (1024**2):.2f} MB")
print(f"Statistics data: {os.path.getsize(STATISTICS_FILE) / (1024**2):.2f} MB")
print(f"Total disk usage: {(os.path.getsize(SENSOR_FILE) + os.path.getsize(PROCESSED_FILE) + os.path.getsize(STATISTICS_FILE)) / (1024**2):.2f} MB")

# Clean up
del sensor_data
del processed_data
del statistics

print("\nProcessing complete! All data saved to memory-mapped files.")

Expected Output:

Creating sensor data array...
Generating simulated sensor readings...
Sensor data size: 19.07 MB

Processing sensor data...

Calculating sensor statistics...

======================================================================
SENSOR STATISTICS (Temperature in Fahrenheit)
======================================================================
Sensor   Mean       Std Dev    Min        Max        Median    
----------------------------------------------------------------------
#0       70.68      3.70       57.69      82.00      70.70     
#1       71.59      3.71       58.52      83.29      71.61     
#2       72.51      3.72       59.00      84.33      72.52     
#3       73.43      3.72       60.40      85.46      73.44     
#4       74.35      3.73       61.51      86.33      74.36     
#5       75.26      3.74       62.16      87.56      75.28     
#6       76.18      3.74       62.88      88.44      76.20     
#7       77.10      3.75       64.28      89.66      77.11     
#8       78.02      3.76       64.99      90.76      78.03     
#9       78.93      3.76       65.41      91.62      78.95     

======================================================================
ANOMALY DETECTION
======================================================================
Sensor #0: 267 anomalies (0.27%)
Sensor #1: 277 anomalies (0.28%)
Sensor #2: 294 anomalies (0.29%)
Sensor #3: 282 anomalies (0.28%)
Sensor #4: 272 anomalies (0.27%)
Sensor #5: 255 anomalies (0.26%)
Sensor #6: 264 anomalies (0.26%)
Sensor #7: 289 anomalies (0.29%)
Sensor #8: 271 anomalies (0.27%)
Sensor #9: 283 anomalies (0.28%)
...

Total anomalies detected: 13812
Overall anomaly rate: 0.276%

======================================================================
CROSS-SENSOR CORRELATION ANALYSIS
======================================================================

         S0     S10    S20    S30    S40    
S0       1.000 -0.015  0.004 -0.006  0.002 
S10     -0.015  1.000  0.012 -0.018  0.008 
S20      0.004  0.012  1.000  0.005 -0.011 
S30     -0.006 -0.018  0.005  1.000  0.016 
S40      0.002  0.008 -0.011  0.016  1.000 

======================================================================
MEMORY-MAPPED FILE INFORMATION
======================================================================
Raw sensor data: 19.07 MB
Processed data: 19.07 MB
Statistics data: 0.00 MB
Total disk usage: 38.15 MB

Processing complete! All data saved to memory-mapped files.

This comprehensive example demonstrates how NumPy memory mapping enables efficient processing of large sensor datasets. The implementation creates memory-mapped arrays for raw sensor data, processed readings, and statistical summaries. By processing data in chunks and using NumPy memory mapping, the system can handle datasets much larger than available RAM. The example includes realistic sensor data simulation, temperature conversion, calibration adjustments, anomaly detection using statistical thresholds, and cross-sensor correlation analysis. This approach is commonly used in environmental monitoring systems, industrial IoT applications, and scientific data collection where continuous sensor streams generate large volumes of data that need persistent storage and efficient analysis.

For more information on NumPy memory mapping and related functionality, visit the official NumPy documentation.