Part I: Python Fundamentals

Chapter 4

NumPy and Pandas — The Engineer's Power Tools

schedule15 min readfitness_center10 exercises

Up to this point, every calculation we have written processes one value at a time, or iterates through a list element by element. That works for a single well, a single depth, a single month. It does not scale. A producing field generates thousands of data points per day across dozens of wells. A reservoir simulation grid can contain millions of cells. A well log records measurements every six inches over thousands of feet of section.

NumPy and Pandas are the two libraries that make large-scale petroleum data analysis practical. NumPy provides fast array operations — performing the same calculation on thousands of values simultaneously, without a loop. Pandas provides the DataFrame — a tabular data structure that handles the messy, labeled, mixed-type data that petroleum engineers actually work with.

This chapter teaches both through real petroleum data tasks: loading production data, cleaning it, computing engineering quantities across entire fields, merging datasets from different sources, and producing the summary tables and visualizations that appear in engineering reports.

infoWhat You Will Learn

NumPy arrays — vectorized arithmetic, performance, and why loops disappear
Pandas Series and DataFrames — loading, indexing, filtering, and transforming tabular data
Data cleaning — handling missing values, outliers, unit inconsistencies, and physically impossible values
Merging and aggregating — joining well headers with production data, computing field-level summaries
Time series — resampling, rolling averages, and trend analysis for production surveillance

NumPy — Fast Arithmetic on Arrays

Why Arrays Matter

Consider a routine task: calculating hydrostatic pressure at 100 different depths for a given mud weight. With a Python list and a loop, you write the formula 100 times (via iteration). With a NumPy array, you write it once.

main.py

import numpy as np
import time

mud_weight_ppg = 11.6
n_depths = 100_000  # 100,000 depth points — realistic for a fine-grid pressure profile

# === List approach — element by element ===
depths_list = [i * 0.1 for i in range(1, n_depths + 1)]  # 0.1 to 10,000 ft in 0.1-ft steps

start = time.perf_counter()
pressures_list = []
for d in depths_list:
    pressures_list.append(0.052 * mud_weight_ppg * d)
list_time = time.perf_counter() - start

# === NumPy approach — all at once ===
depths_array = np.arange(0.1, n_depths * 0.1 + 0.1, 0.1)

start = time.perf_counter()
pressures_array = 0.052 * mud_weight_ppg * depths_array
numpy_time = time.perf_counter() - start

print(f"Depths computed:     {n_depths:,}")
print(f"List approach:       {list_time*1000:.1f} ms")
print(f"NumPy approach:      {numpy_time*1000:.3f} ms")
print(f"Speedup:             {list_time/numpy_time:.0f}x faster")
print(f"\nFirst 5 pressures:   {pressures_array[:5]}")
print(f"Last 5 pressures:    {pressures_array[-5:]}")

The NumPy line 0.052 mud_weight_ppg depths_array applies the formula to all 100,000 elements simultaneously. There is no loop. The operation is vectorized — it runs in optimized C code underneath, which is why it is dramatically faster. More importantly, the code is shorter and easier to read: one line that looks like the equation instead of four lines of loop mechanics.

At 100,000 points, the speedup is typically 50–200x. At 10 million points — common in seismic data and reservoir simulation grids — the difference between "runs in a second" and "runs for five minutes" is the difference between NumPy and a Python loop.

Array Operations for Petroleum Calculations

main.py

import numpy as np

# Well log data — 500 depth points from a density-neutron log
np.random.seed(42)
n = 500
depth = np.linspace(9000, 9500, n)

# Synthetic log values
rhob = np.where((depth > 9150) & (depth < 9350),
                2.35 + np.random.normal(0, 0.03, n),
                2.55 + np.random.normal(0, 0.03, n))

rhob = np.clip(rhob, 1.8, 3.0)  # Physical bounds for bulk density

# === Porosity from Density Log ===
# Density porosity formula: φ = (ρma - ρb) / (ρma - ρf)
# where ρma = matrix density (sandstone ≈ 2.65 g/cc)
#       ρf  = fluid density (≈ 1.0 g/cc for water)
#       ρb  = measured bulk density

rho_matrix = 2.65
rho_fluid = 1.0

porosity = (rho_matrix - rhob) / (rho_matrix - rho_fluid)
porosity = np.clip(porosity, 0, 0.45)  # Porosity cannot be negative or > 45%

print(f"=== Density Porosity Calculation ===")
print(f"Depth range:     {depth[0]:.0f} – {depth[-1]:.0f} ft")
print(f"Data points:     {n}")
print(f"Matrix density:  {rho_matrix} g/cc (sandstone)")
print(f"Fluid density:   {rho_fluid} g/cc (water)")
print()
print(f"Porosity statistics:")
print(f"  Min:   {porosity.min():.3f} ({porosity.min()*100:.1f}%)")
print(f"  Max:   {porosity.max():.3f} ({porosity.max()*100:.1f}%)")
print(f"  Mean:  {porosity.mean():.3f} ({porosity.mean()*100:.1f}%)")
print(f"  Std:   {porosity.std():.3f}")

# Find the reservoir zone (porosity > 0.15)
reservoir_mask = porosity > 0.15
reservoir_depths = depth[reservoir_mask]
print(f"\nReservoir zone (φ > 15%):")
print(f"  Top:       {reservoir_depths[0]:.0f} ft")
print(f"  Bottom:    {reservoir_depths[-1]:.0f} ft")
print(f"  Thickness: {reservoir_depths[-1] - reservoir_depths[0]:.0f} ft")
print(f"  Avg φ:     {porosity[reservoir_mask].mean():.3f} ({porosity[reservoir_mask].mean()*100:.1f}%)")

Every operation above — subtraction, division, clipping, boolean masking — is applied to the entire array at once. The line porosity = (rho_matrix - rhob) / (rho_matrix - rho_fluid) computes porosity for all 500 depth points in a single expression. The boolean mask porosity > 0.15 produces an array of True/False values that can be used to select only the reservoir interval.

main.py

import matplotlib.pyplot as plt

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 10), sharey=True)

# Track 1: Bulk Density
ax1.plot(rhob, depth, color="#4682B4", linewidth=0.7)
ax1.set_xlabel("Bulk Density (g/cc)", fontsize=10)
ax1.set_ylabel("Depth (ft)", fontsize=11)
ax1.set_xlim(2.0, 2.8)
ax1.set_title("RHOB", fontsize=11, fontweight='bold')
ax1.invert_yaxis()
ax1.grid(True, alpha=0.15)

# Track 2: Density Porosity
ax2.plot(porosity * 100, depth, color="#2E8B57", linewidth=0.8)
ax2.axvline(x=15, color="#CC4444", linestyle="--", linewidth=1, alpha=0.6, label="Cutoff (15%)")
ax2.fill_betweenx(depth, 0, porosity * 100, where=(porosity > 0.15),
                  alpha=0.15, color="#FFD700")
ax2.set_xlabel("Porosity (%)", fontsize=10)
ax2.set_xlim(0, 35)
ax2.set_title("Density Porosity", fontsize=11, fontweight='bold')
ax2.legend(loc="lower right", fontsize=9)
ax2.grid(True, alpha=0.15)

# Highlight reservoir zone
for ax in [ax1, ax2]:
    ax.axhspan(9150, 9350, alpha=0.04, color="#FFD700")

fig.suptitle("Oso-Deep 003 — Porosity from Density Log", fontsize=13, fontweight='bold')
fig.tight_layout()
plt.show()

Linear Algebra — Reservoir Engineering Applications

NumPy's linear algebra capabilities are essential for reservoir engineering problems that involve systems of equations. A common example: solving for pressures in a multi-well system where each well influences its neighbors.

main.py

import numpy as np

# Simplified steady-state pressure calculation for 4 connected grid blocks.
# Each block's pressure depends on its neighbors and any wells producing from it.
#
# The system Ax = b represents the discretized flow equations:
# A = transmissibility matrix (how easily fluid flows between blocks)
# b = source/sink terms (wells producing or injecting)
# x = unknown pressures

# Transmissibility matrix (symmetric — flow is bidirectional)
A = np.array([
    [ 3, -1, -1,  0],
    [-1,  3,  0, -1],
    [-1,  0,  3, -1],
    [ 0, -1, -1,  3],
])

# Source terms — negative means production (fluid leaving the system)
# Block 0: injector adding 500 bbl/d equivalent
# Block 3: producer taking 500 bbl/d equivalent
b = np.array([500, 0, 0, -500])

# Solve for pressures
pressures = np.linalg.solve(A, b)

print("Grid Block Pressures (relative units):")
for i, p in enumerate(pressures):
    well_type = "Injector" if b[i] > 0 else "Producer" if b[i] < 0 else "No well"
    print(f"  Block {i}: {p:8.1f}  ({well_type})")

# Verify the solution: A @ x should equal b
residual = np.linalg.norm(A @ pressures - b)
print(f"\nResidual (should be ~0): {residual:.2e}")

This is a preview of the discretized flow equations used in reservoir simulation. In Chapter 11, we will build a complete 1D reservoir simulator using these same principles applied to hundreds of grid blocks.

Pandas — Tabular Data for Real Engineering

Loading Production Data

main.py

import pandas as pd
import numpy as np

# Create a realistic multi-well production dataset
np.random.seed(123)

wells = ["OD-001", "OD-003", "OD-005", "OD-007"]
dates = pd.date_range("2025-01-01", periods=24, freq="MS")  # 24 months

records = []
for well in wells:
    # Each well has different initial rate and decline characteristics
    base_rates = {
        "OD-001": (2400, 0.04, 300),
        "OD-003": (3150, 0.06, 420),
        "OD-005": (1800, 0.03, 150),
        "OD-007": (2950, 0.05, 80),
    }
    qi, di, wi = base_rates[well]

    for i, date in enumerate(dates):
        oil = qi * np.exp(-di * i) + np.random.normal(0, qi * 0.02)
        water = wi + 40 * i + np.random.normal(0, 30)
        gas = oil * (2.1 + np.random.normal(0, 0.1))
        fwhp = 800 - 8 * i + np.random.normal(0, 15)

        records.append({
            "well": well,
            "date": date,
            "oil_bopd": max(0, round(oil, 1)),
            "water_bwpd": max(0, round(water, 1)),
            "gas_mscfd": max(0, round(gas, 1)),
            "fwhp_psi": max(50, round(fwhp, 0)),
        })

# Introduce some realistic data quality issues
records[14]["oil_bopd"] = np.nan        # Missing value — sensor outage
records[27]["oil_bopd"] = -200          # Negative — database error
records[42]["water_bwpd"] = np.nan      # Missing
records[55]["oil_bopd"] = 15000         # Impossibly high — wrong well allocation
records[70]["fwhp_psi"] = np.nan        # Missing

df = pd.DataFrame(records)
df.to_csv("field_production_24mo.csv", index=False)

print(f"Dataset: {len(df)} records × {len(df.columns)} columns")
print(f"Wells: {df['well'].nunique()}")
print(f"Date range: {df['date'].min().date()} to {df['date'].max().date()}")
print(f"\nFirst 8 rows:")
print(df.head(8).to_string(index=False))

Inspecting and Understanding the Data

Before any analysis, you need to understand what you are working with. How many records? What types? Where are the gaps?

main.py

print("=== Data Quality Report ===\n")
print(f"Shape: {df.shape[0]} rows × {df.shape[1]} columns\n")

# Missing values
print("Missing values:")
for col in df.columns:
    n_missing = df[col].isna().sum()
    if n_missing > 0:
        print(f"  {col}: {n_missing} ({n_missing/len(df)*100:.1f}%)")

# Basic statistics
print("\nNumerical summary:")
print(df.describe().round(1).to_string())

# Check for physically impossible values
print(f"\nData quality flags:")
print(f"  Negative oil rates:     {(df['oil_bopd'] < 0).sum()}")
print(f"  Oil rate > 10,000 bopd: {(df['oil_bopd'] > 10000).sum()}")
print(f"  Negative water rates:   {(df['water_bwpd'] < 0).sum()}")
print(f"  FWHP below 50 psi:     {(df['fwhp_psi'] < 50).sum()}")

This report immediately reveals the data quality issues we introduced: missing values in three columns, one negative oil rate, and one impossibly high oil rate. In real field data, these problems are universal. The next section shows how to handle them.

Data Cleaning — Handling Real-World Petroleum Data

main.py

# Make a working copy — never modify the raw data
clean = df.copy()

# Step 1: Replace physically impossible values with NaN
# Oil rates cannot be negative and rarely exceed 8,000 bopd in this field
clean.loc[clean["oil_bopd"] < 0, "oil_bopd"] = np.nan
clean.loc[clean["oil_bopd"] > 8000, "oil_bopd"] = np.nan

print("After removing impossible values:")
print(f"  Total NaN in oil_bopd: {clean['oil_bopd'].isna().sum()}")

# Step 2: Interpolate missing values within each well
# Linear interpolation is appropriate for short gaps (1-2 months)
clean = clean.sort_values(["well", "date"])
clean["oil_bopd"] = clean.groupby("well")["oil_bopd"].transform(
    lambda x: x.interpolate(method="linear", limit=2)
)
clean["water_bwpd"] = clean.groupby("well")["water_bwpd"].transform(
    lambda x: x.interpolate(method="linear", limit=2)
)
clean["fwhp_psi"] = clean.groupby("well")["fwhp_psi"].transform(
    lambda x: x.interpolate(method="linear", limit=2)
)

print(f"  After interpolation: {clean['oil_bopd'].isna().sum()} remaining NaN")

# Step 3: Calculate derived columns
clean["total_liquid_bpd"] = clean["oil_bopd"] + clean["water_bwpd"]
clean["water_cut_pct"] = (clean["water_bwpd"] / clean["total_liquid_bpd"] * 100).round(1)
clean["gor_scf_bbl"] = (clean["gas_mscfd"] * 1000 / clean["oil_bopd"]).round(0)

print(f"\nCleaned dataset: {len(clean)} records")
print(f"New columns added: total_liquid_bpd, water_cut_pct, gor_scf_bbl")
print(f"\nSample of cleaned data:")
print(clean[clean["well"] == "OD-003"].head(6).to_string(index=False))

Grouping and Aggregation — Field-Level Analysis

Individual well data becomes field-level intelligence through grouping and aggregation.

main.py

# Well-level summary — the kind of table that appears in every monthly report
well_summary = clean.groupby("well").agg(
    avg_oil=("oil_bopd", "mean"),
    latest_oil=("oil_bopd", "last"),
    peak_oil=("oil_bopd", "max"),
    avg_water_cut=("water_cut_pct", "mean"),
    latest_water_cut=("water_cut_pct", "last"),
    avg_gor=("gor_scf_bbl", "mean"),
    avg_fwhp=("fwhp_psi", "mean"),
    months=("date", "count"),
).round(0)

print("=== Well Performance Summary ===\n")
print(well_summary.to_string())

# Field totals
field_oil = clean.groupby("date")["oil_bopd"].sum()
field_water = clean.groupby("date")["water_bwpd"].sum()
field_wc = field_water / (field_oil + field_water) * 100

print(f"\n=== Field Totals ===")
print(f"Current field oil rate:  {field_oil.iloc[-1]:,.0f} bopd")
print(f"Current field water cut: {field_wc.iloc[-1]:.1f}%")
print(f"Peak field oil rate:     {field_oil.max():,.0f} bopd")

main.py

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(10, 5.5))

# Pivot to get one column per well
pivot = clean.pivot_table(index="date", columns="well", values="oil_bopd", aggfunc="sum")

colors = ["#2E8B57", "#4682B4", "#D4A847", "#CC6644"]
pivot.plot.area(ax=ax, stacked=True, color=colors, alpha=0.75, linewidth=0.5)

ax.set_xlabel("Date", fontsize=11)
ax.set_ylabel("Oil Rate (bopd)", fontsize=11)
ax.set_title("OML 58 — Field Oil Production by Well", fontsize=13, fontweight='bold')
ax.legend(title="Well", loc="upper right", fontsize=9)
ax.grid(axis='y', alpha=0.2)

fig.tight_layout()
plt.show()

Merging — Joining Well Headers with Production Data

Production data and well metadata typically live in separate tables. Merging them lets you analyze production by well type, formation, operator, or any other attribute.

main.py

# Well header table — static information about each well
headers = pd.DataFrame({
    "well": ["OD-001", "OD-003", "OD-005", "OD-007"],
    "well_type": ["Vertical", "Horizontal", "Vertical", "Horizontal"],
    "target_formation": ["E3000 Sand", "E3000 Sand", "D2000 Sand", "E3000 Sand"],
    "tvd_ft": [9800, 9650, 8400, 9900],
    "lateral_length_ft": [0, 4200, 0, 5100],
    "completion_date": pd.to_datetime(["2023-06-15", "2025-03-22", "2022-11-01", "2025-08-10"]),
})

# Merge production data with well headers
merged = clean.merge(headers, on="well", how="left")

# Now we can analyze by well type
by_type = merged.groupby("well_type").agg(
    well_count=("well", "nunique"),
    avg_oil=("oil_bopd", "mean"),
    avg_water_cut=("water_cut_pct", "mean"),
    avg_gor=("gor_scf_bbl", "mean"),
).round(0)

print("Performance by Well Type:\n")
print(by_type.to_string())

# Analyze by formation
by_fm = merged.groupby("target_formation").agg(
    wells=("well", "nunique"),
    total_oil=("oil_bopd", "sum"),
    avg_wc=("water_cut_pct", "mean"),
).round(0)

print(f"\nPerformance by Formation:\n")
print(by_fm.to_string())

The merge operation joined 96 production records with 4 header records, matching on the well column. This is equivalent to a VLOOKUP in Excel, but it works on millions of rows and does not break when you sort the data.

Time Series — Resampling and Rolling Averages

Production data arrives at different frequencies: daily from SCADA, monthly from allocation, quarterly for regulatory reporting. Resampling converts between frequencies. Rolling averages smooth out noise to reveal underlying trends.

main.py

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Generate synthetic daily production data for one well
np.random.seed(456)
days = 365
dates_daily = pd.date_range("2025-01-01", periods=days, freq="D")

# Underlying decline + daily noise + occasional shut-ins
qi = 2800
di = 0.0018  # daily decline rate
base_rate = qi * np.exp(-di * np.arange(days))
noise = np.random.normal(0, 80, days)
daily_oil = base_rate + noise

# Simulate 5 brief shut-ins (maintenance, weather)
shutin_starts = [45, 112, 198, 267, 320]
for s in shutin_starts:
    duration = np.random.randint(1, 4)
    daily_oil[s:s+duration] = 0

daily_oil = np.maximum(daily_oil, 0)

daily_df = pd.DataFrame({
    "date": dates_daily,
    "oil_bopd": daily_oil
}).set_index("date")

# Calculate rolling average
daily_df["rolling_30d"] = daily_df["oil_bopd"].rolling(window=30, min_periods=10).mean()

# Resample to monthly averages
monthly = daily_df["oil_bopd"].resample("MS").mean()

# Plot
fig, ax = plt.subplots(figsize=(10, 5))

ax.plot(daily_df.index, daily_df["oil_bopd"], color="#CCCCCC", linewidth=0.5,
        alpha=0.7, label="Daily (raw)")
ax.plot(daily_df.index, daily_df["rolling_30d"], color="#2E8B57", linewidth=2,
        label="30-day rolling avg")
ax.scatter(monthly.index, monthly.values, color="#D4A847", zorder=5, s=40,
           edgecolors="white", linewidth=0.5, label="Monthly avg")

# Economic limit
ax.axhline(y=200, color="#CC4444", linestyle="--", linewidth=1, alpha=0.5,
           label="Economic limit (200 bopd)")

ax.set_xlabel("Date", fontsize=11)
ax.set_ylabel("Oil Rate (bopd)", fontsize=11)
ax.set_title("OD-003 — Daily Production with Rolling Average", fontsize=13, fontweight='bold')
ax.legend(loc="upper right", fontsize=9)
ax.set_ylim(0, 3200)
ax.grid(True, alpha=0.15)

fig.tight_layout()
plt.show()

print(f"Daily records:       {len(daily_df)}")
print(f"Monthly averages:    {len(monthly)}")
print(f"Initial rate:        {daily_df['oil_bopd'].iloc[:7].mean():,.0f} bopd (first week avg)")
print(f"Final rate:          {daily_df['oil_bopd'].iloc[-7:].mean():,.0f} bopd (last week avg)")
print(f"Annual decline:      {(1 - daily_df['oil_bopd'].iloc[-7:].mean() / daily_df['oil_bopd'].iloc[:7].mean()) * 100:.1f}%")

Building a Monthly Production Report

This is the kind of deliverable that a production engineer creates every month. Pandas makes it a repeatable, auditable process.

main.py

# Latest month's data
latest_month = clean["date"].max()
latest = clean[clean["date"] == latest_month].copy()

# Previous month for comparison
prev_month = latest_month - pd.DateOffset(months=1)
previous = clean[clean["date"] == prev_month].copy()

# Build report
report = latest[["well", "oil_bopd", "water_bwpd", "gas_mscfd", "water_cut_pct", "fwhp_psi"]].copy()
report = report.merge(
    previous[["well", "oil_bopd"]].rename(columns={"oil_bopd": "prev_oil"}),
    on="well", how="left"
)
report["change_pct"] = ((report["oil_bopd"] - report["prev_oil"]) / report["prev_oil"] * 100).round(1)

# Add field total row
field_total = pd.DataFrame([{
    "well": "FIELD TOTAL",
    "oil_bopd": report["oil_bopd"].sum(),
    "water_bwpd": report["water_bwpd"].sum(),
    "gas_mscfd": report["gas_mscfd"].sum(),
    "water_cut_pct": (report["water_bwpd"].sum() /
                      (report["oil_bopd"].sum() + report["water_bwpd"].sum()) * 100),
    "fwhp_psi": report["fwhp_psi"].mean(),
    "prev_oil": report["prev_oil"].sum(),
    "change_pct": ((report["oil_bopd"].sum() - report["prev_oil"].sum()) /
                    report["prev_oil"].sum() * 100),
}])

report = pd.concat([report, field_total], ignore_index=True)

print(f"=== MONTHLY PRODUCTION REPORT — {latest_month.strftime('%B %Y')} ===\n")
print(report.round(1).to_string(index=False))

Summary

This chapter covered the two libraries that form the backbone of petroleum data science:

NumPy arrays enable vectorized arithmetic — performing calculations on thousands of values without writing loops. Density porosity, hydrostatic pressure profiles, and linear algebra for reservoir systems all benefit from array operations.
Pandas DataFrames handle the labeled, mixed-type tabular data that petroleum engineers actually work with: production records, well headers, and surveillance metrics.
Data cleaning is not optional in petroleum data. Missing values, negative rates, impossible readings, and unit inconsistencies are the norm, not the exception. Systematic cleaning with documented steps is an engineering discipline.
Merging joins data from different sources — production tables with well headers, log data with formation tops — enabling analysis by well type, formation, operator, or any other attribute.
Time series operations — rolling averages and resampling — separate measurement noise from engineering signal, enabling meaningful trend analysis and forecasting.
The monthly production report — a standard industry deliverable — becomes a repeatable, auditable Pandas pipeline rather than a manual spreadsheet exercise.

In the next chapter, we focus entirely on visualization: the standard plots and chart types that petroleum engineers use to communicate data, identify problems, and support decisions.

Exercises

fitness_center

Exercise 4.1Practice

Vectorized PVT Calculations

Using NumPy, implement the Standing correlation for bubble point pressure: Pb=18.2[(Rsγg)0.83×10(0.00091×T−0.0125×API)−1.4]P_b = 18.2 \left[ \left( \f...

arrow_forward