Part I: Python Fundamentals
Chapter 4
NumPy and Pandas — The Engineer's Power Tools
Up to this point, every calculation we have written processes one value at a time, or iterates through a list element by element. That works for a single well, a single depth, a single month. It does not scale. A producing field generates thousands of data points per day across dozens of wells. A reservoir simulation grid can contain millions of cells. A well log records measurements every six inches over thousands of feet of section.
NumPy and Pandas are the two libraries that make large-scale petroleum data analysis practical. NumPy provides fast array operations — performing the same calculation on thousands of values simultaneously, without a loop. Pandas provides the DataFrame — a tabular data structure that handles the messy, labeled, mixed-type data that petroleum engineers actually work with.
This chapter teaches both through real petroleum data tasks: loading production data, cleaning it, computing engineering quantities across entire fields, merging datasets from different sources, and producing the summary tables and visualizations that appear in engineering reports.
infoWhat You Will Learn
- NumPy arrays — vectorized arithmetic, performance, and why loops disappear
- Pandas Series and DataFrames — loading, indexing, filtering, and transforming tabular data
- Data cleaning — handling missing values, outliers, unit inconsistencies, and physically impossible values
- Merging and aggregating — joining well headers with production data, computing field-level summaries
- Time series — resampling, rolling averages, and trend analysis for production surveillance
NumPy — Fast Arithmetic on Arrays
Why Arrays Matter
Consider a routine task: calculating hydrostatic pressure at 100 different depths for a given mud weight. With a Python list and a loop, you write the formula 100 times (via iteration). With a NumPy array, you write it once.
The NumPy line 0.052 mud_weight_ppg depths_array applies the formula to all 100,000 elements simultaneously. There is no loop. The operation is vectorized — it runs in optimized C code underneath, which is why it is dramatically faster. More importantly, the code is shorter and easier to read: one line that looks like the equation instead of four lines of loop mechanics.
At 100,000 points, the speedup is typically 50–200x. At 10 million points — common in seismic data and reservoir simulation grids — the difference between "runs in a second" and "runs for five minutes" is the difference between NumPy and a Python loop.
Array Operations for Petroleum Calculations
Every operation above — subtraction, division, clipping, boolean masking — is applied to the entire array at once. The line porosity = (rho_matrix - rhob) / (rho_matrix - rho_fluid) computes porosity for all 500 depth points in a single expression. The boolean mask porosity > 0.15 produces an array of True/False values that can be used to select only the reservoir interval.
Linear Algebra — Reservoir Engineering Applications
NumPy's linear algebra capabilities are essential for reservoir engineering problems that involve systems of equations. A common example: solving for pressures in a multi-well system where each well influences its neighbors.
This is a preview of the discretized flow equations used in reservoir simulation. In Chapter 11, we will build a complete 1D reservoir simulator using these same principles applied to hundreds of grid blocks.
Pandas — Tabular Data for Real Engineering
Loading Production Data
Inspecting and Understanding the Data
Before any analysis, you need to understand what you are working with. How many records? What types? Where are the gaps?
This report immediately reveals the data quality issues we introduced: missing values in three columns, one negative oil rate, and one impossibly high oil rate. In real field data, these problems are universal. The next section shows how to handle them.
Data Cleaning — Handling Real-World Petroleum Data
Grouping and Aggregation — Field-Level Analysis
Individual well data becomes field-level intelligence through grouping and aggregation.
Merging — Joining Well Headers with Production Data
Production data and well metadata typically live in separate tables. Merging them lets you analyze production by well type, formation, operator, or any other attribute.
The merge operation joined 96 production records with 4 header records, matching on the well column. This is equivalent to a VLOOKUP in Excel, but it works on millions of rows and does not break when you sort the data.
Time Series — Resampling and Rolling Averages
Production data arrives at different frequencies: daily from SCADA, monthly from allocation, quarterly for regulatory reporting. Resampling converts between frequencies. Rolling averages smooth out noise to reveal underlying trends.
Building a Monthly Production Report
This is the kind of deliverable that a production engineer creates every month. Pandas makes it a repeatable, auditable process.
Summary
This chapter covered the two libraries that form the backbone of petroleum data science:
- NumPy arrays enable vectorized arithmetic — performing calculations on thousands of values without writing loops. Density porosity, hydrostatic pressure profiles, and linear algebra for reservoir systems all benefit from array operations.
- Pandas DataFrames handle the labeled, mixed-type tabular data that petroleum engineers actually work with: production records, well headers, and surveillance metrics.
- Data cleaning is not optional in petroleum data. Missing values, negative rates, impossible readings, and unit inconsistencies are the norm, not the exception. Systematic cleaning with documented steps is an engineering discipline.
- Merging joins data from different sources — production tables with well headers, log data with formation tops — enabling analysis by well type, formation, operator, or any other attribute.
- Time series operations — rolling averages and resampling — separate measurement noise from engineering signal, enabling meaningful trend analysis and forecasting.
- The monthly production report — a standard industry deliverable — becomes a repeatable, auditable Pandas pipeline rather than a manual spreadsheet exercise.
In the next chapter, we focus entirely on visualization: the standard plots and chart types that petroleum engineers use to communicate data, identify problems, and support decisions.
Exercises
Vectorized PVT Calculations
Using NumPy, implement the Standing correlation for bubble point pressure: Pb=18.2[(Rsγg)0.83×10(0.00091×T−0.0125×API)−1.4]P_b = 18.2 \left[ \left( \f...
Production Data Loader
Write a function load_production(filepath) that reads a CSV file, automatically detects date columns, converts them to datetime, handles missing value...
Decline Rate Calculator
For each well in the production dataset, calculate the monthly decline rate using: Di=qi−qi+1qi×ΔtD_i = \frac{q_i - q_{i+1}}{q_i \times \Delta t}Di=q...
Data Cleaning Pipeline
The raw production dataset contains intentional errors (negative rates, impossibly high values, missing data). Write a complete cleaning pipeline that...
Multi-Well Comparison Dashboard
Create a 2×2 subplot figure for the field production dataset showing: (a) oil rate over time for all wells, (b) water cut over time for all wells, (c)...
Cumulative Production and EUR Estimation
For each well, calculate cumulative oil production using cumsum(). Plot cumulative oil vs. time. Estimate a simple EUR (Estimated Ultimate Recovery) b...
Allocation Reconciliation
In many fields, total production is measured at a central facility (fiscal metering), and individual well production is estimated through allocation. ...
Well Ranking System
Create a well ranking system that scores each well on multiple criteria: oil rate (higher is better), water cut (lower is better), GOR trend (stable i...
Pressure Survey Analysis
A pressure survey measures static bottomhole pressure (SBHP) at multiple times during a well's life. These measurements tell the reservoir engineer wh...
Field Summary Dashboard
Build a complete field summary that a production manager could present in a monthly review meeting. It should include: A summary table with one row pe...