PETRODATA REPOSITORY

Open Petroleum Datasets · Parquet Format

Download Parquet Files

daily_production.parquet ⬇ monthly_production.parquet ⬇ wells.parquet ⬇

High-performance columnar storage · Ready for analysis · Compressed & optimized

About This Dataset

This repository provides the Equinor Volve field data in Parquet format, converted from the original dataset. The Volve field was Norway's first fully disclosed oil field dataset, offering real-world data for data engineering and analysis workflows.

The data has been structured into three normalized tables: daily metrics, monthly aggregations, and well metadata. Perfect for learning SQL, data engineering, or building analytical pipelines.

Quick Start with DuckDB

Query the Parquet files directly without loading into a database. Here's a simple example using DuckDB:

import duckdb

# Connect to DuckDB (in-memory)
con = duckdb.connect()

# Query daily production data
result = con.execute("""
    SELECT
        w.wellbore_name,
        SUM(d.oil_volume) as total_oil,
        SUM(d.gas_volume) as total_gas,
        SUM(d.water_volume) as total_water
    FROM 'volve/daily_production.parquet' d
    JOIN 'volve/wells.parquet' w
        ON d.npd_wellbore_code = w.npd_wellbore_code
    WHERE d.date BETWEEN '2008-01-01' AND '2008-12-31'
    GROUP BY w.wellbore_name
    ORDER BY total_oil DESC
    LIMIT 10
""").fetchall()

for row in result:
    print(row)

No database setup required. DuckDB reads Parquet files directly and efficiently, making it perfect for exploratory analysis and prototyping.

Database Schema

The dataset consists of three interconnected tables tracking well production metrics at different time granularities.

wells

Well metadata and facility information

npd_wellbore_code

wellbore_code

wellbore_name

npd_field_code

npd_field_name

npd_facility_code

npd_facility_name

daily_production

Daily well production metrics and parameters

date

npd_wellbore_code

on_stream_hours

avg_downhole_pressure

avg_dp_tubing

avg_annulus_pressure

avg_wellhead_pressure

avg_downhole_temperature

avg_wellhead_temperature

avg_choke_size_percent

avg_choke_unit

dp_choke_size

oil_volume

gas_volume

water_volume

water_injection_volume

flow_kind

well_type

→ npd_wellbore_code references wells.npd_wellbore_code

monthly_production

Aggregated monthly production volumes

date

npd_wellbore_code

on_stream_hours

oil_volume_sm3

gas_volume_sm3

water_volume_sm3

gas_injection_sm3

water_injection_sm3

→ npd_wellbore_code references wells.npd_wellbore_code

Source & License

Data from the Equinor Volve Data Village — Norway's first fully disclosed oil field dataset.

Licensed under CC BY-NC-SA 4.0 · Equinor official terms

Download Well Log Files

108 well log files from the FORCE 2020 Machine Learning Competition. Search for specific wells or browse by quadrant.

Showing all 108 wells

Quadrant 7

2 wells ▼

7-1-1.parquet ⬇ 7-1-2_S.parquet ⬇

Quadrant 15

4 wells ▼

15-9-13.parquet ⬇ 15-9-14.parquet ⬇ 15-9-15.parquet ⬇ 15-9-17.parquet ⬇

Quadrant 16

15 wells ▼

16-1-2.parquet ⬇ 16-1-6_A.parquet ⬇ 16-2-6.parquet ⬇ 16-2-11_A.parquet ⬇ 16-2-16.parquet ⬇ 16-4-1.parquet ⬇ 16-5-3.parquet ⬇ 16-7-4.parquet ⬇ 16-7-5.parquet ⬇ 16-8-1.parquet ⬇ 16-10-1.parquet ⬇ 16-10-2.parquet ⬇ 16-10-3.parquet ⬇ 16-10-5.parquet ⬇ 16-11-1_ST3.parquet ⬇

Quadrant 17

1 well ▼

17-11-1.parquet ⬇

Quadrant 25

20 wells ▼

25-2-7.parquet ⬇ 25-2-13_T4.parquet ⬇ 25-2-14.parquet ⬇ 25-3-1.parquet ⬇ 25-4-5.parquet ⬇ 25-5-1.parquet ⬇ 25-5-3.parquet ⬇ 25-5-4.parquet ⬇ 25-6-1.parquet ⬇ 25-6-2.parquet ⬇ 25-6-3.parquet ⬇ 25-7-2.parquet ⬇ 25-8-5_S.parquet ⬇ 25-8-7.parquet ⬇ 25-9-1.parquet ⬇ 25-10-10.parquet ⬇ 25-11-5.parquet ⬇ 25-11-15.parquet ⬇ 25-11-19_S.parquet ⬇ 25-11-24.parquet ⬇

Quadrant 26

1 well ▼

26-4-1.parquet ⬇

Quadrant 29

2 wells ▼

29-3-1.parquet ⬇ 29-6-1.parquet ⬇

Quadrant 30

3 wells ▼

30-3-3.parquet ⬇ 30-3-5_S.parquet ⬇ 30-6-5.parquet ⬇

Quadrant 31

14 wells ▼

31-2-1.parquet ⬇ 31-2-7.parquet ⬇ 31-2-8.parquet ⬇ 31-2-9.parquet ⬇ 31-2-19_S.parquet ⬇ 31-3-1.parquet ⬇ 31-3-2.parquet ⬇ 31-3-3.parquet ⬇ 31-3-4.parquet ⬇ 31-4-5.parquet ⬇ 31-4-10.parquet ⬇ 31-5-4_S.parquet ⬇ 31-6-5.parquet ⬇ 31-6-8.parquet ⬇

Quadrant 32

1 well ▼

32-2-1.parquet ⬇

Quadrant 33

4 wells ▼

33-5-2.parquet ⬇ 33-6-3_S.parquet ⬇ 33-9-1.parquet ⬇ 33-9-17.parquet ⬇

Quadrant 34

21 wells ▼

34-2-4.parquet ⬇ 34-3-1_A.parquet ⬇ 34-3-3_A.parquet ⬇ 34-4-10_R.parquet ⬇ 34-5-1_A.parquet ⬇ 34-5-1_S.parquet ⬇ 34-6-1_S.parquet ⬇ 34-7-13.parquet ⬇ 34-7-20.parquet ⬇ 34-7-21.parquet ⬇ 34-8-1.parquet ⬇ 34-8-3.parquet ⬇ 34-8-7_R.parquet ⬇ 34-10-16_R.parquet ⬇ 34-10-19.parquet ⬇ 34-10-21.parquet ⬇ 34-10-33.parquet ⬇ 34-10-35.parquet ⬇ 34-11-1.parquet ⬇ 34-11-2_S.parquet ⬇ 34-12-1.parquet ⬇

Quadrant 35

19 wells ▼

35-3-7_S.parquet ⬇ 35-4-1.parquet ⬇ 35-6-2_S.parquet ⬇ 35-8-4.parquet ⬇ 35-8-6_S.parquet ⬇ 35-9-2.parquet ⬇ 35-9-5.parquet ⬇ 35-9-6_S.parquet ⬇ 35-9-8.parquet ⬇ 35-9-10_S.parquet ⬇ 35-11-1.parquet ⬇ 35-11-6.parquet ⬇ 35-11-7.parquet ⬇ 35-11-10.parquet ⬇ 35-11-11.parquet ⬇ 35-11-12.parquet ⬇ 35-11-13.parquet ⬇ 35-11-15_S.parquet ⬇ 35-12-1.parquet ⬇

Quadrant 36

1 well ▼

36-7-3.parquet ⬇

About FORCE 2020 Dataset

The FORCE 2020 Machine Learning Competition dataset contains well log data from 108 wells in the Norwegian Continental Shelf. Originally released for lithofacies prediction challenges, this dataset provides comprehensive petrophysical measurements ideal for machine learning and data science applications.

Each well file contains depth-indexed measurements including gamma ray, resistivity, density, neutron porosity, and sonic logs, along with lithofacies classifications.

Quick Start with DuckDB

Query the well log files directly. Here's an example analyzing a single well:

import duckdb

# Connect to DuckDB (in-memory)
con = duckdb.connect()

# Analyze well 15-9-13
result = con.execute("""
    SELECT
        WELL,
        MIN(DEPTH_MD) as min_depth,
        MAX(DEPTH_MD) as max_depth,
        AVG(GR) as avg_gamma_ray,
        AVG(RHOB) as avg_density,
        COUNT(*) as samples
    FROM 'force_2020/wells/15-9-13.parquet'
    GROUP BY WELL
""").fetchall()

for row in result:
    print(row)

# Query multiple wells at once
multi_well = con.execute("""
    SELECT WELL, FORMATION, COUNT(*) as samples
    FROM 'force_2020/wells/*.parquet'
    WHERE FORMATION IS NOT NULL
    GROUP BY WELL, FORMATION
    ORDER BY WELL, samples DESC
""").fetchall()

Well Log Schema

Each well file contains 29 columns with petrophysical measurements and metadata.

Well Log Columns

29 columns per well file

WELL

DEPTH_MD

X_LOC

Y_LOC

Z_LOC

GROUP

FORMATION

dataset

CALI

RSHA

RMED

RDEP

RHOB

SGR

NPHI

PEF

DTC

ROP

DTS

DCAL

DRHO

MUDWEIGHT

RMIC

ROPA

RXO

FORCE_2020_LITHOFACIES_LITHOLOGY

Log curves: GR (Gamma Ray), RHOB (Bulk Density), NPHI (Neutron Porosity), DTC/DTS (Sonic), RDEP/RMED/RSHA (Resistivity), CALI (Caliper), PEF (Photoelectric Factor)

Source & License

Well log data from the FORCE 2020 Machine Learning Competition (hosted on Xeek, now ThinkOnward), released for the lithofacies prediction challenge by FORCE and the Norwegian Petroleum Directorate (NPD).

License terms for the underlying well logs are published on the competition page. The competition required participant prediction submissions to be released under CC BY 4.0.

Download Argentina Files

Monthly production data for ~85,418 oil and gas wells in Argentina (2006–2025). Spanish column names preserved from source.

wells.parquet ⬇ well_operator_history.parquet ⬇ well_events.parquet ⬇

monthly_production (hive-partitioned by year)

_files.json · manifest for read_parquet / httpfs {}

                        Download a specific year
                    

2006 ⬇ 2007 ⬇ 2008 ⬇ 2009 ⬇ 2010 ⬇ 2011 ⬇ 2012 ⬇ 2013 ⬇ 2014 ⬇ 2015 ⬇ 2016 ⬇ 2017 ⬇ 2018 ⬇ 2019 ⬇ 2020 ⬇ 2021 ⬇ 2022 ⬇ 2023 ⬇ 2024 ⬇ 2025 ⬇

README.md 📖 schema.md 📖 schema.json {} schema.sql ⌘

Three single-file tables · 20-year partitioned monthly series · Spanish column names

About This Dataset

Monthly production data for the Argentine oil and gas industry, sourced from the Secretaría de Energía public datasets. Organized into four tables by per-well change frequency: a static well master, slowly-changing operator history, operational state events, and a gap-filled monthly time series of ~17.6 million rows.

Spanish column names are preserved from the source (idpozo, cuenca, prod_pet, …). The full glossary of opaque codes (tef, vida_util, formprod) lives in schema.md.

Quick Start with DuckDB

Aggregate 2023 production by basin, joining the static well master to the hive-partitioned monthly series:

import duckdb

# Aggregate 2023 production by basin
result = duckdb.sql("""
    SELECT w.cuenca,
           SUM(m.prod_pet) AS oil_m3,
           SUM(m.prod_gas) AS gas_mm3
    FROM 'argentina/wells.parquet' w
    JOIN read_parquet(
      'argentina/monthly_production/anio=*/data.parquet',
      hive_partitioning = true
    ) m USING (idpozo)
    WHERE m.anio = 2023
    GROUP BY w.cuenca
    ORDER BY oil_m3 DESC
""").df()

Three more canonical patterns (single-well lookup, year-range aggregation, manifest-driven access via _files.json) live in the dataset README.

Schema Documents

Full per-column documentation is published alongside the parquets:

README.md

Dataset overview + four canonical query examples

→ Open README.md

schema.md

English column docs (Spanish identifiers preserved), four-bucket rationale, glossary

→ Open schema.md

schema.json

Machine-readable column list, types, primary & foreign keys

→ Open schema.json

schema.sql

DDL that mirrors the published structure in a fresh DuckDB

→ Open schema.sql

Source & License

Data from Producción de petróleo y gas por pozo (Capítulo IV), published by the Secretaría de Energía on the Argentine open data portal (datos.energia.gob.ar).

All three source CSVs (produccin-de-pozos-de-gas-y-petrleo-*, capitulo-iv-pozos, listado-de-pozos-cargados-por-empresas-operadoras) are resources of this same dataset package.

Licensed under Creative Commons Attribution 4.0 (as declared on the dataset's portal page).

Download Petrobras 3W Files

Labelled 1-Hz sensor-data windows from the Petrobras 3W dataset. Pinned at upstream git tag v.1.70.0 (dataset version 2.0.0). This release publishes the event-class lookup, the real-Well master, the full Instance catalog, and the per-Instance Observations time-series hive-partitioned by event_class.

event_types.parquet ⬇ wells.parquet ⬇ instances.parquet ⬇ observations/_files.json 📜

README.md 📖 schema.md 📖 schema.json {} schema.sql ⌘ LICENSE-3W-DATA.md 📄

Lookup + Wells master + Instance catalog + per-Instance Observations (hive-partitioned by event_class) · pinned upstream identity logged on every publish

About This Dataset

The Petrobras 3W dataset is a corpus of ~2,228 labelled 1-Hz sensor-data windows recorded on Petrobras's offshore wells, framed around at most one anomaly event per window. The full corpus covers ten operational regimes (NORMAL plus nine anomaly categories such as Hydrate in Production Line and Severe Slugging) across ~40 distinct real wells, supplemented by simulated and hand-drawn instances.

Petrodb pins the upstream repository at git tag v.1.70.0 (dataset version 2.0.0) — refreshes are event-driven on new upstream releases, never silent.

Quick Start with DuckDB

Measure the labelled-data balance across the corpus from the Instance catalog alone (no Observations scan needed):

import duckdb

base = 'https://dev-petrodb.ocortez.com/petrobras_3w'
result = duckdb.sql(f"""
    SELECT
        et.event_class,
        et.description,
        COUNT(*)             AS n_instances,
        SUM(i.n_rows)        AS n_observations
    FROM '{base}/instances.parquet' i
    JOIN '{base}/event_types.parquet' et
        ON et.event_class = i.event_class
    GROUP BY et.event_class, et.description
    ORDER BY et.event_class
""").df()

The per-Instance Observations files are accessible via the hive-partitioned URL pattern observations/event_class=N/<instance_id>.parquet. Each file embeds instance_id, well_id, and well_kind as constant columns, so corpus-wide queries against a single event class do not need to join the catalog:

-- All real-Well Hydrate-in-Production-Line observations
SELECT instance_id, well_id, "timestamp", "P-PDG", "T-PDG", class
FROM 'https://dev-petrodb.ocortez.com/petrobras_3w/observations/event_class=8/*.parquet'
WHERE well_kind = 'real';

Schema Documents

Full per-column documentation is published alongside the parquets:

README.md

Dataset overview, pinned upstream identity, query examples

→ Open README.md

schema.md

English column docs + 27-sensor glossary mirrored from upstream dataset.ini

→ Open schema.md

schema.json

Machine-readable column list, types, primary & foreign keys

→ Open schema.json

schema.sql

DDL that mirrors the published structure in a fresh DuckDB

→ Open schema.sql

Source & License

Upstream repository: https://github.com/petrobras/3W.git (pinned at git tag v.1.70.0, dataset version 2.0.0).

Licensed under Creative Commons Attribution 4.0. All credit for the underlying measurements, labelling, and dataset design belongs to Petrobras and the upstream maintainers.