Open Petroleum Datasets · Parquet Format
High-performance columnar storage · Ready for analysis · Compressed & optimized
This repository provides the Equinor Volve field data in Parquet format, converted from the original dataset. The Volve field was Norway's first fully disclosed oil field dataset, offering real-world data for data engineering and analysis workflows.
The data has been structured into three normalized tables: daily metrics, monthly aggregations, and well metadata. Perfect for learning SQL, data engineering, or building analytical pipelines.
Query the Parquet files directly without loading into a database. Here's a simple example using DuckDB:
import duckdb # Connect to DuckDB (in-memory) con = duckdb.connect() # Query daily production data result = con.execute(""" SELECT w.wellbore_name, SUM(d.oil_volume) as total_oil, SUM(d.gas_volume) as total_gas, SUM(d.water_volume) as total_water FROM 'volve/daily_production.parquet' d JOIN 'volve/wells.parquet' w ON d.npd_wellbore_code = w.npd_wellbore_code WHERE d.date BETWEEN '2008-01-01' AND '2008-12-31' GROUP BY w.wellbore_name ORDER BY total_oil DESC LIMIT 10 """).fetchall() for row in result: print(row)
No database setup required. DuckDB reads Parquet files directly and efficiently, making it perfect for exploratory analysis and prototyping.
The dataset consists of three interconnected tables tracking well production metrics at different time granularities.
Data from the Equinor Volve Data Village — Norway's first fully disclosed oil field dataset.
Licensed under CC BY-NC-SA 4.0 · Equinor official terms
108 well log files from the FORCE 2020 Machine Learning Competition. Search for specific wells or browse by quadrant.
The FORCE 2020 Machine Learning Competition dataset contains well log data from 108 wells in the Norwegian Continental Shelf. Originally released for lithofacies prediction challenges, this dataset provides comprehensive petrophysical measurements ideal for machine learning and data science applications.
Each well file contains depth-indexed measurements including gamma ray, resistivity, density, neutron porosity, and sonic logs, along with lithofacies classifications.
Query the well log files directly. Here's an example analyzing a single well:
import duckdb # Connect to DuckDB (in-memory) con = duckdb.connect() # Analyze well 15-9-13 result = con.execute(""" SELECT WELL, MIN(DEPTH_MD) as min_depth, MAX(DEPTH_MD) as max_depth, AVG(GR) as avg_gamma_ray, AVG(RHOB) as avg_density, COUNT(*) as samples FROM 'force_2020/wells/15-9-13.parquet' GROUP BY WELL """).fetchall() for row in result: print(row) # Query multiple wells at once multi_well = con.execute(""" SELECT WELL, FORMATION, COUNT(*) as samples FROM 'force_2020/wells/*.parquet' WHERE FORMATION IS NOT NULL GROUP BY WELL, FORMATION ORDER BY WELL, samples DESC """).fetchall()
Each well file contains 29 columns with petrophysical measurements and metadata.
Well log data from the FORCE 2020 Machine Learning Competition (hosted on Xeek, now ThinkOnward), released for the lithofacies prediction challenge by FORCE and the Norwegian Petroleum Directorate (NPD).
License terms for the underlying well logs are published on the competition page. The competition required participant prediction submissions to be released under CC BY 4.0.
Monthly production data for ~85,418 oil and gas wells in Argentina (2006–2025). Spanish column names preserved from source.
Three single-file tables · 20-year partitioned monthly series · Spanish column names
Monthly production data for the Argentine oil and gas industry, sourced from the Secretaría de Energía public datasets. Organized into four tables by per-well change frequency: a static well master, slowly-changing operator history, operational state events, and a gap-filled monthly time series of ~17.6 million rows.
Spanish column names are preserved from the source
(idpozo, cuenca, prod_pet, …).
The full glossary of opaque codes (tef, vida_util,
formprod) lives in schema.md.
Aggregate 2023 production by basin, joining the static well master to the hive-partitioned monthly series:
import duckdb # Aggregate 2023 production by basin result = duckdb.sql(""" SELECT w.cuenca, SUM(m.prod_pet) AS oil_m3, SUM(m.prod_gas) AS gas_mm3 FROM 'argentina/wells.parquet' w JOIN read_parquet( 'argentina/monthly_production/anio=*/data.parquet', hive_partitioning = true ) m USING (idpozo) WHERE m.anio = 2023 GROUP BY w.cuenca ORDER BY oil_m3 DESC """).df()
Three more canonical patterns (single-well lookup, year-range
aggregation, manifest-driven access via _files.json) live
in the dataset README.
Full per-column documentation is published alongside the parquets:
Data from Producción de petróleo y gas por pozo (Capítulo IV), published by the Secretaría de Energía on the Argentine open data portal (datos.energia.gob.ar).
All three source CSVs (produccin-de-pozos-de-gas-y-petrleo-*, capitulo-iv-pozos, listado-de-pozos-cargados-por-empresas-operadoras) are resources of this same dataset package.
Licensed under Creative Commons Attribution 4.0 (as declared on the dataset's portal page).
Labelled 1-Hz sensor-data windows from the Petrobras 3W dataset.
Pinned at upstream git tag v.1.70.0
(dataset version 2.0.0). This release
publishes the event-class lookup, the real-Well master, the full
Instance catalog, and the per-Instance Observations time-series
hive-partitioned by event_class.
Lookup + Wells master + Instance catalog + per-Instance Observations (hive-partitioned by event_class) · pinned upstream identity logged on every publish
The Petrobras 3W dataset is a corpus of ~2,228 labelled 1-Hz sensor-data windows recorded on Petrobras's offshore wells, framed around at most one anomaly event per window. The full corpus covers ten operational regimes (NORMAL plus nine anomaly categories such as Hydrate in Production Line and Severe Slugging) across ~40 distinct real wells, supplemented by simulated and hand-drawn instances.
Petrodb pins the upstream repository at git tag
v.1.70.0 (dataset version
2.0.0) — refreshes are
event-driven on new upstream releases, never silent.
Measure the labelled-data balance across the corpus from the Instance catalog alone (no Observations scan needed):
import duckdb base = 'https://dev-petrodb.ocortez.com/petrobras_3w' result = duckdb.sql(f""" SELECT et.event_class, et.description, COUNT(*) AS n_instances, SUM(i.n_rows) AS n_observations FROM '{base}/instances.parquet' i JOIN '{base}/event_types.parquet' et ON et.event_class = i.event_class GROUP BY et.event_class, et.description ORDER BY et.event_class """).df()
The per-Instance Observations files are accessible via the
hive-partitioned URL pattern
observations/event_class=N/<instance_id>.parquet.
Each file embeds instance_id, well_id,
and well_kind as constant columns, so corpus-wide
queries against a single event class do not need to join the
catalog:
-- All real-Well Hydrate-in-Production-Line observations SELECT instance_id, well_id, "timestamp", "P-PDG", "T-PDG", class FROM 'https://dev-petrodb.ocortez.com/petrobras_3w/observations/event_class=8/*.parquet' WHERE well_kind = 'real';
Full per-column documentation is published alongside the parquets:
Upstream repository: https://github.com/petrobras/3W.git
(pinned at git tag v.1.70.0, dataset version
2.0.0).
Licensed under Creative Commons Attribution 4.0. All credit for the underlying measurements, labelling, and dataset design belongs to Petrobras and the upstream maintainers.