Skip to content

Concept

Why ioc_cleanup?

Cleaning tide gauge data is often:

  • ❌ manual
  • ❌ poorly documented
  • ❌ hard to reproduce
  • ❌ difficult to review or share

ioc_cleanup concept:

This project proposes a community-driven, version-controlled approach where all cleaning decisions are explicitly recorded and auditable.

Concept

The core idea of ioc_cleanup is declarative cleaning.

Instead of scripts or notebooks, all cleaning decisions are:

  • Explicit
  • Version controlled
  • Human-readable
  • Reviewable

Cleaning logic lives entirely in JSON files.

Why it matters

This methodology allows:

  • Flagging:
    • bad or corrupt data (timestamp / data ranges)
    • sensor breakpoints
    • singular phenomena (e.g. tsunamis, meteo-tsunamis, seiches, or unidentified events)
  • Reproducible cleaning
  • Transparent and traceable decisions stored in plain JSON
  • Peer review of cleaning decisions via GitHub
  • Easy extension to any other datasets (e.g. GESLA, NDBC)
  • Gradual growth in station coverage through community contributions

Transformations

Each station/sensor pair is described by a JSON file located in:

./transformations/

These files define the transformation from raw data → clean signal by declaring:

  • valid time windows
  • dropped timestamps
  • dropped ranges
  • breakpoints
  • notes and metadata

More details in the JSON format

Dataset Details

The following figures have been generated with the helper functions in scripts/ folder:

  1. download_ioc.py to download IOC stations
  2. generate_maps.py to create maps and graphs for the online documentation
  3. save_cleaning_scenarios.py to create the time series graphs used in the online documentation

Steps 2 and 3 require to have run step 1 for all cleaned IOC stations.

The cleaned stations dataset can be retrieved with:

import ioc_cleanup as C
ioc = C.get_meta()
stats = C.calc_statistics(ioc, stations_dir=C.TRANSFORMATIONS_DIR, pattern="*.json")

Cleaned Stations

Data availability in the 2020 - 2025 period

Ratio of data removed in the 2020 - 2025 period