Concept
Why ioc_cleanup?
Cleaning tide gauge data is often:
manual
poorly documented
hard to reproduce
difficult to review or share
ioc_cleanup concept:
This project proposes a community-driven, version-controlled approach where all cleaning decisions are explicitly recorded and auditable.
Concept
The core idea of ioc_cleanup is declarative cleaning.
Instead of scripts or notebooks, all cleaning decisions are:
- Explicit
- Version controlled
- Human-readable
- Reviewable
Cleaning logic lives entirely in JSON files.
Why it matters
This methodology allows:
- Flagging:
- bad or corrupt data (timestamp / data ranges)
- sensor breakpoints
- singular phenomena (e.g. tsunamis, meteo-tsunamis, seiches, or unidentified events)
- Reproducible cleaning
- Transparent and traceable decisions stored in plain JSON
- Peer review of cleaning decisions via GitHub
- Easy extension to any other datasets (e.g. GESLA, NDBC)
- Gradual growth in station coverage through community contributions
Transformations
Each station/sensor pair is described by a JSON file located in:
These files define the transformation from raw data → clean signal by declaring:
- valid time windows
- dropped timestamps
- dropped ranges
- breakpoints
- notes and metadata
More details in the JSON format
Dataset Details
The following figures have been generated with the helper functions in scripts/ folder:
download_ioc.pyto download IOC stationsgenerate_maps.pyto create maps and graphs for the online documentationsave_cleaning_scenarios.pyto create the time series graphs used in the online documentation
Steps 2 and 3 require to have run step 1 for all cleaned IOC stations.
The cleaned stations dataset can be retrieved with:
import ioc_cleanup as C
ioc = C.get_meta()
stats = C.calc_statistics(ioc, stations_dir=C.TRANSFORMATIONS_DIR, pattern="*.json")
Cleaned Stations
Data availability in the 2020 - 2025 period
Ratio of data removed in the 2020 - 2025 period