data.md
1 # Data Management 2 3 4 5 The Data module provides comprehensive storage and management of [OHLCV](guides/data-management.md#ohlcv-data) (Open, High, Low, Close, Volume) data and other time-series [market data](guides/data-management.md). 6 7 ## Quick Navigation 8 9 - **[Storage Architecture](#Storage-Architecture)** - Understanding Zarr and LMDB backends 10 - **[Historical Data](#Historical-Data-with-Scrapers)** - Using Scrapers for bulk data collection 11 - **[Real-Time Data](#Real-Time-Data-with-Fetch)** - Fetching live data from exchanges 12 - **[Live Streaming](#Live-Data-Streaming-with-Watchers)** - Continuous data monitoring 13 14 ## Prerequisites 15 16 - Basic understanding of [OHLCV data concepts](getting-started/index.md) 17 - Familiarity with [Exchange setup](exchanges.md) 18 19 ## Related Topics 20 21 - **[Strategy Development](guides/strategy-development.md)** - Using data in trading strategies 22 - **[Watchers](watchers/watchers.md)** - Real-time data monitoring 23 - **[Processing](API/processing.md)** - Data transformation and analysis 24 25 ## Storage Architecture 26 27 ### Zarr Backend 28 29 Planar uses **Zarr** as its primary storage backend, which offers several advantages: 30 31 - **Columnar Storage**: Optimized for array-based data, similar to Feather or Parquet 32 - **Flexible Encoding**: Supports different compression and encoding schemes 33 - **Storage Agnostic**: Can be backed by various storage layers, including network-based systems 34 - **Chunked Access**: Efficient for time-series queries despite chunk-based reading 35 36 The framework wraps a Zarr subtype of `AbstractStore` in a [`Planar.Data.ZarrInstance`](@ref). The global `ZarrInstance` is accessible at `Data.zi[]`, with LMDB as the default underlying store. 37 38 ### Data Organization 39 40 [OHLCV data](guides/data-management.md#ohlcv-data) is organized hierarchically using [`Planar.Data.key_path`](@ref): 41 42 ## Data Architecture Overview 43 44 The Data module provides a comprehensive [data management](guides/data-management.md) system with the following key components: 45 46 - **Storage Backend**: Zarr arrays with LMDB as the default store 47 - **Data Organization**: Hierarchical structure by exchange/source, pair, and timeframe 48 - **Data Types**: [OHLCV data](guides/data-management.md#ohlcv-data), generic time-series data, and cached metadata 49 - **Access Patterns**: Progressive loading for large datasets, contiguous time-series validation 50 - **Performance**: Chunked storage, compression, and optimized indexing 51 52 ### Storage Hierarchy 53 54 Data is organized in a hierarchical structure: 55 ``` 56 ZarrInstance/ 57 ├── exchange_name/ 58 │ ├── pair_name/ 59 │ │ ├── [timeframe](guides/data-management.md#timeframes)/ 60 │ │ │ ├── timestamp 61 │ │ │ ├── open 62 │ │ │ ├── high 63 │ │ │ ├── low 64 │ │ │ ├── close 65 │ │ │ └── volume 66 │ │ └── ... 67 │ └── ... 68 └── ... 69 ``` 70 71 ## Data Collection Methods 72 73 Planar provides multiple methods for collecting market data, each optimized for different use cases: 74 75 ## Historical Data with Scrapers 76 77 The Scrapers module provides access to historical data archives from major exchanges, offering the most efficient method for obtaining large amounts of historical data. 78 79 **Supported Exchanges**: Binance and Bybit archives 80 81 ### Basic Scraper Usage 82 83 84 ### Advanced Scraper Examples 85 86 Download multiple symbols and filter by quote currency using `bn.binancesyms()` and `scr.selectsyms()`. 87 88 ### Market Types and Frequencies 89 90 Use different market types (`:spot`, `:um`, `:cm`), frequencies (`:daily`, `:monthly`), and data kinds (`:klines`, `:trades`, `:aggTrades`). 91 92 ### Error Handling and Data Validation 93 94 95 !!! warning "Download Caching" 96 Downloads are cached - requesting the same pair path again will only download newer archives. 97 If data becomes corrupted, pass `reset=true` to force a complete redownload. 98 99 !!! tip "Performance Optimization" 100 - **Monthly Archives**: Use for historical [backtesting](guides/execution-modes.md#simulation)-mode) (faster download, larger chunks) 101 - **Daily Archives**: Use for recent data or frequent updates 102 - **Parallel Downloads**: Consider for multiple symbols, but respect [exchange](exchanges.md) rate limits 103 104 ## Real-Time Data with Fetch 105 106 The Fetch module downloads data directly from exchanges using [CCXT](exchanges.md#ccxt-integration), making it ideal for: 107 108 - Getting the most recent market data 109 - Filling gaps in historical data 110 - Real-time data updates for [live trading](guides/execution-modes.md#live-mode) 111 112 ### Basic Fetch Usage 113 114 115 ### Advanced Fetch Examples 116 117 118 ### Multi-Exchange Data Collection 119 120 121 ### Rate Limit Management 122 123 Use delays between requests and validate data quality. Implement error handling for failed requests. 124 125 !!! warning "Rate Limit Considerations" 126 Direct exchange fetching is heavily rate-limited, especially for smaller [timeframes](guides/data-management.md#timeframes). 127 Use archives for bulk historical data collection. 128 129 !!! tip "Fetch Best Practices" 130 - **Recent Updates**: Use fetch for recent data updates and gap filling 131 - **Rate Limiting**: Implement delays between requests to respect exchange limits 132 - **Data Validation**: Always validate fetched data before using in [strategies](guides/strategy-development.md) 133 - **Raw Data**: Use `fetch_candles` for unchecked data when you need raw exchange responses 134 135 ## Live Data Streaming with Watchers 136 137 The Watchers module enables real-time data tracking from exchanges and other sources, storing data locally for: 138 139 - Live trading operations 140 - Real-time data analysis 141 - Continuous market monitoring 142 143 ### OHLCV Ticker Watcher 144 145 The ticker watcher monitors multiple pairs simultaneously using exchange ticker endpoints: 146 147 148 149 As a convention, the `view` property of a watcher shows the processed data. In this case, the candles processed 150 by the `ohlcv_ticker_watcher` will be stored in a dict. 151 152 153 ### Single-Pair OHLCV Watcher 154 155 There is another OHLCV watcher based on trades, that tracks only one pair at a time with higher precision: 156 157 158 ### Watcher Configuration 159 160 Configure watchers with custom intervals using `timeout_interval`, `fetch_interval`, and `flush_interval` parameters. Use `wc.start!()` and `wc.stop!()` for lifecycle management. 161 162 ### Orderbook Watcher 163 164 165 ### Custom Data Processing 166 167 168 ### Error Handling and Resilience 169 170 171 ### Data Persistence and Storage 172 173 174 Other implemented watchers are the orderbook watcher, and watchers that parse data feeds from 3rd party APIs. 175 176 !!! tip "Watcher Best Practices" 177 - Monitor watcher health regularly with `wc.isrunning()` 178 - Implement proper error handling and reconnection logic 179 - Save data periodically to prevent loss during interruptions 180 - Use appropriate fetch intervals to balance data freshness with rate limits 181 - Consider using multiple watchers for redundancy in critical applications 182 183 ## Custom Data Sources 184 185 Assuming you have your own pipeline to fetch candles, you can use the functions [`Planar.Data.save_ohlcv`](@ref) and [`Planar.Data.load_ohlcv`](@ref) to manage the data. 186 187 ### Basic Custom Data Integration 188 189 To save the data, it is easier if you pass a standard OHLCV dataframe, otherwise you need to provide a `saved_col` argument that indicates the correct column index to use as the `timestamp` column (or use lower-level functions). 190 191 192 To load the data back: 193 194 195 ### Advanced Custom Data Examples 196 197 198 ### Custom Data Validation 199 200 201 ### Working with Large Custom Datasets 202 203 204 ### Generic Data Storage 205 206 If you want to save other kinds of data, there are the [`Planar.Data.save_data`](@ref) and [`Planar.Data.load_data`](@ref) functions. Unlike the ohlcv functions, these functions don't check for contiguity, so it is possible to store sparse data. The data, however, still requires a timestamp column, because data when saved can either be prepend or appended, therefore an index must still be available to maintain order. 207 208 209 ### Serialized Data Storage 210 211 While OHLCV data requires a concrete type for storage (default `Float64`) generic data can either be saved with a shared type, or instead serialized. To serialize the data while saving pass the `serialize=true` argument to `save_data`, while to load serialized data pass `serialized=true` to `load_data`. 212 213 214 ### Progressive Data Loading 215 216 When loading data from storage, you can directly use the `ZArray` by passing `raw=true` to `load_ohlcv` or `as_z=true` or `with_z=true` to `load_data`. By managing the array directly you can avoid materializing the entire dataset, which is required when dealing with large amounts of data. 217 218 219 Data is returned as a `DataFrame` with `open,high,low,close,volume,timestamp` columns. 220 Since these save/load functions require a timestamp column, they check that the provided index is contiguous, it should not have missing timestamps, according to the subject timeframe. It is possible to disable those checks by passing `check=:none`. 221 222 !!! warning "Data Contiguity" 223 OHLCV save/load functions validate timestamp contiguity by default. Use `check=:none` to disable validation for irregular data. 224 225 !!! tip "Performance Optimization" 226 - Use progressive loading (`raw=true`) for datasets larger than available memory 227 - Process data in chunks when dealing with very large time series 228 - Consider serialization for complex data structures that don't fit standard numeric types 229 230 ## Data Indexing and Access Patterns 231 232 The Data module implements dataframe indexing by dates such that you can conveniently access rows by: 233 234 235 ### Advanced Indexing Examples 236 237 238 ### Timeframe Management 239 240 With ohlcv data, we can access the timeframe of the series directly from the dataframe by calling `timeframe!(df)`. This will either return the previously set timeframe or infer it from the `timestamp` column. You can set the timeframe by calling e.g. `timeframe!(df, tf"1m")` or `timeframe!!` to overwrite it. 241 242 243 ### Efficient Data Slicing 244 245 246 ### Data Aggregation and Resampling 247 248 249 ## Caching and Performance Optimization 250 251 `Data.Cache.save_cache` and `Data.Cache.load_cache` can be used to store generic metadata like JSON payloads. The data is saved in the Planar data directory which is either under the `XDG_CACHE_DIR` if set or under `$HOME/.cache` by default. 252 253 ### Basic Caching Usage 254 255 256 ### Advanced Caching Examples 257 258 259 ### Performance Optimization Strategies 260 261 262 ### Cache Management 263 264 265 ### Storage Configuration Optimization 266 267 268 ## Data Processing and Transformation 269 270 The Data module provides comprehensive tools for processing and transforming financial data. This section covers data cleaning, validation, and transformation techniques. 271 272 ### Data Cleaning and Validation 273 274 275 ### Gap Detection and Filling 276 277 278 ### Data Transformation and Feature Engineering 279 280 281 ## Storage Configuration and Optimization 282 283 This section covers advanced storage configuration, optimization techniques, and troubleshooting for the Zarr/LMDB backend. 284 285 ### Zarr Storage Configuration 286 287 288 ### LMDB Configuration and Tuning 289 290 291 ### Storage Optimization Strategies 292 293 294 ### Data Validation and Integrity 295 296 297 ### Troubleshooting Storage Issues 298 299 300 ### Progressive Data Loading 301 302 When loading data from storage, you can directly use the `ZArray` by passing `raw=true` to `load_ohlcv` or `as_z=true` or `with_z=true` to `load_data`. By managing the array directly you can avoid materializing the entire dataset, which is required when dealing with large amounts of data. 303 304 305 !!! tip "Performance Best Practices" 306 - Use progressive loading (`raw=true`) for datasets larger than available memory 307 - Implement caching for expensive computations with appropriate TTL 308 - Monitor cache size and clean up old entries regularly 309 - Use chunked processing for very large datasets 310 - Consider serialization for complex data structures that don't fit standard numeric types 311 312 ## Real-Time Data Pipelines and Monitoring 313 314 This section covers advanced real-time data collection, processing, and monitoring using the Watchers system. 315 316 ### Real-Time Data Pipeline Architecture 317 318 319 ### Advanced Watcher Management 320 321 322 ### Real-Time Data Processing 323 324 325 ### Monitoring and Alerting 326 327 328 ### Data Quality Monitoring 329 330 331 ### Complete Pipeline Example 332 333 334 !!! warning "Storage Considerations" 335 - Always backup data before performing repair operations 336 - Monitor disk space regularly, especially when using compression 337 - Validate data integrity periodically to catch corruption early 338 - Use appropriate LMDB map sizes to avoid out-of-space errors 339 340 !!! tip "Real-Time Data Best Practices" 341 - Implement comprehensive monitoring and alerting for production systems 342 - Use multiple watchers per exchange for redundancy 343 - Monitor data quality continuously to catch issues early 344 - Implement automatic restart mechanisms for failed watchers 345 - Cache processed data for quick access by trading [strategies](guides/strategy-development.md) 346 - Set up proper logging and error handling for debugging issues