data-management.md
1 # Data Management Guide 2 3 4 5 This comprehensive guide covers Planar's [data management](../guides/data-management.md) system for [OHLCV](../guides/data-management.md#ohlcv-data) (Open, High, Low, Close, Volume) data and other time-series [market data](../guides/data-management.md). Learn how to efficiently collect, store, and access market data using multiple collection methods and storage backends. 6 7 ## Quick Navigation 8 9 - **[Storage Architecture](#Storage-Architecture)** - Understanding Zarr and LMDB backends 10 - **[Data Collection Methods](#Data-Collection-Methods)** - Overview of collection approaches 11 - **[Historical Data](#Historical-Data-Collection)** - Using Scrapers for bulk data collection 12 - **[Real-Time Data](#Real-Time-Data-Fetching)** - Fetching live data from [exchanges](../exchanges.md) 13 - **[Live Streaming](#Live-Data-Streaming)** - Continuous data monitoring with Watchers 14 - **[Custom Data Sources](#Custom-Data-Sources)** - Integrating your own data 15 - **[Data Access Patterns](#Data-Access-Patterns)** - Efficient data querying and indexing 16 - **[optimization](../optimization.md)** - Caching and [optimization](../optimization.md) [strategies](../guides/strategy-development.md) 17 - **[troubleshooting](../troubleshooting/index.md)** - Common issues and solutions 18 19 ## Prerequisites 20 21 - Basic understanding of [OHLCV data concepts](../getting-started/index.md) 22 - Familiarity with [exchanges](../exchanges.md) 23 24 ## Related Topics 25 26 - **[Strategy Development](strategy-development.md))** - Using data in trading [strategies](../guides/strategy-development.md) 27 - **[Watchers](../watchers/watchers.md)** - Real-time data monitoring 28 - **[Processing](../API/processing.md)** - Data transformation and analysis 29 30 ## Storage Architecture 31 32 ### Zarr Backend 33 34 Planar uses **Zarr** as its primary storage backend, which offers several advantages for time-series data: 35 36 - **Columnar Storage**: Optimized for array-based data, similar to Feather or Parquet 37 - **Flexible Encoding**: Supports different compression and encoding schemes 38 - **Storage Agnostic**: Can be backed by various storage layers, including network-based systems 39 - **Chunked Access**: Efficient for time-series queries despite chunk-based reading 40 - **Scalability**: Handles large datasets with progressive loading capabilities 41 42 The framework wraps a Zarr subtype of `AbstractStore` in a [`Planar.Data.ZarrInstance`](@ref). The global `ZarrInstance` is accessible at `Data.zi[]`, with LMDB as the default underlying store. 43 44 ### Data Organization 45 46 [OHLCV data](../guides/data-management.md#ohlcv-data) is organized hierarchically using [`Planar.Data.key_path`](@ref): 47 48 ``` 49 ZarrInstance/ 50 ├── exchange_name/ 51 │ ├── pair_name/ 52 │ │ ├── [timeframe](../guides/data-management.md#timeframes)/ 53 │ │ │ ├── timestamp 54 │ │ │ ├── open 55 │ │ │ ├── high 56 │ │ │ ├── low 57 │ │ │ ├── close 58 │ │ │ └── volume 59 │ │ └── ... 60 │ └── ... 61 └── ... 62 ``` 63 64 ### Storage Hierarchy Benefits 65 66 This hierarchical organization provides: 67 68 - **Logical Grouping**: Data organized by source, instrument, and [timeframe](../guides/data-management.md#timeframes) 69 - **Efficient Queries**: Fast access to specific data subsets 70 - **Scalability**: Easy addition of new [exchanges](../exchanges.md), pairs, and [timeframes](../guides/data-management.md#timeframes) 71 - **Data Integrity**: Consistent structure across all data sources 72 - **Performance**: Optimized for common access patterns 73 74 ## Data Collection Methods 75 76 Planar provides multiple methods for collecting market data, each optimized for different use cases: 77 78 | Method | Use Case | Speed | Data Range | Rate Limits | 79 |--------|----------|-------|------------|-------------| 80 | **Scrapers** | Historical bulk data | Fast | Months/Years | None | 81 | **Fetch** | Recent data, gap filling | Medium | Days/Weeks | High | 82 | **Watchers** | Real-time streaming | Real-time | Live only | Low | 83 84 ### Choosing the Right Method 85 86 - **Use Scrapers** for initial historical data collection and [backtesting](../guides/execution-modes.md#simulationmode) datasets 87 - **Use Fetch** for recent data updates and filling gaps in historical data 88 - **Use Watchers** for [live trading](../guides/execution-modes.md#live-mode) and real-time analysis 89 90 **⚠️ Data collection issues?** See [Performance Issues: Data-Related](../troubleshooting/performance-issues.md#data-related-performance-issues) for slow loading and database problems, or [Exchange Issues](../troubleshooting/exchange-issues.md) for connectivity problems. 91 ## 92 Historical Data Collection 93 94 The Scrapers module provides access to historical data archives from major exchanges, offering the most efficient method for obtaining large amounts of historical data. 95 96 **Supported Exchanges**: Binance and Bybit archives 97 98 ### Basic Scraper Usage 99 100 101 ### Market Types and Frequencies 102 103 104 ### Advanced Scraper Examples 105 106 107 ### Error Handling and Data Validation 108 109 110 ### Bybit Scrapers 111 112 113 !!! warning "Download Caching" 114 Downloads are cached - requesting the same pair path again will only download newer archives. 115 If data becomes corrupted, pass `reset=true` to force a complete redownload. 116 117 !!! tip "Performance Optimization" 118 - **Monthly Archives**: Use for historical [backtesting](../guides/execution-modes.md#simulationmode) (faster download, larger chunks) 119 - **Daily Archives**: Use for recent data or frequent updates 120 - **Parallel Downloads**: Consider for multiple symbols, but respect [exchange](../exchanges.md) rate limits 121 122 ## Real-Time Data Fetching 123 124 The Fetch module downloads data directly from exchanges using [CCXT](../exchanges.md#ccxt-integration), making it ideal for: 125 126 - Getting the most recent market data 127 - Filling gaps in historical data 128 - Real-time data updates for [live trading](../guides/execution-modes.md#live-mode) 129 130 ### Basic Fetch Usage 131 132 133 ### Advanced Fetch Examples 134 135 136 ### Multi-Exchange Data Collection 137 138 139 ### Rate Limit Management 140 141 142 ### Data Validation and Quality Checks 143 144 145 !!! warning "Rate Limit Considerations" 146 Direct [exchange](../exchanges.md) fetching is heavily rate-limited, especially for smaller [timeframes](../guides/data-management.md#timeframes). 147 Use archives for bulk historical data collection. 148 149 !!! tip "Fetch Best Practices" 150 - **Recent Updates**: Use fetch for recent data updates and gap filling 151 - **Rate Limiting**: Implement delays between requests to respect exchange limits 152 - **Data Validation**: Always validate fetched data before using in [strategies](../guides/strategy-development.md) 153 - **Raw Data**: Use `fetch_candles` for unchecked data when you need raw exchange responses## L 154 ive Data Streaming 155 156 The Watchers module enables real-time data tracking from exchanges and other sources, storing data locally for: 157 158 - Live trading operations 159 - Real-time data analysis 160 - Continuous market monitoring 161 162 ### [OHLCV](../guides/data-management.md#ohlcv-data) Ticker Watcher 163 164 The ticker watcher monitors multiple pairs simultaneously using exchange ticker endpoints: 165 166 167 ```julia 168 # Activate Planar project 169 import Pkg 170 Pkg.activate("Planar") 171 172 try 173 using Planar 174 @environment! 175 176 # Example watcher output (this would be the result of displaying a watcher) 177 println("Example watcher display:") 178 println("17-element Watchers.Watcher20{Dict{String, NamedTup...Nothing, Float64}, Vararg{Float64, 7}}}}") 179 println("Name: ccxt_ohlcv_ticker") 180 println("Intervals: 5 seconds(TO), 5 seconds(FE), 6 minutes(FL)") 181 println("Fetched: 2023-03-07T12:06:18.690 busy: true") 182 println("Flushed: 2023-03-07T12:04:31.472") 183 println("Active: true") 184 println("Attempts: 0") 185 186 # Note: In real usage, 'w' would be an actual watcher instance 187 # w = create_watcher(...) # This would create the actual watcher 188 189 catch e 190 @warn "Planar not available: $e" 191 end 192 ``` 193 194 As a convention, the `view` property of a watcher shows the processed data: 195 196 197 ### Single-Pair OHLCV Watcher 198 199 There is another OHLCV watcher based on trades, that tracks only one pair at a time with higher precision: 200 201 202 ### Advanced Watcher Configuration 203 204 205 ### Watcher Management 206 207 208 ### Orderbook Watcher 209 210 211 ### Custom Data Processing 212 213 214 ### Error Handling and Resilience 215 216 217 ### Data Persistence and Storage 218 219 220 !!! tip "Watcher Best Practices" 221 - Monitor watcher health regularly with `wc.isrunning()` 222 - Implement proper error handling and reconnection logic 223 - Save data periodically to prevent loss during interruptions 224 - Use appropriate fetch intervals to balance data freshness with rate limits 225 - Consider using multiple watchers for redundancy in critical applications 226 227 ## Custom Data Sources 228 229 Assuming you have your own pipeline to fetch candles, you can use the functions [`Planar.Data.save_ohlcv`](@ref) and [`Planar.Data.load_ohlcv`](@ref) to manage the data. 230 231 ### Basic Custom Data Integration 232 233 To save the data, it is easier if you pass a standard OHLCV dataframe, otherwise you need to provide a `saved_col` argument that indicates the correct column index to use as the `timestamp` column (or use lower-level functions). 234 235 236 To load the data back: 237 238 239 ### Advanced Custom Data Examples 240 241 242 ### Custom Data Validation 243 244 245 ### Working with Large Custom Datasets 246 247 248 ## Generic Data Storage 249 250 If you want to save other kinds of data, there are the [`Planar.Data.save_data`](@ref) and [`Planar.Data.load_data`](@ref) functions. Unlike the ohlcv functions, these functions don't check for contiguity, so it is possible to store sparse data. The data, however, still requires a timestamp column, because data when saved can either be prepend or appended, therefore an index must still be available to maintain order. 251 252 253 ### Serialized Data Storage 254 255 While OHLCV data requires a concrete type for storage (default `Float64`) generic data can either be saved with a shared type, or instead serialized. To serialize the data while saving pass the `serialize=true` argument to `save_data`, while to load serialized data pass `serialized=true` to `load_data`. 256 257 258 !!! warning "Data Contiguity" 259 OHLCV save/load functions validate timestamp contiguity by default. Use `check=:none` to disable validation for irregular data. 260 261 !!! tip "Performance Optimization" 262 - Use progressive loading (`raw=true`) for large datasets to avoid memory issues 263 - Process data in chunks when dealing with very large time series 264 - Consider serialization for complex data structures that don't fit standard numeric types 265 266 ## Data Access Patterns 267 268 The Data module implements dataframe indexing by dates such that you can conveniently access rows by: 269 270 271 ### Advanced Indexing Examples 272 273 274 ### Timeframe Management 275 276 With ohlcv data, we can access the timeframe of the series directly from the dataframe by calling `timeframe!(df)`. This will either return the previously set timeframe or infer it from the `timestamp` column. You can set the timeframe by calling e.g. `timeframe!(df, tf"1m")` or `timeframe!!` to overwrite it. 277 278 279 ### Efficient Data Slicing 280 281 282 ### Progressive Data Loading 283 284 When loading data from storage, you can directly use the `ZArray` by passing `raw=true` to `load_ohlcv` or `as_z=true` or `with_z=true` to `load_data`. By managing the array directly you can avoid materializing the entire dataset, which is required when dealing with large amounts of data. 285 286 287 ### Data Aggregation and Resampling 288 289 ```julia 290 # Activate Planar project 291 import Pkg 292 Pkg.activate("Planar") 293 294 try 295 using Planar 296 using DataFrames 297 @environment! 298 299 # Aggregate data to different timeframes 300 function resample_ohlcv(data, target_timeframe) 301 # Group by target timeframe periods 302 data.period = floor.(data.timestamp, target_timeframe) 303 304 aggregated = combine(groupby(data, :period)) do group 305 ( 306 timestamp = first(group.timestamp), 307 open = first(group.open), 308 high = maximum(group.high), 309 low = minimum(group.low), 310 close = last(group.close), 311 volume = sum(group.volume) 312 ) 313 end 314 315 select!(aggregated, Not(:period)) 316 return aggregated 317 end 318 319 println("Data resampling function defined") 320 321 catch e 322 @warn "Planar or DataFrames not available: $e" 323 end 324 325 # Example: Convert 1m data to 5m 326 minute_data = Data.load_ohlcv(Data.zi[], "binance", "BTC/USDT", "1m") 327 five_min_data = resample_ohlcv(minute_data, Minute(5)) 328 ``` 329 330 Data is returned as a `DataFrame` with `open,high,low,close,volume,timestamp` columns. 331 Since these save/load functions require a timestamp column, they check that the provided index is contiguous, it should not have missing timestamps, according to the subject timeframe. It is possible to disable those checks by passing `check=:none`. 332 333 This comprehensive [data management](../guides/data-management.md) guide provides everything you need to efficiently collect, store, and access market data in Planar. Start with the basic collection methods and gradually explore more advanced features as your data requirements grow.