Cradicle Explorer

/ docs / src / data.md
data.md
  1  # Data Management
  2  
  3  
  4  
  5  The Data module provides comprehensive storage and management of [OHLCV](guides/data-management.md#ohlcv-data) (Open, High, Low, Close, Volume) data and other time-series [market data](guides/data-management.md).
  6  
  7  ## Quick Navigation
  8  
  9  - **[Storage Architecture](#Storage-Architecture)** - Understanding Zarr and LMDB backends
 10  - **[Historical Data](#Historical-Data-with-Scrapers)** - Using Scrapers for bulk data collection
 11  - **[Real-Time Data](#Real-Time-Data-with-Fetch)** - Fetching live data from exchanges
 12  - **[Live Streaming](#Live-Data-Streaming-with-Watchers)** - Continuous data monitoring
 13  
 14  ## Prerequisites
 15  
 16  - Basic understanding of [OHLCV data concepts](getting-started/index.md)
 17  - Familiarity with [Exchange setup](exchanges.md)
 18  
 19  ## Related Topics
 20  
 21  - **[Strategy Development](guides/strategy-development.md)** - Using data in trading strategies
 22  - **[Watchers](watchers/watchers.md)** - Real-time data monitoring
 23  - **[Processing](API/processing.md)** - Data transformation and analysis
 24  
 25  ## Storage Architecture
 26  
 27  ### Zarr Backend
 28  
 29  Planar uses **Zarr** as its primary storage backend, which offers several advantages:
 30  
 31  - **Columnar Storage**: Optimized for array-based data, similar to Feather or Parquet
 32  - **Flexible Encoding**: Supports different compression and encoding schemes
 33  - **Storage Agnostic**: Can be backed by various storage layers, including network-based systems
 34  - **Chunked Access**: Efficient for time-series queries despite chunk-based reading
 35  
 36  The framework wraps a Zarr subtype of `AbstractStore` in a [`Planar.Data.ZarrInstance`](@ref). The global `ZarrInstance` is accessible at `Data.zi[]`, with LMDB as the default underlying store.
 37  
 38  ### Data Organization
 39  
 40  [OHLCV data](guides/data-management.md#ohlcv-data) is organized hierarchically using [`Planar.Data.key_path`](@ref):
 41  
 42  ## Data Architecture Overview
 43  
 44  The Data module provides a comprehensive [data management](guides/data-management.md) system with the following key components:
 45  
 46  - **Storage Backend**: Zarr arrays with LMDB as the default store
 47  - **Data Organization**: Hierarchical structure by exchange/source, pair, and timeframe
 48  - **Data Types**: [OHLCV data](guides/data-management.md#ohlcv-data), generic time-series data, and cached metadata
 49  - **Access Patterns**: Progressive loading for large datasets, contiguous time-series validation
 50  - **Performance**: Chunked storage, compression, and optimized indexing
 51  
 52  ### Storage Hierarchy
 53  
 54  Data is organized in a hierarchical structure:
 55  ```
 56  ZarrInstance/
 57  ├── exchange_name/
 58  │   ├── pair_name/
 59  │   │   ├── [timeframe](guides/data-management.md#timeframes)/
 60  │   │   │   ├── timestamp
 61  │   │   │   ├── open
 62  │   │   │   ├── high
 63  │   │   │   ├── low
 64  │   │   │   ├── close
 65  │   │   │   └── volume
 66  │   │   └── ...
 67  │   └── ...
 68  └── ...
 69  ```
 70  
 71  ## Data Collection Methods
 72  
 73  Planar provides multiple methods for collecting market data, each optimized for different use cases:
 74  
 75  ## Historical Data with Scrapers
 76  
 77  The Scrapers module provides access to historical data archives from major exchanges, offering the most efficient method for obtaining large amounts of historical data.
 78  
 79  **Supported Exchanges**: Binance and Bybit archives
 80  
 81  ### Basic Scraper Usage
 82  
 83  
 84  ### Advanced Scraper Examples
 85  
 86  Download multiple symbols and filter by quote currency using `bn.binancesyms()` and `scr.selectsyms()`.
 87  
 88  ### Market Types and Frequencies
 89  
 90  Use different market types (`:spot`, `:um`, `:cm`), frequencies (`:daily`, `:monthly`), and data kinds (`:klines`, `:trades`, `:aggTrades`).
 91  
 92  ### Error Handling and Data Validation
 93  
 94  
 95  !!! warning "Download Caching"
 96      Downloads are cached - requesting the same pair path again will only download newer archives.
 97      If data becomes corrupted, pass `reset=true` to force a complete redownload.
 98  
 99  !!! tip "Performance Optimization"
100      - **Monthly Archives**: Use for historical [backtesting](guides/execution-modes.md#simulation)-mode) (faster download, larger chunks)
101      - **Daily Archives**: Use for recent data or frequent updates
102      - **Parallel Downloads**: Consider for multiple symbols, but respect [exchange](exchanges.md) rate limits 
103  
104  ## Real-Time Data with Fetch
105  
106  The Fetch module downloads data directly from exchanges using [CCXT](exchanges.md#ccxt-integration), making it ideal for:
107  
108  - Getting the most recent market data
109  - Filling gaps in historical data
110  - Real-time data updates for [live trading](guides/execution-modes.md#live-mode)
111  
112  ### Basic Fetch Usage
113  
114  
115  ### Advanced Fetch Examples
116  
117  
118  ### Multi-Exchange Data Collection
119  
120  
121  ### Rate Limit Management
122  
123  Use delays between requests and validate data quality. Implement error handling for failed requests.
124  
125  !!! warning "Rate Limit Considerations"
126      Direct exchange fetching is heavily rate-limited, especially for smaller [timeframes](guides/data-management.md#timeframes).
127      Use archives for bulk historical data collection.
128  
129  !!! tip "Fetch Best Practices"
130      - **Recent Updates**: Use fetch for recent data updates and gap filling
131      - **Rate Limiting**: Implement delays between requests to respect exchange limits
132      - **Data Validation**: Always validate fetched data before using in [strategies](guides/strategy-development.md)
133      - **Raw Data**: Use `fetch_candles` for unchecked data when you need raw exchange responses
134  
135  ## Live Data Streaming with Watchers
136  
137  The Watchers module enables real-time data tracking from exchanges and other sources, storing data locally for:
138  
139  - Live trading operations
140  - Real-time data analysis
141  - Continuous market monitoring
142  
143  ### OHLCV Ticker Watcher
144  
145  The ticker watcher monitors multiple pairs simultaneously using exchange ticker endpoints:
146  
147  
148  
149  As a convention, the `view` property of a watcher shows the processed data. In this case, the candles processed
150  by the `ohlcv_ticker_watcher` will be stored in a dict.
151  
152  
153  ### Single-Pair OHLCV Watcher
154  
155  There is another OHLCV watcher based on trades, that tracks only one pair at a time with higher precision:
156  
157  
158  ### Watcher Configuration
159  
160  Configure watchers with custom intervals using `timeout_interval`, `fetch_interval`, and `flush_interval` parameters. Use `wc.start!()` and `wc.stop!()` for lifecycle management.
161  
162  ### Orderbook Watcher
163  
164  
165  ### Custom Data Processing
166  
167  
168  ### Error Handling and Resilience
169  
170  
171  ### Data Persistence and Storage
172  
173  
174  Other implemented watchers are the orderbook watcher, and watchers that parse data feeds from 3rd party APIs.
175  
176  !!! tip "Watcher Best Practices"
177      - Monitor watcher health regularly with `wc.isrunning()`
178      - Implement proper error handling and reconnection logic
179      - Save data periodically to prevent loss during interruptions
180      - Use appropriate fetch intervals to balance data freshness with rate limits
181      - Consider using multiple watchers for redundancy in critical applications
182  
183  ## Custom Data Sources
184  
185  Assuming you have your own pipeline to fetch candles, you can use the functions [`Planar.Data.save_ohlcv`](@ref) and [`Planar.Data.load_ohlcv`](@ref) to manage the data.
186  
187  ### Basic Custom Data Integration
188  
189  To save the data, it is easier if you pass a standard OHLCV dataframe, otherwise you need to provide a `saved_col` argument that indicates the correct column index to use as the `timestamp` column (or use lower-level functions).
190  
191  
192  To load the data back:
193  
194  
195  ### Advanced Custom Data Examples
196  
197  
198  ### Custom Data Validation
199  
200  
201  ### Working with Large Custom Datasets
202  
203  
204  ### Generic Data Storage
205  
206  If you want to save other kinds of data, there are the [`Planar.Data.save_data`](@ref) and [`Planar.Data.load_data`](@ref) functions. Unlike the ohlcv functions, these functions don't check for contiguity, so it is possible to store sparse data. The data, however, still requires a timestamp column, because data when saved can either be prepend or appended, therefore an index must still be available to maintain order.
207  
208  
209  ### Serialized Data Storage
210  
211  While OHLCV data requires a concrete type for storage (default `Float64`) generic data can either be saved with a shared type, or instead serialized. To serialize the data while saving pass the `serialize=true` argument to `save_data`, while to load serialized data pass `serialized=true` to `load_data`.
212  
213  
214  ### Progressive Data Loading
215  
216  When loading data from storage, you can directly use the `ZArray` by passing `raw=true` to `load_ohlcv` or `as_z=true` or `with_z=true` to `load_data`. By managing the array directly you can avoid materializing the entire dataset, which is required when dealing with large amounts of data.
217  
218  
219  Data is returned as a `DataFrame` with `open,high,low,close,volume,timestamp` columns.
220  Since these save/load functions require a timestamp column, they check that the provided index is contiguous, it should not have missing timestamps, according to the subject timeframe. It is possible to disable those checks by passing `check=:none`.
221  
222  !!! warning "Data Contiguity"
223      OHLCV save/load functions validate timestamp contiguity by default. Use `check=:none` to disable validation for irregular data.
224  
225  !!! tip "Performance Optimization"
226      - Use progressive loading (`raw=true`) for datasets larger than available memory
227      - Process data in chunks when dealing with very large time series
228      - Consider serialization for complex data structures that don't fit standard numeric types
229  
230  ## Data Indexing and Access Patterns
231  
232  The Data module implements dataframe indexing by dates such that you can conveniently access rows by:
233  
234  
235  ### Advanced Indexing Examples
236  
237  
238  ### Timeframe Management
239  
240  With ohlcv data, we can access the timeframe of the series directly from the dataframe by calling `timeframe!(df)`. This will either return the previously set timeframe or infer it from the `timestamp` column. You can set the timeframe by calling e.g. `timeframe!(df, tf"1m")` or `timeframe!!` to overwrite it.
241  
242  
243  ### Efficient Data Slicing
244  
245  
246  ### Data Aggregation and Resampling
247  
248  
249  ## Caching and Performance Optimization
250  
251  `Data.Cache.save_cache` and `Data.Cache.load_cache` can be used to store generic metadata like JSON payloads. The data is saved in the Planar data directory which is either under the `XDG_CACHE_DIR` if set or under `$HOME/.cache` by default.
252  
253  ### Basic Caching Usage
254  
255  
256  ### Advanced Caching Examples
257  
258  
259  ### Performance Optimization Strategies
260  
261  
262  ### Cache Management
263  
264  
265  ### Storage Configuration Optimization
266  
267  
268  ## Data Processing and Transformation
269  
270  The Data module provides comprehensive tools for processing and transforming financial data. This section covers data cleaning, validation, and transformation techniques.
271  
272  ### Data Cleaning and Validation
273  
274  
275  ### Gap Detection and Filling
276  
277  
278  ### Data Transformation and Feature Engineering
279  
280  
281  ## Storage Configuration and Optimization
282  
283  This section covers advanced storage configuration, optimization techniques, and troubleshooting for the Zarr/LMDB backend.
284  
285  ### Zarr Storage Configuration
286  
287  
288  ### LMDB Configuration and Tuning
289  
290  
291  ### Storage Optimization Strategies
292  
293  
294  ### Data Validation and Integrity
295  
296  
297  ### Troubleshooting Storage Issues
298  
299  
300  ### Progressive Data Loading
301  
302  When loading data from storage, you can directly use the `ZArray` by passing `raw=true` to `load_ohlcv` or `as_z=true` or `with_z=true` to `load_data`. By managing the array directly you can avoid materializing the entire dataset, which is required when dealing with large amounts of data.
303  
304  
305  !!! tip "Performance Best Practices"
306      - Use progressive loading (`raw=true`) for datasets larger than available memory
307      - Implement caching for expensive computations with appropriate TTL
308      - Monitor cache size and clean up old entries regularly
309      - Use chunked processing for very large datasets
310      - Consider serialization for complex data structures that don't fit standard numeric types
311  
312  ## Real-Time Data Pipelines and Monitoring
313  
314  This section covers advanced real-time data collection, processing, and monitoring using the Watchers system.
315  
316  ### Real-Time Data Pipeline Architecture
317  
318  
319  ### Advanced Watcher Management
320  
321  
322  ### Real-Time Data Processing
323  
324  
325  ### Monitoring and Alerting
326  
327  
328  ### Data Quality Monitoring
329  
330  
331  ### Complete Pipeline Example
332  
333  
334  !!! warning "Storage Considerations"
335      - Always backup data before performing repair operations
336      - Monitor disk space regularly, especially when using compression
337      - Validate data integrity periodically to catch corruption early
338      - Use appropriate LMDB map sizes to avoid out-of-space errors
339  
340  !!! tip "Real-Time Data Best Practices"
341      - Implement comprehensive monitoring and alerting for production systems
342      - Use multiple watchers per exchange for redundancy
343      - Monitor data quality continuously to catch issues early
344      - Implement automatic restart mechanisms for failed watchers
345      - Cache processed data for quick access by trading [strategies](guides/strategy-development.md)
346      - Set up proper logging and error handling for debugging issues