Cradicle Explorer

/ docs / src / guides / data-management.md
data-management.md
  1  # Data Management Guide
  2  
  3  
  4  
  5  This comprehensive guide covers Planar's [data management](../guides/data-management.md) system for [OHLCV](../guides/data-management.md#ohlcv-data) (Open, High, Low, Close, Volume) data and other time-series [market data](../guides/data-management.md). Learn how to efficiently collect, store, and access market data using multiple collection methods and storage backends.
  6  
  7  ## Quick Navigation
  8  
  9  - **[Storage Architecture](#Storage-Architecture)** - Understanding Zarr and LMDB backends
 10  - **[Data Collection Methods](#Data-Collection-Methods)** - Overview of collection approaches
 11  - **[Historical Data](#Historical-Data-Collection)** - Using Scrapers for bulk data collection
 12  - **[Real-Time Data](#Real-Time-Data-Fetching)** - Fetching live data from [exchanges](../exchanges.md)
 13  - **[Live Streaming](#Live-Data-Streaming)** - Continuous data monitoring with Watchers
 14  - **[Custom Data Sources](#Custom-Data-Sources)** - Integrating your own data
 15  - **[Data Access Patterns](#Data-Access-Patterns)** - Efficient data querying and indexing
 16  - **[optimization](../optimization.md)** - Caching and [optimization](../optimization.md) [strategies](../guides/strategy-development.md)
 17  - **[troubleshooting](../troubleshooting/index.md)** - Common issues and solutions
 18  
 19  ## Prerequisites
 20  
 21  - Basic understanding of [OHLCV data concepts](../getting-started/index.md)
 22  - Familiarity with [exchanges](../exchanges.md)
 23  
 24  ## Related Topics
 25  
 26  - **[Strategy Development](strategy-development.md))** - Using data in trading [strategies](../guides/strategy-development.md)
 27  - **[Watchers](../watchers/watchers.md)** - Real-time data monitoring
 28  - **[Processing](../API/processing.md)** - Data transformation and analysis
 29  
 30  ## Storage Architecture
 31  
 32  ### Zarr Backend
 33  
 34  Planar uses **Zarr** as its primary storage backend, which offers several advantages for time-series data:
 35  
 36  - **Columnar Storage**: Optimized for array-based data, similar to Feather or Parquet
 37  - **Flexible Encoding**: Supports different compression and encoding schemes
 38  - **Storage Agnostic**: Can be backed by various storage layers, including network-based systems
 39  - **Chunked Access**: Efficient for time-series queries despite chunk-based reading
 40  - **Scalability**: Handles large datasets with progressive loading capabilities
 41  
 42  The framework wraps a Zarr subtype of `AbstractStore` in a [`Planar.Data.ZarrInstance`](@ref). The global `ZarrInstance` is accessible at `Data.zi[]`, with LMDB as the default underlying store.
 43  
 44  ### Data Organization
 45  
 46  [OHLCV data](../guides/data-management.md#ohlcv-data) is organized hierarchically using [`Planar.Data.key_path`](@ref):
 47  
 48  ```
 49  ZarrInstance/
 50  ├── exchange_name/
 51  │   ├── pair_name/
 52  │   │   ├── [timeframe](../guides/data-management.md#timeframes)/
 53  │   │   │   ├── timestamp
 54  │   │   │   ├── open
 55  │   │   │   ├── high
 56  │   │   │   ├── low
 57  │   │   │   ├── close
 58  │   │   │   └── volume
 59  │   │   └── ...
 60  │   └── ...
 61  └── ...
 62  ```
 63  
 64  ### Storage Hierarchy Benefits
 65  
 66  This hierarchical organization provides:
 67  
 68  - **Logical Grouping**: Data organized by source, instrument, and [timeframe](../guides/data-management.md#timeframes)
 69  - **Efficient Queries**: Fast access to specific data subsets
 70  - **Scalability**: Easy addition of new [exchanges](../exchanges.md), pairs, and [timeframes](../guides/data-management.md#timeframes)
 71  - **Data Integrity**: Consistent structure across all data sources
 72  - **Performance**: Optimized for common access patterns
 73  
 74  ## Data Collection Methods
 75  
 76  Planar provides multiple methods for collecting market data, each optimized for different use cases:
 77  
 78  | Method | Use Case | Speed | Data Range | Rate Limits |
 79  |--------|----------|-------|------------|-------------|
 80  | **Scrapers** | Historical bulk data | Fast | Months/Years | None |
 81  | **Fetch** | Recent data, gap filling | Medium | Days/Weeks | High |
 82  | **Watchers** | Real-time streaming | Real-time | Live only | Low |
 83  
 84  ### Choosing the Right Method
 85  
 86  - **Use Scrapers** for initial historical data collection and [backtesting](../guides/execution-modes.md#simulationmode) datasets
 87  - **Use Fetch** for recent data updates and filling gaps in historical data
 88  - **Use Watchers** for [live trading](../guides/execution-modes.md#live-mode) and real-time analysis
 89  
 90  **⚠️ Data collection issues?** See [Performance Issues: Data-Related](../troubleshooting/performance-issues.md#data-related-performance-issues) for slow loading and database problems, or [Exchange Issues](../troubleshooting/exchange-issues.md) for connectivity problems.
 91  ##
 92   Historical Data Collection
 93  
 94  The Scrapers module provides access to historical data archives from major exchanges, offering the most efficient method for obtaining large amounts of historical data.
 95  
 96  **Supported Exchanges**: Binance and Bybit archives
 97  
 98  ### Basic Scraper Usage
 99  
100  
101  ### Market Types and Frequencies
102  
103  
104  ### Advanced Scraper Examples
105  
106  
107  ### Error Handling and Data Validation
108  
109  
110  ### Bybit Scrapers
111  
112  
113  !!! warning "Download Caching"
114      Downloads are cached - requesting the same pair path again will only download newer archives.
115      If data becomes corrupted, pass `reset=true` to force a complete redownload.
116  
117  !!! tip "Performance Optimization"
118      - **Monthly Archives**: Use for historical [backtesting](../guides/execution-modes.md#simulationmode) (faster download, larger chunks)
119      - **Daily Archives**: Use for recent data or frequent updates
120      - **Parallel Downloads**: Consider for multiple symbols, but respect [exchange](../exchanges.md) rate limits
121  
122  ## Real-Time Data Fetching
123  
124  The Fetch module downloads data directly from exchanges using [CCXT](../exchanges.md#ccxt-integration), making it ideal for:
125  
126  - Getting the most recent market data
127  - Filling gaps in historical data
128  - Real-time data updates for [live trading](../guides/execution-modes.md#live-mode)
129  
130  ### Basic Fetch Usage
131  
132  
133  ### Advanced Fetch Examples
134  
135  
136  ### Multi-Exchange Data Collection
137  
138  
139  ### Rate Limit Management
140  
141  
142  ### Data Validation and Quality Checks
143  
144  
145  !!! warning "Rate Limit Considerations"
146      Direct [exchange](../exchanges.md) fetching is heavily rate-limited, especially for smaller [timeframes](../guides/data-management.md#timeframes).
147      Use archives for bulk historical data collection.
148  
149  !!! tip "Fetch Best Practices"
150      - **Recent Updates**: Use fetch for recent data updates and gap filling
151      - **Rate Limiting**: Implement delays between requests to respect exchange limits
152      - **Data Validation**: Always validate fetched data before using in [strategies](../guides/strategy-development.md)
153      - **Raw Data**: Use `fetch_candles` for unchecked data when you need raw exchange responses## L
154  ive Data Streaming
155  
156  The Watchers module enables real-time data tracking from exchanges and other sources, storing data locally for:
157  
158  - Live trading operations
159  - Real-time data analysis
160  - Continuous market monitoring
161  
162  ### [OHLCV](../guides/data-management.md#ohlcv-data) Ticker Watcher
163  
164  The ticker watcher monitors multiple pairs simultaneously using exchange ticker endpoints:
165  
166  
167  ```julia
168  # Activate Planar project
169  import Pkg
170  Pkg.activate("Planar")
171  
172  try
173      using Planar
174      @environment!
175      
176      # Example watcher output (this would be the result of displaying a watcher)
177      println("Example watcher display:")
178      println("17-element Watchers.Watcher20{Dict{String, NamedTup...Nothing, Float64}, Vararg{Float64, 7}}}}")
179      println("Name: ccxt_ohlcv_ticker")
180      println("Intervals: 5 seconds(TO), 5 seconds(FE), 6 minutes(FL)")
181      println("Fetched: 2023-03-07T12:06:18.690 busy: true")
182      println("Flushed: 2023-03-07T12:04:31.472")
183      println("Active: true")
184      println("Attempts: 0")
185      
186      # Note: In real usage, 'w' would be an actual watcher instance
187      # w = create_watcher(...)  # This would create the actual watcher
188      
189  catch e
190      @warn "Planar not available: $e"
191  end
192  ```
193  
194  As a convention, the `view` property of a watcher shows the processed data:
195  
196  
197  ### Single-Pair OHLCV Watcher
198  
199  There is another OHLCV watcher based on trades, that tracks only one pair at a time with higher precision:
200  
201  
202  ### Advanced Watcher Configuration
203  
204  
205  ### Watcher Management
206  
207  
208  ### Orderbook Watcher
209  
210  
211  ### Custom Data Processing
212  
213  
214  ### Error Handling and Resilience
215  
216  
217  ### Data Persistence and Storage
218  
219  
220  !!! tip "Watcher Best Practices"
221      - Monitor watcher health regularly with `wc.isrunning()`
222      - Implement proper error handling and reconnection logic
223      - Save data periodically to prevent loss during interruptions
224      - Use appropriate fetch intervals to balance data freshness with rate limits
225      - Consider using multiple watchers for redundancy in critical applications
226  
227  ## Custom Data Sources
228  
229  Assuming you have your own pipeline to fetch candles, you can use the functions [`Planar.Data.save_ohlcv`](@ref) and [`Planar.Data.load_ohlcv`](@ref) to manage the data.
230  
231  ### Basic Custom Data Integration
232  
233  To save the data, it is easier if you pass a standard OHLCV dataframe, otherwise you need to provide a `saved_col` argument that indicates the correct column index to use as the `timestamp` column (or use lower-level functions).
234  
235  
236  To load the data back:
237  
238  
239  ### Advanced Custom Data Examples
240  
241  
242  ### Custom Data Validation
243  
244  
245  ### Working with Large Custom Datasets
246  
247  
248  ## Generic Data Storage
249  
250  If you want to save other kinds of data, there are the [`Planar.Data.save_data`](@ref) and [`Planar.Data.load_data`](@ref) functions. Unlike the ohlcv functions, these functions don't check for contiguity, so it is possible to store sparse data. The data, however, still requires a timestamp column, because data when saved can either be prepend or appended, therefore an index must still be available to maintain order.
251  
252  
253  ### Serialized Data Storage
254  
255  While OHLCV data requires a concrete type for storage (default `Float64`) generic data can either be saved with a shared type, or instead serialized. To serialize the data while saving pass the `serialize=true` argument to `save_data`, while to load serialized data pass `serialized=true` to `load_data`.
256  
257  
258  !!! warning "Data Contiguity"
259      OHLCV save/load functions validate timestamp contiguity by default. Use `check=:none` to disable validation for irregular data.
260  
261  !!! tip "Performance Optimization"
262      - Use progressive loading (`raw=true`) for large datasets to avoid memory issues
263      - Process data in chunks when dealing with very large time series
264      - Consider serialization for complex data structures that don't fit standard numeric types
265  
266  ## Data Access Patterns
267  
268  The Data module implements dataframe indexing by dates such that you can conveniently access rows by:
269  
270  
271  ### Advanced Indexing Examples
272  
273  
274  ### Timeframe Management
275  
276  With ohlcv data, we can access the timeframe of the series directly from the dataframe by calling `timeframe!(df)`. This will either return the previously set timeframe or infer it from the `timestamp` column. You can set the timeframe by calling e.g. `timeframe!(df, tf"1m")` or `timeframe!!` to overwrite it.
277  
278  
279  ### Efficient Data Slicing
280  
281  
282  ### Progressive Data Loading
283  
284  When loading data from storage, you can directly use the `ZArray` by passing `raw=true` to `load_ohlcv` or `as_z=true` or `with_z=true` to `load_data`. By managing the array directly you can avoid materializing the entire dataset, which is required when dealing with large amounts of data.
285  
286  
287  ### Data Aggregation and Resampling
288  
289  ```julia
290  # Activate Planar project
291  import Pkg
292  Pkg.activate("Planar")
293  
294  try
295      using Planar
296      using DataFrames
297      @environment!
298      
299      # Aggregate data to different timeframes
300      function resample_ohlcv(data, target_timeframe)
301          # Group by target timeframe periods
302          data.period = floor.(data.timestamp, target_timeframe)
303          
304          aggregated = combine(groupby(data, :period)) do group
305              (
306                  timestamp = first(group.timestamp),
307                  open = first(group.open),
308                  high = maximum(group.high),
309                  low = minimum(group.low),
310                  close = last(group.close),
311                  volume = sum(group.volume)
312              )
313          end
314          
315          select!(aggregated, Not(:period))
316          return aggregated
317      end
318      
319      println("Data resampling function defined")
320      
321  catch e
322      @warn "Planar or DataFrames not available: $e"
323  end
324  
325  # Example: Convert 1m data to 5m
326  minute_data = Data.load_ohlcv(Data.zi[], "binance", "BTC/USDT", "1m")
327  five_min_data = resample_ohlcv(minute_data, Minute(5))
328  ```
329  
330  Data is returned as a `DataFrame` with `open,high,low,close,volume,timestamp` columns.
331  Since these save/load functions require a timestamp column, they check that the provided index is contiguous, it should not have missing timestamps, according to the subject timeframe. It is possible to disable those checks by passing `check=:none`.
332  
333  This comprehensive [data management](../guides/data-management.md) guide provides everything you need to efficiently collect, store, and access market data in Planar. Start with the basic collection methods and gradually explore more advanced features as your data requirements grow.