/ README.md
README.md
1 # json-stream 2 3 Simple streaming JSON parser. 4 5 `json-stream` is a JSON parser just like the standard library's 6 [`json.load()`](https://docs.python.org/3/library/json.html#json.load). It 7 will read a JSON document and convert it into native python types. 8 9 ```python 10 import json_stream 11 data = json_stream.load(f) 12 ``` 13 14 Features: 15 * stream all JSON data types (objects or lists) 16 * stream nested data 17 * simple pythonic `list`-like/`dict`-like interface 18 * stream truncated or malformed JSON data (up to the first error) 19 * pure python 20 * no dependencies 21 22 Unlike `json.load()`, `json-stream` can _stream_ JSON data from a file-like 23 object. This has the following benefits: 24 25 * It does not require the whole json document to be into memory up-front 26 * It can start producing data before the entire document has finished loading 27 * It only requires enough memory to hold the data currently being parsed 28 29 ## What are the problems with standard `json.load()`? 30 31 The problem with the `json.load()` stem from the fact that it must read 32 the whole JSON document into memory before parsing it. 33 34 ### Memory usage 35 36 `json.load()` first reads the whole document into memory as a string. It 37 then starts parsing that string and converting the whole document into python 38 types again stored in memory. For a very large document, this could be more 39 memory than you have available to your system. 40 41 `json_stream.load()` does not read the whole document into memory, it only 42 buffers enough from the stream to produce the next item of data. 43 44 Additionally, in transient mode (see below) `json-stream` also doesn't store 45 up all of the parsed data in memory. 46 47 ### Latency 48 49 `json.load()` produces all the data after parsing the whole document. If you 50 only care about the first 10 items in a list of 2 million items, then you 51 have wait until all 2 million items have been parsed first. 52 53 `json_stream.load()` produces data as soon as it is available in the stream. 54 55 ## Usage 56 57 ### `json_stream.load()` 58 59 `json_stream.load()` has two modes of operation, controlled by 60 the `persistent` argument (default false). 61 62 It is also possible to "mix" the modes as you consume the data. 63 64 #### Transient mode (default) 65 66 This mode is appropriate if you can consume the data iteratively. You cannot 67 move backwards through the stream to read data that has already been skipped 68 over. It is the mode you **must** use if you want process large amounts of 69 JSON data without consuming large amounts of memory required. 70 71 In transient mode, only the data currently being read is stored in memory. Any 72 data previously read from the stream is discarded (it's up to you what to do 73 with it) and attempting to access this data results in a 74 `TransientAccessException`. 75 76 ```python 77 import json_stream 78 79 # JSON: {"count": 3, "results": ["a", "b", "c"]} 80 data = json_stream.load(f) # data is a transient dict-like object 81 # stream has been read up to "{" 82 83 # use data like a dict 84 results = data["results"] # results is a transient list-like object 85 # stream has been read up to "[", we now cannot read "count" 86 87 # iterate transient list 88 for result in results: 89 print(result) # prints a, b, c 90 # stream has been read up to "]" 91 92 # attempt to read "count" from earlier in stream 93 count = data["count"] # will raise exception 94 # stream is now exhausted 95 96 # attempt to read from list that has already be iterated 97 for result in results: # will raise exception 98 pass 99 ``` 100 101 #### Persistent mode 102 103 In persistent mode all previously read data is stored in memory as 104 it is parsed. The returned `dict`-like or `list`-like objects 105 can be used just like normal data structures. 106 107 If you request an index or key that has already been read from the stream 108 then it is retrieved from memory. If you request an index or key that has 109 not yet been read from the stream, then the request blocks until that item 110 is found in the stream. 111 112 ```python 113 import json_stream 114 115 # JSON: {"count": 1, "results": ["a", "b", "c"]} 116 data = json_stream.load(f, persistent=True) 117 # data is a streaming dict-like object 118 # stream has been read up to "{" 119 120 # use data like a dict 121 results = data["results"] # results is a streaming list-like object 122 # stream has been read up to "[" 123 # count has been stored data 124 125 # use results like a list 126 a_result = results[1] # a_result = "b" 127 # stream has been read up to the middle of list 128 # "a" and "b" have been stored in results 129 130 # read earlier data from memory 131 count = data["count"] # count = 1 132 133 # consume rest of list 134 results.read_all() 135 # stream has been read up to "}" 136 # "c" is now stored in results too 137 # results.is_streaming() == False 138 139 # consume everything 140 data.read_all() 141 # stream is now exhausted 142 # data.is_streaming() == False 143 ``` 144 145 Persistent mode is not appropriate if you care about memory consumption, but 146 provides an identical experience compared to `json.load()`. 147 148 #### Mixed mode 149 150 In some cases you will need to be able to randomly access some part of the 151 data, but still only have that specific data taking up memory resources. 152 153 For example, you might have a very long list of objects, but you cannot always 154 access the keys of the objects in stream order. You want to be able to iterate 155 the list transiently, but access the result objects persistently. 156 157 This can be achieved using the `persistent()` method of all the `list` or 158 `dict`-like objects json_stream produces. Calling `persistent()` causes the existing 159 transient object to produce persistent child objects. 160 161 Note that the `persistent()` method makes the children of the object it 162 is called on persistent, not the object it is called on. 163 164 ```python 165 import json_stream 166 167 # JSON: {"results": [{"x": 1, "y": 3}, {"y": 4, "x": 2}]} 168 # note that the keys of the inner objects are not ordered 169 data = json_stream.load(f) # data is a transient dict-like object 170 171 # iterate transient list, but produce persistent items 172 for result in data['results'].persistent(): 173 # result is a persistent dict-like object 174 print(result['x']) # print x 175 print(result['y']) # print y (error on second result without .persistent()) 176 print(result['x']) # print x again (error without .persistent()) 177 ``` 178 179 The opposite is also possible, going from persistent mode to transient mode, though 180 the use cases for this are more esoteric. 181 182 ```python 183 # JSON: {"a": 1, "x": ["long", "list", "I", "don't", "want", "in", "memory"], "b": 2} 184 data = load(StringIO(json), persistent=True).transient() 185 # data is a persistent dict-list object that produces transient children 186 187 print(data["a"]) # prints 1 188 x = data["x"] # x is a transient list, you can use it accordingly 189 print(x[0]) # prints long 190 191 # access earlier data from memory 192 print(data["a"]) # this would have raised an exception if data was transient 193 194 print(data["b"]) # prints 2 195 196 # we have now moved past all the data in the transient list 197 print(x[0]) # will raise exception 198 ``` 199 200 ### visitor pattern 201 202 You can also parse using a visitor-style approach where a function you supply 203 is called for each data item as it is parsed (depth-first). 204 205 This uses a transient parser under the hood, so does not consume memory for 206 the whole document. 207 208 ```python 209 import json_stream 210 211 # JSON: {"x": 1, "y": {}, "xxxx": [1,2, {"yyyy": 1}, "z", 1, []]} 212 213 def visitor(item, path): 214 print(f"{item} at path {path}") 215 216 json_stream.visit(f, visitor) 217 ``` 218 219 Output: 220 ``` 221 1 at path ('x',) 222 {} at path ('y',) 223 1 at path ('xxxx', 0) 224 2 at path ('xxxx', 1) 225 1 at path ('xxxx', 2, 'yyyy') 226 z at path ('xxxx', 3) 227 1 at path ('xxxx', 4) 228 [] at path ('xxxx', 5) 229 ``` 230 231 ### Stream a URL 232 233 #### urllib 234 235 ```python 236 import urllib.request 237 import json_stream 238 239 with urllib.request.urlopen('http://example.com/data.json') as response: 240 data = json_stream.load(response) 241 ``` 242 243 #### requests 244 245 ```python 246 import requests 247 import json_stream.requests 248 249 with requests.get('http://example.com/data.json', stream=True) as response: 250 data = json_stream.requests.load(response) 251 ``` 252 253 # Future improvements 254 255 * Allow long strings in the JSON to be read as streams themselves 256 * Allow transient mode on seekable streams to seek to data earlier in 257 the stream instead of raising a `TransientAccessException` 258 * A more efficient tokenizer? 259 260 # Alternatives 261 262 ## NAYA 263 264 [NAYA](https://github.com/danielyule/naya) is a pure python JSON parser for 265 parsing a simple JSON list as a stream. 266 267 ### Why not NAYA? 268 269 * It can only stream JSON containing a top-level list 270 * It does not provide a pythonic `dict`/`list`-like interface 271 272 ## Yajl-Py 273 274 [Yajl-Py](https://pykler.github.io/yajl-py/) is a wrapper around the Yajl JSON library that can be used to 275 generate SAX style events while parsing JSON. 276 277 ### Why not Yajl-Py? 278 279 * It's not pure python 280 * It does not provide a pythonic `dict`/`list`-like interface 281 282 # Build 283 ```bash 284 cd ~/sources/json-stream/ 285 python3 -m venv ~/build/ 286 . ~/build/bin/activate 287 pip install --upgrade build twine 288 python -m build 289 twine upload dist/* 290 ``` 291 292 # Acknowledgements 293 294 The JSON tokenizer used in the project was taken from the 295 [NAYA](https://github.com/danielyule/naya) project. 296