Cradicle Explorer

/ README.md
README.md
  1  # json-stream
  2  
  3  Simple streaming JSON parser.
  4  
  5  `json-stream` is a JSON parser just like the standard library's
  6   [`json.load()`](https://docs.python.org/3/library/json.html#json.load). It 
  7   will read a JSON document and convert it into native python types.
  8  
  9  ```python
 10  import json_stream
 11  data = json_stream.load(f)
 12  ```
 13  
 14  Features:
 15  * stream all JSON data types (objects or lists)
 16  * stream nested data
 17  * simple pythonic `list`-like/`dict`-like interface
 18  * stream truncated or malformed JSON data (up to the first error)
 19  * pure python
 20  * no dependencies
 21  
 22  Unlike `json.load()`, `json-stream` can _stream_ JSON data from a file-like
 23  object. This has the following benefits:
 24  
 25  * It does not require the whole json document to be into memory up-front
 26  * It can start producing data before the entire document has finished loading
 27  * It only requires enough memory to hold the data currently being parsed
 28  
 29  ## What are the problems with standard `json.load()`?
 30  
 31  The problem with the `json.load()` stem from the fact that it must read
 32  the whole JSON document into memory before parsing it.
 33  
 34  ### Memory usage
 35  
 36  `json.load()` first reads the whole document into memory as a string. It
 37  then starts parsing that string and converting the whole document into python
 38  types again stored in memory. For a very large document, this could be more
 39  memory than you have available to your system.
 40  
 41  `json_stream.load()` does not read the whole document into memory, it only
 42  buffers enough from the stream to produce the next item of data.
 43  
 44  Additionally, in transient mode (see below) `json-stream` also doesn't store 
 45  up all of the parsed data in memory.
 46  
 47  ### Latency
 48  
 49  `json.load()` produces all the data after parsing the whole document. If you
 50  only care about the first 10 items in a list of 2 million items, then you
 51  have wait until all 2 million items have been parsed first.
 52  
 53  `json_stream.load()` produces data as soon as it is available in the stream.
 54  
 55  ## Usage
 56  
 57  ### `json_stream.load()`
 58  
 59  `json_stream.load()` has two modes of operation, controlled by
 60  the `persistent` argument (default false).
 61  
 62  It is also possible to "mix" the modes as you consume the data.
 63  
 64  #### Transient mode (default)
 65  
 66  This mode is appropriate if you can consume the data iteratively. You cannot 
 67  move backwards through the stream to read data that has already been skipped
 68  over. It is the mode you **must** use if you want process large amounts of
 69  JSON data without consuming large amounts of memory required.
 70  
 71  In transient mode, only the data currently being read is stored in memory. Any
 72  data previously read from the stream is discarded (it's up to you what to do 
 73  with it) and attempting to access this data results in a
 74  `TransientAccessException`.
 75  
 76  ```python
 77  import json_stream
 78  
 79  # JSON: {"count": 3, "results": ["a", "b", "c"]}
 80  data = json_stream.load(f)  # data is a transient dict-like object 
 81  # stream has been read up to "{"
 82  
 83  # use data like a dict
 84  results = data["results"]  # results is a transient list-like object
 85  # stream has been read up to "[", we now cannot read "count"
 86  
 87  # iterate transient list
 88  for result in results:
 89      print(result)  # prints a, b, c
 90  # stream has been read up to "]"
 91  
 92  # attempt to read "count" from earlier in stream
 93  count = data["count"]  # will raise exception
 94  # stream is now exhausted
 95  
 96  # attempt to read from list that has already be iterated
 97  for result in results:  # will raise exception
 98      pass
 99  ```
100  
101  #### Persistent mode
102  
103  In persistent mode all previously read data is stored in memory as
104  it is parsed. The returned `dict`-like or `list`-like objects
105  can be used just like normal data structures.
106  
107  If you request an index or key that has already been read from the stream
108  then it is retrieved from memory. If you request an index or key that has
109  not yet been read from the stream, then the request blocks until that item
110  is found in the stream.
111  
112  ```python
113  import json_stream
114  
115  # JSON: {"count": 1, "results": ["a", "b", "c"]}
116  data = json_stream.load(f, persistent=True)
117  # data is a streaming  dict-like object 
118  # stream has been read up to "{"
119  
120  # use data like a dict
121  results = data["results"]  # results is a streaming list-like object
122  # stream has been read up to "["
123  # count has been stored data
124  
125  # use results like a list
126  a_result = results[1]  # a_result = "b"
127  # stream has been read up to the middle of list
128  # "a" and "b" have been stored in results
129  
130  # read earlier data from memory
131  count = data["count"]  # count = 1
132  
133  # consume rest of list
134  results.read_all()
135  # stream has been read up to "}"
136  # "c" is now stored in results too
137  # results.is_streaming() == False
138  
139  # consume everything
140  data.read_all()
141  # stream is now exhausted
142  # data.is_streaming() == False
143  ```
144  
145  Persistent mode is not appropriate if you care about memory consumption, but
146  provides an identical experience compared to `json.load()`.
147  
148  #### Mixed mode
149  
150  In some cases you will need to be able to randomly access some part of the 
151  data, but still only have that specific data taking up memory resources.
152  
153  For example, you might have a very long list of objects, but you cannot always 
154  access the keys of the objects in stream order. You want to be able to iterate
155  the list transiently, but access the result objects persistently.
156  
157  This can be achieved using the `persistent()` method of all the `list` or
158  `dict`-like objects json_stream produces. Calling `persistent()` causes the existing
159  transient object to produce persistent child objects.
160  
161  Note that the `persistent()` method makes the children of the object it
162  is called on persistent, not the object it is called on.
163  
164  ```python
165  import json_stream
166  
167  # JSON: {"results": [{"x": 1, "y": 3}, {"y": 4, "x": 2}]}
168  # note that the keys of the inner objects are not ordered 
169  data = json_stream.load(f)  # data is a transient dict-like object 
170  
171  # iterate transient list, but produce persistent items
172  for result in data['results'].persistent():
173      # result is a persistent dict-like object
174      print(result['x'])  # print x
175      print(result['y'])  # print y (error on second result without .persistent())
176      print(result['x'])  # print x again (error without .persistent())
177  ```
178  
179  The opposite is also possible, going from persistent mode to transient mode, though 
180  the use cases for this are more esoteric.
181  
182  ```python
183  # JSON: {"a": 1, "x": ["long", "list", "I", "don't", "want", "in", "memory"], "b": 2}
184  data = load(StringIO(json), persistent=True).transient()
185  # data is a persistent dict-list object that produces transient children
186  
187  print(data["a"])  # prints 1
188  x = data["x"]  # x is a transient list, you can use it accordingly
189  print(x[0])  # prints long
190  
191  # access earlier data from memory
192  print(data["a"])  # this would have raised an exception if data was transient
193  
194  print(data["b"])  # prints 2
195  
196  # we have now moved past all the data in the transient list
197  print(x[0])  # will raise exception
198  ```
199  
200  ### visitor pattern
201  
202  You can also parse using a visitor-style approach where a function you supply
203  is called for each data item as it is parsed (depth-first).
204  
205  This uses a transient parser under the hood, so does not consume memory for
206  the whole document.
207  
208  ```python
209  import json_stream
210  
211  # JSON: {"x": 1, "y": {}, "xxxx": [1,2, {"yyyy": 1}, "z", 1, []]}
212  
213  def visitor(item, path):
214      print(f"{item} at path {path}")
215  
216  json_stream.visit(f, visitor)
217  ```
218  
219  Output:
220  ```
221  1 at path ('x',)
222  {} at path ('y',)
223  1 at path ('xxxx', 0)
224  2 at path ('xxxx', 1)
225  1 at path ('xxxx', 2, 'yyyy')
226  z at path ('xxxx', 3)
227  1 at path ('xxxx', 4)
228  [] at path ('xxxx', 5)
229  ```
230  
231  ### Stream a URL
232  
233  #### urllib
234  
235  ```python
236  import urllib.request
237  import json_stream
238  
239  with urllib.request.urlopen('http://example.com/data.json') as response:
240      data = json_stream.load(response)
241  ```
242  
243  #### requests
244  
245  ```python
246  import requests
247  import json_stream.requests
248  
249  with requests.get('http://example.com/data.json', stream=True) as response:
250      data = json_stream.requests.load(response)
251  ```
252  
253  # Future improvements
254  
255  * Allow long strings in the JSON to be read as streams themselves
256  * Allow transient mode on seekable streams to seek to data earlier in
257  the stream instead of raising a `TransientAccessException`
258  * A more efficient tokenizer?
259  
260  # Alternatives
261  
262  ## NAYA
263  
264  [NAYA](https://github.com/danielyule/naya) is a pure python JSON parser for
265  parsing a simple JSON list as a stream.
266  
267  ### Why not NAYA?
268  
269  * It can only stream JSON containing a top-level list 
270  * It does not provide a pythonic `dict`/`list`-like interface 
271  
272  ## Yajl-Py
273  
274  [Yajl-Py](https://pykler.github.io/yajl-py/) is a wrapper around the Yajl JSON library that can be used to 
275  generate SAX style events while parsing JSON.
276  
277  ### Why not Yajl-Py?
278  
279  * It's not pure python
280  * It does not provide a pythonic `dict`/`list`-like interface 
281  
282  # Build
283  ```bash
284  cd ~/sources/json-stream/
285  python3 -m venv ~/build/
286  . ~/build/bin/activate
287  pip install --upgrade build twine
288  python -m build
289  twine upload dist/*
290  ```
291  
292  # Acknowledgements
293  
294  The JSON tokenizer used in the project was taken from the
295  [NAYA](https://github.com/danielyule/naya) project.
296