d1_client.iter package

This package contains iterators that provide a convenient way to retrieve and iterate over Node contents.

Although this directory is not a package, this __init__.py file is required for pytest to be able to reach test directories below this directory.

Submodules

d1_client.iter.logrecord module

Log Record Iterator.

Iterator that provides a convenient way to retrieve log records from a DataONE node and iterate over the results.

Log records are automatically retrieved from the node in batches as required.

The LogRecordIterator takes a CoordinatingNodeClient or MemberNodeClient together with filters to select a set of log records. It returns an iterator object which enables using a Python for loop for iterating over the matching log records.

Log records are retrieved from the Node only when required. This avoids storing a large list of records in memory.

The LogRecordIterator repeatedly calls the Node’s getLogRecords() API method. The CN implementation of this method yields log records for objects for which the caller has access. Log records are not provided for public objects. This is also how getLogRecords() is implemented in the Metacat Member Node. In GMN, the requirements for authentication for this method are configurable. Other MNs are free to chose how or if to implement access control for this method.

To authenticate to the target Node, provide a valid CILogon signed certificate when creating the CoordinatingNodeClient or MemberNodeClient.

See the CNCore.getLogRecords() and MNCore.getLogRecords() specifications in the DataONE Architecture Documentation for more information.

Example

#!/usr/bin/env python

import d1_client.client
import sys

logging.basicConfig(level=logging.INFO)
target = "https://mn-unm-1.dataone.org/mn"
client = d1_client.client.MemberNodeClient(target=target)
log_record_iterator = LogRecordIterator(client)
for event in log_record_iterator:
  print "Event    = %s" % event.event
  print "Timestamp  = %s" % event.dateLogged.isoformat()
  print "IP Addres  = %s" % event.ipAddress
  print "Identifier = %s" % event.identifier
  print "User agent = %s" % event.userAgent
  print "Subject  = %s" % event.subject
  print '-' * 79
class d1_client.iter.logrecord.LogRecordIterator(client, get_log_records_arg_dict=None, start=0, count=100)

Bases: object

Log Record Iterator.

__init__(client, get_log_records_arg_dict=None, start=0, count=100)

Log Record Iterator.

Parameters
  • client – d1_client.cnclient.CoordinatingNodeClient or

  • d1_client.mnclient.MemberNodeClient – A client that has been initialized with the base_url and, optionally, other connection parameters for the DataONE node from which log records are to be retrieved.

    Log records for an object are typically available only to subjects that have elevated permissions on the object, so an unauthenticated (public) connection may not receive any log records. See the CoordinatingNodeClient and MemberNodeClient classes for details on how to authenticate.

  • get_log_records_arg_dict – dict

    If this argument is set, it is passed as keyword arguments to getLogRecords().

    The iterator calls the getLogRecords() API method as necessary to retrieve the log records. The method supports a limited set of filtering capabilities, Currently, fromDate, toDate, event, pidFilter and idFilter.

    To access these filters, use this argument to pass a dict which matching keys and the expected values. E.g.:

    { 'fromDate': datetime.datetime(2009, 1, 1) }
    
  • start – int

    If a section of the log records have been retrieved earlier, they can be skipped by setting a start value.

  • count – int

    The number of log records to retrieve in each getLogRecords() call.

    Depending on network conditions and Node implementation, changing this value from its default may affect performance and resource usage.

d1_client.iter.logrecord_multi module

Multiprocessed LogRecord Iterator.

Fast retrieval of event log records from a DataONE Node.

See additional notes in SysMeta iter docstring.

class d1_client.iter.logrecord_multi.LogRecordIteratorMulti(base_url, page_size=1000, max_workers=16, max_result_queue_size=100, max_task_queue_size=16, api_major=2, client_arg_dict=None, get_log_records_arg_dict=None)

Bases: d1_client.iter.base_multi.MultiprocessedIteratorBase

d1_client.iter.node module

Iterate over the nodes that are registered in a DataONE environment.

For each Node in the environment, returns a PyXB representation of a DataONE Node document.

https://releases.dataone.org/online/api-documentation-v2.0/ apis/Types.html#Types.Node

class d1_client.iter.node.NodeListIterator(base_url, api_major=2, client_arg_dict=None, listNodes_dict=None)

Bases: object

d1_client.iter.objectlist module

Implements an iterator that iterates over the entire ObjectList for a DataONE node. Data is retrieved from the target only when required.

The ObjectListIterator takes a CoordinatingNodeClient or MemberNodeClient together with filters to select a set of objects. It returns an iterator object which enables using a Python for loop for iterating over the matching objects. Using the ObjectListIterator is appropriate in circumstances where a large percentage of the total number of objecs is expected to be returned or when one of the limited number of filters can be used for selecting the desired set of objects.

If more fine grained filtering is required, DataONE’s Solr index should be used. It can be accessed using the Solr Client.

Object information is retrieved from the Node only when required. This avoids storing a large object list in memory.

The ObjectListIterator repeatedly calls the Node’s listObjects() API method. The CN implementation of this method yields only public objects and objects for which the caller has access. This is also how listObjects() is implemented in the Metacat and GMN Member Nodes. However, other MNs are free to chose how or if to implement access control for this method.

To authenticate to the target Node, provide a valid CILogon signed certificate when creating the CoordinatingNodeClient or MemberNodeClient.

Example:

#!/usr/bin/env python
from d1_client import d1baseclient
from d1_client.objectlistiterator import ObjectListIterator

# The Base URL for a DataONE Coordinating Node or Member Node.
base_url = 'https://cn.dataone.org/cn'
# Start retrieving objects from this position.
start = 0
# Maximum number of entries to retrieve.
max = 500
# Maximum number of entries to retrieve per call.
pagesize = 100

client = d1baseclient.DataONEBaseClient(base_url)
ol = ObjectListIterator(client, start=start, pagesize=pagesize, max=max)
counter = start
print "---"
print "total: %d" % len(ol)
print "---"
for o in ol:
  print "-"
  print "  item     : %d" % counter
  print "  pid      : %s" % o.identifier.value()
  print "  modified : %s" % o.dateSysMetadataModified
  print "  format   : %s" % o.formatId
  print "  size     : %s" % o.size
  print "  checksum : %s" % o.checksum.value()
  print "  algorithm: %s" % o.checksum.algorithm
  counter += 1

Output:

---
total: 5
---
-
  item     : 1
  pid      : knb-lter-lno.9.1
  modified : 2011-01-13 18:42:32.469000
  format   : eml://ecoinformatics.org/eml-2.0.1
  size     : 6751
  checksum : 9039F0388DC76B1A13B0F139520A8D90
  algorithm: MD5
-
  item     : 2
  pid      : LB30XX_030MTV2021R00_20080516.50.1
  modified : 2011-01-12 22:51:00.774000
  format   : eml://ecoinformatics.org/eml-2.0.1
  size     : 14435
  checksum : B2200FB7FAE18A3517AA9E2EA680EE09
  algorithm: MD5
-
  ...
class d1_client.iter.objectlist.ObjectListIterator(client, start=0, fromDate=None, pagesize=500, max=-1, nodeId=None)

Bases: object

Implements an iterator that iterates over the entire ObjectList for a DataONE node.

Data is retrieved from the target only when required.

__init__(client, start=0, fromDate=None, pagesize=500, max=-1, nodeId=None)

Initializes the iterator.

TODO: Extend this with date range and other restrictions

Parameters
  • client (DataONEBaseClient or derivative) – The client instance for retrieving stuff.

  • start (integer) – The zero based starting index value (0)

  • fromDate (DateTime) –

  • pagesize (integer) – Number of items to retrieve in a single request (page, 500)

  • max (integer) – Maximum number of items to retrieve (all)

d1_client.iter.objectlist_multi module

Multiprocessed ObjectList Iterator.

Fast retrieval of ObjectList from a DataONE Node.

See additional notes in SysMeta iter docstring.

class d1_client.iter.objectlist_multi.ObjectListIteratorMulti(base_url, page_size=1000, max_workers=16, max_result_queue_size=100, max_task_queue_size=16, api_major=2, client_arg_dict=None, list_objects_arg_dict=None)

Bases: d1_client.iter.base_multi.MultiprocessedIteratorBase

d1_client.iter.sysmeta_multi module

Multiprocessed System Metadata iterator.

Parallel download of a set of SystemMetadata documents from a CN or MN. The SystemMetadata to download can be selected by the filters that are available in the MNRead.listObjects() and CNRead.listObjects() API calls. For MNs, these include: fromDate, toDate, formatId and identifier. For CNs, these include the ones supported by MNs plus nodeId.

Note: Unhandled exceptions raised in client code while iterating over results from this iterator, or in the iterator itself, will not be shown and may cause the client code to hang. This is a limitation of the multiprocessing module.

If there is an error when retrieving a System Metadata, such as NotAuthorized, an object that is derived from d1_common.types.exceptions.DataONEException is returned instead.

Will create the same number of DataONE clients and HTTP or HTTPS connections as the number of workers. A single connection is reused, first for retrieving a page of results, then all System Metadata objects in the result.

There is a bottleneck somewhere in this iterator, but it’s not pickle/unpickle of sysmeta_pyxb.

Notes on MAX_QUEUE_SIZE:

Queues that become too large can cause deadlocks: https://stackoverflow.com/questions/21641887/python-multiprocessing-process-hangs-on-join-for-large-queue Each item in the queue is a potentially large SysMeta PyXB object, so we set a low max queue size.

class d1_client.iter.sysmeta_multi.SystemMetadataIteratorMulti(base_url, page_size=1000, max_workers=16, max_result_queue_size=100, max_task_queue_size=16, api_major=2, client_arg_dict=None, list_objects_arg_dict=None, get_system_metadata_arg_dict=None)

Bases: d1_client.iter.base_multi.MultiprocessedIteratorBase