d1_client.iter package¶
This package contains iterators that provide a convenient way to retrieve and iterate over Node contents.
Although this directory is not a package, this __init__.py file is required for pytest to be able to reach test directories below this directory.
Submodules¶
d1_client.iter.base_multi module¶
Base for multiprocessed DataONE type iterator.
-
class
d1_client.iter.base_multi.
MultiprocessedIteratorBase
(base_url, page_size, max_workers, max_result_queue_size, max_task_queue_size, api_major, client_arg_dict, page_arg_dict, item_proc_arg_dict, page_func, iter_func, item_proc_func)¶ Bases:
object
-
property
total
¶
-
property
-
d1_client.iter.base_multi.
create_client
(base_url='https://cn.dataone.org/cn', api_major=2, client_arg_dict=None)¶
d1_client.iter.logrecord module¶
Log Record Iterator.
Iterator that provides a convenient way to retrieve log records from a DataONE node and iterate over the results.
Log records are automatically retrieved from the node in batches as required.
The LogRecordIterator takes a CoordinatingNodeClient or MemberNodeClient together with
filters to select a set of log records. It returns an iterator object which enables
using a Python for
loop for iterating over the matching log records.
Log records are retrieved from the Node only when required. This avoids storing a large list of records in memory.
The LogRecordIterator repeatedly calls the Node’s getLogRecords()
API method. The
CN implementation of this method yields log records for objects for which the caller
has access. Log records are not provided for public objects. This is also how
getLogRecords()
is implemented in the Metacat Member Node. In
GMN, the requirements for authentication for this method are configurable.
Other MNs are free to chose how or if to implement access control for this method.
To authenticate to the target Node, provide a valid CILogon signed certificate when creating the CoordinatingNodeClient or MemberNodeClient.
See the CNCore.getLogRecords() and MNCore.getLogRecords() specifications in the DataONE Architecture Documentation for more information.
Example
#!/usr/bin/env python
import d1_client.client
import sys
logging.basicConfig(level=logging.INFO)
target = "https://mn-unm-1.dataone.org/mn"
client = d1_client.client.MemberNodeClient(target=target)
log_record_iterator = LogRecordIterator(client)
for event in log_record_iterator:
print "Event = %s" % event.event
print "Timestamp = %s" % event.dateLogged.isoformat()
print "IP Addres = %s" % event.ipAddress
print "Identifier = %s" % event.identifier
print "User agent = %s" % event.userAgent
print "Subject = %s" % event.subject
print '-' * 79
-
class
d1_client.iter.logrecord.
LogRecordIterator
(client, get_log_records_arg_dict=None, start=0, count=100)¶ Bases:
object
Log Record Iterator.
-
__init__
(client, get_log_records_arg_dict=None, start=0, count=100)¶ Log Record Iterator.
- Parameters
client – d1_client.cnclient.CoordinatingNodeClient or
d1_client.mnclient.MemberNodeClient – A client that has been initialized with the
base_url
and, optionally, other connection parameters for the DataONE node from which log records are to be retrieved.Log records for an object are typically available only to subjects that have elevated permissions on the object, so an unauthenticated (public) connection may not receive any log records. See the CoordinatingNodeClient and MemberNodeClient classes for details on how to authenticate.
get_log_records_arg_dict – dict
If this argument is set, it is passed as keyword arguments to getLogRecords().
The iterator calls the getLogRecords() API method as necessary to retrieve the log records. The method supports a limited set of filtering capabilities, Currently, fromDate, toDate, event, pidFilter and idFilter.
To access these filters, use this argument to pass a dict which matching keys and the expected values. E.g.:
{ 'fromDate': datetime.datetime(2009, 1, 1) }
start – int
If a section of the log records have been retrieved earlier, they can be skipped by setting a start value.
count – int
The number of log records to retrieve in each getLogRecords() call.
Depending on network conditions and Node implementation, changing this value from its default may affect performance and resource usage.
-
d1_client.iter.logrecord_multi module¶
Multiprocessed LogRecord Iterator.
Fast retrieval of event log records from a DataONE Node.
See additional notes in SysMeta iter docstring.
-
class
d1_client.iter.logrecord_multi.
LogRecordIteratorMulti
(base_url='https://cn.dataone.org/cn', page_size=1000, max_workers=16, max_result_queue_size=100, max_task_queue_size=16, api_major=2, client_arg_dict=None, get_log_records_arg_dict=None)¶
d1_client.iter.node module¶
Iterate over the nodes that are registered in a DataONE environment.
For each Node in the environment, returns a PyXB representation of a DataONE Node document.
https://releases.dataone.org/online/api-documentation-v2.0/ apis/Types.html#Types.Node
-
class
d1_client.iter.node.
NodeListIterator
(cn_client)¶ Bases:
object
d1_client.iter.objectlist module¶
Implements an iterator that iterates over the entire ObjectList for a DataONE node. Data is retrieved from the target only when required.
The ObjectListIterator takes a CoordinatingNodeClient or MemberNodeClient together with
filters to select a set of objects. It returns an iterator object which enables using a
Python for
loop for iterating over the matching objects. Using the
ObjectListIterator is appropriate in circumstances where a large percentage of the total
number of objecs is expected to be returned or when one of the limited number of filters
can be used for selecting the desired set of objects.
If more fine grained filtering is required, DataONE’s Solr index should be used. It can be accessed using the Solr Client.
Object information is retrieved from the Node only when required. This avoids storing a large object list in memory.
The ObjectListIterator repeatedly calls the Node’s listObjects()
API method. The CN
implementation of this method yields only public objects and objects for which the
caller has access. This is also how listObjects()
is implemented in the
Metacat and GMN Member Nodes. However, other MNs are free to chose how
or if to implement access control for this method.
To authenticate to the target Node, provide a valid CILogon signed certificate when creating the CoordinatingNodeClient or MemberNodeClient.
Example:
#!/usr/bin/env python
from d1_client import d1baseclient
from d1_client.objectlistiterator import ObjectListIterator
# The Base URL for a DataONE Coordinating Node or Member Node.
base_url = 'https://cn.dataone.org/cn'
# Start retrieving objects from this position.
start = 0
# Maximum number of entries to retrieve.
max = 500
# Maximum number of entries to retrieve per call.
pagesize = 100
client = d1baseclient.DataONEBaseClient(base_url)
ol = ObjectListIterator(client, start=start, pagesize=pagesize, max=max)
counter = start
print "---"
print "total: %d" % len(ol)
print "---"
for o in ol:
print "-"
print " item : %d" % counter
print " pid : %s" % o.identifier.value()
print " modified : %s" % o.dateSysMetadataModified
print " format : %s" % o.formatId
print " size : %s" % o.size
print " checksum : %s" % o.checksum.value()
print " algorithm: %s" % o.checksum.algorithm
counter += 1
Output:
---
total: 5
---
-
item : 1
pid : knb-lter-lno.9.1
modified : 2011-01-13 18:42:32.469000
format : eml://ecoinformatics.org/eml-2.0.1
size : 6751
checksum : 9039F0388DC76B1A13B0F139520A8D90
algorithm: MD5
-
item : 2
pid : LB30XX_030MTV2021R00_20080516.50.1
modified : 2011-01-12 22:51:00.774000
format : eml://ecoinformatics.org/eml-2.0.1
size : 14435
checksum : B2200FB7FAE18A3517AA9E2EA680EE09
algorithm: MD5
-
...
-
class
d1_client.iter.objectlist.
ObjectListIterator
(client, start=0, fromDate=None, pagesize=500, max=-1, nodeId=None)¶ Bases:
object
Implements an iterator that iterates over the entire ObjectList for a DataONE node.
Data is retrieved from the target only when required.
-
__init__
(client, start=0, fromDate=None, pagesize=500, max=-1, nodeId=None)¶ Initializes the iterator.
TODO: Extend this with date range and other restrictions
- Parameters
client (DataONEBaseClient or derivative) – The client instance for retrieving stuff.
start (integer) – The zero based starting index value (0)
fromDate (DateTime) –
pagesize (integer) – Number of items to retrieve in a single request (page, 500)
max (integer) – Maximum number of items to retrieve (all)
-
d1_client.iter.objectlist_multi module¶
Multiprocessed ObjectList Iterator.
Fast retrieval of ObjectList from a DataONE Node.
See additional notes in SysMeta iter docstring.
-
class
d1_client.iter.objectlist_multi.
ObjectListIteratorMulti
(base_url='https://cn.dataone.org/cn', page_size=1000, max_workers=16, max_result_queue_size=100, max_task_queue_size=16, api_major=2, client_arg_dict=None, list_objects_arg_dict=None)¶
d1_client.iter.sysmeta_multi module¶
Multiprocessed System Metadata iterator.
Parallel download of a set of SystemMetadata documents from a CN or MN. The SystemMetadata to download can be selected by the filters that are available in the MNRead.listObjects() and CNRead.listObjects() API calls. For MNs, these include: fromDate, toDate, formatId and identifier. For CNs, these include the ones supported by MNs plus nodeId.
Note: Unhandled exceptions raised in client code while iterating over results from this iterator, or in the iterator itself, will not be shown and may cause the client code to hang. This is a limitation of the multiprocessing module.
If there is an error when retrieving a System Metadata, such as NotAuthorized, an object that is derived from d1_common.types.exceptions.DataONEException is returned instead.
Will create the same number of DataONE clients and HTTP or HTTPS connections as the number of workers. A single connection is reused, first for retrieving a page of results, then all System Metadata objects in the result.
There is a bottleneck somewhere in this iterator, but it’s not pickle/unpickle of sysmeta_pyxb.
Notes on MAX_QUEUE_SIZE:
Queues that become too large can cause deadlocks: https://stackoverflow.com/questions/21641887/python-multiprocessing-process-hangs-on-join-for-large-queue Each item in the queue is a potentially large SysMeta PyXB object, so we set a low max queue size.
-
class
d1_client.iter.sysmeta_multi.
SystemMetadataIteratorMulti
(base_url='https://cn.dataone.org/cn', page_size=1000, max_workers=16, max_result_queue_size=100, max_task_queue_size=16, api_major=2, client_arg_dict=None, list_objects_arg_dict=None, get_system_metadata_arg_dict=None)¶