d1_common package

DataONE Common Library.

Although this directory is not a package, this __init__.py file is required for pytest to be able to reach test directories below this directory.

Submodules

d1_common.bagit module

Create and validate BagIt Data Packages / zip file archives.

d1_common.bagit.validate_bagit_file(bagit_path)

Check if a BagIt file is valid.

Raises

ServiceFailure – If the BagIt zip archive file fails any of the following checks: - Is a valid zip file. - The tag and manifest files are correctly formatted. - Contains all the files listed in the manifests. - The file checksums match the manifests.

d1_common.bagit.create_bagit_stream(dir_name, payload_info_list)

Create a stream containing a BagIt zip archive.

Parameters
  • dir_name – str The name of the root directory in the zip file, under which all the files are placed (avoids “zip bombs”).

  • payload_info_list – list List of payload_info_dict, each dict describing a file.

    • keys: pid, filename, iter, checksum, checksum_algorithm

    • If the filename is None, the pid is used for the filename.

d1_common.checksum module

Utilities for handling checksums.

Warning

The MD5 checksum algorithm is not cryptographically secure. It’s possible to craft a sequence of bytes that yields a predetermined checksum.

d1_common.checksum.create_checksum_object_from_stream(f, algorithm='SHA-1')

Calculate the checksum of a stream.

Parameters
  • f – file-like object Only requirement is a read() method that returns bytes.

  • algorithm – str Checksum algorithm, MD5 or SHA1 / SHA-1.

Returns

Populated Checksum PyXB object.

d1_common.checksum.create_checksum_object_from_iterator(itr, algorithm='SHA-1')

Calculate the checksum of an iterator.

Parameters
  • itr – iterable Object which supports the iterator protocol.

  • algorithm – str Checksum algorithm, MD5 or SHA1 / SHA-1.

Returns

Populated Checksum PyXB object.

d1_common.checksum.create_checksum_object_from_bytes(b, algorithm='SHA-1')

Calculate the checksum of bytes.

Warning

This method requires the entire object to be buffered in (virtual) memory, which should normally be avoided in production code.

Parameters
  • b – bytes Raw bytes

  • algorithm – str Checksum algorithm, MD5 or SHA1 / SHA-1.

Returns

Populated PyXB Checksum object.

d1_common.checksum.calculate_checksum_on_stream(f, algorithm='SHA-1', chunk_size=1048576)

Calculate the checksum of a stream.

Parameters
  • f – file-like object Only requirement is a read() method that returns bytes.

  • algorithm – str Checksum algorithm, MD5 or SHA1 / SHA-1.

  • chunk_size – int Number of bytes to read from the file and add to the checksum at a time.

Returns

Checksum as a hexadecimal string, with length decided by the algorithm.

Return type

str

d1_common.checksum.calculate_checksum_on_iterator(itr, algorithm='SHA-1')

Calculate the checksum of an iterator.

Parameters
  • itr – iterable Object which supports the iterator protocol.

  • algorithm – str Checksum algorithm, MD5 or SHA1 / SHA-1.

Returns

Checksum as a hexadecimal string, with length decided by the algorithm.

Return type

str

d1_common.checksum.calculate_checksum_on_bytes(b, algorithm='SHA-1')

Calculate the checksum of bytes.

Warning: This method requires the entire object to be buffered in (virtual) memory, which should normally be avoided in production code.

Parameters
  • b – bytes Raw bytes

  • algorithm – str Checksum algorithm, MD5 or SHA1 / SHA-1.

Returns

Checksum as a hexadecimal string, with length decided by the algorithm.

Return type

str

d1_common.checksum.are_checksums_equal(checksum_a_pyxb, checksum_b_pyxb)

Determine if checksums are equal.

Parameters

checksum_a_pyxb, checksum_b_pyxb – PyXB Checksum objects to compare.

Returns

bool
  • True: The checksums contain the same hexadecimal values calculated with the same algorithm. Identical checksums guarantee (for all practical purposes) that the checksums were calculated from the same sequence of bytes.

  • False: The checksums were calculated with the same algorithm but the hexadecimal values are different.

Raises

ValueError – The checksums were calculated with different algorithms, hence cannot be compared.

d1_common.checksum.get_checksum_calculator_by_dataone_designator(dataone_algorithm_name)

Get a checksum calculator.

Parameters

dataone_algorithm_name – str Checksum algorithm, MD5 or SHA1 / SHA-1.

Returns

Checksum calculator from the hashlib library

Object that supports update(arg), digest(), hexdigest() and copy().

d1_common.checksum.get_default_checksum_algorithm()

Get the default checksum algorithm.

Returns

Checksum algorithm that is supported by DataONE, the DataONE Python stack and is in common use within the DataONE federation. Currently, SHA-1.

The returned string can be passed as the algorithm_str to the functions in this module.

Return type

str

d1_common.checksum.is_supported_algorithm(algorithm_str)

Determine if string is the name of a supported checksum algorithm.

Parameters

algorithm_str – str String that may or may not contain the name of a supported algorithm.

Returns

bool
  • True: The string contains the name of a supported algorithm and can be passed as the algorithm_str to the functions in this module.

  • False: The string is not a supported algorithm.

d1_common.checksum.get_supported_algorithms()

Get a list of the checksum algorithms that are supported by the DataONE Python stack.

Returns

List of algorithms that are supported by the DataONE Python stack and can be passed to as the algorithm_str to the functions in this module.

Return type

list

d1_common.checksum.format_checksum(checksum_pyxb)

Create string representation of a PyXB Checksum object.

Parameters

PyXB Checksum object

Returns

Combined hexadecimal value and algorithm name.

Return type

str

d1_common.const module

System wide constants for the Python DataONE stack.

d1_common.date_time module

Utilities for handling date-times in DataONE.

Timezones (tz):

  • A datetime object can be tz-naive or tz-aware.

  • tz-naive: The datetime does not include timezone information. As such, it does not by itself fully specify an absolute point in time. The exact point in time depends on in which timezone the time is specified, and the information may not be accessible to the end user. However, as timezones go from GMT-12 to GMT+14, and when including a possible daylight saving offset of 1 hour, a tz-naive datetime will always be within 14 hours of the real time.

  • tz-aware: The datetime includes a timezone, specified as an abbreviation or as a hour and minute offset. It specifies an exact point in time.

class d1_common.date_time.UTC

Bases: datetime.tzinfo

datetime.tzinfo based class that represents the UTC timezone.

Date-times in DataONE should have timezone information that is fixed to UTC. A naive Python datetime can be fixed to UTC by attaching it to this datetime.tzinfo based class.

utcoffset(dt)

Returns:

UTC offset of zero

tzname(dt=None)

Returns:

str: “UTC”

dst(dt=None)

Args: dt: Ignored.

Returns: timedelta(0), meaning that daylight saving is never in effect.

class d1_common.date_time.FixedOffset(name, offset_hours=0, offset_minutes=0)

Bases: datetime.tzinfo

datetime.tzinfo derived class that represents any timezone as fixed offset in minutes east of UTC.

  • Date-times in DataONE should have timezone information that is fixed to UTC. A naive Python datetime can be fixed to UTC by attaching it to this datetime.tzinfo based class.

  • See the UTC class for representing timezone in UTC.

__init__(name, offset_hours=0, offset_minutes=0)

Args: name: str Name of the timezone this offset represents.

offset_hours:

Number of hours offset from UTC.

offset_minutes:

Number of minutes offset from UTC.

utcoffset(dt)

Args: dt: Ignored.

Returns

The time offset from UTC.

Return type

datetime.timedelta

tzname(dt)

Args: dt: Ignored.

Returns: Name of the timezone this offset represents.

dst(dt=None)

Args: dt: Ignored.

Returns: timedelta(0), meaning that daylight saving is never in effect.

d1_common.date_time.is_valid_iso8601(iso8601_str)

Determine if string is a valid ISO 8601 date, time, or datetime.

Parameters

iso8601_str – str String to check.

Returns

True if string is a valid ISO 8601 date, time, or datetime.

Return type

bool

d1_common.date_time.has_tz(dt)

Determine if datetime has timezone (is not naive)

Parameters

dt – datetime

Returns

bool
  • True: datetime is tz-aware.

  • False: datetime is tz-naive.

d1_common.date_time.is_utc(dt)

Determine if datetime has timezone and the timezone is in UTC.

Parameters

dt – datetime

Returns

True if datetime has timezone and the timezone is in UTC

Return type

bool

d1_common.date_time.are_equal(a_dt, b_dt, round_sec=1)

Determine if two datetimes are equal with fuzz factor.

A naive datetime (no timezone information) is assumed to be in in UTC.

Parameters
  • a_dt – datetime Timestamp to compare.

  • b_dt – datetime Timestamp to compare.

  • round_sec – int or float Round the timestamps to the closest second divisible by this value before comparing them.

    E.g.:

    • n_round_sec = 0.1: nearest 10th of a second.

    • n_round_sec = 1: nearest second.

    • n_round_sec = 30: nearest half minute.

    Timestamps may lose resolution or otherwise change slightly as they go through various transformations and storage systems. This again may cause timestamps that have been processed in different systems to fail an exact equality compare even if they were initially the same timestamp. This rounding avoids such problems as long as the error introduced to the original timestamp is not higher than the rounding value. Of course, the rounding also causes a loss in resolution in the values compared, so should be kept as low as possible. The default value of 1 second should be a good tradeoff in most cases.

Returns

bool
  • True: If the two datetimes are equal after being rounded by round_sec.

d1_common.date_time.ts_from_dt(dt)

Convert datetime to POSIX timestamp.

Parameters

dt – datetime

  • Timezone aware datetime: The tz is included and adjusted to UTC (since timestamp is always in UTC).

  • Naive datetime (no timezone information): Assumed to be in UTC.

Returns

int or float
  • The number of seconds since Midnight, January 1st, 1970, UTC.

  • If dt contains sub-second values, the returned value will be a float with fraction.

See also

dt_from_ts() for the reverse operation.

d1_common.date_time.dt_from_ts(ts, tz=None)

Convert POSIX timestamp to a timezone aware datetime.

Parameters
  • ts – int or float, optionally with fraction The number of seconds since Midnight, January 1st, 1970, UTC.

  • tz – datetime.tzinfo - If supplied: The dt is adjusted to that tz before being returned. It does not

    affect the ts, which is always in UTC.

    • If not supplied: the dt is returned in UTC.

Returns

datetime

Timezone aware datetime, in UTC.

See also

ts_from_dt() for the reverse operation.

d1_common.date_time.http_datetime_str_from_dt(dt)

Format datetime to HTTP Full Date format.

Parameters

dt – datetime

  • tz-aware: Used in the formatted string.

  • tz-naive: Assumed to be in UTC.

Returns

str

The returned format is a is fixed-length subset of that defined by RFC 1123 and is the preferred format for use in the HTTP Date header. E.g.:

Sat, 02 Jan 1999 03:04:05 GMT

d1_common.date_time.xsd_datetime_str_from_dt(dt)

Format datetime to a xs:dateTime string.

Parameters

dt – datetime

  • tz-aware: Used in the formatted string.

  • tz-naive: Assumed to be in UTC.

Returns

str

The returned format can be used as the date in xs:dateTime XML elements. It will be on the form YYYY-MM-DDTHH:MM:SS.mmm+00:00.

d1_common.date_time.dt_from_http_datetime_str(http_full_datetime)

Parse HTTP Full Date formats and return as datetime.

Parameters

http_full_datetime – str Each of the allowed formats are supported:

  • Sun, 06 Nov 1994 08:49:37 GMT ; RFC 822, updated by RFC 1123

  • Sunday, 06-Nov-94 08:49:37 GMT ; RFC 850, obsoleted by RFC 1036

  • Sun Nov 6 08:49:37 1994 ; ANSI C’s asctime() format

HTTP Full Dates are always in UTC.

Returns

datetime

The returned datetime is always timezone aware and in UTC.

d1_common.date_time.dt_from_iso8601_str(iso8601_str)

Parse ISO8601 formatted datetime string.

Parameters

iso8601_str – str ISO 8601 formatted datetime.

  • tz-aware: Used in the formatted string.

  • tz-naive: Assumed to be in UTC.

  • Partial strings are accepted as long as they’re on the general form. Everything from just 2014 to 2006-10-20T15:34:56.123+02:30 will work. The sections that are not present in the string are set to zero in the returned datetime.

  • See test_iso8601.py in the iso8601 package for examples.

Returns

datetime

The returned datetime is always timezone aware and in UTC.

Raises

d1_common.date_time.iso8601.ParseError – If ``iso8601_string` is not on the general form of ISO 8601.

d1_common.date_time.normalize_datetime_to_utc(dt)

Adjust datetime to UTC.

Apply the timezone offset to the datetime and set the timezone to UTC.

This is a no-op if the datetime is already in UTC.

Parameters

dt – datetime - tz-aware: Used in the formatted string. - tz-naive: Assumed to be in UTC.

Returns

datetime

The returned datetime is always timezone aware and in UTC.

Notes

This forces a new object to be returned, which fixes an issue with serialization to XML in PyXB. PyXB uses a mixin together with datetime to handle the XML xs:dateTime. That type keeps track of timezone information included in the original XML doc, which conflicts if we return it here as part of a datetime mixin.

See also

cast_naive_datetime_to_tz()

d1_common.date_time.cast_naive_datetime_to_tz(dt, tz=UTC)

If datetime is tz-naive, set it to tz. If datetime is tz-aware, return it unmodified.

Parameters
  • dt – datetime tz-naive or tz-aware datetime.

  • tz – datetime.tzinfo The timezone to which to adjust tz-naive datetime.

Returns

datetime

tz-aware datetime.

Warning

This will change the actual moment in time that is represented if the datetime is naive and represents a date and time not in tz.

See also

normalize_datetime_to_utc()

d1_common.date_time.strip_timezone(dt)

Make datetime tz-naive by stripping away any timezone information.

Parameters
  • dt – datetime

  • - tz-aware – Used in the formatted string.

  • - tz-naive – Returned unchanged.

Returns

datetime

tz-naive datetime.

d1_common.date_time.utc_now()

Returns: tz-aware datetime: The current local date and time adjusted to the UTC timezone.

Notes

  • Local time is retrieved from the local machine clock.

  • Relies on correctly set timezone on the local machine.

  • Relies on current tables for Daylight Saving periods.

  • Local machine timezone can be checked with: $ date +'%z %Z'.

d1_common.date_time.date_utc_now_iso()

Returns:

strThe current local date as an ISO 8601 string in the UTC timezone

Does not include the time.

d1_common.date_time.local_now()

Returns:

tz-aware datetime : The current local date and time in the local timezone

d1_common.date_time.local_now_iso()

Returns:

str : The current local date and time as an ISO 8601 string in the local timezone

d1_common.date_time.to_iso8601_utc(dt)

Args: dt: datetime.

Returns: str: ISO 8601 string in the UTC timezone

d1_common.date_time.create_utc_datetime(*datetime_parts)

Create a datetime with timezone set to UTC.

Parameters

tuple of int – year, month, day, hour, minute, second, microsecond

Returns

datetime

d1_common.date_time.round_to_nearest(dt, n_round_sec=1.0)

Round datetime up or down to nearest divisor.

Round datetime up or down to nearest number of seconds that divides evenly by the divisor.

Any timezone is preserved but ignored in the rounding.

Parameters
  • dt – datetime

  • n_round_sec – int or float Divisor for rounding

Examples

  • n_round_sec = 0.1: nearest 10th of a second.

  • n_round_sec = 1: nearest second.

  • n_round_sec = 30: nearest half minute.

d1_common.env module

Utilities for handling DataONE environments.

d1_common.env.get_d1_env_keys()

Get the DataONE env dict keys in preferred order.

Returns

DataONE env dict keys

Return type

list

d1_common.env.get_d1_env(env_key)

Get the values required in order to connect to a DataONE environment.

Returns

Values required in order to connect to a DataONE environment.

Return type

dict

d1_common.env.get_d1_env_by_base_url(cn_base_url)

Given the BaseURL for a CN, return the DataONE environment dict for the CN’s environemnt.

d1_common.logging_context module

Context manager that enables temporary changes in logging level.

Source: https://docs.python.org/2/howto/logging-cookbook.html

class d1_common.logging_context.LoggingContext(logger, level=None, handler=None, close=True)

Bases: object

Logging Context Manager.

__init__(logger, level=None, handler=None, close=True)

Args: logger: logger Logger for which to change the logging level.

level:

Temporary logging level.

handler:

Optional logging handler to use. Supplying a new handler allows temporarily changing the logging format as well.

close:

Automatically close handler (if supplied).

d1_common.multipart module

Utilities for handling MIME Multipart documents.

d1_common.multipart.parse_response(response, encoding='utf-8')

Parse a multipart Requests.Response into a tuple of BodyPart objects.

Parameters
  • response – Requests.Response

  • encoding – The parser will assume that any text in the HTML body is encoded with this encoding when decoding it for use in the text attribute.

Returns

tuple of BodyPart

Members: headers (CaseInsensitiveDict), content (bytes), text (Unicode), encoding (str).

d1_common.multipart.parse_str(mmp_bytes, content_type, encoding='utf-8')

Parse multipart document bytes into a tuple of BodyPart objects.

Parameters
  • mmp_bytes – bytes Multipart document.

  • content_type – str Must be on the form, multipart/form-data; boundary=<BOUNDARY>, where <BOUNDARY> is the string that separates the parts of the multipart document in mmp_bytes. In HTTP requests and responses, it is passed in the Content-Type header.

  • encoding – str The coding used for the text in the HTML body.

Returns

tuple of BodyPart

Members: headers (CaseInsensitiveDict), content (bytes), text (Unicode), encoding (str).

d1_common.multipart.normalize(body_part_tup)

Normalize a tuple of BodyPart objects to a string.

Normalization is done by sorting the body_parts by the Content- Disposition headers, which is typically on the form, form-data; name="name_of_part.

d1_common.multipart.is_multipart(header_dict)
Parameters

header_dict – CaseInsensitiveDict

Returns

True if header_dict has a Content-Type key (case insensitive) with value that begins with ‘multipart’.

Return type

bool

d1_common.node module

Utilities for handling the DataONE Node and NodeList types.

d1_common.node.pyxb_to_dict(node_list_pyxb)
Returns

Representation of node_list_pyxb, keyed on the Node identifier (urn:node:*).

Return type

dict

Example:

{
  u'urn:node:ARCTIC': {
    'base_url': u'https://arcticdata.io/metacat/d1/mn',
    'description': u'The US National Science Foundation...',
    'name': u'Arctic Data Center',
    'ping': None,
    'replicate': 0,
    'state': u'up',
    'synchronize': 1,
    'type': u'mn'
  },
  u'urn:node:BCODMO': {
    'base_url': u'https://www.bco-dmo.org/d1/mn',
    'description': u'Biological and Chemical Oceanography Data...',
    'name': u'Biological and Chemical Oceanography Data...',
    'ping': None,
    'replicate': 0,
    'state': u'up',
    'synchronize': 1,
    'type': u'mn'
  },
}

d1_common.object_format_cache module

Local cache of the DataONE ObjectFormatList for a given DataONE environment.

As part of the metadata for a science object, DataONE stores a type identifier called an ObjectFormatID. The ObjectFormatList allows mapping ObjectFormatIDs to filename extensions and content type.

The cache is stored in a file and is automatically updated periodically.

Simple methods for looking up elements of the ObjectFormatList are provided.

Examples

Section of an ObjectFormatList:

{
‘-//ecoinformatics.org//eml-access-2.0.0beta4//EN’: {

‘extension’: ‘xml’, ‘format_name’: ‘Ecological Metadata Language, Access module, version 2.0.0beta4’, ‘format_type’: ‘METADATA’, ‘media_type’: {

‘name’: ‘text/xml’, ‘property_list’: []

}

}, ‘-//ecoinformatics.org//eml-access-2.0.0beta6//EN’: {

‘extension’: ‘xml’, ‘format_name’: ‘Ecological Metadata Language, Access module, version 2.0.0beta6’, ‘format_type’: ‘METADATA’, ‘media_type’: {

‘name’: ‘text/xml’, ‘property_list’: []}

},

}

class d1_common.object_format_cache.Singleton

Bases: object

class d1_common.object_format_cache.ObjectFormatListCache(cn_base_url='https://cn.dataone.org/cn', object_format_cache_path='/home/docs/checkouts/readthedocs.org/user_builds/dataone-python/checkouts/latest/lib_common/src/d1_common/object_format_cache.json', cache_refresh_period=datetime.timedelta(days=30), lock_file_path='/tmp/object_format_cache.lock')

Bases: d1_common.object_format_cache.Singleton

__init__(cn_base_url='https://cn.dataone.org/cn', object_format_cache_path='/home/docs/checkouts/readthedocs.org/user_builds/dataone-python/checkouts/latest/lib_common/src/d1_common/object_format_cache.json', cache_refresh_period=datetime.timedelta(days=30), lock_file_path='/tmp/object_format_cache.lock')
Parameters
  • cn_base_url – str: BaseURL for a CN in the DataONE Environment being targeted.

    This can usually be left at the production root, even if running in other environments.

  • object_format_cache_path – str Path to a file in which the cached ObjectFormatList is or will be stored.

    By default, the path is set to a cache file that is distributed together with this module.

    The directories must exist. The file is created if it doesn’t exist. The file is recreated whenever needed. Paths under “/tmp” will typically cause the file to have to be recreated after reboot while paths under “/var/tmp/” typically persist over reboot.

  • cache_refresh_period – datetime.timedelta or None Period of time in which to use the cached ObjectFormatList before refreshing it by downloading a new copy from the CN. The ObjectFormatList does not change often, so a month is probably a sensible default.

    Set to None to disable refresh. When refresh is disabled, object_format_cache_path must point to an existing file.

property object_format_dict

Direct access to a native Python dict representing cached ObjectFormatList.

get_content_type(format_id, default=None)
get_filename_extension(format_id, default=None)
refresh_cache()

Force a refresh of the local cached version of the ObjectFormatList.

This is typically not required, as the cache is refreshed automatically after the configured cache_expiration_period.

is_valid_format_id(format_id)

d1_common.replication_policy module

Utilities for handling the DataONE ReplicationPolicy type.

The Replication Policy is an optional section of the System Metadata which may be used to enable or disable replication, set the desired number of replicas and specify remote MNs to either prefer or block as replication targets.

Examples:

ReplicationPolicy:

<replicationPolicy replicationAllowed="true" numberReplicas="3">
  <!--Zero or more repetitions:-->
  <preferredMemberNode>node1</preferredMemberNode>
  <preferredMemberNode>node2</preferredMemberNode>
  <preferredMemberNode>node3</preferredMemberNode>
  <!--Zero or more repetitions:-->
  <blockedMemberNode>node4</blockedMemberNode>
  <blockedMemberNode>node5</blockedMemberNode>
</replicationPolicy>
d1_common.replication_policy.has_replication_policy(sysmeta_pyxb)

Args: sysmeta_pyxb: SystemMetadata PyXB object.

Returns: bool: True if SystemMetadata includes the optional ReplicationPolicy section.

d1_common.replication_policy.sysmeta_add_preferred(sysmeta_pyxb, node_urn)

Add a remote Member Node to the list of preferred replication targets to this System Metadata object.

Also remove the target MN from the list of blocked Member Nodes if present.

If the target MN is already in the preferred list and not in the blocked list, this function is a no-op.

Parameters
  • sysmeta_pyxb – SystemMetadata PyXB object. System Metadata in which to add the preferred replication target.

    If the System Metadata does not already have a Replication Policy, a default replication policy which enables replication is added and populated with the preferred replication target.

  • node_urn

    str

    Node URN of the remote MN that will be added. On the form

    urn:node:MyMemberNode.

d1_common.replication_policy.sysmeta_add_blocked(sysmeta_pyxb, node_urn)

Add a remote Member Node to the list of blocked replication targets to this System Metadata object.

The blocked node will not be considered a possible replication target for the associated System Metadata.

Also remove the target MN from the list of preferred Member Nodes if present.

If the target MN is already in the blocked list and not in the preferred list, this function is a no-op.

Parameters
  • sysmeta_pyxb – SystemMetadata PyXB object. System Metadata in which to add the blocked replication target.

    If the System Metadata does not already have a Replication Policy, a default replication policy which enables replication is added and then populated with the blocked replication target.

  • node_urn – str Node URN of the remote MN that will be added. On the form urn:node:MyMemberNode.

d1_common.replication_policy.sysmeta_set_default_rp(sysmeta_pyxb)

Set a default, empty, Replication Policy.

This will clear any existing Replication Policy in the System Metadata.

The default Replication Policy disables replication and sets number of replicas to 0.

Parameters

sysmeta_pyxb – SystemMetadata PyXB object. System Metadata in which to set a default Replication Policy.

d1_common.replication_policy.normalize(rp_pyxb)

Normalize a ReplicationPolicy PyXB type in place.

The preferred and blocked lists are sorted alphabetically. As blocked nodes override preferred nodes, and any node present in both lists is removed from the preferred list.

Parameters

rp_pyxb – ReplicationPolicy PyXB object The object will be normalized in place.

d1_common.replication_policy.is_preferred(rp_pyxb, node_urn)
Parameters
  • rp_pyxb – ReplicationPolicy PyXB object The object will be normalized in place.

  • node_urn – str Node URN of the remote MN for which to check preference.

Returns

True if node_urn is a preferred replica target.

As blocked nodes override preferred nodes, return False if node_urn is in both lists.

Return type

bool

d1_common.replication_policy.is_blocked(rp_pyxb, node_urn)
Parameters
  • rp_pyxb – ReplicationPolicy PyXB object The object will be normalized in place.

  • node_urn – str Node URN of the remote MN for which to check preference.

Returns

True if node_urn is a blocked replica target.

As blocked nodes override preferred nodes, return True if node_urn is in both lists.

Return type

bool

d1_common.replication_policy.are_equivalent_pyxb(a_pyxb, b_pyxb)

Check if two ReplicationPolicy objects are semantically equivalent.

The ReplicationPolicy objects are normalized before comparison.

Parameters

a_pyxb, b_pyxb – ReplicationPolicy PyXB objects to compare

Returns

True if the resulting policies for the two objects are semantically equivalent.

Return type

bool

d1_common.replication_policy.are_equivalent_xml(a_xml, b_xml)

Check if two ReplicationPolicy XML docs are semantically equivalent.

The ReplicationPolicy XML docs are normalized before comparison.

Parameters

a_xml, b_xml – ReplicationPolicy XML docs to compare

Returns

True if the resulting policies for the two objects are semantically equivalent.

Return type

bool

d1_common.replication_policy.add_preferred(rp_pyxb, node_urn)

Add a remote Member Node to the list of preferred replication targets.

Also remove the target MN from the list of blocked Member Nodes if present.

If the target MN is already in the preferred list and not in the blocked list, this function is a no-op.

Parameters
  • rp_pyxb – SystemMetadata PyXB object. Replication Policy in which to add the preferred replication target.

  • node_urn – str Node URN of the remote MN that will be added. On the form urn:node:MyMemberNode.

d1_common.replication_policy.add_blocked(rp_pyxb, node_urn)

Add a remote Member Node to the list of blocked replication targets.

Also remove the target MN from the list of preferred Member Nodes if present.

If the target MN is already in the blocked list and not in the preferred list, this function is a no-op.

Parameters
  • rp_pyxb – SystemMetadata PyXB object. Replication Policy in which to add the blocked replication target.

  • node_urn – str Node URN of the remote MN that will be added. On the form urn:node:MyMemberNode.

d1_common.replication_policy.pyxb_to_dict(rp_pyxb)

Convert ReplicationPolicy PyXB object to a normalized dict.

Parameters

rp_pyxb – ReplicationPolicy to convert.

Returns

Replication Policy as normalized dict.

Return type

dict

Example:

{
  'allowed': True,
  'num': 3,
  'blockedMemberNode': {'urn:node:NODE1', 'urn:node:NODE2', 'urn:node:NODE3'},
  'preferredMemberNode': {'urn:node:NODE4', 'urn:node:NODE5'},
}
d1_common.replication_policy.dict_to_pyxb(rp_dict)

Convert dict to ReplicationPolicy PyXB object.

Parameters

rp_dict – Native Python structure representing a Replication Policy.

Example:

{
  'allowed': True,
  'num': 3,
  'blockedMemberNode': {'urn:node:NODE1', 'urn:node:NODE2', 'urn:node:NODE3'},
  'preferredMemberNode': {'urn:node:NODE4', 'urn:node:NODE5'},
}
Returns

ReplicationPolicy PyXB object.

d1_common.resource_map module

Read and write DataONE OAI-ORE Resource Maps.

DataONE supports a system that allows relationships between Science Objects to be described. These relationships are stored in OAI-ORE Resource Maps.

This module provides functionality for the most common use cases when parsing and generating Resource Maps for use in DataONE.

For more information about how Resource Maps are used in DataONE, see:

https://releases.dataone.org/online/api-documentation-v2.0.1/design/DataPackage.html

Common RDF-XML namespaces:

dc: <http://purl.org/dc/elements/1.1/>
foaf: <http://xmlns.com/foaf/0.1/>
rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns# >
rdfs1: <http://www.w3.org/2001/01/rdf-schema# >
ore: <http://www.openarchives.org/ore/terms/>
dcterms: <http://purl.org/dc/terms/>
cito: <http://purl.org/spar/cito/>

Note

In order for Resource Maps to be recognized and indexed by DataONE, they must be created with formatId set to http://www.openarchives.org/ore/terms.

d1_common.resource_map.createSimpleResourceMap(ore_pid, scimeta_pid, sciobj_pid_list)

Create a simple OAI-ORE Resource Map with one Science Metadata document and any number of Science Data objects.

This creates a document that establishes an association between a Science Metadata object and any number of Science Data objects. The Science Metadata object contains information that is indexed by DataONE, allowing both the Science Metadata and the Science Data objects to be discoverable in DataONE Search. In search results, the objects will appear together and can be downloaded as a single package.

Parameters
  • ore_pid – str Persistent Identifier (PID) to use for the new Resource Map

  • scimeta_pid – str PID for an object that will be listed as the Science Metadata that is describing the Science Data objects.

  • sciobj_pid_list – list of str List of PIDs that will be listed as the Science Data objects that are being described by the Science Metadata.

Returns

OAI-ORE Resource Map

Return type

ResourceMap

d1_common.resource_map.createResourceMapFromStream(in_stream, base_url='https://cn.dataone.org/cn')

Create a simple OAI-ORE Resource Map with one Science Metadata document and any number of Science Data objects, using a stream of PIDs.

Parameters
  • in_stream – The first non-blank line is the PID of the resource map itself. Second line is the science metadata PID and remaining lines are science data PIDs.

    Example stream contents:

    PID_ORE_value
    sci_meta_pid_value
    data_pid_1
    data_pid_2
    data_pid_3
    
  • base_url – str Root of the DataONE environment in which the Resource Map will be used.

Returns

OAI-ORE Resource Map

Return type

ResourceMap

class d1_common.resource_map.ResourceMap(ore_pid=None, scimeta_pid=None, scidata_pid_list=None, base_url='https://cn.dataone.org/cn', api_major=2, ore_software_id='DataONE.org Python ITK 3.4.5', *args, **kwargs)

Bases: rdflib.graph.ConjunctiveGraph

OAI-ORE Resource Map.

__init__(ore_pid=None, scimeta_pid=None, scidata_pid_list=None, base_url='https://cn.dataone.org/cn', api_major=2, ore_software_id='DataONE.org Python ITK 3.4.5', *args, **kwargs)

Create a OAI-ORE Resource Map.

Parameters
  • ore_pid – str Persistent Identifier (PID) to use for the new Resource Map

  • scimeta_pid – str PID for an object that will be listed as the Science Metadata that is describing the Science Data objects.

  • scidata_pid_list – list of str List of PIDs that will be listed as the Science Data objects that are being described by the Science Metadata.

  • base_url – str Root of the DataONE environment in which the Resource Map will be used.

  • api_major – The DataONE API version to use for the the DataONE Resolve API. Clients call the Resolve API to get a list of download locations for the objects in the Resource Map.

  • ore_software_id – str Optional string which identifies the software that was used for creating the Resource Map. If specified, should be on the form of a UserAgent string.

  • args and kwargs – Optional arguments forwarded to rdflib.ConjunctiveGraph.__init__().

initialize(pid, ore_software_id='DataONE.org Python ITK 3.4.5')

Create the basic ORE document structure.

serialize_to_transport(doc_format='xml', *args, **kwargs)

Serialize ResourceMap to UTF-8 encoded XML document.

Parameters
  • doc_format – str One of: xml, n3, turtle, nt, pretty-xml, trix, trig and nquads.

  • args and kwargs – Optional arguments forwarded to rdflib.ConjunctiveGraph.serialize().

Returns

UTF-8 encoded XML doc.

Return type

bytes

Note

Only the default, “xml”, is automatically indexed by DataONE.

serialize_to_display(doc_format='pretty-xml', *args, **kwargs)

Serialize ResourceMap to an XML doc that is pretty printed for display.

Parameters
  • doc_format – str One of: xml, n3, turtle, nt, pretty-xml, trix, trig and nquads.

  • args and kwargs – Optional arguments forwarded to rdflib.ConjunctiveGraph.serialize().

Returns

Pretty printed Resource Map XML doc

Return type

str

Note

Only the default, “xml”, is automatically indexed by DataONE.

deserialize(*args, **kwargs)

Deserialize Resource Map XML doc.

The source is specified using one of source, location, file or data.

Parameters
  • source – InputSource, file-like object, or string In the case of a string the string is the location of the source.

  • location – str String indicating the relative or absolute URL of the source. Graph``s absolutize method is used if a relative location is specified.

  • file – file-like object

  • data – str The document to be parsed.

  • format – str Used if format can not be determined from source. Defaults to rdf/xml. Format support can be extended with plugins.

    Built-in: xml, n3, nt, trix, rdfa

  • publicID – str Logical URI to use as the document base. If None specified the document location is used (at least in the case where there is a document location).

Raises

xml.sax.SAXException based exception – On parse error.

getAggregation()

Returns:

str : URIRef of the Aggregation entity

getObjectByPid(pid)
Parameters

pid – str

Returns

URIRef of the entry identified by pid.

Return type

str

addResource(pid)

Add a resource to the Resource Map.

Parameters

pid – str

setDocuments(documenting_pid, documented_pid)

Add a CiTO, the Citation Typing Ontology, triple asserting that documenting_pid documents documented_pid.

Adds assertion: documenting_pid cito:documents documented_pid

Parameters
  • documenting_pid – str PID of a Science Object that documents documented_pid.

  • documented_pid – str PID of a Science Object that is documented by documenting_pid.

setDocumentedBy(documented_pid, documenting_pid)

Add a CiTO, the Citation Typing Ontology, triple asserting that documented_pid isDocumentedBy documenting_pid.

Adds assertion: documented_pid cito:isDocumentedBy documenting_pid

Parameters
  • documented_pid – str PID of a Science Object that is documented by documenting_pid.

  • documenting_pid – str PID of a Science Object that documents documented_pid.

addMetadataDocument(pid)

Add a Science Metadata document.

Parameters

pid – str PID of a Science Metadata object.

addDataDocuments(scidata_pid_list, scimeta_pid=None)

Add Science Data object(s)

Parameters
  • scidata_pid_list – list of str List of one or more PIDs of Science Data objects

  • scimeta_pid – str PID of a Science Metadata object that documents the Science Data objects.

getResourceMapPid()

Returns:

str : PID of the Resource Map itself.

getAllTriples()

Returns:

list of tuples : Each tuple holds a subject, predicate, object triple

getAllPredicates()

Returns: list of str: All unique predicates.

Notes

Equivalent SPARQL:

SELECT DISTINCT ?p
WHERE {
  ?s ?p ?o .
}
getSubjectObjectsByPredicate(predicate)
Parameters

predicate – str Predicate for which to return subject, object tuples.

Returns

All subject/objects with predicate.

Return type

list of subject, object tuples

Notes

Equivalent SPARQL:

SELECT DISTINCT ?s ?o
WHERE {{
  ?s {0} ?o .
}}
getAggregatedPids()

Returns: list of str: All aggregated PIDs.

Notes

Equivalent SPARQL:

SELECT ?pid
WHERE {
  ?s ore:aggregates ?o .
  ?o dcterms:identifier ?pid .
}
getAggregatedScienceMetadataPids()

Returns: list of str: All Science Metadata PIDs.

Notes

Equivalent SPARQL:

SELECT DISTINCT ?pid
WHERE {
  ?s ore:aggregates ?o .
  ?o cito:documents ?o2 .
  ?o dcterms:identifier ?pid .
}
getAggregatedScienceDataPids()

Returns: list of str: All Science Data PIDs.

Notes

Equivalent SPARQL:

SELECT DISTINCT ?pid
WHERE {
  ?s ore:aggregates ?o .
  ?o cito:isDocumentedBy ?o2 .
  ?o dcterms:identifier ?pid .
}
asGraphvizDot(stream)

Serialize the graph to .DOT format for ingestion in Graphviz.

Args: stream: file-like object open for writing that will receive the resulting document.

parseDoc(doc_str, format='xml')

Parse a OAI-ORE Resource Maps document.

See Also: rdflib.ConjunctiveGraph.parse for documentation on arguments.

d1_common.revision module

Utilities for working with revision / obsolescence chains.

d1_common.revision.get_identifiers(sysmeta_pyxb)

Get set of identifiers that provide revision context for SciObj.

Returns: tuple: PID, SID, OBSOLETES_PID, OBSOLETED_BY_PID

d1_common.revision.topological_sort(unsorted_dict)

Sort objects by dependency.

Sort a dict of obsoleting PID to obsoleted PID to a list of PIDs in order of obsolescence.

Parameters

unsorted_dict – dict Dict that holds obsolescence information. Each key/value pair establishes that the PID in key identifies an object that obsoletes an object identifies by the PID in value.

Returns

sorted_list: A list of PIDs ordered so that all PIDs that obsolete an object are listed after the object they obsolete.

unconnected_dict: A dict of PID to obsoleted PID of any objects that could not be added to a revision chain. These items will have obsoletes PIDs that directly or indirectly reference a PID that could not be sorted.

Return type

tuple of sorted_list, unconnected_dict

Notes

obsoletes_dict is modified by the sort and on return holds any items that could not be sorted.

The sort works by repeatedly iterating over an unsorted list of PIDs and moving PIDs to the sorted list as they become available. A PID is available to be moved to the sorted list if it does not obsolete a PID or if the PID it obsoletes is already in the sorted list.

d1_common.revision.get_pids_in_revision_chain(client, did)

Args: client: d1_client.cnclient.CoordinatingNodeClient or d1_client.mnclient.MemberNodeClient.

didstr

SID or a PID of any object in a revision chain.

Returns

All PIDs in the chain. The returned list is in the same order as the chain. The initial PID is typically obtained by resolving a SID. If the given PID is not in a chain, a list containing the single object is returned.

Return type

list of str

d1_common.revision.revision_list_to_obsoletes_dict(revision_list)

Args: revision_list: list of tuple tuple: PID, SID, OBSOLETES_PID, OBSOLETED_BY_PID.

Returns: dict: Dict of obsoleted PID to obsoleting PID.

d1_common.revision.revision_list_to_obsoleted_by_dict(revision_list)

Args: revision_list: list of tuple tuple: PID, SID, OBSOLETES_PID, OBSOLETED_BY_PID.

Returns: dict: Dict of obsoleting PID to obsoleted PID.

d1_common.system_metadata module

Utilities for handling the DataONE SystemMetadata type.

DataONE API methods such as MNStorage.create() require a Science Object and System Metadata pair.

Examples

Example v2 SystemMetadata XML document with all optional values included:

<v2:systemMetadata xmlns:v2="http://ns.dataone.org/service/types/v2.0">
  <!--Optional:-->
  <serialVersion>11</serialVersion>

  <identifier>string</identifier>
  <formatId>string</formatId>
  <size>11</size>
  <checksum algorithm="string">string</checksum>

  <!--Optional:-->
  <submitter>string</submitter>
  <rightsHolder>string</rightsHolder>

  <!--Optional:-->
  <accessPolicy>
    <!--1 or more repetitions:-->
    <allow>
      <!--1 or more repetitions:-->
      <subject>string</subject>
      <!--1 or more repetitions:-->
      <permission>read</permission>
    </allow>
  </accessPolicy>

  <!--Optional:-->
  <replicationPolicy replicationAllowed="true" numberReplicas="3">
    <!--Zero or more repetitions:-->
    <preferredMemberNode>string</preferredMemberNode>
    <!--Zero or more repetitions:-->
    <blockedMemberNode>string</blockedMemberNode>
  </replicationPolicy>

  <!--Optional:-->
  <obsoletes>string</obsoletes>
  <obsoletedBy>string</obsoletedBy>
  <archived>true</archived>
  <dateUploaded>2014-09-18T17:18:33</dateUploaded>
  <dateSysMetadataModified>2006-08-19T11:27:14-06:00</dateSysMetadataModified>
  <originMemberNode>string</originMemberNode>
  <authoritativeMemberNode>string</authoritativeMemberNode>

  <!--Zero or more repetitions:-->
  <replica>
    <replicaMemberNode>string</replicaMemberNode>
    <replicationStatus>failed</replicationStatus>
    <replicaVerified>2013-05-21T19:02:49-06:00</replicaVerified>
  </replica>

  <!--Optional:-->
  <seriesId>string</seriesId>

  <!--Optional:-->
  <mediaType name="string">
    <!--Zero or more repetitions:-->
    <property name="string">string</property>
  </mediaType>

  <!--Optional:-->
  <fileName>string</fileName>
</v2:systemMetadata>
d1_common.system_metadata.is_sysmeta_pyxb(sysmeta_pyxb)

Args: sysmeta_pyxb: Object that may or may not be a SystemMetadata PyXB object.

Returns

  • True if sysmeta_pyxb is a SystemMetadata PyXB object.

  • False if sysmeta_pyxb is not a PyXB object or is a PyXB object of a type other than SystemMetadata.

Return type

bool

d1_common.system_metadata.normalize_in_place(sysmeta_pyxb, reset_timestamps=False, reset_filename=False)

Normalize SystemMetadata PyXB object in-place.

Parameters
  • sysmeta_pyxb – SystemMetadata PyXB object to normalize.

  • reset_timestamps – bool True: Timestamps in the SystemMetadata are set to a standard value so that objects that are compared after normalization register as equivalent if only their timestamps differ.

Notes

The SystemMetadata is normalized by removing any redundant information and ordering all sections where there are no semantics associated with the order. The normalized SystemMetadata is intended to be semantically equivalent to the un-normalized one.

d1_common.system_metadata.are_equivalent_pyxb(a_pyxb, b_pyxb, ignore_timestamps=False, ignore_filename=False)

Determine if SystemMetadata PyXB objects are semantically equivalent.

Normalize then compare SystemMetadata PyXB objects for equivalency.

Parameters
  • a_pyxb, b_pyxb – SystemMetadata PyXB objects to compare

  • ignore_timestamps – bool True: Timestamps are ignored during the comparison.

  • ignore_filename – bool True: FileName elements are ignored during the comparison.

    This is necessary in cases where GMN returns a generated filename because one was not provided in the SysMeta.

Returns

True if SystemMetadata PyXB objects are semantically equivalent.

Return type

bool

Notes

The SystemMetadata is normalized by removing any redundant information and ordering all sections where there are no semantics associated with the order. The normalized SystemMetadata is intended to be semantically equivalent to the un-normalized one.

d1_common.system_metadata.are_equivalent_xml(a_xml, b_xml, ignore_timestamps=False)

Determine if two SystemMetadata XML docs are semantically equivalent.

Normalize then compare SystemMetadata XML docs for equivalency.

Parameters
  • a_xml, b_xml – bytes UTF-8 encoded SystemMetadata XML docs to compare

  • ignore_timestamps – bool True: Timestamps in the SystemMetadata are ignored so that objects that are compared register as equivalent if only their timestamps differ.

Returns

True if SystemMetadata XML docs are semantically equivalent.

Return type

bool

Notes

The SystemMetadata is normalized by removing any redundant information and ordering all sections where there are no semantics associated with the order. The normalized SystemMetadata is intended to be semantically equivalent to the un-normalized one.

d1_common.system_metadata.clear_elements(sysmeta_pyxb, clear_replica=True, clear_serial_version=True)

{clear_replica} causes any replica information to be removed from the object.

{clear_replica} ignores any differences in replica information, as this information is often different between MN and CN.

d1_common.system_metadata.update_elements(dst_pyxb, src_pyxb, el_list)

Copy elements specified in el_list from src_pyxb to dst_pyxb

Only elements that are children of root are supported. See SYSMETA_ROOT_CHILD_LIST.

If an element in el_list does not exist in src_pyxb, it is removed from dst_pyxb.

d1_common.system_metadata.generate_system_metadata_pyxb(pid, format_id, sciobj_stream, submitter_str, rights_holder_str, authoritative_mn_urn, sid=None, obsoletes_pid=None, obsoleted_by_pid=None, is_archived=False, serial_version=1, uploaded_datetime=None, modified_datetime=None, file_name=None, origin_mn_urn=None, is_private=False, access_list=None, media_name=None, media_property_list=None, is_replication_allowed=False, prefered_mn_list=None, blocked_mn_list=None, pyxb_binding=None)

Generate a System Metadata PyXB object

Parameters
  • pid

  • format_id

  • sciobj_stream

  • submitter_str

  • rights_holder_str

  • authoritative_mn_urn

  • pyxb_binding

  • sid

  • obsoletes_pid

  • obsoleted_by_pid

  • is_archived

  • serial_version

  • uploaded_datetime

  • modified_datetime

  • file_name

  • origin_mn_urn

  • access_list

  • is_private

  • media_name

  • media_property_list

  • is_replication_allowed

  • prefered_mn_list

  • blocked_mn_list

Returns

systemMetadata PyXB object

d1_common.system_metadata.gen_checksum_and_size(sciobj_stream)
d1_common.system_metadata.gen_access_policy(pyxb_binding, sysmeta_pyxb, is_private, access_list)
d1_common.system_metadata.gen_replication_policy(pyxb_binding, prefered_mn_list=None, blocked_mn_list=None, is_replication_allowed=False)
d1_common.system_metadata.gen_media_type(pyxb_binding, media_name, media_property_list=None)

d1_common.type_conversions module

Utilities for handling the DataONE types.

  • Handle conversions between XML representations used in the D1 Python stack.

  • Handle conversions between v1 and v2 DataONE XML types.

The DataONE Python stack uses the following representations for the DataONE API XML docs:

  • As native Unicode str, typically “pretty printed” with indentations, when formatted for display.

  • As UTF-8 encoded bytes when send sending or receiving over the network, or loading or saving as files.

  • Schema validation and manipulation in Python code as PyXB binding objects.

  • General processing as ElementTrees.

In order to allow conversions between all representations without having to implement separate conversions for each combination of input and output representation, a “hub and spokes” model is used. Native Unicode str was selected as the “hub” representation due to:

  • PyXB provides translation to/from string and DOM.

  • ElementTree provides translation to/from string.

d1_common.type_conversions.get_version_tag_by_pyxb_binding(pyxb_binding)

Map PyXB binding to DataONE API version.

Given a PyXB binding, return the API major version number.

Parameters

pyxb_binding – PyXB binding object

Returns

DataONE API major version number, currently, v1, 1, v2 or 2.

d1_common.type_conversions.get_pyxb_binding_by_api_version(api_major, api_minor=0)

Map DataONE API version tag to PyXB binding.

Given a DataONE API major version number, return PyXB binding that can serialize and deserialize DataONE XML docs of that version.

Parameters

api_major, api_minor – str or int DataONE API major and minor version numbers.

  • If api_major is an integer, it is combined with api_minor to form an exact version.

  • If api_major is a string of v1 or v2, api_minor is ignored and the latest PyXB bindingavailable for the api_major version is returned.

Returns

E.g., d1_common.types.dataoneTypes_v1_1.

Return type

PyXB binding

d1_common.type_conversions.get_version_tag(api_major)

Args:

api_major: int DataONE API major version. Valid versions are currently 1 or 2. Returns: str: DataONE API version tag. Valid version tags are currently v1 or v2.

d1_common.type_conversions.extract_version_tag_from_url(url)

Extract a DataONE API version tag from a MN or CN service endpoint URL.

Parameters

url – str Service endpoint URL. E.g.: https://mn.example.org/path/v2/object/pid.

Returns

Valid version tags are currently v1 or v2.

Return type

str

d1_common.type_conversions.get_pyxb_namespaces()

Returns:

list of str: XML namespaces currently known to PyXB

d1_common.type_conversions.str_to_v1_str(xml_str)

Convert a API v2 XML doc to v1 XML doc.

Removes elements that are only valid for v2 and changes namespace to v1.

If doc is already v1, it is returned unchanged.

Parameters

xml_str – str API v2 XML doc. E.g.: SystemMetadata v2.

Returns

API v1 XML doc. E.g.: SystemMetadata v1.

Return type

str

d1_common.type_conversions.pyxb_to_v1_str(pyxb_obj)

Convert a API v2 PyXB object to v1 XML doc.

Removes elements that are only valid for v2 and changes namespace to v1.

Parameters

pyxb_obj – PyXB object API v2 PyXB object. E.g.: SystemMetadata v2_0.

Returns

API v1 XML doc. E.g.: SystemMetadata v1.

Return type

str

d1_common.type_conversions.str_to_v1_pyxb(xml_str)

Convert a API v2 XML doc to v1 PyXB object.

Removes elements that are only valid for v2 and changes namespace to v1.

Parameters

xml_str – str API v2 XML doc. E.g.: SystemMetadata v2.

Returns

API v1 PyXB object. E.g.: SystemMetadata v1_2.

Return type

PyXB object

d1_common.type_conversions.str_to_v2_str(xml_str)

Convert a API v1 XML doc to v2 XML doc.

All v1 elements are valid for v2, so only changes namespace.

Parameters

xml_str – str API v1 XML doc. E.g.: SystemMetadata v1.

Returns

API v2 XML doc. E.g.: SystemMetadata v2.

Return type

str

d1_common.type_conversions.pyxb_to_v2_str(pyxb_obj)

Convert a API v1 PyXB object to v2 XML doc.

All v1 elements are valid for v2, so only changes namespace.

Parameters

pyxb_obj – PyXB object API v1 PyXB object. E.g.: SystemMetadata v1_0.

Returns

API v2 XML doc. E.g.: SystemMetadata v2.

Return type

str

d1_common.type_conversions.str_to_v2_pyxb(xml_str)

Convert a API v1 XML doc to v2 PyXB object.

All v1 elements are valid for v2, so only changes namespace.

Parameters

xml_str – str API v1 XML doc. E.g.: SystemMetadata v1.

Returns

API v2 PyXB object. E.g.: SystemMetadata v2_0.

Return type

PyXB object

d1_common.type_conversions.is_pyxb(pyxb_obj)

Returns:

bool: True if pyxb_obj is a PyXB object.

d1_common.type_conversions.is_pyxb_d1_type(pyxb_obj)

Returns:

bool: True if pyxb_obj is a PyXB object holding a DataONE API type.

d1_common.type_conversions.is_pyxb_d1_type_name(pyxb_obj, expected_pyxb_type_name)
Parameters
  • pyxb_obj – object May be a PyXB object and may hold a DataONE API type.

  • expected_pyxb_type_name – str Case sensitive name of a DataONE type.

    E.g.: SystemMetadata, LogEntry, ObjectInfo.

Returns

True if object is a PyXB object holding a value of the specified type.

Return type

bool

d1_common.type_conversions.pyxb_get_type_name(obj_pyxb)

Args: obj_pyxb: PyXB object.

Returns

Name of the type the PyXB object is holding.

E.g.: SystemMetadata, LogEntry, ObjectInfo.

Return type

str

d1_common.type_conversions.pyxb_get_namespace_name(obj_pyxb)

Args: obj_pyxb: PyXB object.

Returns

Namespace and Name of the type the PyXB object is holding.

E.g.: {http://ns.dataone.org/service/types/v2.0}SystemMetadata

Return type

str

d1_common.type_conversions.str_is_v1(xml_str)
Parameters

xml_str – str DataONE API XML doc.

Returns

True if XML doc is a DataONE API v1 type.

Return type

bool

d1_common.type_conversions.str_is_v2(xml_str)
Parameters

xml_str – str DataONE API XML doc.

Returns

True if XML doc is a DataONE API v2 type.

Return type

bool

d1_common.type_conversions.str_is_error(xml_str)
Parameters

xml_str – str DataONE API XML doc.

Returns

True if XML doc is a DataONE Exception type.

Return type

bool

d1_common.type_conversions.str_is_identifier(xml_str)
Parameters

xml_str – str DataONE API XML doc.

Returns

True if XML doc is a DataONE Identifier type.

Return type

bool

d1_common.type_conversions.str_is_objectList(xml_str)
Parameters

xml_str – str DataONE API XML doc.

Returns

True if XML doc is a DataONE ObjectList type.

Return type

bool

d1_common.type_conversions.str_is_well_formed(xml_str)
Parameters

xml_str – str DataONE API XML doc.

Returns

True if XML doc is well formed.

Return type

bool

d1_common.type_conversions.pyxb_is_v1(pyxb_obj)
Parameters

pyxb_obj – PyXB object PyXB object holding an unknown type.

Returns

True if pyxb_obj holds an API v1 type.

Return type

bool

d1_common.type_conversions.pyxb_is_v2(pyxb_obj)
Parameters

pyxb_obj – PyXB object PyXB object holding an unknown type.

Returns

True if pyxb_obj holds an API v2 type.

Return type

bool

d1_common.type_conversions.str_to_pyxb(xml_str)

Deserialize API XML doc to PyXB object.

Parameters

xml_str – str DataONE API XML doc

Returns

Matching the API version of the XML doc.

Return type

PyXB object

d1_common.type_conversions.str_to_etree(xml_str, encoding='utf-8')

Deserialize API XML doc to an ElementTree.

Parameters
  • xml_str – bytes DataONE API XML doc

  • encoding – str Decoder to use when converting the XML doc bytes to a Unicode str.

Returns

Matching the API version of the XML doc.

Return type

ElementTree

d1_common.type_conversions.pyxb_to_str(pyxb_obj, encoding='utf-8')

Serialize PyXB object to XML doc.

Parameters
  • pyxb_obj – PyXB object

  • encoding – str Encoder to use when converting the Unicode strings in the PyXB object to XML doc bytes.

Returns

API XML doc, matching the API version of pyxb_obj.

Return type

str

d1_common.type_conversions.etree_to_str(etree_obj, encoding='utf-8')

Serialize ElementTree to XML doc.

Parameters
  • etree_obj – ElementTree

  • encoding – str Encoder to use when converting the Unicode strings in the ElementTree to XML doc bytes.

Returns

XML doc.

Return type

str

d1_common.type_conversions.etree_to_pretty_xml(etree_obj, encoding='unicode')

Serialize ElementTree to pretty printed XML doc.

Parameters
  • etree_obj – ElementTree

  • encoding – str Encoder to use when converting the Unicode strings in the ElementTree to XML doc bytes.

Returns

Pretty printed XML doc.

Return type

str

d1_common.type_conversions.pyxb_to_etree(pyxb_obj)

Convert PyXB object to ElementTree.

Parameters

pyxb_obj – PyXB object

Returns

Matching the API version of the PyXB object.

Return type

ElementTree

d1_common.type_conversions.etree_to_pyxb(etree_obj)

Convert ElementTree to PyXB object.

Parameters

etree_obj – ElementTree

Returns

Matching the API version of the ElementTree object.

Return type

PyXB object

d1_common.type_conversions.replace_namespace_with_prefix(tag_str, ns_reverse_dict=None)

Convert XML tag names with namespace on the form {namespace}tag to form prefix:tag.

Parameters
  • tag_str – str Tag name with namespace. E.g.: {http://www.openarchives.org/ore/terms/}ResourceMap.

  • ns_reverse_dict – dict A dictionary of namespace to prefix to use for the conversion. If not supplied, a default dict with the namespaces used in DataONE XML types is used.

Returns

Tag name with prefix. E.g.: ore:ResourceMap.

Return type

str

d1_common.type_conversions.etree_replace_namespace(etree_obj, ns_str)

In-place change the namespace of elements in an ElementTree.

Parameters
  • etree_obj – ElementTree

  • ns_str – str The namespace to set. E.g.: http://ns.dataone.org/service/types/v1.

d1_common.type_conversions.strip_v2_elements(etree_obj)

In-place remove elements and attributes that are only valid in v2 types.

Args: etree_obj: ElementTree ElementTree holding one of the DataONE API types that changed between v1 and v2.

d1_common.type_conversions.strip_system_metadata(etree_obj)

In-place remove elements and attributes that are only valid in v2 types from v1 System Metadata.

Args: etree_obj: ElementTree ElementTree holding a v1 SystemMetadata.

d1_common.type_conversions.strip_log(etree_obj)

In-place remove elements and attributes that are only valid in v2 types from v1 Log.

Args: etree_obj: ElementTree ElementTree holding a v1 Log.

d1_common.type_conversions.strip_logEntry(etree_obj)

In-place remove elements and attributes that are only valid in v2 types from v1 LogEntry.

Args: etree_obj: ElementTree ElementTree holding a v1 LogEntry.

d1_common.type_conversions.strip_node(etree_obj)

In-place remove elements and attributes that are only valid in v2 types from v1 Node.

Args: etree_obj: ElementTree ElementTree holding a v1 Node.

d1_common.type_conversions.strip_node_list(etree_obj)

In-place remove elements and attributes that are only valid in v2 types from v1 NodeList.

Args: etree_obj: ElementTree ElementTree holding a v1 NodeList.

d1_common.type_conversions.v2_0_tag(element_name)

Add a v2 namespace to a tag name.

Parameters

element_name – str The name of a DataONE v2 type. E.g.: NodeList.

Returns

The tag name with DataONE API v2 namespace. E.g.: {http://ns.dataone.org/service/types/v2.0}NodeList

Return type

str

d1_common.url module

Utilities for handling URLs in DataONE.

d1_common.url.parseUrl(url)

Return a dict containing scheme, netloc, url, params, query, fragment keys.

query is a dict where the values are always lists. If the query key appears only once in the URL, the list will have a single value.

d1_common.url.isHttpOrHttps(url)

URL is HTTP or HTTPS protocol.

Upper and lower case protocol names are recognized.

d1_common.url.encodePathElement(element)

Encode a URL path element according to RFC3986.

d1_common.url.decodePathElement(element)

Decode a URL path element according to RFC3986.

d1_common.url.encodeQueryElement(element)

Encode a URL query element according to RFC3986.

d1_common.url.decodeQueryElement(element)

Decode a URL query element according to RFC3986.

d1_common.url.stripElementSlashes(element)

Strip any slashes from the front and end of an URL element.

d1_common.url.joinPathElements(*elements)

Join two or more URL elements, inserting ‘/’ as needed.

Note: Any leading and trailing slashes are stripped from the resulting URL. An empty element (‘’) causes an empty spot in the path (‘//’).

d1_common.url.encodeAndJoinPathElements(*elements)

Encode URL path element according to RFC3986 then join them, inserting ‘/’ as needed.

Note: Any leading and trailing slashes are stripped from the resulting URL. An empty element (‘’) causes an empty spot in the path (‘//’).

d1_common.url.normalizeTarget(target)

If necessary, modify target so that it ends with ‘/’.

d1_common.url.urlencode(query, doseq=0)

Modified version of the standard urllib.urlencode that is conforms to RFC3986. The urllib version encodes spaces as ‘+’ which can lead to inconsistency. This version will always encode spaces as ‘%20’.

Encode a sequence of two-element tuples or dictionary into a URL query string.

If any values in the query arg are sequences and doseq is true, each sequence element is converted to a separate parameter.

If the query arg is a sequence of two-element tuples, the order of the parameters in the output will match the order of parameters in the input.

d1_common.url.makeCNBaseURL(url)

Attempt to create a valid CN BaseURL when one or more sections of the URL are missing.

d1_common.url.makeMNBaseURL(url)

Attempt to create a valid MN BaseURL when one or more sections of the URL are missing.

d1_common.url.find_url_mismatches(a_url, b_url)

Given two URLs, return a list of any mismatches.

If the list is empty, the URLs are equivalent. Implemented by parsing and comparing the elements. See RFC 1738 for details.

d1_common.url.is_urls_equivalent(a_url, b_url)

d1_common.util module

General utilities often needed by DataONE clients and servers.

d1_common.util.log_setup(is_debug=False, is_multiprocess=False)

Set up a standardized log format for the DataONE Python stack. All Python components should use this function. If is_multiprocess is True, include process ID in the log so that logs can be separated for each process.

Output only to stdout and stderr.

d1_common.util.get_content_type(content_type)

Extract the MIME type value from a content type string.

Removes any subtype and parameter values that may be present in the string.

Parameters

content_type – str String with content type and optional subtype and parameter fields.

Returns

String with only content type

Return type

str

Example:

Input:   multipart/form-data; boundary=aBoundaryString
Returns: multipart/form-data
d1_common.util.nested_update(d, u)

Merge two nested dicts.

Nested dicts are sometimes used for representing various recursive structures. When updating such a structure, it may be convenient to present the updated data as a corresponding recursive structure. This function will then apply the update.

Parameters
  • d – dict dict that will be updated in-place. May or may not contain nested dicts.

  • u – dict dict with contents that will be merged into d. May or may not contain nested dicts.

class d1_common.util.EventCounter(logger_=<module 'logging' from '/home/docs/.pyenv/versions/3.7.3/lib/python3.7/logging/__init__.py'>)

Bases: object

Count events during a lengthy operation and write running totals and/or a summary to a logger when the operation has completed.

The summary contains the name and total count of each event that was counted.

Example

Summary written to the log:

Events:
Creating SciObj DB representations: 200
Retrieving revision chains: 200
Skipped Node registry update: 1
Updating obsoletedBy: 42
Whitelisted subject: 2
property event_dict

Provide direct access to the underlying dict where events are recorded.

Returns: dict: Events and event counts.

count(event_str, inc_int=1)

Count an event.

Parameters
  • event_str – The name of an event to count. Used as a key in the event dict. The same name will also be used in the summary.

  • inc_int – int Optional argument to increase the count for the event by more than 1.

log_and_count(event_str, msg_str=None, inc_int=None)

Count an event and write a message to a logger.

Parameters
  • event_str – str The name of an event to count. Used as a key in the event dict. The same name will be used in the summary. This also becomes a part of the message logged by this function.

  • msg_str – str Optional message with details about the events. The message is only written to the log. While the event_str functions as a key and must remain the same for the same type of event, log_str may change between calls.

  • inc_int – int Optional argument to increase the count for the event by more than 1.

dump_to_log()

Write summary to logger with the name and number of times each event has been counted.

This function may be called at any point in the process. Counts are not zeroed.

d1_common.util.print_logging()

Context manager to temporarily suppress additional information such as timestamps when writing to loggers.

This makes logging look like print(). The main use case is in scripts that mix logging and print(), as Python uses separate streams for those, and output can and does end up getting shuffled if print() and logging is used interchangeably.

When entering the context, the logging levels on the current handlers are saved then modified to WARNING levels. A new DEBUG level handler with a formatter that does not write timestamps, etc, is then created.

When leaving the context, the DEBUG handler is removed and existing loggers are restored to their previous levels.

By modifying the log levels to WARNING instead of completely disabling the loggers, it is ensured that potentially serious issues can still be logged while the context manager is in effect.

d1_common.util.save_json(py_obj, json_path)

Serialize a native object to JSON and save it normalized, pretty printed to a file.

The JSON string is normalized by sorting any dictionary keys.

Parameters
  • py_obj – object Any object that can be represented in JSON. Some types, such as datetimes are automatically converted to strings.

  • json_path – str File path to which to write the JSON file. E.g.: The path must exist. The filename will normally end with “.json”.

See also

ToJsonCompatibleTypes()

d1_common.util.load_json(json_path)

Load JSON file and parse it to a native object.

Parameters

json_path – str File path from which to load the JSON file.

Returns

Typically a nested structure of list and dict objects.

Return type

object

d1_common.util.format_json_to_normalized_pretty_json(json_str)

Normalize and pretty print a JSON string.

The JSON string is normalized by sorting any dictionary keys.

Parameters

json_str – A valid JSON string.

Returns

normalized, pretty printed JSON string.

Return type

str

d1_common.util.serialize_to_normalized_pretty_json(py_obj)

Serialize a native object to normalized, pretty printed JSON.

The JSON string is normalized by sorting any dictionary keys.

Parameters

py_obj – object Any object that can be represented in JSON. Some types, such as datetimes are automatically converted to strings.

Returns

normalized, pretty printed JSON string.

Return type

str

d1_common.util.serialize_to_normalized_compact_json(py_obj)

Serialize a native object to normalized, compact JSON.

The JSON string is normalized by sorting any dictionary keys. It will be on a single line without whitespace between elements.

Parameters

py_obj – object Any object that can be represented in JSON. Some types, such as datetimes are automatically converted to strings.

Returns

normalized, compact JSON string.

Return type

str

class d1_common.util.ToJsonCompatibleTypes(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)

Bases: json.encoder.JSONEncoder

Some native objects such as datetime.datetime are not automatically converted to strings for use as values in JSON.

This helper adds such conversions for types that the DataONE Python stack encounters frequently in objects that are to be JSON encoded.

default(o)

Implement this method in a subclass such that it returns a serializable object for o, or calls the base implementation (to raise a TypeError).

For example, to support arbitrary iterators, you could implement default like this:

def default(self, o):
    try:
        iterable = iter(o)
    except TypeError:
        pass
    else:
        return list(iterable)
    # Let the base class default method raise the TypeError
    return JSONEncoder.default(self, o)
d1_common.util.format_sec_to_dhm(sec)

Format seconds to days, hours, minutes.

Parameters

sec – float or int Number of seconds in a period of time

Returns

00h:00m``.

Return type

Period of time represented as a string on the form ``0d

d1_common.xml module

Utilities for handling XML docs.

d1_common.xml.deserialize(doc_xml, pyxb_binding=None)

Deserialize DataONE XML types to PyXB.

Parameters
  • doc_xml – UTF-8 encoded bytes

  • pyxb_binding – PyXB binding object. If not specified, the correct one should be

  • selected automatically.

Returns

PyXB object

See also

deserialize_d1_exception() for deserializing DataONE Exception types.

d1_common.xml.deserialize_d1_exception(doc_xml)

Args: doc_xml: UTF-8 encoded bytes An XML doc that conforms to the dataoneErrors XML Schema.

Returns: DataONEException object

d1_common.xml.serialize_gen(obj_pyxb, encoding='utf-8', pretty=False, strip_prolog=False, xslt_url=None)

Serialize PyXB object to XML.

Parameters
  • obj_pyxb – PyXB object PyXB object to serialize.

  • encoding – str Encoding to use for XML doc bytes

  • pretty – bool True: Use pretty print formatting for human readability.

  • strip_prolog – True: remove any XML prolog (e.g., <?xml version="1.0" encoding="utf-8"?>), from the resulting XML doc.

  • xslt_url – str If specified, add a processing instruction to the XML doc that specifies the download location for an XSLT stylesheet.

Returns

XML document

d1_common.xml.serialize_for_transport(obj_pyxb, pretty=False, strip_prolog=False, xslt_url=None)

Serialize PyXB object to XML bytes with UTF-8 encoding for transport over the network, filesystem storage and other machine usage.

Parameters
  • obj_pyxb – PyXB object PyXB object to serialize.

  • pretty – bool True: Use pretty print formatting for human readability.

  • strip_prolog – True: remove any XML prolog (e.g., <?xml version="1.0" encoding="utf-8"?>), from the resulting XML doc.

  • xslt_url – str If specified, add a processing instruction to the XML doc that specifies the download location for an XSLT stylesheet.

Returns

UTF-8 encoded XML document

Return type

bytes

See also

serialize_for_display()

d1_common.xml.serialize_to_xml_str(obj_pyxb, pretty=True, strip_prolog=False, xslt_url=None)

Serialize PyXB object to pretty printed XML str for display.

Parameters
  • obj_pyxb – PyXB object PyXB object to serialize.

  • pretty – bool False: Disable pretty print formatting. XML will not have line breaks.

  • strip_prolog – True: remove any XML prolog (e.g., <?xml version="1.0" encoding="utf-8"?>), from the resulting XML doc.

  • xslt_url – str If specified, add a processing instruction to the XML doc that specifies the download location for an XSLT stylesheet.

Returns

Pretty printed XML document

Return type

str

d1_common.xml.reformat_to_pretty_xml(doc_xml)

Pretty print XML doc.

Parameters

doc_xml – str Well formed XML doc

Returns

Pretty printed XML doc

Return type

str

d1_common.xml.are_equivalent_pyxb(a_pyxb, b_pyxb)

Return True if two PyXB objects are semantically equivalent, else False.

d1_common.xml.are_equivalent(a_xml, b_xml, encoding=None)

Return True if two XML docs are semantically equivalent, else False.

  • TODO: Include test for tails. Skipped for now because tails are not used in any D1 types.

d1_common.xml.are_equal_or_superset(superset_tree, base_tree)

Return True if superset_tree is equal to or a superset of base_tree

  • Checks that all elements and attributes in superset_tree are present and contain the same values as in base_tree. For elements, also checks that the order is the same.

  • Can be used for checking if one XML document is based on another, as long as all the information in base_tree is also present and unmodified in superset_tree.

d1_common.xml.are_equal_xml(a_xml, b_xml)

Normalize and compare XML documents for equality. The document may or may not be a DataONE type.

Parameters
  • a_xml – str

  • b_xml – str XML documents to compare for equality.

Returns

True if the XML documents are semantically equivalent.

Return type

bool

d1_common.xml.are_equal_pyxb(a_pyxb, b_pyxb)

Normalize and compare PyXB objects for equality.

Parameters
  • a_pyxb – PyXB object

  • b_pyxb – PyXB object PyXB objects to compare for equality.

Returns

True if the PyXB objects are semantically equivalent.

Return type

bool

d1_common.xml.are_equal_elements(a_el, b_el)

Normalize and compare ElementTrees for equality.

Parameters
  • a_el – ElementTree

  • b_el – ElementTree ElementTrees to compare for equality.

Returns

True if the ElementTrees are semantically equivalent.

Return type

bool

d1_common.xml.sort_value_list_pyxb(obj_pyxb)

In-place sort complex value siblings in a PyXB object.

Args: obj_pyxb: PyXB object

d1_common.xml.sort_elements_by_child_values(obj_pyxb, child_name_list)

In-place sort simple or complex elements in a PyXB object by values they contain in child elements.

Parameters
  • obj_pyxb – PyXB object

  • child_name_list – list of str List of element names that are direct children of the PyXB object.

d1_common.xml.format_diff_pyxb(a_pyxb, b_pyxb)

Create a diff between two PyXB objects.

Parameters
  • a_pyxb – PyXB object

  • b_pyxb – PyXB object

Returns

Differ-style delta

Return type

str

d1_common.xml.format_diff_xml(a_xml, b_xml)

Create a diff between two XML documents.

Parameters
  • a_xml – str

  • b_xml – str

Returns

Differ-style delta

Return type

str

d1_common.xml.is_valid_utf8(o)

Determine if object is valid UTF-8 encoded bytes.

Parameters

o – str

Returns

True if object is bytes containing valid UTF-8.

Return type

bool

Notes

  • An empty bytes object is valid UTF-8.

  • Any type of object can be checked, not only bytes.

d1_common.xml.get_auto(obj_pyxb)

Return value from simple or complex PyXB element.

PyXB complex elements have a .value() member which must be called in order to retrieve the value of the element, while simple elements represent their values directly. This function allows retrieving element values without knowing the type of element.

Parameters

obj_pyxb – PyXB object

Returns

Value of the PyXB object.

Return type

str

d1_common.xml.get_opt_attr(obj_pyxb, attr_str, default_val=None)

Get an optional attribute value from a PyXB element.

The attributes for elements that are optional according to the schema and not set in the PyXB object are present and set to None.

PyXB validation will fail if required elements are missing.

Parameters
  • obj_pyxb – PyXB object

  • attr_str – str Name of an attribute that the PyXB object may contain.

  • default_val – any object Value to return if the attribute is not present.

Returns

Value of the attribute if present, else default_val.

Return type

str

d1_common.xml.get_opt_val(obj_pyxb, attr_str, default_val=None)

Get an optional Simple Content value from a PyXB element.

The attributes for elements that are optional according to the schema and not set in the PyXB object are present and set to None.

PyXB validation will fail if required elements are missing.

Parameters
  • obj_pyxb – PyXB object

  • attr_str – str Name of an attribute that the PyXB object may contain.

  • default_val – any object Value to return if the attribute is not present.

Returns

Value of the attribute if present, else default_val.

Return type

str

d1_common.xml.get_req_val(obj_pyxb)

Get a required Simple Content value from a PyXB element.

The attributes for elements that are required according to the schema are always present, and provide a value() method.

PyXB validation will fail if required elements are missing.

Getting a Simple Content value from PyXB with .value() returns a PyXB object that lazily evaluates to a native Unicode string. This confused parts of the Django ORM that check types before passing values to the database. This function forces immediate conversion to Unicode.

Parameters

obj_pyxb – PyXB object

Returns

Value of the element.

Return type

str

exception d1_common.xml.CompareError

Bases: Exception

Raised when objects are compared and found not to be semantically equivalent.