Treatment Data Access

From Plazi
Jump to: navigation, search

Access to Plazi Treatments

What is a treatment?

The Plazi treatment repository [1] deals with scientific, published, biosystematic literature. It is the literature documenting and describing all the world’s ca 1.9 Million known species in an estimated corpus of over 500 Million published pages. The cited publications in Plazi are all available at the Biodiversity Literature Repository [2] at Zenodo/CERN.

Treatments are well defined parts of articles that define the particular usage of a scientific name by an author at a given time (the publication) [3]. With other words, each scientific name has one to several treatments, depending whether there exists only an original description of a species, or whether there are subsequent re-descriptions. Similar to bibliographic references, treatments can be cited, and subsequent usages of names cite earlier treatments.

Treatments are a synthesis of the knowledge of a given species at a given time. They can be very rich in data, explicitly or implicitly, detailed or summarized, and include many references to external data sources, such as scientific names, collection codes, DNA-codes.

The data can be semantically enhanced, and linked. Treatments as parts of publication need be extracted. Most recently, treatments are tagged in electronic publications with the National Library of Medicine’s Journal Article Tag Suites (JATS) TaxPub extension [3]. This allows automatic extraction. Still the majority of the ca. 2000 journals and books publishing treatments use the PDF format at best. Plazi has tools to extract treatments, enhance the embedded data and import it into its SRS- Treatment Search Portal for public online access.

The data, that is, treatments and observation data, can be viewed as HTML, XML, RDF, or can be harvested with the protocols provided below. The data is provided for harvesting as Darwin Core-Archives.

What is a DarwinCore Archive?

The Darwin Core Archive format is a simple and extensible schema for sharing biodiversity data, especially catalogue data based on the ratified Darwin Core terms and the Darwin Core text guidelines [4]. Darwin Core is a standard for describing sample data in the Biodiversity Informatics community. It has been developed by the Global Biodiversity Information Facility (GBIF).. DarwinCore Archives use a table-based, "spreadsheet-style" format that is more comfortable and familiar to biologists. It uses plain text-files but it is tied to processes that support consistency and stability.


Fig. Schematic representation of a Darwin Core Archive and its components [4]

The GBIF GNA format consists of a set of files where one (or more) files represents the 'core' taxonomic data where a single row represents a single taxon reference. The DarwinCore Taxon class provides the majority of concepts supported in the format that enable taxonomic and nomenclatural semantics and syntax (classification, taxonomic and nomenclatural synonymy, status, etc.) to be expressed.

Other files represent "extensions" to this core table and allow additional data elements to be linked to a taxon in the core table with a many to one relationship. The overall topology of one or more of these extensions to the core table is referred to as a "star schema" and provides a compromise between an overly simple flat-file representation of data and more complex multi-related files. In addition to these files, an additional descriptor file named “meta.xml” serves as a key to the other files. Collectively, these files can be further zipped into a single compressed archive file for portability. This compressed file is known as a Darwin Core Archive (DwCA) file [4].


The Darwin Core Archive used by Plazi

There is one archive per article stored in Plazi, containing the data from all the treatments in the article. Archives contain nine files:

  • meta.xml: description of columns in data files
  • eml.xml: archive meta data, i.e., bibliographic citation of article, etc.
  • taxa.txt: the archive core file, containing one row per taxon in the nomenclature section of a treatment, thus one or multiple rows per treatment, with any after the first for each treatment handling synonymizations.
  • occurrences.txt: occurrence data, containing one row per materials citation, with an ID reference to taxa.txt
  • description.txt: description data, containing one row per descriptive treatment section, with an ID reference to taxa.txt
  • distribution.txt: general distribution data, one row per distribution statement, with an ID reference to taxa.txt
  • media.txt: full text treatments with HTML markup with additional meta data like a bibliographic citation, one row per treatment, with an ID reference to taxa.txt
  • references.txt: bibliographic references to individual treatments, one row per treatment, with an ID reference to taxa.txt
  • vernaculars.txt: vernacular names of treatment taxa, currently empty, as we do not have or mark this kind of data

For a detailed description of the content of each file see Appendix: Darwin Core Archive Content


Treatment Data representation in Plazi

The treatment data is stored in the Treatment Search Portal in native, generic XML included in tagged original publications. The tagged elements are (a) additionally stored in dedicated index structures to support search and (b) extracted and exported in several formats, including DwCA.

A treatment document includes two main elements, the header including the metadata based on the Metadata Object Description Schema (MODS) and the body.

<tax:taxonx>
   <tax:taxonxHeader>
   <tax:taxonxBody>

The data XML can be converted via XSLT into HTML, TaxonX XML (a schema developed to model biosystematics legacy literature), and RDF and HTML

HTML:

http://treatment.plazi.org/id/31F96F41-E3E0-02BD-8898-5A4F3A20E45A 

(this is also the persistent httpURI used as identifier for treatments)

Plain XML:

http://plazi.cs.umb.edu/GgServer/xslt/31F96F41E3E002BD88985A4F3A20E45A

TaxonX XML:

http://plazi.cs.umb.edu/GgServer/taxonx/31F96F41E3E002BD88985A4F3A20E45A

RDF:

http://plazi.cs.umb.edu/GgServer/rdf/31F96F41E3E002BD88985A4F3A20E45A

The terms used in TaxonX and RDF are either imported from existing schemas (such as Darwin Core for observation records, MODS for bibliographic data) or are, if not available, defined in schemas (TaxonX) or ontologies (RDF: in development)


Plazi API

Treatment data is open access and can be accessed via HTTP GET as described in detail below. The treatment data is provided in HTML, various XML flavors, and RDF.

Obtaining a list of all the treatments available from Plazi

HTTP GET

http://plazi.cs.umb.edu/GgServer/xml.rss.xml

Response (RSS, in Atom XML, encoded in UTF-8)


Entries of interest

- channel/item/link: the link to the XML treatment
- channel/item/title: the taxon name and authority


Accessing a particular DwC-Archive

HTTP GET

 http://plazi.cs.umb.edu/GgServer/dwca/<dataSetUUID>.zip

Replace <dataSetUUID> with any UUID from the GBIF-provided listing (see below). It is also possible to directly use the endpoint URL from that listing list.


Example:

http://plazi.cs.umb.edu/GgServer/dwca/23A1465DDF212F7DA589F41341B83FCC.zip

Response (ZIP Archive, containing XML and tab separated TXT files, all encoded in UTF-8)


Entries of interest:

  • eml.xml: an XML file containing the meta data of the publication, in MODS format
  • taxa.txt: a tab separated TXT file listing the taxa and treatments the DwC-Archive contains, plus higher taxonomy; the Identifier column takes the form <treatmentUUID>.taxon, and the treatment UUID can be used to access the treatment on the Plazi servers (see below)
  • occurrences.txt: a tab separated TXT file containing occurrence data; the TaxonID column references the Identifier column in taxa.txt, the data column headers are DwC terms
  • media.txt: a tab separated TXT file containing HTML versions of the treatments; the TaxonID column references the Identifier column in taxa.txt, the HTML treatments are located in the Description column
  • references.txt:


A detailed description of contents can be found here

http://github.com/plazi/Plazi-Communications/wiki/GBIF#darwin-core-archive


Accessing a particular treatment on the Plazi servers

HTTP GET

http://plazi.cs.umb.edu/GgServer/html/<treatmentUUID>

Replace <treatmentUUID> with the actual treatment UUID from the taxa.txt file found in DwC-Archives


Example:

http://plazi.cs.umb.edu/GgServer/html/8C4CE845A6DEE6FDFD1600A70D5BC71B

Response (HTML, encoded in UTF-8): a web page displaying the treatment


HTTP GET

http://plazi.cs.umb.edu/GgServer/xml/<treatmentUUID>

Replace <treatmentUUID> with the actual treatment UUID from the taxa.txt file found in DwC-Archives


Example:

http://plazi.cs.umb.edu/GgServer/xml/8C4CE845A6DEE6FDFD1600A70D5BC71B

Response (XML, encoded in UTF-8): the raw, generic XML version of the treatment, which all other representations are generated from


HTTP GET

http://plazi.cs.umb.edu/GgServer/taxonx/<treatmentUUID>

Replace <treatmentUUID> with the actual treatment UUID from the taxa.txt file found in DwC-Archives


Example:

http://plazi.cs.umb.edu/GgServer/taxonx/8C4CE845A6DEE6FDFD1600A70D5BC71B

Response (XML, encoded in UTF-8): a TaxonX XML version of the treatment


List of Plazi's available DwC-Archives from GBIF API

GBIF is a regular harvester of Plazi data and can be used as an alternative site.


HTTP GET

http://api.gbif.org/v1/organization/7ce8aef0-9e92-11dc-8738-b8a03c50a862/publishedDataset?limit=20&offset=<20k>

Replace <20k> with any multiple of 20 (including 0) to page through the list. It is also possible to use a limit other than 20, with the offset then being a multiple of that other limit.

Example (first 20 datasets):

http://api.gbif.org/v1/organization/7ce8aef0-9e92-11dc-8738-b8a03c50a862/publishedDataset?limit=20&offset=0

Response (JSON)

{
	"offset": 0,
	"limit": 1,
	"endOfRecords": false,
	"count": 1129,
	"results": [{
		"key": "3e8b196b-c482-47f1-9574-772141310c40",
		"installationKey": "7ce8aef1-9e92-11dc-8740-b8a03c50a999",
		"publishingOrganizationKey": "7ce8aef0-9e92-11dc-8738-b8a03c50a862",
		"external": false,
		"numConstituents": 0,
		"type": "CHECKLIST",
		"title": "Revision of the ant genus Myrmoteras in the Malay Archipelago (Hymenoptera, Formicidae).",
		"description": "UNAVAILABLE",
		"language": "eng",
		"homepage": "http://plazi.cs.umb.edu/GgServer/summary/23A1465DDF212F7DA589F41341B83FCC",
		"citation": {
			"text": "Plazi.org taxonomic treatments database: Revision of the ant genus Myrmoteras in the Malay Archipelago (Hymenoptera, Formicidae)."
		},
		"rights": "No known copyright restrictions apply. See Agosti, D., Egloff, W., 2009. Taxonomic information exchange and copyright: the Plazi approach. BMC Research Notes 2009, 2:53 for further explanation.",
		"lockedForAutoUpdate": false,
		"createdBy": "plazi",
		"modifiedBy": "crawler.gbif.org",
		"created": "2014-06-28T12:55:54.089+0000",
		"modified": "2014-11-25T13:29:20.716+0000",
		"contacts": [...],
		"endpoints": [{
			"key": 45389,
			"type": "DWC_ARCHIVE",
			"url": "http://plazi.cs.umb.edu/GgServer/dwca/23A1465DDF212F7DA589F41341B83FCC.zip",
			"createdBy": "plazi",
			"modifiedBy": "plazi",
			"created": "2014-06-28T12:55:54.604+0000",
			"modified": "2014-06-28T12:55:54.604+0000",
			"machineTags": []
		}],
		"machineTags": [...],
		"tags": [],
		"identifiers": [{
			"key": 23594,
			"type": "UUID",
			"identifier": "23A1465DDF212F7DA589F41341B83FCC",
			"createdBy": "plazi",
			"created": "2014-06-28T12:55:54.334+0000"
		}],
		"comments": [],
		"bibliographicCitations": [],
		"curatorialUnits": [],
		"taxonomicCoverages": [],
		"geographicCoverages": [],
		"temporalCoverages": [],
		"keywordCollections": [],
		"countryCoverage": [],
		"collections": [],
		"dataDescriptions": []
	}]
}


Entries of interest:

  • endOfRecords: if false, increasing offset will return further datasets
  • count: total number of available Plazi datasets
  • results.endpoints.url: the URL of the DwC-Archive containing the data on
  • results.identifiers.identifier: the UUID of the dataset
  • results.homepage: the URL of an HTML page listing the taxonomic treatments whose data is contained in the DwC-Archive


References

  1. Plazi http://plazi.org
  2. Biodiversity Literature Repository. https://zenodo.org/collection/user-biosyslit
  3. 3.0 3.1 Catapano T. 2010. TaxPub: An Extension of the NLM/NCBI Journal Publishing DTD for Taxonomic Descriptions. Proceedings of the Journal Article Tag Suite Conference 2010 (pdf)
  4. 4.0 4.1 Darwin Core Archive



Appendix: Darwin Core Archive Content

taxa.txt


occurrences.txt


description.txt


distribution.txt


media.txt


references.txt


vernaculars.txt


Further reading

Plazi background documents


Downloads

Download the description as PDF


Support and Questions

For support and questions, please contact our support

Version

20150223