colonel package

Module contents

Colonel - a Python 3 library for handling CoNLL data formats

class colonel.Sentence(elements=None, comments=None)[source]

Bases: object

Representation of a sentence.

This class is modeled starting from the CoNLL-U Format specification, which states that sentences consist of one or more word lines. Each word line contains a series of fields, first of all an ID, the value of which determines the kind of the whole line: a single word, a (multiword) token or an empty node.

Analogously, here a Sentence mostly consists of an ordered list of elements, which can be object of any BaseSentenceElement’s subclass, commonly a Word, a Multiword or an EmptyNode.

Since the CoNLL-U format allows the presence of comment lines before a sentence, the comments attribute is made available here as a simple list of strings.

comments

Miscellaneous comments related to the sentence.

For the time being, in the context of this project no particular meaning is given to the values of this attribute, however the following guidelines should be followed in order to facilitate possible future usages and processing:

  • the presence of the leading # character (which denotes the start of a comment line in CoNLL-U format) is discouraged, in order to keep comments more format-independent;
  • each comment should be always stripped from leading/trailing spaces or newline characters.
elements

Ordered list of words, tokens and nodes which form the sentence.

Usually this list can be freely and directly manipulated, since the methods of the class always recompute their returned value accordingly; just pay particular attention performing changes while in the context of iterations (see for example words() and raw_tokens() methods).

is_valid()[source]

Returns whether or not the sentence is valid.

The checks implemented here are mostly based on the CoNLL-U format and on the most widely adopted common practices among NLP and dependency parsing contexts, yet including a minimum set of essential validation, so that you are free to use this as a foundation for other custom rules in your application.

A sentence is considered valid only if all of the following conditions apply:

  • there is at least one element of type Word;

  • every single element is valid as well - see BaseSentenceElement.is_valid() and the overriding of its subclasses;

  • the ordered sequence of the elements and their ID is valid, that is:

    • the sequence of Word.index starts from 1 and progressively increases by 1 step;
    • there are no index duplicates or range overlapping;
    • the EmptyNode elements (if any) are correctly placed after the Word element related to their EmptyNode.main_index (or before the first word of the sentence, when the main index is zero), and for each sequence of empty nodes their EmptyNode.sub_index starts from 1 and progressively increases by 1 step;
    • the Multiword elements (if any) are correctly placed before the first Word included in their index range, and each range always cover existing Word elements in the sentence;
  • if one or more Word.head values are set (not None), each head must refer to the index of a Word existing within the sentence, or at least be equal to zero (0, for root grammatical relations).

Return type:bool
raw_tokens()[source]

Extracts the raw token sequence.

Iterates through elements and yields the only elements which represent the raw sequence of tokens in the sentence. The result includes Word and Multiword elements, skipping all Word items which indexes are included in the range of a preceding MultiWord.

Empty nodes are ignored.

This method do not perform any validity check among the elements, so if you want to ensure valid and meaningful results, please refer to is_valid(); unless you really know what you are doing, iterating an invalid sentence could lead to wrong or incoherent results or unexpected behaviours.

Return type:Iterator[Union[Word, Multiword]]
to_conllu()[source]

Returns a CoNLL-U formatted representation of the sentence.

No validity check is performed on the sentence and its element; elements and values not compatible with CoNLL-U format could lead to an incorrect output value or raising of exceptions.

Return type:str
words()[source]

Extracts the sequence of words.

Iterates through elements and yields Word elements only. This can be especially handy in many dependency parsing contexts, where the focus mostly resides among simple words and their relations, ignoring the additional information carried by empty nodes and (multiword) tokens.

This method do not perform any validity check among the elements, so if you want to ensure valid and meaningful results, please refer to is_valid(); unless you really know what you are doing, iterating an invalid sentence could lead to wrong or incoherent results or unexpected behaviours.

Return type:Iterator[Word]
class colonel.Word(index=None, head=None, deprel=None, **kwargs)[source]

Bases: colonel.base_rich_sentence_element.BaseRichSentenceElement

Representation of a Word sentence element

deprel

Universal dependency relation to the head or a defined language-specific subtype of one.

It is compatible with CoNLL-U DEPREL field.

head

Head of the current word, which is usually a value of another Word’s index or zero (0, for root grammatical relations).

It is compatible with CoNLL-U HEAD field.

index

Word index.

It is compatible with CoNLL-U ID field.

The term index has been preferred over the more conventional ID, mostly for the purpose of preventing confusion, especially with Python’s id() built-in function (which returns the “identity” of an object).

is_valid()[source]

Returns whether or not the object can be considered valid, however ignoring the context of the sentence in which the word itself is possibly inserted.

In compliance with the CoNLL-U format, an instance of type Word is considered valid only when index is set to a value greater than zero (0).

Return type:bool
to_conllu()[source]

Returns a CoNLL-U formatted representation of the element.

No validity check is performed on the attributes; values not compatible with CoNLL-U format could lead to an incorrect output value or raising of exceptions.

Return type:str
class colonel.EmptyNode(main_index=None, sub_index=None, **kwargs)[source]

Bases: colonel.base_rich_sentence_element.BaseRichSentenceElement

Representation of an Empty Node sentence element

is_valid()[source]

Returns whether or not the object can be considered valid, however ignoring the context of the sentence in which the word itself is possibly inserted.

In compliance with the CoNLL-U format, an instance of type EmptyNode is considered valid only when main_index is set to a value equal to or greater than zero (0) and sub_index is set to a value greater than zero (0).

Return type:bool
main_index

The primary index of the empty node.

This usually corresponds to the value of the Word.index after which the empty node is inserted, or to zero (0) if the empty node is inserted before the first word of the sentence (the one with index equal to 1).

It is compatible with CoNLL-U ID field, which in case of an empty node is a decimal number: the main index here corresponds to the integer part of such value.

sub_index

The secondary index of the empty node.

It is compatible with CoNLL-U ID field, which in case of an empty node is a decimal number: the sub index here corresponds to the decimal part of such value.

to_conllu()[source]

Returns a CoNLL-U formatted representation of the element.

No validity check is performed on the attributes; values not compatible with CoNLL-U format could lead to an incorrect output value or raising of exceptions.

Return type:str
class colonel.Multiword(first_index=None, last_index=None, **kwargs)[source]

Bases: colonel.base_sentence_element.BaseSentenceElement

Representation of a Multiword Token sentence element

first_index

The first word index (inclusive) covered by the multiword token.

This usually corresponds to the value of the Word.index of the first Word which is part of this multiword token.

It is compatible with CoNLL-U ID field, which in case of a multiword token is a range of integer numbers, where first and last bound indexes are separated by a dash (-): the first index here corresponds to the value at left.

is_valid()[source]

Returns whether or not the object can be considered valid, however ignoring the context of the sentence in which the word itself is possibly inserted.

In compliance with the CoNLL-U format, an instance of type Multiword is considered valid only when first_index is set to a value greater than zero (0) and last_index is set to a value greater than first_index.

Return type:bool
last_index

The last word index (inclusive) covered by the multiword token.

This usually corresponds to the value of the Word.index of the last Word which is part of this multiword token.

It is compatible with CoNLL-U ID field, which in case of a multiword token is a range of integer numbers, where first and last bound indexes are separated by a dash (-): the first index here corresponds to the value at right.

to_conllu()[source]

Returns a CoNLL-U formatted representation of the element.

No validity check is performed on the attributes; values not compatible with CoNLL-U format could lead to an incorrect output value or raising of exceptions.

Return type:str
class colonel.UposTag[source]

Bases: enum.Enum

Enumeration of Universal POS tags.

These tags mark the core part-of-speech categories according to the Universal Dependencies framework.

See also the UPOS field in the CoNLL-U format.

Note: always refer to the name of each member; values are automatically generated and thus MUST be considered opaque.

ADJ = 1

adjective

ADP = 2

adposition

ADV = 3

adverb

AUX = 4

auxiliary

CCONJ = 5

coordinating conjunction

DET = 6

determiner

INTJ = 7

interjection

NOUN = 8

noun

NUM = 9

numeral

PART = 10

particle

PRON = 11

pronoun

PROPN = 12

proper noun

PUNCT = 13

punctuation

SCONJ = 14

subordinating conjunction

SYM = 15

symbol

VERB = 16

verb

X = 17

other