colonel.sentence module¶
Module providing the Sentence
class.
-
class
colonel.sentence.
Sentence
(elements=None, comments=None)[source]¶ Bases:
object
Representation of a sentence.
This class is modeled starting from the CoNLL-U Format specification, which states that sentences consist of one or more word lines. Each word line contains a series of fields, first of all an ID, the value of which determines the kind of the whole line: a single word, a (multiword) token or an empty node.
Analogously, here a
Sentence
mostly consists of an ordered list ofelements
, which can be object of anyBaseSentenceElement
’s subclass, commonly aWord
, aMultiword
or anEmptyNode
.Since the CoNLL-U format allows the presence of comment lines before a sentence, the
comments
attribute is made available here as a simple list of strings.-
comments
¶ Miscellaneous comments related to the sentence.
For the time being, in the context of this project no particular meaning is given to the values of this attribute, however the following guidelines should be followed in order to facilitate possible future usages and processing:
- the presence of the leading
#
character (which denotes the start of a comment line in CoNLL-U format) is discouraged, in order to keep comments more format-independent; - each comment should be always stripped from leading/trailing spaces or newline characters.
- the presence of the leading
-
elements
¶ Ordered list of words, tokens and nodes which form the sentence.
Usually this list can be freely and directly manipulated, since the methods of the class always recompute their returned value accordingly; just pay particular attention performing changes while in the context of iterations (see for example
words()
andraw_tokens()
methods).
-
is_valid
()[source]¶ Returns whether or not the sentence is valid.
The checks implemented here are mostly based on the CoNLL-U format and on the most widely adopted common practices among NLP and dependency parsing contexts, yet including a minimum set of essential validation, so that you are free to use this as a foundation for other custom rules in your application.
A sentence is considered valid only if all of the following conditions apply:
there is at least one element of type
Word
;every single element is valid as well - see
BaseSentenceElement.is_valid()
and the overriding of its subclasses;the ordered sequence of the elements and their ID is valid, that is:
- the sequence of
Word.index
starts from1
and progressively increases by 1 step; - there are no index duplicates or range overlapping;
- the
EmptyNode
elements (if any) are correctly placed after theWord
element related to theirEmptyNode.main_index
(or before the first word of the sentence, when the main index is zero), and for each sequence of empty nodes theirEmptyNode.sub_index
starts from1
and progressively increases by 1 step; - the
Multiword
elements (if any) are correctly placed before the firstWord
included in their index range, and each range always cover existingWord
elements in the sentence;
- the sequence of
if one or more
Word.head
values are set (notNone
), each head must refer to the index of aWord
existing within the sentence, or at least be equal to zero (0
, forroot
grammatical relations).
Return type: bool
-
raw_tokens
()[source]¶ Extracts the raw token sequence.
Iterates through
elements
and yields the only elements which represent the raw sequence of tokens in the sentence. The result includesWord
andMultiword
elements, skipping allWord
items which indexes are included in the range of a precedingMultiWord
.Empty nodes are ignored.
This method do not perform any validity check among the elements, so if you want to ensure valid and meaningful results, please refer to
is_valid()
; unless you really know what you are doing, iterating an invalid sentence could lead to wrong or incoherent results or unexpected behaviours.Return type: Iterator
[Union
[Word
,Multiword
]]
-
to_conllu
()[source]¶ Returns a CoNLL-U formatted representation of the sentence.
No validity check is performed on the sentence and its element; elements and values not compatible with CoNLL-U format could lead to an incorrect output value or raising of exceptions.
Return type: str
-
words
()[source]¶ Extracts the sequence of words.
Iterates through
elements
and yieldsWord
elements only. This can be especially handy in many dependency parsing contexts, where the focus mostly resides among simple words and their relations, ignoring the additional information carried by empty nodes and (multiword) tokens.This method do not perform any validity check among the elements, so if you want to ensure valid and meaningful results, please refer to
is_valid()
; unless you really know what you are doing, iterating an invalid sentence could lead to wrong or incoherent results or unexpected behaviours.Return type: Iterator
[Word
]
-