colonel¶
Colonel is a Python 3 library for handling CoNLL data formats.
colonel package¶
Subpackages¶
colonel.conllu package¶
Submodules¶
colonel.conllu.lexer module¶
Module providing the ConlluLexerBuilder
class and related
exception classes.
-
class
colonel.conllu.lexer.
ConlluLexerBuilder
[source]¶ Bases:
object
Class containing PLY Lex rules for processing the CoNLL-U format and for creating new related PLY
Lexer
instances.Usually you can simply invoke the class method
build()
which returns a PLYLexer
; such lexer instance is ready to process your input, making use of the rules provided by theConlluLexerBuilder
class itself.-
classmethod
build
()[source]¶ Returns a PLY
Lexer
instance for CoNLL-U processing.The returned lexer makes use of the rules defined by
ConlluLexerBuilder
.Return type: Lexer
-
static
find_column
(token)[source]¶ Given a
LexToken
, it returns the related column number.Return type: int
-
states
= (('v0', 'exclusive'), ('v1', 'exclusive'), ('v2', 'exclusive'), ('v3', 'exclusive'), ('v4', 'exclusive'), ('v5', 'exclusive'), ('v6', 'exclusive'), ('v7', 'exclusive'), ('v8', 'exclusive'), ('v9', 'exclusive'), ('c1', 'exclusive'), ('c2', 'exclusive'), ('c3', 'exclusive'), ('c4', 'exclusive'), ('c5', 'exclusive'), ('c6', 'exclusive'), ('c7', 'exclusive'), ('c8', 'exclusive'), ('c9', 'exclusive'))¶
-
tokens
= ('NEWLINE', 'TAB', 'COMMENT', 'INTEGER_ID', 'RANGE_ID', 'DECIMAL_ID', 'FORM', 'LEMMA', 'UPOS', 'XPOS', 'FEATS', 'HEAD', 'DEPREL', 'DEPS', 'MISC')¶
-
classmethod
-
exception
colonel.conllu.lexer.
IllegalCharacterError
(token)[source]¶ Bases:
colonel.conllu.lexer.LexerError
Exception raised by
ConlluLexerBuilder
when a lexer error caused by invalid input is encountered.An exception instance must be initialized with the
LexToken
which the lexer was not able to process, so thatline_number
andcolumn_number
can be extracted; a short error message is also generated by the constructor.-
column_number
= None¶ Column position, associated with
line_number
, containing the illegal character, or the start of an illegal sequence.
-
line_number
= None¶ Line number containing the illegal character, or the start of an illegal sequence.
-
-
exception
colonel.conllu.lexer.
LexerError
[source]¶ Bases:
Exception
Generic error class for
ConlluLexerBuilder
.
colonel.conllu.parser module¶
Module providing the ConlluParserBuilder
class and related
exception classes.
-
class
colonel.conllu.parser.
ConlluParserBuilder
[source]¶ Bases:
object
Class containing PLY Yacc rules for processing the CoNLL-U format and for creating new related PLY
LRParser
instances.Usually you can simply invoke the class method
build()
which returns a PLYLRParser
; such parser instance is ready to process your input, making use of the rules provided by theConlluParserBuilder
class itself.As usual, this class is paired with an associated lexer, which in in this case is served by
ConlluLexerBuilder
.-
classmethod
build
()[source]¶ Returns a PLY
LRParser
instance for CoNLL-U processing.The returned parser makes use of the rules defined by
ConlluParserBuilder
.Return type: LRParser
-
static
p_sentence_with_comments
(prod)[source]¶ sentence : comments wordlines NEWLINE
Return type: None
-
static
p_wordline_emptynode
(prod)[source]¶ wordline : DECIMAL_ID TAB FORM TAB LEMMA TAB UPOS TAB XPOS TAB FEATS TAB HEAD TAB DEPREL TAB DEPS TAB MISC NEWLINE
Return type: None
-
static
p_wordline_multiword
(prod)[source]¶ wordline : RANGE_ID TAB FORM TAB LEMMA TAB UPOS TAB XPOS TAB FEATS TAB HEAD TAB DEPREL TAB DEPS TAB MISC NEWLINE
Return type: None
-
classmethod
-
exception
colonel.conllu.parser.
IllegalEmptyNodeError
(prod)[source]¶ Bases:
colonel.conllu.parser.ParserError
Exception raised by
ConlluParserBuilder
when a word line was parsed correctly and has been recognised as an empty node line, however the data is not valid for this kind of element.An exception instance must be initialized with the
YaccProduction
related to the word line containing illegal data, so that theline_number
can be extracted; a short error message is also generated by the constructor.
-
exception
colonel.conllu.parser.
IllegalEofError
[source]¶ Bases:
colonel.conllu.parser.ParserError
Exception raised by
ConlluParserBuilder
when a parser error caused by invalid end-of-file is encountered.When this exception is raised, it means that the end of the input data has been reached, but some additional tokens were expected in order to be valid CoNLL-U.
-
exception
colonel.conllu.parser.
IllegalMultiwordError
(prod)[source]¶ Bases:
colonel.conllu.parser.ParserError
Exception raised by
ConlluParserBuilder
when a word line was parsed correctly and has been recognised as a multiword token line, however the data is not valid for this kind of element.An exception instance must be initialized with the
YaccProduction
related to the word line containing illegal data, so that theline_number
can be extracted; a short error message is also generated by the constructor.
-
exception
colonel.conllu.parser.
IllegalTokenError
(t)[source]¶ Bases:
colonel.conllu.parser.ParserError
Exception raised by
ConlluParserBuilder
when a parser error caused by invalid token is encountered.An exception instance must be initialized with the
LexToken
which the parser was not able to process, so that all the exception attributes can be extracted; a short error message is also generated by the constructor.-
column_number
= None¶ Column position, associated with
line_number
, related to the illegal token encountered, or to the first token of an illegal tokens sequence.
-
line_number
= None¶ Line number related to the illegal token encountered, or to the first token of an illegal tokens sequence.
-
type
= None¶ The type of the illegal token encountered, or of the first token of an illegal tokens sequence.
-
value
= None¶ The value of the illegal token encountered, or of the first token of an illegal tokens sequence.
-
-
exception
colonel.conllu.parser.
ParserError
[source]¶ Bases:
Exception
Generic error class for
ConlluParserBuilder
.
Module contents¶
This package provides methods and modules to process the CoNLL-U format.
In most situations it’s sufficient to make use of parse()
and
to_conllu()
functions, without caring too much about the implementation
under the hood.
In more detail, this package provides a lexical analyzer (see lexer
)
and a parser (see parser
) to transform the raw string input into
related Sentence
objects.
Lexer and parser classes are implemented taking advantage of the PLY (Python Lex-Yacc) library; you can learn more from the PLY documentation and from the Lex & Yacc Page.
-
colonel.conllu.
parse
(content)[source]¶ Parses a CoNLL-U string content, returning a list of sentences.
Raises: - lexer.LexerError – (any specific subclass) in case of invalid input breaking the rules of the CoNLL-U lexer
- parser.ParserError – (any specific subclass) in case of invalid input breaking the rules of the CoNLL-U parser
Parameters: content (
str
) – CoNLL-U formatted string to be parsedReturn type: Returns: list of parsed
Sentence
items
-
colonel.conllu.
to_conllu
(sentences)[source]¶ Serializes a list of sentences to a formatted CoNLL-U string.
This method simply concatenates the output of
Sentence.to_conllu()
for each given sentence and do not perform any validity check; sentences and elements not compatible with CoNLL-U format could lead to an incorrect output value or raising of exceptions.Parameters: sentences ( List
[Sentence
]) – list ofSentence
itemsReturn type: str
Returns: a CoNLL-U formatted representation of the sentences
Submodules¶
colonel.base_rich_sentence_element module¶
Module providing the BaseRichSentenceElement
class.
-
class
colonel.base_rich_sentence_element.
BaseRichSentenceElement
(lemma=None, upos=None, xpos=None, feats=None, deps=None, **kwargs)[source]¶ Bases:
colonel.base_sentence_element.BaseSentenceElement
Abstract class containing basic information in common with some specific elements being part of a sentence.
It is compliant with the CoNLL-U format, in the sense that it provides a common foundation for elements of type word and empty nodes, which can be made up of a richer set of fields in comparison to other elements, such as the (multiword) tokens.
-
deps
¶ Enhanced dependency graph, usually in the form of a list of head-deprel pairs.
It is compatible with CoNLL-U
DEPS
field.You are free to assign to it any kind of value suitable for your project.
-
feats
¶ List of morphological features from the universal feature inventory or from a defined language-specific extension.
It is compatible with CoNLL-U
FEATS
field.You are free to assign to it any kind of value suitable for your project.
-
is_valid
()[source]¶ Returns whether or not the object can be considered valid, however ignoring the context of the sentence in which the word itself is possibly inserted.
An instance of type
BaseRichSentenceElement
is always considered valid, independently from any value of its attributes (it doesn’t provide any additional check to the overridden superclass method).
-
lemma
¶ Lemma of the element.
It is compatible with CoNLL-U
LEMMA
field.
-
to_conllu
()[source]¶ Returns a CoNLL-U formatted representation of the element.
This method is expected to be overridden by each specific element.
-
upos
¶ Universal part-of-speech tag.
It is compatible with CoNLL-U
UPOS
field.
-
xpos
¶ Language-specific part-of-speech tag.
It is compatible with CoNLL-U
XPOS
field.
-
colonel.base_sentence_element module¶
Module providing the BaseSentenceElement
class.
-
class
colonel.base_sentence_element.
BaseSentenceElement
(form=None, misc=None)[source]¶ Bases:
object
Abstract class containing the minimum information in common with all specific elements being part of a sentence.
In the context of this library, it is expected that each item of a sentence is an instance of a
BaseSentenceElement
subclass.The generic term element is used in order to prevent confusion, while each specialized element (i.e. a subclass of
BaseSentenceElement
) will adopt a more appropriate naming convention, so that, for example, a sentence will be usually formed by words, tokens or nodes.-
form
¶ Word form or punctuation symbol.
It is compatible with CoNLL-U
FORM
field.
-
is_valid
()[source]¶ Returns whether or not the object can be considered valid, however ignoring the context of the sentence in which the word itself is possibly inserted.
An instance of type
BaseWord
is always considered valid, independently from any value of its attributes.Return type: bool
-
misc
¶ Any other annotation.
It is compatible with CoNLL-U
MISC
field.
-
colonel.emptynode module¶
Module providing the EmptyNode
class.
-
class
colonel.emptynode.
EmptyNode
(main_index=None, sub_index=None, **kwargs)[source]¶ Bases:
colonel.base_rich_sentence_element.BaseRichSentenceElement
Representation of an Empty Node sentence element
-
is_valid
()[source]¶ Returns whether or not the object can be considered valid, however ignoring the context of the sentence in which the word itself is possibly inserted.
In compliance with the CoNLL-U format, an instance of type
EmptyNode
is considered valid only whenmain_index
is set to a value equal to or greater than zero (0
) andsub_index
is set to a value greater than zero (0
).Return type: bool
-
main_index
¶ The primary index of the empty node.
This usually corresponds to the value of the
Word.index
after which the empty node is inserted, or to zero (0
) if the empty node is inserted before the first word of the sentence (the one with index equal to1
).It is compatible with CoNLL-U
ID
field, which in case of an empty node is a decimal number: the main index here corresponds to the integer part of such value.
-
sub_index
¶ The secondary index of the empty node.
It is compatible with CoNLL-U
ID
field, which in case of an empty node is a decimal number: the sub index here corresponds to the decimal part of such value.
-
colonel.multiword module¶
Module providing the Multiword
class.
-
class
colonel.multiword.
Multiword
(first_index=None, last_index=None, **kwargs)[source]¶ Bases:
colonel.base_sentence_element.BaseSentenceElement
Representation of a Multiword Token sentence element
-
first_index
¶ The first word index (inclusive) covered by the multiword token.
This usually corresponds to the value of the
Word.index
of the firstWord
which is part of this multiword token.It is compatible with CoNLL-U
ID
field, which in case of a multiword token is a range of integer numbers, where first and last bound indexes are separated by a dash (-
): the first index here corresponds to the value at left.
-
is_valid
()[source]¶ Returns whether or not the object can be considered valid, however ignoring the context of the sentence in which the word itself is possibly inserted.
In compliance with the CoNLL-U format, an instance of type
Multiword
is considered valid only whenfirst_index
is set to a value greater than zero (0
) andlast_index
is set to a value greater thanfirst_index
.Return type: bool
-
last_index
¶ The last word index (inclusive) covered by the multiword token.
This usually corresponds to the value of the
Word.index
of the lastWord
which is part of this multiword token.It is compatible with CoNLL-U
ID
field, which in case of a multiword token is a range of integer numbers, where first and last bound indexes are separated by a dash (-
): the first index here corresponds to the value at right.
-
colonel.sentence module¶
Module providing the Sentence
class.
-
class
colonel.sentence.
Sentence
(elements=None, comments=None)[source]¶ Bases:
object
Representation of a sentence.
This class is modeled starting from the CoNLL-U Format specification, which states that sentences consist of one or more word lines. Each word line contains a series of fields, first of all an ID, the value of which determines the kind of the whole line: a single word, a (multiword) token or an empty node.
Analogously, here a
Sentence
mostly consists of an ordered list ofelements
, which can be object of anyBaseSentenceElement
’s subclass, commonly aWord
, aMultiword
or anEmptyNode
.Since the CoNLL-U format allows the presence of comment lines before a sentence, the
comments
attribute is made available here as a simple list of strings.-
comments
¶ Miscellaneous comments related to the sentence.
For the time being, in the context of this project no particular meaning is given to the values of this attribute, however the following guidelines should be followed in order to facilitate possible future usages and processing:
- the presence of the leading
#
character (which denotes the start of a comment line in CoNLL-U format) is discouraged, in order to keep comments more format-independent; - each comment should be always stripped from leading/trailing spaces or newline characters.
- the presence of the leading
-
elements
¶ Ordered list of words, tokens and nodes which form the sentence.
Usually this list can be freely and directly manipulated, since the methods of the class always recompute their returned value accordingly; just pay particular attention performing changes while in the context of iterations (see for example
words()
andraw_tokens()
methods).
-
is_valid
()[source]¶ Returns whether or not the sentence is valid.
The checks implemented here are mostly based on the CoNLL-U format and on the most widely adopted common practices among NLP and dependency parsing contexts, yet including a minimum set of essential validation, so that you are free to use this as a foundation for other custom rules in your application.
A sentence is considered valid only if all of the following conditions apply:
there is at least one element of type
Word
;every single element is valid as well - see
BaseSentenceElement.is_valid()
and the overriding of its subclasses;the ordered sequence of the elements and their ID is valid, that is:
- the sequence of
Word.index
starts from1
and progressively increases by 1 step; - there are no index duplicates or range overlapping;
- the
EmptyNode
elements (if any) are correctly placed after theWord
element related to theirEmptyNode.main_index
(or before the first word of the sentence, when the main index is zero), and for each sequence of empty nodes theirEmptyNode.sub_index
starts from1
and progressively increases by 1 step; - the
Multiword
elements (if any) are correctly placed before the firstWord
included in their index range, and each range always cover existingWord
elements in the sentence;
- the sequence of
if one or more
Word.head
values are set (notNone
), each head must refer to the index of aWord
existing within the sentence, or at least be equal to zero (0
, forroot
grammatical relations).
Return type: bool
-
raw_tokens
()[source]¶ Extracts the raw token sequence.
Iterates through
elements
and yields the only elements which represent the raw sequence of tokens in the sentence. The result includesWord
andMultiword
elements, skipping allWord
items which indexes are included in the range of a precedingMultiWord
.Empty nodes are ignored.
This method do not perform any validity check among the elements, so if you want to ensure valid and meaningful results, please refer to
is_valid()
; unless you really know what you are doing, iterating an invalid sentence could lead to wrong or incoherent results or unexpected behaviours.Return type: Iterator
[Union
[Word
,Multiword
]]
-
to_conllu
()[source]¶ Returns a CoNLL-U formatted representation of the sentence.
No validity check is performed on the sentence and its element; elements and values not compatible with CoNLL-U format could lead to an incorrect output value or raising of exceptions.
Return type: str
-
words
()[source]¶ Extracts the sequence of words.
Iterates through
elements
and yieldsWord
elements only. This can be especially handy in many dependency parsing contexts, where the focus mostly resides among simple words and their relations, ignoring the additional information carried by empty nodes and (multiword) tokens.This method do not perform any validity check among the elements, so if you want to ensure valid and meaningful results, please refer to
is_valid()
; unless you really know what you are doing, iterating an invalid sentence could lead to wrong or incoherent results or unexpected behaviours.Return type: Iterator
[Word
]
-
colonel.upostag module¶
Module providing the UposTag
enumeration.
-
class
colonel.upostag.
UposTag
[source]¶ Bases:
enum.Enum
Enumeration of Universal POS tags.
These tags mark the core part-of-speech categories according to the Universal Dependencies framework.
See also the
UPOS
field in the CoNLL-U format.Note: always refer to the name of each member; values are automatically generated and thus MUST be considered opaque.
-
ADJ
= 1¶ adjective
-
ADP
= 2¶ adposition
-
ADV
= 3¶ adverb
-
AUX
= 4¶ auxiliary
-
CCONJ
= 5¶ coordinating conjunction
-
DET
= 6¶ determiner
-
INTJ
= 7¶ interjection
-
NOUN
= 8¶ noun
-
NUM
= 9¶ numeral
-
PART
= 10¶ particle
-
PRON
= 11¶ pronoun
-
PROPN
= 12¶ proper noun
-
PUNCT
= 13¶ punctuation
-
SCONJ
= 14¶ subordinating conjunction
-
SYM
= 15¶ symbol
-
VERB
= 16¶ verb
-
X
= 17¶ other
-
colonel.word module¶
Module providing the Word
class.
-
class
colonel.word.
Word
(index=None, head=None, deprel=None, **kwargs)[source]¶ Bases:
colonel.base_rich_sentence_element.BaseRichSentenceElement
Representation of a Word sentence element
-
deprel
¶ Universal dependency relation to the
head
or a defined language-specific subtype of one.It is compatible with CoNLL-U
DEPREL
field.
-
head
¶ Head of the current word, which is usually a value of another Word’s
index
or zero (0
, forroot
grammatical relations).It is compatible with CoNLL-U
HEAD
field.
-
index
¶ Word index.
It is compatible with CoNLL-U
ID
field.The term index has been preferred over the more conventional ID, mostly for the purpose of preventing confusion, especially with Python’s
id()
built-in function (which returns the “identity” of an object).
-
is_valid
()[source]¶ Returns whether or not the object can be considered valid, however ignoring the context of the sentence in which the word itself is possibly inserted.
In compliance with the CoNLL-U format, an instance of type
Word
is considered valid only whenindex
is set to a value greater than zero (0
).Return type: bool
-
Module contents¶
Colonel - a Python 3 library for handling CoNLL data formats
-
class
colonel.
Sentence
(elements=None, comments=None)[source]¶ Bases:
object
Representation of a sentence.
This class is modeled starting from the CoNLL-U Format specification, which states that sentences consist of one or more word lines. Each word line contains a series of fields, first of all an ID, the value of which determines the kind of the whole line: a single word, a (multiword) token or an empty node.
Analogously, here a
Sentence
mostly consists of an ordered list ofelements
, which can be object of anyBaseSentenceElement
’s subclass, commonly aWord
, aMultiword
or anEmptyNode
.Since the CoNLL-U format allows the presence of comment lines before a sentence, the
comments
attribute is made available here as a simple list of strings.-
comments
¶ Miscellaneous comments related to the sentence.
For the time being, in the context of this project no particular meaning is given to the values of this attribute, however the following guidelines should be followed in order to facilitate possible future usages and processing:
- the presence of the leading
#
character (which denotes the start of a comment line in CoNLL-U format) is discouraged, in order to keep comments more format-independent; - each comment should be always stripped from leading/trailing spaces or newline characters.
- the presence of the leading
-
elements
¶ Ordered list of words, tokens and nodes which form the sentence.
Usually this list can be freely and directly manipulated, since the methods of the class always recompute their returned value accordingly; just pay particular attention performing changes while in the context of iterations (see for example
words()
andraw_tokens()
methods).
-
is_valid
()[source]¶ Returns whether or not the sentence is valid.
The checks implemented here are mostly based on the CoNLL-U format and on the most widely adopted common practices among NLP and dependency parsing contexts, yet including a minimum set of essential validation, so that you are free to use this as a foundation for other custom rules in your application.
A sentence is considered valid only if all of the following conditions apply:
there is at least one element of type
Word
;every single element is valid as well - see
BaseSentenceElement.is_valid()
and the overriding of its subclasses;the ordered sequence of the elements and their ID is valid, that is:
- the sequence of
Word.index
starts from1
and progressively increases by 1 step; - there are no index duplicates or range overlapping;
- the
EmptyNode
elements (if any) are correctly placed after theWord
element related to theirEmptyNode.main_index
(or before the first word of the sentence, when the main index is zero), and for each sequence of empty nodes theirEmptyNode.sub_index
starts from1
and progressively increases by 1 step; - the
Multiword
elements (if any) are correctly placed before the firstWord
included in their index range, and each range always cover existingWord
elements in the sentence;
- the sequence of
if one or more
Word.head
values are set (notNone
), each head must refer to the index of aWord
existing within the sentence, or at least be equal to zero (0
, forroot
grammatical relations).
Return type: bool
-
raw_tokens
()[source]¶ Extracts the raw token sequence.
Iterates through
elements
and yields the only elements which represent the raw sequence of tokens in the sentence. The result includesWord
andMultiword
elements, skipping allWord
items which indexes are included in the range of a precedingMultiWord
.Empty nodes are ignored.
This method do not perform any validity check among the elements, so if you want to ensure valid and meaningful results, please refer to
is_valid()
; unless you really know what you are doing, iterating an invalid sentence could lead to wrong or incoherent results or unexpected behaviours.Return type: Iterator
[Union
[Word
,Multiword
]]
-
to_conllu
()[source]¶ Returns a CoNLL-U formatted representation of the sentence.
No validity check is performed on the sentence and its element; elements and values not compatible with CoNLL-U format could lead to an incorrect output value or raising of exceptions.
Return type: str
-
words
()[source]¶ Extracts the sequence of words.
Iterates through
elements
and yieldsWord
elements only. This can be especially handy in many dependency parsing contexts, where the focus mostly resides among simple words and their relations, ignoring the additional information carried by empty nodes and (multiword) tokens.This method do not perform any validity check among the elements, so if you want to ensure valid and meaningful results, please refer to
is_valid()
; unless you really know what you are doing, iterating an invalid sentence could lead to wrong or incoherent results or unexpected behaviours.Return type: Iterator
[Word
]
-
-
class
colonel.
Word
(index=None, head=None, deprel=None, **kwargs)[source]¶ Bases:
colonel.base_rich_sentence_element.BaseRichSentenceElement
Representation of a Word sentence element
-
deprel
¶ Universal dependency relation to the
head
or a defined language-specific subtype of one.It is compatible with CoNLL-U
DEPREL
field.
-
head
¶ Head of the current word, which is usually a value of another Word’s
index
or zero (0
, forroot
grammatical relations).It is compatible with CoNLL-U
HEAD
field.
-
index
¶ Word index.
It is compatible with CoNLL-U
ID
field.The term index has been preferred over the more conventional ID, mostly for the purpose of preventing confusion, especially with Python’s
id()
built-in function (which returns the “identity” of an object).
-
is_valid
()[source]¶ Returns whether or not the object can be considered valid, however ignoring the context of the sentence in which the word itself is possibly inserted.
In compliance with the CoNLL-U format, an instance of type
Word
is considered valid only whenindex
is set to a value greater than zero (0
).Return type: bool
-
-
class
colonel.
EmptyNode
(main_index=None, sub_index=None, **kwargs)[source]¶ Bases:
colonel.base_rich_sentence_element.BaseRichSentenceElement
Representation of an Empty Node sentence element
-
is_valid
()[source]¶ Returns whether or not the object can be considered valid, however ignoring the context of the sentence in which the word itself is possibly inserted.
In compliance with the CoNLL-U format, an instance of type
EmptyNode
is considered valid only whenmain_index
is set to a value equal to or greater than zero (0
) andsub_index
is set to a value greater than zero (0
).Return type: bool
-
main_index
¶ The primary index of the empty node.
This usually corresponds to the value of the
Word.index
after which the empty node is inserted, or to zero (0
) if the empty node is inserted before the first word of the sentence (the one with index equal to1
).It is compatible with CoNLL-U
ID
field, which in case of an empty node is a decimal number: the main index here corresponds to the integer part of such value.
-
sub_index
¶ The secondary index of the empty node.
It is compatible with CoNLL-U
ID
field, which in case of an empty node is a decimal number: the sub index here corresponds to the decimal part of such value.
-
-
class
colonel.
Multiword
(first_index=None, last_index=None, **kwargs)[source]¶ Bases:
colonel.base_sentence_element.BaseSentenceElement
Representation of a Multiword Token sentence element
-
first_index
¶ The first word index (inclusive) covered by the multiword token.
This usually corresponds to the value of the
Word.index
of the firstWord
which is part of this multiword token.It is compatible with CoNLL-U
ID
field, which in case of a multiword token is a range of integer numbers, where first and last bound indexes are separated by a dash (-
): the first index here corresponds to the value at left.
-
is_valid
()[source]¶ Returns whether or not the object can be considered valid, however ignoring the context of the sentence in which the word itself is possibly inserted.
In compliance with the CoNLL-U format, an instance of type
Multiword
is considered valid only whenfirst_index
is set to a value greater than zero (0
) andlast_index
is set to a value greater thanfirst_index
.Return type: bool
-
last_index
¶ The last word index (inclusive) covered by the multiword token.
This usually corresponds to the value of the
Word.index
of the lastWord
which is part of this multiword token.It is compatible with CoNLL-U
ID
field, which in case of a multiword token is a range of integer numbers, where first and last bound indexes are separated by a dash (-
): the first index here corresponds to the value at right.
-
-
class
colonel.
UposTag
[source]¶ Bases:
enum.Enum
Enumeration of Universal POS tags.
These tags mark the core part-of-speech categories according to the Universal Dependencies framework.
See also the
UPOS
field in the CoNLL-U format.Note: always refer to the name of each member; values are automatically generated and thus MUST be considered opaque.
-
ADJ
= 1¶ adjective
-
ADP
= 2¶ adposition
-
ADV
= 3¶ adverb
-
AUX
= 4¶ auxiliary
-
CCONJ
= 5¶ coordinating conjunction
-
DET
= 6¶ determiner
-
INTJ
= 7¶ interjection
-
NOUN
= 8¶ noun
-
NUM
= 9¶ numeral
-
PART
= 10¶ particle
-
PRON
= 11¶ pronoun
-
PROPN
= 12¶ proper noun
-
PUNCT
= 13¶ punctuation
-
SCONJ
= 14¶ subordinating conjunction
-
SYM
= 15¶ symbol
-
VERB
= 16¶ verb
-
X
= 17¶ other
-