SMULTRON Swedish lemmatisation guidelines

These guidelines were used for selecting Swedish lemmas in SMULTRON, the Stockholm MULtilingual TReebank.

Goal

Each token gets exactly one lemma.

Method

Each token gets one lemma as suggested by the morphology system Swetwol.

Exceptions

The following tokens do not get a lemma:

  • XML tags
  • Punctuation symbols
  • Numbers that consist only of digits (and related symbols)
      Examples: 37  4,9  30'261   +4.3
  • Roman numbers
      Examples:  CX  IV

Case 1: Multiple Swetwol Lemmas

If Swetwol provides more than one lemma for the given token (and the given Part-of-Speech), then the most appropriate lemma is chosen in the given context. This refers in particular to the disambiguation between different possible segmentations.

Example:

	framtidsutsikter	-->	framtid\s#utsikt
      					NOT: fram#tid\s#utsikt
	bostadsområdena		-->	bostad\s#område
      					NOT: bo#stad\s#område
	arbetsförhållanden	-->  	arbet\s#förhållande
      					NOT: arbetsför#hållande
					NOT: arbetsför#hål#land

The disambiguation between multiple lemmas is done manually.

Important: We add information to the Swetwol lemma by marking the gap 's' with \s. [This gap morph is sometimes called an interfix.]

Case 2: No Swetwol Lemma

If Swetwol does not provide a lemma, then the human annotator chooses the correct lemma.

If the word is a proper name (PoS=NE), then the lemma is identical with the word form, unless the name is in genitive. The genitive suffix -s will be removed.

Example:

	IBMs      -->  IBM
	Schröders -->  Schröder

If the word is a foreign word (PoS=FM), then the lemma is identical with the word form, unless it is an English word in plural. In that case the suffix -s will be removed.

Example:

	Directors  -->  Director

If a foreign word is identical with a Swedish (loan) word, there might be a Swetwol lemma for it (including segmentations). In this case the Swetwol lemma is not used.

If the token is an abbreviation, then the full word is taken as the basis for lemmatisation.

Example:

	kl   -->  klocka
	%    -->  procent

Acronyms are not spelled out.

	SEB   -->  SEB
	USA   -->  USA

Deviations from the Swetwol Suggestions

If the token is an elliptical compound, then the full compound is taken as the basis for lemmatisation.

Example:

	lång- och kortfristiga               -->  lång#fristig och kort#fristig
	kapital- och likviditetsfrågor       -->  kapital#fråga och likviditet\s#fråga
	lågspänningsbrytare och -omkopplare  -->  låg#spänning\s#brytare och låg#spänning\s#omkopplare

For the lemmatisation of determiners we follow the SUC (Stockholm Umeå Corpus) conventions.

Word form PoS Lemma

Word formPoSLemma
de Determiner DTden
Pronoun PNde
denDeterminer DTden
Pronoun PNden
demPronoun PNde
detDeterminer DTden
Pronoun PNdet

de Determiner DT den Pronoun PN de den Determiner DT den Pronoun PN den dem Pronoun PN de det Determiner DT den Pronoun PN det

Type Information

In order to subclassify foreign words and names we assign the following labels.

Each foreign word (PoS=FM) gets a label specifying its language. The label is the two-character ISO language code.

Example:

	Board  -->  Board   EN
	Crédit -->  Crédit  FR

Gender Information

Each noun (PoS=NN) gets a label specifying its grammatical gender. We use the following labels.

  • UTR - utrum
  • NEU - neuter
  • NONE - none

Example:

	utrustning    -->  utrustning   UTR
	antalet       -->  antal        NEU

Foreign words (PoS=UO) and names (PoS=PM) do not get a gender label.

Misspelled Words

If a word is misspelled, then the word is not corrected according to the principle of faithfulness to the original text. But the (imagined) corrected word is taken as the basis for the lemmatisation.

Example:

	avfallshateringsregler  -->  avfalls#hantering\s#regel