

Portrait and Research Interests
I am Titulary Professor and postdoc staff member in Computational Linguistics (CL) at the Department of Computational Linguistics, head of the NLP group in Linguistics Research Infrastructure (LiRI Tech NLP) and of the Text Crunching Center (TCC), which offers computational linguistics services to the University and other partners.
I am a senior researcher in the URPP Digital Religion(s), in Project 8, where we advance hate speech detection tools, detect intolerance and apply content analysis methods on important social and religious issues.
I have been senior lecturer and computing scientist (wissenschaftlicher Informatiker) at the English Department of the University of Zurich (Gerold Schneider's homepage at the English Department).
My research interests include corpus linguistics, semantic mining, automated media content analysis, cognitive linguistics, digital humanities, robust parsing, syntax, formal grammar.
I am involved in research on automated media content analysis, and on Text Mining in the biomedical and many other domains. I am also doing research on Digital Humanities, learner language, variationist linguistics (genre, regions, contrastive, typology), and statistical methods.
I have published over 130 peer-reviewed articles and a coursebook on Statistics.
In the winter term 2017/18 I have worked as Substituting Professor for German Linguistics at TU Dortmund University.
I have worked at the linguistics department of University of Konstanz, substituting Prof. Dr. Miriam Butt from 2015 to 2017 as Professor of Computational and General Linguistics.
Selected articles in bibliographical databases can be
downloaded from ZORA
or downloaded from my Google Scholar profile
I co-supervise the following doctoral theses: Michi Amsler, Peter Makarov, Janis Goldzycher, Maud Reveilhac.
I have written my cumulative habilitation on using computational linguistics methods for descriptive linguistics, text mining and psycholinguistics.
I have written a a low-complexity, broad-coverage probabilistic Dependency Parser for English,as a part of
I have also ported it to German, together with Rico Sennrich.
My Recent Publications related to the Department of Computational Linguistics (ZORA)
ZORA Publikationsliste
Download-Optionen
Publikationen
-
Investigating Linguistic Abilities of LLMs for Native Language Identification. In: Proceedings of the 14th Workshop on NLP for Computer Assisted Language Learning. 2025., Talin, Estonia, 5 März 2025.
-
Digital Dickens: An automated content analysis of Charles Dickens’ novels. In: Buschfeld, Sarah; Ronan, Patricia; Neumaier, Theresa; Wellinghoff, Andreas; Westermayer, Lisa. Crossing Boundaries through Corpora: Innovative corpus approaches within and beyond linguistics. Amsterdam: John Benjamins Publishing, 62-98.
-
Automatically detecting directives with SPICE Ireland. In: Schweinberger, Martin; Ronan, Patricia. Socio-Pragmatic Variation in Ireland: Using Pragmatic Variation to Construct Social Identities. Berlin: De Gruyter, 205-234.
-
Evaluating Transformers on the Ethical Question of Euthanasia. In: SwissText 2024, Chur, Switzerland, 10 Juni 2024 - 11 Juni 2024, 241-246.
-
Text Analytics for Corpus Linguistics and Digital Humanities: Simple R Scripts and Tools. London: Bloomsbury Academic.
-
The Visualisation and Evaluation of Semantic and Conceptual Maps. In: Laitinen, Mikko; Tyrkkö, Jukka. Linguistics across Disciplinary Borders: The March of Data. London: Bloomsbury Publishing, 67-94.
-
Improving Adversarial Data Collection by Supporting Annotators: Lessons from GAHD, a German Hate Speech Dataset. In: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Mexico City, Mexico, 1 June 2024. Association for Computational Linguistics, 4405-4424.
-
Native Language Identification Improves Authorship Attribution. In: Proceedings of the 7th International Conference on Natural Language and Speech Processing (ICNLSP 2024), Trento, Italy, 2024. Association for Computational Linguistics, 289-296.
-
Investigating child language acquisition from a joint perspective: A comparison of traditional and new L1 speakers of English. In: Schmalz, Mirjam; Vida-Mannl, Manuela; Buschfeld, Sarah. Acquisition and Variation in World Englishes: Bridging Paradigms and Rethinking Approaches. Berlin: De Gruyter, 133-157.
-
Turkish Native Language Identification. In: 6th International Conference on Natural Language and Speech Processing (ICNLSP-2023), virtual, 16 December 2023 - 17 December 2023, 303-307.
-
Exploring Hybrid Linguistic Features for Turkish Text Readability. In: 6th International Conference on Natural Language and Speech Processing (ICNLSP-2023), virtual, 16 December 2023 - 17 December 2023, 223-232.
-
The LiRI Corpus Platform. In: CLARIN Annual Conference 2023, Leuven, Belgium, 16 October 2023 - 18 October 2023. CLARIN ERIC, 145-149.
-
“To boldly go where no man has gone before”: how iconic is the Star Trek split infinitive?. Linguistics Vanguard, 9(s3):247-255.
-
Exploring the role of AI in classifying, analyzing, and generating case reports on assisted suicide cases: feasibility and ethical implications. Frontiers in Artificial Intelligence, 6:1328865.
-
Colloquialisation, compression and democratisation in British parliamentary debates. In: Korhonen, Minna; Kotze, Haidee; Tyrkkö, Jukka. Exploring Language and Society with Big Data: Parliamentary discourse across time and space. Amsterdam: John Benjamins Publishing, 336-372.
-
Swissdox@ LiRI–a large database of media articles made accessible to researchers. In: CLARIN Annual Conference 2023, Leuven, 16 October 2023 - 18 October 2023. CLARIN ERIC, 111-115.
-
Differences in syntactic annotation affect retrieval. International Journal of Corpus Linguistics, 28(3):378-406.
-
Evaluating the Effectiveness of Natural Language Inference for Hate Speech Detection in Languages with Limited Labeled Data. In: The 7th Workshop on Online Abuse and Harms (WOAH), Toronto, Canada, 13 July 2023. Association for Computational Linguistics, 187-201.
-
Detecting and Analysing Learner Difficulties Using a Learner Corpus Without Error Tagging. In: Harrington, Kieran; Ronan, Patricia. Demystifying Corpus Linguistics for English Language Teaching. Cham: Palgrave Macmillan, 229-257.
-
Replicable semi-supervised approaches to state-of-the-art stance detection of tweets. Information Processing & Management, 60(2):103199.
-
Do Non-native Speakers Read Differently? Predicting Reading Times with Surprisal and Language Models of Native and Non-native Eye Tracking Data. In: Busse, Beatrix; Dumrukcic, Nina; Kleiber, Ingo. Language and Linguistics in a Complex World. Berlin: De Gruyter, 153-188.
-
Scaling Native Language Identification with Transformer Adapters. In: 5th International Conference on Natural Language and Speech Processing (ICNLSP), Trento, 16 December 2022 - 17 December 2022, Cornell University.
-
Complementing Kernel Density Estimation and Topic Modelling to Visualise Political Discourse. In: Digital Research Data and Human Sciences DRDHum Conference 2022, Jyväskylä, Finland, 1 Dezember 2022 - 3 Dezember 2022. University of Jyväskylä, 12-27.
-
Assessing How Attitudes to Migration in Social Media Complement Public Attitudes Found in Opinion Surveys. SPELL: Swiss Papers in English Language and Literature, 41:119-153.
-
Systematically Detecting Patterns of Social, Historical and Linguistic Change: The Framing of Poverty in Times of Poverty. Transactions of the Philological Society, 120(3):447-473.
-
Hypothesis Engineering for Zero-Shot Hate Speech Detection. In: Proceedings of the Third Workshop on Threat, Aggression and Cyberbullying (TRAC 2022), Gyeongju, Republic of Korea, 12 October 2022 - 17 October 2022. ACL, 75-90.
-
Comparing the coverage of the “marriage for all” vote on Twitter and in the newspapers. In: 2nd Workshop on Computational Linguistics for Political Text Analysis (CPSS-2022), Potsdam, Germany, 12 September 2022. CPSS, 55-62.
-
Correlations and predictions of reading times using language models and surprisal. In: Krug, Manfred; Schützler, Ole; Vetter, Fabian; Werner, Valentin. Perspectives on Contemporary English : Structure, Variation, Cognition. Berlin, Bern, Bruxelles, New York, Oxford, Warszawa, Wien: Peter Lang, 209-243.
-
Medical topics and style from 1500 to 2018. In: Hiltunen, Turo; Taavitsainen, Irma. Corpus pragmatic studies on the history of medical discourse. Amsterdam: Benjamins, 49-78.
-
Recent changes in spoken British English according to spoken BNC2014. In: Flach, Susanne; Hilpert, Martin. Broadening the spectrum of corpus linguistics: New approaches to variability and change. Amsterdam: John Benjamins Publishing, 173-195.
-
Measuring Attitudes to Migration in the Media automatically with Complementary Data Sources and Methods. In: Ronan, Patricia; Ziegler, Evelyn. Approaches to Migration and Language Identity. Oxford, Bern, Berlin, Bruxelles, New York, Wien: Peter Lang, 207-252.
-
Comparing data-driven to corpus-based approaches for diachronic variation: document-classification and overuse metrics. In: Schlüter, Julia; Schützler, Ole. Data and Methods in Corpus Linguistics: Comparative Approaches. Cambridge: Cambridge University Press, 291-322.
-
Syntactic changes in verbal clauses and noun phrases from 1500 onwards. In: Los, Bettelou; Cowie, Claire; Honeybone, Patrick. English Historical Linguistics: Change in Structure and Meaning. Amsterdam: John Benjamins Publishing, 163-200.
-
Challenges and best practices for digital unstructured data enrichment in health research: a systematic narrative review. medRxiv 22278137, Cold Spring Harbor Laboratory.
-
With a little help from familiar interlocutors: real-world language use in young and older adults. Aging & Mental Health, 25(12):2310-2319.
-
Linear and Non-Linear Age Trajectories of Language Use: A Laboratory Observation Study of Couples' Conflict Conversations. Journals of Gerontology, Series B: Psychological Sciences and Social Sciences, 75(9):e206-e214.
-
Changes in society and language: charting poverty. In: Rautinaho, Paula; Nurmi, Arja; Klemola, Juhani. Corpora and the changing society: studies in the evolution of English. Amsterdam: John Benjamins Publishing, 29-56.
-
Using Multilingual Resources to Evaluate CEFRLex for Learner Applications. In: 12th Conference on Language Resources and Evaluation (LREC 2020), Marseille, 11 May 2020 - 16 May 2020. European Language Resources Association, 346-355.
-
Spelling normalisation of Late Modern English: comparison and combination of VARD and character-based statistical machine translation. In: Kytö, Merja; Smitterberg, Eric. Late Modern English: novel encounters. Amsterdam: John Benjamins Publishing, 243-268.
-
A Man who Was Just an Incredible Man, an Incredible Man: Age Factors and Coherence in Donald Trump’s Spontaneous Speech. In: Schneider, Ulrike; Eitelmann, Matthias. Linguistic Inquiries into Donald Trump’s Language : From ‘Fake News’ to ‘Tremendous Success’. London: Bloomsbury, 62-84.
-
Statistics for Linguists: A patient, slow-paced introduction to statistics and to the programming language R. Zurich: Digitale Lehre und Forschung UZH.
-
Cognitive Aging Effects on Language Use in Real-Life Contexts: A Naturalistic Observation Study. In: The 41st Annual Meeting of the Cognitive Science Society, Montreal, QC, 24 July 2019 - 27 July 2019, CogSci.
-
Topics of eighteenth-century medical writing with triangulation of methods: LMEMT and the underlying reality. In: Taavitsainen, Irma; Hiltunen, Turo. Late Modern English medical texts: writing medicine in the eighteenth century (Including the LMEMT Corpus). Amsterdam: John Benjamins Publishing, 31-74.
-
Statistical MWE-aware parsing. In: Parmentier, Yannick; Waszczuk, Jakub. Representation and parsing of multiword expressions: current trends. Berlin: Language Science Press, 147-182.
-
Scholastic argumentation in Early English medical writing and its afterlife: new corpus evidence. In: Suhr, Carla; Nevalianen, Terttu; Taavitsainen, Irma. From data to evidence in English language research. Leiden: Brill, 191-221.
-
NLP Corpus Observatory – Looking for Constellations in Parallel Corpora to Improve Learners’ Collocational Skills. In: 7th Workshop on NLP for Computer Assisted Language Learning at SLTC 2018 (NLP4CALL 2018), Stockholm, 7 November 2018 - 7 November 2018, 69-78.
-
Detecting innovations in a parsed corpus of learner English. In: Deshors, Sandra C.; Götz, Sandra; Laporte, Samanantha. Rethinking linguistic creativity in non-native Englishes. Amsterdam: John Benjamins Publishing, 47-74.
-
Differences between Swiss High German and German High German via data-driven methods. In: 3rd Swiss Text Analytics Conference (SwissText 2018), Winterthur, Switzerland, 12 June 2018 - 13 June 2018. CEUR-WS, 17-25.
-
Differences between Swiss High German and German German via data-driven methods. In: SwissText 2018: 3rd Swiss Text Analytics Conference, Winterthur, 12 Juni 2018 - 13 Juni 2018.
-
From Lexical Bundles to Surprisal and Language Models: measuring the idiom principle on native and learner language. In: Kopaczyk, Joanna; Tyrkkö, Jukka. Applications of Pattern-driven Methods in Corpus Linguistics. Amsterdam: Benjamins, 15-56.
-
Tools and Methods for Processing and Visualizing Large Corpora. Studies in Variation, Contacts and Change in English, 19:online.
-
Measuring Encoding Efficiency in Swedish and English Language Learner Speech Production. In: Interspeech 2017, Stockholm, 19 August 2017 - 24 August 2017. ISCA, 1779-1783.
-
Saying Whatever It Takes: Creating and Analyzing Corpora from US Presidential Debate Transcripts. In: Corpus Linguistics Conference 2017, Birmingham, 25 Juli 2017 - 28 Juli 2017, 537-544.
-
Comparing Rule-based and SMT-based Spelling Normalisation for English Historical Texts. In: Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language, Gothenburg, 22 Mai 2017 - 22 Mai 2017, 40-46.
-
Crossing the Border Twice: Reimporting Prepositions to Alleviate L1-Specific Transfer Errors. In: Joint 6th Workshop on NLP for Computer Assisted Language Learning and 2nd Workshop on NLP for Research on Language Acquisition, Gothenburg, 22 Mai 2017. Linköping University Electronic Press, 18-26.
-
Statistical sequence and parsing models for descriptive linguistics and psycholinguistics. In: Timofeeva, Olga; Chevalier, Sarah; Gardner, Anne-Christine; Honkapohja, Alpo. New Approaches in English Linguistics : Building Bridges. Amsterdam: John Benjamins Publishing, 281-320.
-
Part-Of-Speech in Historical Corpora: Tagger Evaluation and Ensemble Systems on ARCHER. In: KONVENS 2016, Bochum, 19 September 2016 - 21 September 2016, RUB.
-
Detecting innovations in a parsed corpus of learner english. International Journal of Learner Corpus Research, 2(2):177-204.
-
Determining light verb constructions in contemporary British and Irish English. International Journal of Corpus Linguistics, 20(3):326-354.
-
Review of Automatic Treatment of Learner Corpus Data, Ana Diaz Negrillo, Nicolas Ballier and Paul Thompson, eds. (2013). International Journal of Learner Corpus Research, (1):172-177.
-
Parsing early and late modern English corpora. Literary and Linguistic Computing, 30(3):423-439.
-
Automated Media Content Analysis from the Perspective of Computational Linguistics. In: Sommer, Katharina; Wettstein, Martin; Wirth, Werner; Matthes, Jörg. Automatisierung in der Inhaltsanalyse. Köln: Herbert von Halem Verlag, 40-54.
-
Measuring the Public Account- ability of New Modes of Governance. In: ACL Workshop on Language Technology and Computational Social Science, Baltimore, Maryland, USA, 24 June 2014 - 26 June 2014.
-
Measuring the public accountability of new modes of governance. In: ACL Workshop on Language Technologies and Computational Social Science, Baltimore, MD, USA, 26 June 2014 - 26 June 2014, 38-43.
-
Applying Computational Linguistics and Language Models: From Descriptive Linguistics to Text Mining and Psycholinguistics. 2014, University of Zurich, Philosophische Fakultät.
-
ODIN: a customizable literature curation tool. In: Fourth BioCreative Challenge Evaluation Workshop, Bethesda, MD, US, 7 October 2013 - 9 October 2013, 219-223.
-
Of-genitive versus s-genitive: A corpus-based analysis of possessive constructions in 20thcentury English. In: Bennett, Paul; Durrell, Martin; Scheible, Silke; Whitt, Richard J. New Methods in Historical Corpora. Tübingen: Narr Verlag, 163-180.
-
Exploiting Synergies Between Open Resources for German Dependency Parsing, POS-tagging, and Morphological Analysis. In: Recent Advances in Natural Language Processing (RANLP 2013), Hissar, Bulgaria, 7 September 2013 - 13 September 2013, 601-609.
-
UZH in BioNLP 2013. In: Proceedings of the BioNLP Shared Task 2013 Workshop, Sophia, Bulgaria, 9 August 2013 - 9 August 2013, 116-120.
-
Investigating Irish English With ICE-Ireland. Cahiers de l'institut de linguistique et des sciences du langage, 38(2013):137-162.
-
Using the OntoGene pipeline for the triage task of BioCreative 2012. Database, 2013:bas053.
-
Notes about the OntoGene pipeline. In: AAAI-2012 Fall Symposium on Information Retrieval and Knowledge Discovery in Biomedical Text, Arlington, Virginia, USA., 2 November 2012 - 4 November 2012.
-
Using syntax features and document discourse for relation extraction on PharmGKB and CTD. In: SMBM 2012, Zurich, Switzerland, 3 September 2012 - 4 September 2012, 52-57.
-
Dependency parsing for interaction detection in pharmacogenomics. In: LREC 2012: The eighth international conference on Language Resources and Evaluation, Istanbul, 21 May 2012 - 25 May 2012.
-
Dependency bank. In: LREC 2012 Conference Workshop "Challenges in the Management of Large Corpora", Istanbul, Turkey, 22 May 2012 - 22 May 2012, 23-28.
-
Using semantic resources to improve a syntactic dependency parser. In: LREC 2012 Conference Workshop "Semantic Relations II", Istanbul, Turkey, 22 May 2012 - 22 May 2012, 67-76.
-
Adapting a parser to historical English. Helsinki: University of Helsinki.
-
Relation Mining Experiments in the Pharmacogenomics Domain. Journal of Biomedical Informatics, 45(5):851-861.
-
Using automatically parsed corpora to discover lexico-grammatical features of English varieties. In: 30th International Conference on Lexis and Grammar, Nicosia, Cyprus, 5 October 2011 - 8 October 2011, 251-258.
-
Detection of interaction articles and experimental methods in biomedical literature. BMC Bioinformatics, 12(Suppl 8):S13.
-
Text-Mining-Methoden im Semantic Web. Wirtschaftsinformatik und Management, 3:28-35.
-
An incremental model for the coreference resolution task of BioNLP 2011. In: BioNLP 2011, Portland, Oregon, USA, 23 June 2011 - 24 June 2011. Association for Computational Linguistics (ACL), 151-152.
-
A large-scale investigation of verb-attached prepositional phrases. Helsinki: University of Helsinki.
-
A data-driven approach to alternations based on protein-protein interactions. In: III Congreso Internacional de Lingüística de Corpus, Valencia, Spain, 7 April 2011 - 9 April 2011, 597-607.
-
OntoGene at CALBC II and Some Thoughts on the Need of Document-Wide Harmonization. In: Second CALBC Workshop, Hinxton, Cambridgeshire, UK, 16 March 2011 - 18 March 2011, 48-51.
-
Mining complex Drug/Gene/Disease relations. In: Pacific Symposium on Biocomputing Workshop "Mining the Pharmacogenomics Literature", Hawaii, 3 January 2011 - 7 January 2011.
-
The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text. BMC Bioinformatics, 12(Suppl 8):S3.
-
ODIN: an advanced interface for the curation of biomedical literature. Nature Precedings:online.
-
OntoGene (Team 65): preliminary analysis of participation in BioCreative III. In: BioCreative III workshop, Bethesda, Maryland, 13 September 2010 - 15 September 2010.
-
OntoGene in BioCreative II.5. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 7(3):472-480.
-
OntoGene in CALBC. In: First CALBC Workshop, Hinxton, Cambridgeshire, UK, 17 June 2010 - 18 June 2010, 30-31.
-
Text Mining Methoden im Semantic Web. HMD Praxis der Wirtschaftsinformatik, (271):35-46.
-
Effective Mining of Protein Interactions. In: Third international symposium on languages in biology and medecine (LBM 2009), Jeju Island, South Korea, 8 November 2009 - 10 November 2009, 115-118.
-
Using a parser as a heuristic tool for the description of New Englishes. In: The Fifth Corpus Linguistics Conference, Liverpool, UK, 20 July 2009 - 23 July 2009, online.
-
Using existing biomedical resources to detect and ground terms in biomedical literature. In: Combi, C; Shahar, Y; Abu-Hanna, A. Artificial Intelligence in Medicine: 12th Conference on Artificial Intelligence in Medicine, AIME 2009, Verona, Italy, July 18-22, 2009. Proceedings. Berlin: Springer, 225-234.
-
UZurich in the BioNLP 2009 Shared Task. In: BioNLP 2009 Companion Volume: Shared Task on Event Extraction, NAACL/HLT, Boulder, Colorado, 4 June 2009 - 5 June 2009, 28-36.
-
Detecting protein-protein interactions in biomedical texts using a parser and linguistic resources. In: Gelbukh, Alexander. Computational Linguistics and Intelligent Text Processing. Berlin: Springer, 406-417.
-
A New Hybrid Dependency Parser for German. In: Chiarcos, Christian; de Castilho, Richard Eckart; Stede, Manfred. Von der Form zur Bedeutung: Texte automatisch verarbeiten / From Form to Meaning: Processing Texts Automatically. Proceedings of the Biennial GSCL Conference 2009. Tübingen: Narr, 115-124.
-
Detecting and grounding terms in biomedical literature. Advances in Computational Linguistics, 41:15-26.
-
Parser-based analysis of syntax-lexis interactions. In: Jucker, Andreas H; Schreier, Daniel; Hundt, Marianne. Corpora: Pragmatics and Discourse. Amsterdam, The Netherlands: Rodopi, 477-502.
-
Detecting Protein-Protein Interactions in Biomedical Literature Using a Parser. In: Clematide, Simon; Klenner, Manfred; Volk, Martin. Searching Answers. Münster: MV Verlag, 109-118.
-
A framework for constituent-dependency conversion. In: 8th Conference on Treebanks and Linguistic Theories, Milano, 2009.
-
Towards automatic detection of experimental methods from biomedical literature. In: Third International Symposium on Semantic Mining in Biomedicine (SMBM 2008), Turku, Finland, 1 September 2008 - 3 September 2008, 61-68.
-
OntoGene in BioCreative II. Genome Biology, 9(Suppl 2):S13.
-
Dependency-based relation mining for biomedical literature. In: The 6th edition of the Language Resources and Evaluation Conference (LREC 2008), Marrakech, Morocco, 28 May 2008 - 30 May 2008, 1-6.
-
Tools for detection of protein interactions in biomedical literature. In: Genomes to Systems Conference 2008, Manchester, UK, 17 March 2008 - 19 March 2008.
-
Hybrid long-distance functional dependency parsing. 2008, University of Zurich, Faculty of Arts.
-
Pro3Gres parser in the CoNLL domain adaptation shared task. In: ACL Conference, Workshop on Computational Natural Language Learning (CoNLL-XI) Shared Task, Prague, June 2007, 1161-1165.
-
OntoGene in Biocreative II. In: Second BioCreative Challenge Evaluation Workshop, Madrid, Spain, 23 April 2007 - 25 April 2007.
-
Mining of functional relations between genes and proteins over biomedical scientific literature using a deep-linguistic approach. Artificial Intelligence in Medicine, 39(2):127-136.
-
An environment for relation mining over richly annotated corpora: the case of GENIA. BMC Bioinformatics, 7(Suppl 3):S3.
-
Discourse representation structures for ACE 5. ifi Technical Reports ifi2006.10, University of Zurich.
-
Extended discourse representation structures in attempto controlled English. ifi Technical Reports ifi2006.07, University of Zurich.
-
Tools for text mining over biomedical literature. In: ECAI2006, Riva del Garda, Italy, 2006, 825-826.
-
Relation mining over a corpus of scientific literature. In: 10th Conference on Artificial Intelligence in Medicine, AIME 2005, Aberdeen, Scotland, 23 July 2005 - 27 July 2005, 535-544.
-
A Broad-Coverage, Representationally Minimalist LFG Parser: Chunks and F-Structures Are Enough. In: LFG05, Bergen, Norway, 18 July 2005 - 20 July 2005.
-
Attempto controlled English: a knowledge representation language readable by humans and machines. In: Reasoning Web, First International Summer School 2005, Msida, Malta, July 2005, 213-250.
-
Closing the gap: cognitively adequate, fast broad-coverage grammatical role parsing. In: 2nd International Workshop on Natural Language and Cognitive Science (NLUCS-2005), Miami, USA, May 2005, 178-184.
-
Extended discourse representation structures in attempto controlled English. ifi Technical Reports ifi-2005.8, University of Zurich.
-
Using Distributional Similarity to Organise BioMedical Terminology. Terminology, 11(1):3-4.
-
Exploiting technical terminology for knowledge management. In: Ontology Learning from Text: Methods, Evaluation and Applications, Amsterdam: IOS Press (Frontiers in artificial intelligence and applications, edited by J. Breuker et al., volume 123), 2005 - 2005, 140-154.
-
Mining relations in the GENIA corpus. In: Second European Workshop on Data Mining and Text Mining for Bioinformatics, Pisa, Italy, September 2004 - September 2004, 61-68.
-
A robust and hybrid deep-linguistic theory applied to large scale parsing. In: COLING-2004 Robust Methods in Analysis of Natural language Data, Geneva, Switzerland, August 2004 - August 2004, 14-23.
-
Answering questions in the genomics domain. In: ACL-2004 workshop on Question Answering in Restricted Domains, Barcelona, Spain., July 2004 - July 2004, 46-53.
-
Terminology expansion and relation identification between genes and pathways. In: Workshop on Terminology, Ontology and Knowledge Representation, Universit Jean Moulin (Lyon 3), January 2004 - January 2004, 61-68.
-
Combining shallow and deep processing for a robust, fast, deep-linguistic dependency parser. In: European Summer School in Logic, Language and Information ESSLLI 2004, Nancy, France, 2004 - 2004, 41-50.
-
Fast, deep-linguistic statistical minimalist dependency parsing. In: COLING-2004 Recent Advances in Dependency Grammars, Geneva, Switzerland, 2004 - 2004, 33-40.
-
Steps towards a GENIA dependency treebank. In: Third Workshop on Treebanks and Linguistic Theories (TLT) 2004, Tübingen, Germany, 2004 - 2004, 137-149.
-
A low-complexity, broad-coverage probabilistic Dependency Parser for English. In: NAACL/HLT 2003 Student session, Edmonton, Canada, May 2003.
-
Extracting and using trace-free functional dependencies from the penn treebank to reduce parsing complexity. In: Treebanks and Linguistic Theories (TLT) 2003, Vxj, Sweden, 2003 - 2003, 153-164.
-
Learning to Disambiguate Syntactic Relations. Linguistik Online: Learning and teaching (in) Computational Linguistics, 17(5):117-136.
-
Answer extraction in technical domains. In: Computational Linguistics and Intelligent Text Processing. Lecture Notes in Computer Science. VOL. 2276, Mexico City, Mexico, February 2002, 165-177.
-
Inkrementelle minimale logische Formen für die Antwortextraktion. In: 34th Linguistic Colloquium, University of Mainz, FASK, Mainz, Germany, 7 September 2000 - 10 September 2000, 7-10.
-
Answer extraction using a dependency grammar in ExtrAns. TAL, 41(1):127-156.
-
Adding manual constraints and lexical look-up to a brill-tagger for German. In: ESSLLI-98 Workshop on Recent Advances in Corpus Annotation, Saarbrücken, 1998.
-
Comparing a statistical and a rule-based tagger for German. In: Proc. of KONVENS-98, Bonn, 1998.
My Recent Publications related to the English Department (ZORA)
ZORA Publikationsliste
Download-Optionen
Publikationen
-
Digital Dickens: An automated content analysis of Charles Dickens’ novels. In: Buschfeld, Sarah; Ronan, Patricia; Neumaier, Theresa; Wellinghoff, Andreas; Westermayer, Lisa. Crossing Boundaries through Corpora: Innovative corpus approaches within and beyond linguistics. Amsterdam: John Benjamins Publishing, 62-98.
-
Automatically detecting directives with SPICE Ireland. In: Schweinberger, Martin; Ronan, Patricia. Socio-Pragmatic Variation in Ireland: Using Pragmatic Variation to Construct Social Identities. Berlin: De Gruyter, 205-234.
-
Text Analytics for Corpus Linguistics and Digital Humanities: Simple R Scripts and Tools. London: Bloomsbury Academic.
-
The Visualisation and Evaluation of Semantic and Conceptual Maps. In: Laitinen, Mikko; Tyrkkö, Jukka. Linguistics across Disciplinary Borders: The March of Data. London: Bloomsbury Publishing, 67-94.
-
Investigating child language acquisition from a joint perspective: A comparison of traditional and new L1 speakers of English. In: Schmalz, Mirjam; Vida-Mannl, Manuela; Buschfeld, Sarah. Acquisition and Variation in World Englishes: Bridging Paradigms and Rethinking Approaches. Berlin: De Gruyter, 133-157.
-
“To boldly go where no man has gone before”: how iconic is the Star Trek split infinitive?. Linguistics Vanguard, 9(s3):247-255.
-
Colloquialisation, compression and democratisation in British parliamentary debates. In: Korhonen, Minna; Kotze, Haidee; Tyrkkö, Jukka. Exploring Language and Society with Big Data: Parliamentary discourse across time and space. Amsterdam: John Benjamins Publishing, 336-372.
-
Differences in syntactic annotation affect retrieval. International Journal of Corpus Linguistics, 28(3):378-406.
-
Detecting and Analysing Learner Difficulties Using a Learner Corpus Without Error Tagging. In: Harrington, Kieran; Ronan, Patricia. Demystifying Corpus Linguistics for English Language Teaching. Cham: Palgrave Macmillan, 229-257.
-
Replicable semi-supervised approaches to state-of-the-art stance detection of tweets. Information Processing & Management, 60(2):103199.
-
Assessing How Attitudes to Migration in Social Media Complement Public Attitudes Found in Opinion Surveys. SPELL: Swiss Papers in English Language and Literature, 41:119-153.
-
Systematically Detecting Patterns of Social, Historical and Linguistic Change: The Framing of Poverty in Times of Poverty. Transactions of the Philological Society, 120(3):447-473.
-
Medical topics and style from 1500 to 2018. In: Hiltunen, Turo; Taavitsainen, Irma. Corpus pragmatic studies on the history of medical discourse. Amsterdam: Benjamins, 49-78.
-
Recent changes in spoken British English according to spoken BNC2014. In: Flach, Susanne; Hilpert, Martin. Broadening the spectrum of corpus linguistics: New approaches to variability and change. Amsterdam: John Benjamins Publishing, 173-195.
-
Measuring Attitudes to Migration in the Media automatically with Complementary Data Sources and Methods. In: Ronan, Patricia; Ziegler, Evelyn. Approaches to Migration and Language Identity. Oxford, Bern, Berlin, Bruxelles, New York, Wien: Peter Lang, 207-252.
-
Comparing data-driven to corpus-based approaches for diachronic variation: document-classification and overuse metrics. In: Schlüter, Julia; Schützler, Ole. Data and Methods in Corpus Linguistics: Comparative Approaches. Cambridge: Cambridge University Press, 291-322.
-
Syntactic changes in verbal clauses and noun phrases from 1500 onwards. In: Los, Bettelou; Cowie, Claire; Honeybone, Patrick. English Historical Linguistics: Change in Structure and Meaning. Amsterdam: John Benjamins Publishing, 163-200.
-
With a little help from familiar interlocutors: real-world language use in young and older adults. Aging & Mental Health, 25(12):2310-2319.
-
Pluralized non-count nouns across Englishes: a corpus-linguistic approach to dialect typology. Corpus Linguistics and Linguistic Theory, 16(3):515-546.
-
Linear and Non-Linear Age Trajectories of Language Use: A Laboratory Observation Study of Couples' Conflict Conversations. Journals of Gerontology, Series B: Psychological Sciences and Social Sciences, 75(9):e206-e214.
-
Changes in society and language: charting poverty. In: Rautinaho, Paula; Nurmi, Arja; Klemola, Juhani. Corpora and the changing society: studies in the evolution of English. Amsterdam: John Benjamins Publishing, 29-56.
-
Using Multilingual Resources to Evaluate CEFRLex for Learner Applications. In: 12th Conference on Language Resources and Evaluation (LREC 2020), Marseille, 11 May 2020 - 16 May 2020. European Language Resources Association, 346-355.
-
Spelling normalisation of Late Modern English: comparison and combination of VARD and character-based statistical machine translation. In: Kytö, Merja; Smitterberg, Eric. Late Modern English: novel encounters. Amsterdam: John Benjamins Publishing, 243-268.
-
A Man who Was Just an Incredible Man, an Incredible Man: Age Factors and Coherence in Donald Trump’s Spontaneous Speech. In: Schneider, Ulrike; Eitelmann, Matthias. Linguistic Inquiries into Donald Trump’s Language : From ‘Fake News’ to ‘Tremendous Success’. London: Bloomsbury, 62-84.
-
Statistics for Linguists: A patient, slow-paced introduction to statistics and to the programming language R. Zurich: Digitale Lehre und Forschung UZH.
-
Enhancing the linguistic discovery potential of historical corpora: a twin-track approach using ARCHER. In: CL 2019 International Corpus Linguistics Conference, Cardiff, Wales, UK, 22 Juli 2019 - 26 Juli 2019, Gossip Theme.
-
Topics of eighteenth-century medical writing with triangulation of methods: LMEMT and the underlying reality. In: Taavitsainen, Irma; Hiltunen, Turo. Late Modern English medical texts: writing medicine in the eighteenth century (Including the LMEMT Corpus). Amsterdam: John Benjamins Publishing, 31-74.
-
Statistical MWE-aware parsing. In: Parmentier, Yannick; Waszczuk, Jakub. Representation and parsing of multiword expressions: current trends. Berlin: Language Science Press, 147-182.
-
Scholastic argumentation in Early English medical writing and its afterlife: new corpus evidence. In: Suhr, Carla; Nevalianen, Terttu; Taavitsainen, Irma. From data to evidence in English language research. Leiden: Brill, 191-221.
-
NLP Corpus Observatory – Looking for Constellations in Parallel Corpora to Improve Learners’ Collocational Skills. In: 7th Workshop on NLP for Computer Assisted Language Learning at SLTC 2018 (NLP4CALL 2018), Stockholm, 7 November 2018 - 7 November 2018, 69-78.
-
Detecting innovations in a parsed corpus of learner English. In: Deshors, Sandra C.; Götz, Sandra; Laporte, Samanantha. Rethinking linguistic creativity in non-native Englishes. Amsterdam: John Benjamins Publishing, 47-74.
-
Differences between Swiss High German and German High German via data-driven methods. In: 3rd Swiss Text Analytics Conference (SwissText 2018), Winterthur, Switzerland, 12 June 2018 - 13 June 2018. CEUR-WS, 17-25.
-
Differences between Swiss High German and German German via data-driven methods. In: SwissText 2018: 3rd Swiss Text Analytics Conference, Winterthur, 12 Juni 2018 - 13 Juni 2018.
-
From Lexical Bundles to Surprisal and Language Models: measuring the idiom principle on native and learner language. In: Kopaczyk, Joanna; Tyrkkö, Jukka. Applications of Pattern-driven Methods in Corpus Linguistics. Amsterdam: Benjamins, 15-56.
-
Tools and Methods for Processing and Visualizing Large Corpora. Studies in Variation, Contacts and Change in English, 19:online.
-
Measuring Encoding Efficiency in Swedish and English Language Learner Speech Production. In: Interspeech 2017, Stockholm, 19 August 2017 - 24 August 2017. ISCA, 1779-1783.
-
Saying Whatever It Takes: Creating and Analyzing Corpora from US Presidential Debate Transcripts. In: Corpus Linguistics Conference 2017, Birmingham, 25 Juli 2017 - 28 Juli 2017, 537-544.
-
Comparing Rule-based and SMT-based Spelling Normalisation for English Historical Texts. In: Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language, Gothenburg, 22 Mai 2017 - 22 Mai 2017, 40-46.
-
Statistical sequence and parsing models for descriptive linguistics and psycholinguistics. In: Timofeeva, Olga; Chevalier, Sarah; Gardner, Anne-Christine; Honkapohja, Alpo. New Approaches in English Linguistics : Building Bridges. Amsterdam: John Benjamins Publishing, 281-320.
-
Introduction - The New Energy Crisis : Climate, Economics and Geopolitics. In: Timofeeva, Olga; Gardner, Anne-Christine; Honkapohja, Alpo; Chevalier, Sarah. New Approaches in English Linguistics : Building Bridges. Amsterdam: Springer, 1-12.
-
Part-Of-Speech in Historical Corpora: Tagger Evaluation and Ensemble Systems on ARCHER. In: KONVENS 2016, Bochum, 19 September 2016 - 21 September 2016, RUB.
-
Detecting innovations in a parsed corpus of learner english. International Journal of Learner Corpus Research, 2(2):177-204.
-
Introduction - New Approaches to English Linguistics : Building bridges. In: Timofeeva, Olga; Gardner, Anne-Christine; Honkapoja, Alpo; Chevalier, Sarah. New Approaches to English Linguistics : Building bridges. Amsterdam: John Benjamins Publishing, 1-12.
-
Determining light verb constructions in contemporary British and Irish English. International Journal of Corpus Linguistics, 20(3):326-354.
-
Review of Automatic Treatment of Learner Corpus Data, Ana Diaz Negrillo, Nicolas Ballier and Paul Thompson, eds. (2013). International Journal of Learner Corpus Research, (1):172-177.
-
Parsing early and late modern English corpora. Literary and Linguistic Computing, 30(3):423-439.
-
Of-genitive versus s-genitive: A corpus-based analysis of possessive constructions in 20thcentury English. In: Bennett, Paul; Durrell, Martin; Scheible, Silke; Whitt, Richard J. New Methods in Historical Corpora. Tübingen: Narr Verlag, 163-180.
-
Investigating Irish English With ICE-Ireland. Cahiers de l'institut de linguistique et des sciences du langage, 38(2013):137-162.
-
Discovering new verb-preposition combinations in New Englishes. Studies in Variation, Contacts and Change in English, 13:online.
-
Dependency bank. In: LREC 2012 Conference Workshop "Challenges in the Management of Large Corpora", Istanbul, Turkey, 22 May 2012 - 22 May 2012, 23-28.
-
Using semantic resources to improve a syntactic dependency parser. In: LREC 2012 Conference Workshop "Semantic Relations II", Istanbul, Turkey, 22 May 2012 - 22 May 2012, 67-76.
-
Adapting a parser to historical English. Helsinki: University of Helsinki.
-
BNC Dependency Bank 1.0. In: Oksefjell, Signe; Ebeling, Jarle; Hasselgard, Hilde. Aspects of corpus linguistics: compilation, annotation, analysis. Helsinki: Research Unit for Variation, Contacts, and Change in English, online.
-
Semantic corpus trawling: Expressions of “courtesy” and “politeness” in the Helsinki Corpus. In: Suhr, Carla; Taavitsainen, Irma. Developing Corpus Methodology for Historical Pragmatics. Helsinki: Research Unit for Variation, Contacts and Change in English, 1.
-
Relative complexity in scientific discourse. English Language and Linguistics, 16(2):209-240.
-
"Off with their heads". Profiling TAM in ICE corpora. In: Hundt, Marianne; Gut, Ulrike. Mapping Unity and Diversity World-Wide. Corpus-Based Studies of New Englishes. Amsterdam: John Benjamins, 1-34.
-
Retrieving relatives from historical data. Literary and Linguistic Computing, 27(1):3-16.
-
Using automatically parsed corpora to discover lexico-grammatical features of English varieties. In: 30th International Conference on Lexis and Grammar, Nicosia, Cyprus, 5 October 2011 - 8 October 2011, 251-258.
-
Detection of interaction articles and experimental methods in biomedical literature. BMC Bioinformatics, 12(Suppl 8):S13.
-
Text-Mining-Methoden im Semantic Web. Wirtschaftsinformatik und Management, 3:28-35.
-
A large-scale investigation of verb-attached prepositional phrases. Helsinki: University of Helsinki.
-
A data-driven approach to alternations based on protein-protein interactions. In: III Congreso Internacional de Lingüística de Corpus, Valencia, Spain, 7 April 2011 - 9 April 2011, 597-607.
-
OntoGene (Team 65): preliminary analysis of participation in BioCreative III. In: BioCreative III workshop, Bethesda, Maryland, 13 September 2010 - 15 September 2010.
-
OntoGene in BioCreative II.5. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 7(3):472-480.
-
Text Mining Methoden im Semantic Web. HMD Praxis der Wirtschaftsinformatik, (271):35-46.
-
Multi-verbal expressions of ‘giving’ in Old English and Old Irish. In: Corpus Linguistics Conference, Liverpool, UK, 20 July 2009 - 23 July 2009, 116.
-
Using a parser as a heuristic tool for the description of New Englishes. In: The Fifth Corpus Linguistics Conference, Liverpool, UK, 20 July 2009 - 23 July 2009, online.
-
UZurich in the BioNLP 2009 Shared Task. In: BioNLP 2009 Companion Volume: Shared Task on Event Extraction, NAACL/HLT, Boulder, Colorado, 4 June 2009 - 5 June 2009, 28-36.
-
Detecting protein-protein interactions in biomedical texts using a parser and linguistic resources. In: Gelbukh, Alexander. Computational Linguistics and Intelligent Text Processing. Berlin: Springer, 406-417.
-
A New Hybrid Dependency Parser for German. In: Chiarcos, Christian; de Castilho, Richard Eckart; Stede, Manfred. Von der Form zur Bedeutung: Texte automatisch verarbeiten / From Form to Meaning: Processing Texts Automatically. Proceedings of the Biennial GSCL Conference 2009. Tübingen: Narr, 115-124.
-
Parser-based analysis of syntax-lexis interactions. In: Jucker, Andreas H; Schreier, Daniel; Hundt, Marianne. Corpora: Pragmatics and Discourse. Amsterdam, The Netherlands: Rodopi, 477-502.
-
Detecting Protein-Protein Interactions in Biomedical Literature Using a Parser. In: Clematide, Simon; Klenner, Manfred; Volk, Martin. Searching Answers. Münster: MV Verlag, 109-118.
-
Fishing for compliments: precision and recall in corpus-linguistic compliment research. In: Jucker, Andreas H; Taavitsainen, Irma. Speech acts in the history of English. Amsterdam: John Benjamins, 273-294.
-
A Broad-Coverage, Representationally Minimalist LFG Parser: Chunks and F-Structures Are Enough. In: LFG05, Bergen, Norway, 18 July 2005 - 20 July 2005.
Research Interests
My research interests iclude
- Natural Language Processing (NLP)
- Corpus Linguistics
- Robust Fast Broad-Coverage Parsing
- Dependency Grammar
- Text Mining, Information Extraction
- Semantic Web
- Information Retrieval
- BioMedical Parsing Applications
- Automated Media Content Analysis
- Formal Grammar
My interests also include UNIX and Mac OS X system administration, Prolog and Perl programming, desktop publishing, travelling, literature, jogging and cycling. I have taught Prolog, theoretical computing science, and semantic web at Fernfachhochschule Schweiz (Swiss distance learning UAS). I have taught Prolog and Perl at the CL department of the University of Geneva.
Dependency Grammar and Robust Parsing
I have written a low-complexity, broad-coverage probabilistic Dependency Parser for English, Pro3Gres, as part of my doctoral thesis.
I have written my Master's Paper on Dependency Grammar and the partly dependency-based Link Grammar. I am currently developing Pro3Gres: a robust, probabilistic parser for a Dependency Grammar. In winter 2003/2004 and winter 2005/2006 I am teaching Dependency Grammar Parsing. In winter 2006/2007/2014 I am teaching Parsing Technology.
Corpus Linguistics
Both the English Seminar and the Department of Computational Linguistics have a long tradition in Corpus Linguistics research. I am a member of the Archer consortium. At the English Department, I am involved in the compilation of and web interface access to several corpora. In summer 2003, I teach a seminar on Corpus Linguistics. In summer 2006, I teach a colloquium on Corpus Linguistics. In spring 2008, I teach a lecture on Corpus Linguistics, together with Fabio Rinaldi. In spring 2008, I teach the workshop at the ICAME conference, together with Hans Martin Lehmann and Nelleke Oostdjik. In autumn 2012, I teach a BA seminar on Corpus Linguistics.
BioMedical Parsing and Relation Finding
Our research on an important application of my high-precision robust parser has started in 2005, and is an NFS project from 2008 to 2013. OntoGene: Relation Finding in the BioMedical domain.
Automated Media Content Analysis
We are using parsing and Opinion Mining in Automated Media Content Analysis projects. I am leader of subproject I.6 in the Swiss NCCR democracy project and part of the scientific network of the European ERC project POLCON.
Information Retrieval
From 2000 to 2004, I have worked in an unsupervised text classification project at the CL department of the University of Geneva
Question Answering
From 1999 to 2000 I have worked in the ExtrAns Project in Zurich.
Formal Grammars
Since the winter term 1999/2000 I sometimes teach the syntax course of the Zurich CL curriculum. We focus on GB, LFG and HPSG.