This is an old revision of the document!

Notes on NLP

Papers / Websites

The Annotated Transformer (Harvard): http://nlp.seas.harvard.edu/2018/04/03/attention.html

Attention Is All You Need (original paper): https://arxiv.org/abs/1706.03762
Distilling the Knowledge in a Neural Network: https://arxiv.org/abs/1503.02531
Well-Read Students Learn Better: On the Importance of Pre-training Compact Models: https://arxiv.org/abs/1908.08962

TensorFlow Hub: https://tfhub.dev/
Google Research, BERT - Smaller Models on Git: https://github.com/google-research/bert
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding: https://arxiv.org/pdf/1810.04805.pdf

Websites

Getting meaning from text: self-attention step-by-step video (Romain Futrzynski):
https://peltarion.com/blog/data-science/self-attention-video
The Illustrated Transformer (Jay Alammar):
http://jalammar.github.io/illustrated-transformer/
Paper Dissected: “Attention is All You Need” Explained:
https://mlexplained.com/2017/12/29/attention-is-all-you-need-explained/
Speech and Language Processing, Dan Jurafsky and James H. Martin (3rd ed. draft):
https://web.stanford.edu/~jurafsky/slp3/

Videos

RNN W3L08 - Attention Model (Andrew Ng): https://www.youtube.com/watch?v=FMXUkEbjf9k&feature=youtu.be
Attention is all you need; Attentional Neural Network Models | Łukasz Kaiser | Masterclass: https://www.youtube.com/watch?v=rBCqOTEfxvg&feature=youtu.be
Transformer Neural Networks - EXPLAINED! (Attention is all you need) (CodeEmporium): https://www.youtube.com/watch?v=TQQlZhbC5ps&feature=youtu.be
Attention in Neural Networks (CodeEmporium):
https://www.youtube.com/watch?v=W2rWgXJBZhU&t
[Transformer] Attention Is All You Need | AISC Foundational (Joseph Palermo (Dessa)):
https://www.youtube.com/watch?v=S0KakHcj_rs&feature=youtu.be
Ivan Bilan: Understanding and Applying Self-Attention for NLP | PyData Berlin 2018: https://www.youtube.com/watch?v=OYygPG4d9H0&feature=youtu.be
Transformer (Attention is all you need)(Minsuk Heo):
https://www.youtube.com/watch?v=z1xs9jdZnuY&feature=youtu.be
Self-attention step-by-step | How to get meaning from text | Peltarion Platform (Romain Futrzynski):
https://www.youtube.com/watch?v=-9vVhYEXeyQ&feature=emb_logo

Literature overview on NLP

These tables should give an overview over recent and influential literature in the field of Natural Language Processing from the past few years.

General overview

NLP, transfer learning, language models.

Author	Title	Link to code	Abstract (short)
Vaswani et al. (2017)	Attention Is All You Need	Code used for training and evaluation: https://github.com/tensorflow/tensor2tensor	Introduction of a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
Kim et al. (2017)	Structured Attention Networks	https://github.com/harvardnlp/struct-attn	In this work, we experiment with incorporating richer structural distributions, encoded using graphical models, within deep networks. We show that these structured attention networks are simple extensions of the basic attention procedure, and that they allow for extending attention beyond the standard soft-selection approach, such as attending to partial segmentations or to subtrees.
Radford et al. (2018)	Improving Language Understanding by Generative Pre-Training	https://github.com/openai/finetune-transformer-lm	Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering, semantic similarity assessment, and document classification. Although large unlabeled text corpora are abundant, labeled data for learning these specific tasks is scarce, making it challenging for discriminatively trained models to perform adequately. We demonstrate that large gains on these tasks can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task. *GPT-1*
Devlin et al. (2018)	BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding	https://github.com/openai/finetune-transformer-lm	Introduction of a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers.
Radford et al. (2019)	Language Models are Unsupervised Multitask Learners	https://github.com/openai/gpt-2	Natural language processing tasks, such as question answering, machine translation, reading comprehension, and summarization, are typically approached with supervised learning on taskspecific datasets. We demonstrate that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText.(…) Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits Web-Text.
Ruder (2019)	Neural Transfer Learning for Natural Language Processing	https://github.com/sebastianruder	Multiple novel methods for different transfer learning scenarios were presented and evaluated across a diversity of settings where they outperformed single-task learning as well as competing transfer learning methods.
Kovaleva et al. (2019)	Revealing the Dark Secrets of BERT	-	BERT-based architectures currently give state-of-the-art performance on many NLP tasks, but little is known about the exact mechanisms that contribute to its success. In the current work, we focus on the interpretation of self-attention, which is one of the fundamental underlying components of BERT.
Rogers et al. (2020)	A Primer in BERTology: What We Know About How BERT Works	-	This paper is the first survey of over 150 studies of the popular BERT model. We review the current state of knowledge about how BERT works, what kind of information it learns and how it is represented, common modifications to its training objectives and architecture, the overparameterization issue and approaches to compression.
Brown et al. (2020)	Language Models are Few-Shot Learners	https://github.com/openai/gpt-3	Demonstration that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting.
Schick and Schütze (2020)	It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners	https://github.com/timoschick/pet	We show that performance similar to GPT-3 can be obtained with language models that are much “greener” in that their parameter count is several orders of magnitude smaller. This is achieved by converting textual inputs into cloze questions that contain a task description, combined with gradient-based optimization; exploiting unlabeled data gives further improvements.
Jaegle et al. (2021)	Perceiver IO: A General Architecture for Structured Inputs & Outputs	https://github.com/deepmind/deepmind-research/tree/master/perceiver	The recently-proposed Perceiver model obtains good results on several domains (images, audio, multimodal, point clouds) while scaling linearly in compute and memory with the input size. While the Perceiver supports many kinds of inputs, it can only produce very simple outputs such as class scores. Perceiver IO overcomes this limitation without sacrificing the original’s appealing properties by learning to flexibly query the model’s latent space to produce outputs of arbitrary size and semantics.

Specific overview

Speech recognition

Author	Title	Link to code	Abstract (short)
Amodei et al. (2015)	Deep Speech 2: End-to-End Speech Recognition in English and Mandarin	-	We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech—two vastly different languages. Because it replaces entire pipelines of hand-engineered components with neural networks, end-to-end learning allows us to handle a diverse variety of speech including noisy environments, accents and different languages.
Agarwal and Zesch (2019)	German End-to-end Speech Recognition based on DeepSpeech	https://github.com/AASHISHAG/deepspeech-german	Description of the process of training German models based on the Mozilla DeepSpeech architecture using publicly available data.

Information Extraction

Named Entity Recognition

Author	Title	Link to code	Abstract (short)
Anthofer (2017)	A Neural Network for Open Information Extraction from German Text	https://github.com/danielanthofer/nnoiegt	Systems that extract information from natural language texts usually need to consider language-dependent aspects like vocabulary and grammar. Compared to the develop ment of individual systems for different languages, development of multilingual information extraction (IE) systems has the potential to reduce cost and effort. One path towards IE from different languages is to port an IE system from one language to another. PropsDE is an open IE (OIE) system that has been ported from the English system PropS to the German language.
Riedl and Padó (2018)	A Named Entity Recognition Shootout for German	https://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/german-ner/	We ask how to practically build a model for German named entity recognition (NER) that performs at the state of the art for both contemporary and historical texts, i.e., a big-data and a small-data scenario.
Torge et al. (2021)	Transfer Learning for Domain-Specific Named Entity Recognition in German	-	Investigation of different transfer learning approaches to recognize unknown domain-specific entities, including the influence on varying training data size.

HSRW EOLab Students Wiki

Table of Contents