A Primer on Explainable AI for Natural Language Processing  

Major breakthroughs in AI are accelerating the adoption of AI-driven tools in industry, research, and government. However, the reasoning process of modern AI models occurs in a black box. In this article we explore techniques developed to mitigate this problem in the domain of NLP. 

Natural Language Processing (NLP) refers to the field of techniques developed for the automated processing of human language in text or speech data. In recent years, NLP has become dominated by machine learning-based methods, culminating in the advent of large language models such as ChatGPT. For the remainder of this article, we focus on the current state of Explainable AI (XAI) research in the NLP-context, focusing on some of the most widely adopted approaches. For a more exhaustive and in-depth survey of explainable NLP, see (Danilevsky et al. 2020). 

Feature Importance 

Feature Importance techniques try to identify the features in a textual input sample that contribute the most to the model’s final output. Typically this means identifying the most important words, but, depending on how text samples are encoded into mathematical vectors, feature importance techniques can also be used to identify important phrases or sentences. This method is often represented visually in the form of a saliency map, which highlights the most important words with an intensity corresponding to that word’s importance. Figure 1 shows an example of a saliency map for a binary classification task. Popular feature importance techniques include LIME (Ribeiro et al. 2016), SHAP (Lundberg and Lee 2017) and first-derivative saliency (Li et al. 2015). 

Figure 1. A saliency map constructed using the LIME explainer on a model trained to classify questions posted on Quora as either sincere or insincere. Words highlighted in blue indicate that the question is sincere, while words highlighted in orange indicate the opposite. The most opaque highlighting indicates the words with the greatest contribution (image source). 


Example-driven interpretability techniques do not provide explicit explanations for the model’s decision. Instead, the goal is to identify other samples that are considered similar from the point of view of the model. This allows an exterior auditor to check the similar samples and identify which common factors and differences are likely to have played a significant role. Figure 2 depicts results from example-driven methods developed in (Croce et al. 2019). 

 Figure 2. In this case the task consisted of sorting questions into the category corresponding to the subject of the question (e.g., location, number, entity…). Each question pair above is considered similar by the model, but the questions do not always belong to the same class. 

Generated Explanations 

A third technique involves training generative language models such as GPT-3 to generate natural language explanations for the given task (e.g. “Generated Text: The candidate is good because they have a degree from a leading university.”). Training a model capable of generating such explanations typically requires a sufficiently large dataset annotated with human-written explanations. In order to correlate generated explanations with output, the model should be simultaneously trained to accomplish the target task (e.g. classifying job applications) and generate an explanation, using a combined loss function that compares both model output and generated explanations to samples from the training data. Such techniques have been explored in (Camburu et al. 2018). 

Obstacles and Shortcomings 

Each of the three approaches described above shows promise, but they are also subject to several shortcomings. For instance, in many cases, it is not clear how well the explanations match the model’s actual decision-making process. By design, generated explanations provide valuable information about the reasoning of data annotators as opposed to that of the model. Perhaps there is a trade-off between faithfully explaining the model’s behavior and generating explanations that are easily understood and can be used to audit or appeal unfair algorithmic decisions. Oftentimes the latter may be more important.  

It has also been shown that different techniques often lead to different, even conflicting, explanations, an instance of the so-called disagreement problem (Krishna et al. 2022). This necessitates the creation of metrics that attempt to measure the quality of different explanation techniques (DeYoung et al. 2020), as well as the level of agreement between various explainers. Ideally, interpretable models could be identified via their performance on a variety of metrics that measure different aspects of interpretability. In the end, a combination of approaches such as those covered in this article will be necessary to obtain different shades of understanding and create satisfactorily interpretable models. 


  1.  Camburu, Oana-Maria, et al. “e-snli: Natural language inference with natural language explanations.” Advances in Neural Information Processing Systems 31 (2018). 
  2.  Croce, Danilo, Daniele Rossini, and Roberto Basili. “Auditing deep learning processes through kernel-based explanatory models.” Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. 
  3.  Danilevsky, Marina, et al. “A survey of the state of explainable AI for natural language processing.” arXiv preprint arXiv:2010.00711 (2020). 
  4.  DeYoung, Jay, et al. “ERASER: A benchmark to evaluate rationalized NLP models.” arXiv preprint arXiv:1911.03429 (2019). 
  5.  Krishna, Satyapriya, et al. “The Disagreement Problem in Explainable Machine Learning: A Practitioner’s Perspective.” arXiv preprint arXiv:2202.01602 (2022). 
  6.  Li, Jiwei, et al. “Visualizing and understanding neural models in nlp.” arXiv preprint arXiv:1506.01066 (2015). 
  7.  Lundberg, Scott M., and Su-In Lee. “A unified approach to interpreting model predictions.” Advances in neural information processing systems 30 (2017). 
  8.  Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. “” Why should i trust you?” Explaining the predictions of any classifier.” Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 2016. 
Creative Commons Licence

AUTHOR: Alexandre Puttick

Dr. Alexandre Puttick is a post-doctoral researcher in the Applied Machine Intelligence research group at the Bern University of Applied Sciences. His current research explores the development of clinical mental health tools and detecting and mitigating bias in AI-driven recruitment tools.

Create PDF

Related Posts

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *