Texts tend to have a hierarchical structure and the importance of words and sentences are highly context dependent.

This post is a short tutorial to highlight text using sample weights. The text is displayed in Jupyter notebook. The weights can be from a model such as Logistic regression or Attention model. The different intensity of the color for the text helps to highlight the important words which are discovered by the model.

import pandas as pd
import numpy as np
import html
import random
from IPython.core.display import display, HTML
# Prevent special characters like & and < to cause the browser to display something other than what you intended.
def html_escape(text):
    return html.escape(text)
#Taken from :http://52.51.209.151/data-analysis-resources/an-analysis-of-the-impact-of-eu-membership-on-the-economic-development-of-ireland/ http://52.51.209.151/data-analysis-resources/an-analysis-of-the-impact-of-eu-membership-on-the-economic-development-of-ireland/


text = "Ireland became a member of the European Union (EU) and joined the single market on 1st January 1973. Before accession to the bloc, Ireland had decades of an underachieving economy which was heavily dependent on the UK. Since then it has transformed into a prosperous and confident country which is a major influence in the global politics. The economy has transformed from agricultural dependent to one driven by the tech industry and global exports. The membership has also affected every part of Irish society from the way the citizens work, travel or even shop [1].However, the recent political turmoil of Brexit and 2008 recession crisis has left certain citizens to wonder the importance of the membership. Sometimes, there is a doubt in EUs ability to provide a good living standard. Economic development requires economic growth to reflect economic as well as social growth. Since joining the EU, Ireland has developed economically and changed socially. The statistical and mathematical analysis of the data from World Bank shows that the economy has transformed from agriculture dependent to dependent on manufacturing merchandise, goods and services industry. There has been growth in the population due to reduced death rate and increased life expectancy. The birth rate has also decreased. The population is young and highly educated which helps the multinational companies making decision to base their operation in Ireland. The participation of females in the job market has increased over the years but females are also most likely to be employed in part-time jobs"
# Remove duplicate words from text
seen = set()
result = []
for item in text.split():
    if item not in seen:
        seen.add(item)
        result.append(item)
#Create random sample weights for each unique word
weights = []
for i in range(len(result)):
    weights.append(random.random())


df_coeff = pd.DataFrame(
    {'word': result,
     'num_code': weights
    })
#Select the code value to generate different weights
word_to_coeff_mapping = {}
for row in df_coeff.iterrows():
    row = row[1]
    word_to_coeff_mapping[row[1]] = (row[0])
max_alpha = 0.8
highlighted_text = []
for word in text.split():
    weight = word_to_coeff_mapping.get(word)

    if weight is not None:
        highlighted_text.append('<span style="background-color:rgba(135,206,250,' + str(weight / max_alpha) + ');">' + html_escape(word) + '</span>')
    else:
        highlighted_text.append(word)
highlighted_text = ' '.join(highlighted_text)
#highlighted_text
display(HTML(highlighted_text))

3 Responses

  1. it is not working becouse the word_to_coeff_mapping dictionary is done the opposite way, change word_to_coeff_mapping[row[1]] = (row[0]) to word_to_coeff_mapping[row[0]] = (row[1])