Text Analysis -- TextBlob sentiment

· 2020-03-19 · # NLP # Source Code

"Topic: How to calculate sentiment value in TextBlob library?"

Topic: How to calculate sentiment value in TextBlob library?

A Simple Example Code

from textblob import TextBlob

TextBlob(text).sentiment[0]

Class TextBlob(BaseBlob) source code

# textblob/blob.py

class TextBlob(BaseBlob):
    """A general text block, meant for larger bodies of text (esp. those
    containing sentences). Inherits from :class:`BaseBlob <BaseBlob>`.
    :param str text: A string.
    :param tokenizer: (optional) A tokenizer instance. If ``None``, defaults to
        :class:`WordTokenizer() <textblob.tokenizers.WordTokenizer>`.
    :param np_extractor: (optional) An NPExtractor instance. If ``None``,
        defaults to :class:`FastNPExtractor() <textblob.en.np_extractors.FastNPExtractor>`.
    :param pos_tagger: (optional) A Tagger instance. If ``None``, defaults to
        :class:`NLTKTagger <textblob.en.taggers.NLTKTagger>`.
    :param analyzer: (optional) A sentiment analyzer. If ``None``, defaults to
        :class:`PatternAnalyzer <textblob.en.sentiments.PatternAnalyzer>`.
    :param classifier: (optional) A classifier.
    """

    @cached_property
    def sentences(self):
        """Return list of :class:`Sentence <Sentence>` objects."""
        return self._create_sentence_objects()

    @cached_property
    def words(self):
        """Return a list of word tokens. This excludes punctuation characters.
        If you want to include punctuation characters, access the ``tokens``
        property.
        :returns: A :class:`WordList <WordList>` of word tokens.
        """
        return WordList(word_tokenize(self.raw, include_punc=False))

    @property
    def raw_sentences(self):
        """List of strings, the raw sentences in the blob."""
        return [sentence.raw for sentence in self.sentences]

    @property
    def serialized(self):
        """Returns a list of each sentence's dict representation."""
        return [sentence.dict for sentence in self.sentences]

    def to_json(self, *args, **kwargs):
        '''Return a json representation (str) of this blob.
        Takes the same arguments as json.dumps.
        .. versionadded:: 0.5.1
        '''
        return json.dumps(self.serialized, *args, **kwargs)

    @property
    def json(self):
        '''The json representation of this blob.
        .. versionchanged:: 0.5.1
            Made ``json`` a property instead of a method to restore backwards
            compatibility that was broken after version 0.4.0.
        '''
        return self.to_json()

    def _create_sentence_objects(self):
        '''Returns a list of Sentence objects from the raw text.
        '''
        sentence_objects = []
        sentences = sent_tokenize(self.raw)
        char_index = 0  # Keeps track of character index within the blob
        for sent in sentences:
            # Compute the start and end indices of the sentence
            # within the blob
            start_index = self.raw.index(sent, char_index)
            char_index += len(sent)
            end_index = start_index + len(sent)
            # Sentences share the same models as their parent blob
            s = Sentence(sent, start_index=start_index, end_index=end_index,
                tokenizer=self.tokenizer, np_extractor=self.np_extractor,
                pos_tagger=self.pos_tagger, analyzer=self.analyzer,
                parser=self.parser, classifier=self.classifier)
            sentence_objects.append(s)
        return sentence_objects

There is not attribute or method named "sentiment" in the class TextBlob, but we can see that it inherits class "BaseBlob" and has all the properties and methods of class BaseBlob. Therefore, we need to continue to search for BaseBlob content. Fortunately, we quickly found traces of sentiment.

Class BaseBlob(StringlikeMixin, BlobComparableMixin) source code

# textblob/blob.py

class BaseBlob(StringlikeMixin, BlobComparableMixin):
    """An abstract base class that all textblob classes will inherit from.
    Includes words, POS tag, NP, and word count properties. Also includes
    basic dunder and string methods for making objects like Python strings.
    ....
    """
    np_extractor = FastNPExtractor()
    pos_tagger = NLTKTagger()
    tokenizer = WordTokenizer()
    translator = Translator()
    analyzer = PatternAnalyzer()
    parser = PatternParser()

    def __init__(self, text, tokenizer=None,
        ...

    ...

    @cached_property
    def sentiment(self):
        """Return a tuple of form (polarity, subjectivity ) where polarity
        is a float within the range [-1.0, 1.0] and subjectivity is a float
        within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is
        very subjective.
        :rtype: namedtuple of the form ``Sentiment(polarity, subjectivity)``
        """
        return self.analyzer.analyze(self.raw)

    ...

It declares the method sentiment as a property, which is convenient for users to call. And this method returns self.analyzer.analyze(self.raw), which is PatternAnalyzer().analyze(self.raw).

Now what is PatternAnalyzer class? Look at the statement at the beginning of the document "from textblob.sentiments import PatternAnalyzer", let's keep track of this class

Class PatternAnalyzer(BaseSentimentAnalyzer) source code

The path of this file: textblob/en/sentiments.py

# textblob/en/sentiments.py 

class PatternAnalyzer(BaseSentimentAnalyzer):
    """Sentiment analyzer that uses the same implementation as the
    pattern library. Returns results as a named tuple of the form:
    ``Sentiment(polarity, subjectivity, [assessments])``
    where [assessments] is a list of the assessed tokens and their
    polarity and subjectivity scores
    """
    kind = CONTINUOUS
    # This is only here for backwards-compatibility.
    # The return type is actually determined upon calling analyze()
    RETURN_TYPE = namedtuple('Sentiment', ['polarity', 'subjectivity'])

    def analyze(self, text, keep_assessments=False):
        """Return the sentiment as a named tuple of the form:
        ``Sentiment(polarity, subjectivity, [assessments])``.
        """
        #: Return type declaration
        if keep_assessments:
            Sentiment = namedtuple('Sentiment', ['polarity', 'subjectivity', 'assessments'])
            assessments = pattern_sentiment(text).assessments
            polarity, subjectivity = pattern_sentiment(text)
            return Sentiment(polarity, subjectivity, assessments)

        else:
            Sentiment = namedtuple('Sentiment', ['polarity', 'subjectivity'])
            return Sentiment(*pattern_sentiment(text))

Therefore, what is algorithm behind this two final commands?

from collections import namedtuple
from textblob.en import sentiment as pattern_sentiment

Sentiment = namedtuple('Sentiment', ['polarity', 'subjectivity'])
return Sentiment(*pattern_sentiment(text))

Google what is namedtuple if you don't know about it. So far, Sentiment is a empty tuple, and it is this statement "Sentiment(*pattern_sentiment(text))" that gives this tuple the correct value.
pattern_sentiment is the alias of sentiment, which is in textblob.en.__init__.py

type(pattern_sentiment)     # textblob.en.Sentiment
len(pattern_sentiment)      # 2860

for k,v in pattern_sentiment.items():
    print(k, v)

'''
Output:
...
absolute {'JJ': [0.2, 0.9, 1.0], None: [0.2, 0.9, 1.0]}
absorbed {'JJ': [0.3, 0.9, 1.0], None: [0.3, 0.9, 1.0]}
absorbing {'JJ': [0.2, 0.95, 1.0], None: [0.2, 0.95, 1.0]}
absurd {'JJ': [-0.5, 1.0, 1.0], None: [-0.5, 1.0, 1.0]}
...
'''

# textblob.en.__init__.py

sentiment = Sentiment(
        path = os.path.join(MODULE, "en-sentiment.xml"),
      synset = "wordnet_id",
   negations = ("no", "not", "n't", "never"),
   modifiers = ("RB",),
   modifier  = lambda w: w.endswith("ly"),
   tokenizer = parser.find_tokens,
    language = "en"
)

Class Sentiment(lazydict) source code

This class is in textblob/_text.py

# textblob/_text.py

### SENTIMENT POLARITY LEXICON #####################################################################
# A sentiment lexicon can be used to discern objective facts from subjective opinions in text.
# Each word in the lexicon has scores for:
# 1)     polarity: negative vs. positive    (-1.0 => +1.0)
# 2) subjectivity: objective vs. subjective (+0.0 => +1.0)
# 3)    intensity: modifies next word?      (x0.5 => x2.0)

# For English, adverbs are used as modifiers (e.g., "very good").
# For Dutch, adverbial adjectives are used as modifiers
# ("hopeloos voorspelbaar", "ontzettend spannend", "verschrikkelijk goed").
# Negation words (e.g., "not") reverse the polarity of the following word.

# Sentiment()(txt) returns an averaged (polarity, subjectivity)-tuple.
# Sentiment().assessments(txt) returns a list of (chunk, polarity, subjectivity, label)-tuples.

# Semantic labels are useful for fine-grained analysis, e.g.,
# negative words + positive emoticons could indicate cynicism.

class Sentiment(lazydict):

    def __init__(self, path="", language=None, synset=None, confidence=None, **kwargs):
        """ A dictionary of words (adjectives) and polarity scores (positive/negative).
            The value for each word is a dictionary of part-of-speech tags.
            The value for each word POS-tag is a tuple with values for
            polarity (-1.0-1.0), subjectivity (0.0-1.0) and intensity (0.5-2.0).
        """
        ....

    ...

    def __call__(self, s, negation=True, **kwargs):
        """ Returns a (polarity, subjectivity)-tuple for the given sentence,
            with polarity between -1.0 and 1.0 and subjectivity between 0.0 and 1.0.
            The sentence can be a string, Synset, Text, Sentence, Chunk, Word, Document, Vector.
            An optional weight parameter can be given,
            as a function that takes a list of words and returns a weight.
        """
        def avg(assessments, weighted=lambda w: 1):
            s, n = 0, 0
            for words, score in assessments:
                w = weighted(words)
                s += w * score
                n += w
            return s / float(n or 1)
        # A pattern.en.wordnet.Synset.
        # Sentiment(synsets("horrible", "JJ")[0]) => (-0.6, 1.0)
        if hasattr(s, "gloss"):
            a = [(s.synonyms[0],) + self.synset(s.id, pos=s.pos) + (None,)]
        # A synset id.
        # Sentiment("a-00193480") => horrible => (-0.6, 1.0)   (English WordNet)
        # Sentiment("c_267") => verschrikkelijk => (-0.9, 1.0) (Dutch Cornetto)
        elif isinstance(s, basestring) and RE_SYNSET.match(s) and hasattr(s, "synonyms"):
            a = [(s.synonyms[0],) + self.synset(s.id, pos=s.pos) + (None,)]
        # A string of words.
        # Sentiment("a horrible movie") => (-0.6, 1.0)
        elif isinstance(s, basestring):
            a = self.assessments(((w.lower(), None) for w in " ".join(self.tokenizer(s)).split()), negation)
        # A pattern.en.Text.
        elif hasattr(s, "sentences"):
            a = self.assessments(((w.lemma or w.string.lower(), w.pos[:2]) for w in chain(*s)), negation)
        # A pattern.en.Sentence or pattern.en.Chunk.
        elif hasattr(s, "lemmata"):
            a = self.assessments(((w.lemma or w.string.lower(), w.pos[:2]) for w in s.words), negation)
        # A pattern.en.Word.
        elif hasattr(s, "lemma"):
            a = self.assessments(((s.lemma or s.string.lower(), s.pos[:2]),), negation)
        # A pattern.vector.Document.
        # Average score = weighted average using feature weights.
        # Bag-of words is unordered: inject None between each two words
        # to stop assessments() from scanning for preceding negation & modifiers.
        elif hasattr(s, "terms"):
            a = self.assessments(chain(*(((w, None), (None, None)) for w in s)), negation)
            kwargs.setdefault("weight", lambda w: s.terms[w[0]])
        # A dict of (word, weight)-items.
        elif isinstance(s, dict):
            a = self.assessments(chain(*(((w, None), (None, None)) for w in s)), negation)
            kwargs.setdefault("weight", lambda w: s[w[0]])
        # A list of words.
        elif isinstance(s, list):
            a = self.assessments(((w, None) for w in s), negation)
        else:
            a = []
        weight = kwargs.get("weight", lambda w: 1) # [(w, p) for w, p, s, x in a]
        return Score(polarity = avg( [(w, p) for w, p, s, x in a], weight ),
                 subjectivity = avg([(w, s) for w, p, s, x in a], weight),
                  assessments = a)

    def assessments(self, words=[], negation=True):
        """ Returns a list of (chunk, polarity, subjectivity, label)-tuples for the given list of words:
            where chunk is a list of successive words: a known word optionally
            preceded by a modifier ("very good") or a negation ("not good").
        """
        a = []
        m = None # Preceding modifier (i.e., adverb or adjective).
        n = None # Preceding negation (e.g., "not beautiful").
        for w, pos in words:
            # Only assess known words, preferably by part-of-speech tag.
            # Including unknown words (polarity 0.0 and subjectivity 0.0) lowers the average.
            if w is None:
                continue
            if w in self and pos in self[w]:
                p, s, i = self[w][pos]
                # Known word not preceded by a modifier ("good").
                if m is None:
                    a.append(dict(w=[w], p=p, s=s, i=i, n=1, x=self.labeler.get(w)))
                # Known word preceded by a modifier ("really good").
                if m is not None:
                    a[-1]["w"].append(w)
                    a[-1]["p"] = max(-1.0, min(p * a[-1]["i"], +1.0))
                    a[-1]["s"] = max(-1.0, min(s * a[-1]["i"], +1.0))
                    a[-1]["i"] = i
                    a[-1]["x"] = self.labeler.get(w)
                # Known word preceded by a negation ("not really good").
                if n is not None:
                    a[-1]["w"].insert(0, n)
                    a[-1]["i"] = 1.0 / a[-1]["i"]
                    a[-1]["n"] = -1
                # Known word may be a negation.
                # Known word may be modifying the next word (i.e., it is a known adverb).
                m = None
                n = None
                if pos and pos in self.modifiers or any(map(self[w].__contains__, self.modifiers)):
                    m = (w, pos)
                if negation and w in self.negations:
                    n = w
            else:
                # Unknown word may be a negation ("not good").
                if negation and w in self.negations:
                    n = w
                # Unknown word. Retain negation across small words ("not a good").
                elif n and len(w.strip("'")) > 1:
                    n = None
                # Unknown word may be a negation preceded by a modifier ("really not good").
                if n is not None and m is not None and (pos in self.modifiers or self.modifier(m[0])):
                    a[-1]["w"].append(n)
                    a[-1]["n"] = -1
                    n = None
                # Unknown word. Retain modifier across small words ("really is a good").
                elif m and len(w) > 2:
                    m = None
                # Exclamation marks boost previous word.
                if w == "!" and len(a) > 0:
                    a[-1]["w"].append("!")
                    a[-1]["p"] = max(-1.0, min(a[-1]["p"] * 1.25, +1.0))
                # Exclamation marks in parentheses indicate sarcasm.
                if w == "(!)":
                    a.append(dict(w=[w], p=0.0, s=1.0, i=1.0, n=1, x=IRONY))
                # EMOTICONS: {("grin", +1.0): set((":-D", ":D"))}
                if w.isalpha() is False and len(w) <= 5 and w not in PUNCTUATION: # speedup
                    for (type, p), e in EMOTICONS.items():
                        if w in imap(lambda e: e.lower(), e):
                            a.append(dict(w=[w], p=p, s=1.0, i=1.0, n=1, x=MOOD))
                            break
        for i in range(len(a)):
            w = a[i]["w"]
            p = a[i]["p"]
            s = a[i]["s"]
            n = a[i]["n"]
            x = a[i]["x"]
            # "not good" = slightly bad, "not bad" = slightly good.
            a[i] = (w, p * -0.5 if n < 0 else p, s, x)
        return a

    def annotate(self, word, pos=None, polarity=0.0, subjectivity=0.0, intensity=1.0, label=None):
        ...

To be simply, the results returned from assessments method looks like:

from textblob.en import sentiment as pattern_sentiment
pattern_sentiment(text).assessments

'''
Output:
[(['accomplished'], 0.2, 0.5, None),
 (['internal'], 0.0, 0.0, None),
 (['internal'], 0.0, 0.0, None),
 (['particular'], 0.16666666666666666, 0.3333333333333333, None),
 (['useful'], 0.3, 0.0, None),
 (['other'], -0.125, 0.375, None),
 (['not', 'great'], -0.4, 0.75, None)]

 # The fist column is chunk word.           (w)
 # The second column is polarity.           (p)
 # The third column is subjectivity.        (s)
 # The fourth column is label of the word.  (x)
 '''

Then we know how __call__ methods works:

def __call__(self, s, negation=True, **kwargs):

    def avg(assessments, weighted=lambda w: 1):
            s, n = 0, 0
            for words, score in assessments:
                w = weighted(words)
                s += w * score
                n += w
            return s / float(n or 1)
    ....
    a = self.assessments(...)   # the format of "a" is looks like the output above
    weight = kwargs.get("weight", lambda w: 1) # [(w, p) for w, p, s, x in a]
    return Score(polarity = avg( [(w, p) for w, p, s, x in a], weight ),
                subjectivity = avg([(w, s) for w, p, s, x in a], weight),
                assessments = a)

We can easily find out that the whole text polarity and subjectivity are the weight average of polarity and subjectivity for each word.

In fact, there are two main contributors that can make textblob's sentiment analysis accurate:

Various rules in method assessments() of Sentiment class, which defines how to update the Polarity and Subjectivity values of the current word according to context.
lexicon corpus building: en-sentiment.xml (textblob/en/en-sentiment.xml) online source

Let's check out some interesting rules of textblob sentiment analysis

First, the lexicon corpus contains 2860 words, and they are almost all adjective. But en-sentiment.xml has more records than 2860. Because some words have different polarities and subjectivities in different contexts. For example, there are 12 records belong to "rich", and when textblob load this corpus, textblob average 12 polarities and 12 subjectivities to be the final polarity and subjectivity value of "rich".

The rules of textblob analysis are mainly in two aspects:

modifier (i.e., adverb or adjective).
negation (e.g., "not beautiful")

self.negations   = kwargs.get("negations", ("no", "not", "n't", "never"))
self.modifiers   = kwargs.get("modifiers", ("RB",))
self.modifier    = kwargs.get("modifier" , lambda w: w.endswith("ly"))

If current word is a negation ("no", "not", "n't", "never"):

a = []  # result list
m = None # Preceding modifier (i.e., adverb or adjective).
n = None # Preceding negation (e.g., "not beautiful").
negation = True     # default argument

########### Current Word is Known ###########
########### Rule 1 ###########
# current Known word not preceded by a modifier ("good").
if m is None:
    a.append(dict(w=[w], p=p, s=s, i=i, n=1, x=self.labeler.get(w)))    # normal append a dictionary

########### Rule 2 ###########
# current Known word preceded by a modifier ("really good").
if m is not None:
    a[-1]["w"].append(w)            # append modifier word to the latest word to become one new chunck token
    a[-1]["p"] = max(-1.0, min(p * a[-1]["i"], +1.0))       # new chunk token's original intensity times current word's polarity (control it between(-1.0, 1.0))
    a[-1]["s"] = max(-1.0, min(s * a[-1]["i"], +1.0))       # same operation on the subjectivity
    a[-1]["i"] = i                                          # i update to the current known word's intensity
    a[-1]["x"] = self.labeler.get(w)                        # x update to the current known word's label

########### Rule 3 ############
# current Known word preceded by a negation ("not really good").
if n is not None:
    a[-1]["w"].insert(0, n)         # insert negation word to the latest word to be one chunck token
    a[-1]["i"] = 1.0 / a[-1]["i"]   # the latest word's intensity becomes the original countdown
    a[-1]["n"] = -1                 # set n is -1

########### Rule 4 ###########
# current Known word may be a negation.
# current Known word may be modifying the next word (i.e., it is a known adverb).
m = None        # remove the influence from previous modifier (known or unknown word)
n = None        # remove the influence from previous negation (known or unknown word)
if pos and pos in self.modifiers or any(map(self[w].__contains__, self.modifiers)):   # if current known word is a modifier
    m = (w, pos)        # assign (w, pos) tuple to m
if negation and w in self.negations:    # if current known word is a negation
    n = w               # assign current word to n


########### Current Word is Unknown ###########
########### Rule 5 ############
# current Unknown word may be a negation ("not good").
if negation and w in self.negations:        # if current unknown word is a negation
    n = w       # assign current word to n
# current Unknown word. Retain negation across small words ("not a good").
elif n and len(w.strip("'")) > 1:       # if previous word is a negation, and current word is more than on letter
    n = None    # Reset n 

########### Rule 6 ############
# Unknown word may be a negation preceded by a modifier ("really not good").
if n is not None and m is not None and (pos in self.modifiers or self.modifier(m[0])):      # if modifier + negation
    a[-1]["w"].append(n)    # append negation word to the latest word to become one new chunck token
    a[-1]["n"] = -1         
    n = None                # reset n
# Unknown word. Retain modifier across small words ("really is a good").
elif m and len(w) > 2:  
    m = None    # Reset m

########### Rule 7 ############
# Exclamation marks boost previous word.
if w == "!" and len(a) > 0:
    a[-1]["w"].append("!")
    a[-1]["p"] = max(-1.0, min(a[-1]["p"] * 1.25, +1.0))

########### Rule 8 ############
# Exclamation marks in parentheses indicate sarcasm.
if w == "(!)":
    a.append(dict(w=[w], p=0.0, s=1.0, i=1.0, n=1, x=IRONY))

########### Rule 9 ############
# EMOTICONS: {("grin", +1.0): set((":-D", ":D"))}
if w.isalpha() is False and len(w) <= 5 and w not in PUNCTUATION: # speedup
    for (type, p), e in EMOTICONS.items():
        if w in imap(lambda e: e.lower(), e):
            a.append(dict(w=[w], p=p, s=1.0, i=1.0, n=1, x=MOOD))
            break

st=>start: Assessment:>https://github.com/sloria/TextBlob/blob/e6cd9791ae42e37b5a2132676f9ca69340e8d8c0/textblob/_text.py#L854
e=>end: Return a :>http://www.google.com
io1=>inputoutput: Initialize a=[ ], m=None, n=None
cond=>condition: word in lexicon?
<!-- op1=>operation: Initialize a, m, n -->
<!-- sub1=>subroutine: My Subroutine -->
op1=>operation: Get word's p,s,i
rule1=>condition: m == None?
r1yop=>operation: append word dict to "a"
r1nop=>operation: a[-1]["w"].append(word), update p,s,i
rule2=>condition: n == None?
r2nop=>operation: a[-1]["w"].insert(0,n), update i, n
op4=>operation: reset m,n = None,None
rule3=>condition: word is modifier?
r3yop=>operation: m = w
rule4=>condition: word is negation?
r4yop=>operation: n = w


io=>inputoutput: catch something...
para=>parallel: parallel tasks

st->io1->cond
cond(yes)->op1->rule1
rule1(yes)->r1yop->rule2
rule1(no)->r1nop->rule2
rule2(yes)->op4
rule2(no)->r2nop->op4
op4->rule3
rule3(yes)->r3yop->rule4
rule3(no)->rule4
rule4(yes)->r4yop->e
rule4(no)->e

<!-- cond(no)->para -->
<!-- para(path1, bottom)->sub1(right)->op1 -->
<!-- para(path2, top)->op1 -->

'abc'.isalpha() ==> True
'abc1'.isalpha() ==> False
'abc:'.isalpha() ==> False

Sentiment.call() support lots of format