Connective Media

Assignment 3: Data Analysis

Sasinda Ruksha <sp2335@cornell.edu> Yilin Xu <yx349@cornell.edu>

May 10, 2016

Introduction¶

What people say and share in social media reveals a considerable amount of information about their psychological status and personality. Whether people reveal their personality traits and psychological state differently online vs offline and in different contexts has been an important research question for consideration specially with the advent of social media platforms like Twitter and Facebook. There is significant research on how people behave online vs offline, whether they try to idealize themselves on online platforms etc. In the case of Facebook, research has shown that the information in users’ profiles are reflective of their actual personalities, not an “idealized” version of themselves[1]. Furthermore it’s been shown that users’ Big Five personality traits can be predicted from the public information they share on Facebook[2].

“All the world’s a stage, and all the men and women merely players.” — Shakespeare (and Goffman)

We are aware that people use different social media platforms for different purposes, and have a sense of audience in every mediated conversation in order to present themselves appropriately[3]. What we are interested in is in general, whether different kinds of users on different platforms have different self-presentations under different contexts. In order to research that, we analyzed data from three platforms: Twitter, Airbnb and The Unofficial Apple Weblog.While these platforms usually provides different social contexts, they share a similarity that their users’ self-presentations are mostly based on conventional signals.

Hypotheses¶

In general, we have a hypothesis H1 that self-presentation varies under different contexts. Twitter users may tend to present themselves in fixed, singular, and self-conscious ways when they’re describing themselves, while more dynamic and “careless” in Tweets. TUAW is consisted with bloggers who write about Apple products for a ‘cognitively constructed’ audience, an imagined group of readers who may not actually read the blog[4]. As a result, their tone and language in general might be different from other platforms. On the other hand, on goal-oriented websites like Airbnb, hosts may be highly conscious of their self-presentations and how their audience might interpret them. Our other hypotheses are as follows:

H2: Twitter Brand accounts use more confident and positive language, are more analytical and show less emotions.
H3: There are differences in language styles and emotions shown between males and females.
H4: Airbnb hosts’ descriptions show more positive emotions and more extraversion, openness vs. Twitter descriptions
H5: TUAW blogs are more analytical and show less emotions than Tweets

Data Sets¶

We used data from multiple sources, the gender classifier data from CrowdFlower (https://www.crowdflower.com/data-for-everyone/), self-descriptions of Airbnb hosts and TUAW blog posts(http://socialcomputing.asu.edu/datasets/TUAW).

Gender Classifier Data: The data set was used to train a CrowdFlower AI gender predictor. Contributors were asked to view a Twitter profile and judge whether the user was a male, a female, or a brand (non-individual). The dataset contains 20,000 rows, each with a username, a random tweet, account profile and image, location, and even link and sidebar color.
Self-descriptions of Airbnb hosts: We scraped 20,000 users on Airbnb and got 1,000 self-descriptions of hosts.
TUAW blog posts: The dataset consists of blog posts crawled from TUAW. The blog site consists of a closed community of bloggers, where other users are allowed to comment on the blog posts. The dataset consists of blog posts from the period January 2004 till February 2007, in addition to metadata like the number of inlinks. We analyzed 100 blog posts.

Processing the Data¶

Filtering Twitter Dataset¶

The twitter dataset was filtered to include profiles wich were 100% confident(by the annotators) to be males or female or a brand.

twitusers = pd.read_csv('data.in/gender-classifier-DFE-791531.csv')
abnbusers= pd.read_csv('data.in/cm_airbnb1k.csv', skip_blank_lines=True)
blgusers=pd.read_csv('data.in/blogs2k.csv')

males = twitusers[(twitusers['gender']=='male') & (twitusers['gender:confidence']==1)]
females = twitusers[(twitusers['gender']=='female') & (twitusers['gender:confidence']==1)]
brands =  twitusers[(twitusers['gender']=='brand') & (twitusers['gender:confidence']==1)]
print 'all twusers:', len(twitusers)
print 'num males  :' ,len(males)
print 'num females:', len(females)
print 'num brands :', len(brands)
print 'aBnB users :', len(abnbusers)
print 'num blogs  :', len(blgusers)

all twusers: 20050
num males  : 4653
num females: 5367
num brands : 3784
aBnB users : 1042
num blogs  : 1830

IBM Watson Tone Analyzer API¶

We used the IBM Watson™ Tone Analyzer Service to get emotions, social tendencies, and writing style from our data. The Tone Analyzer uses linguistic analysis to detect and interpret emotional, social, and language cues found in text. Emotions include anger, fear, joy, sadness, and disgust. Social tendencies are the Big Five personality traits. These include openness, conscientiousness, extraversion, agreeableness, and emotional range(neuroticism). Writing styles include confident, analytical, and tentative.

Refer to Appendix A for the analysis code and further description.

Making Sense of Data¶

We load the tone analysed datasets to pandas dataframes and use plotly inorder to visualize them.

import pandas as pd
import plotly.plotly as py
import plotly.graph_objs as go
import matplotlib.pyplot as plt
from plotly.tools import FigureFactory as FF
import numpy as np

%matplotlib inline
%matplotlib notebook

#sign in to plotly
py.sign_in('sas', 'ir9n78uhxq')

# Read in the data
df_md = pd.read_csv('./data.ibm/descriptions_male.csv')
df_fd = pd.read_csv('./data.ibm/descriptions_female.csv')
df_bd = pd.read_csv('./data.ibm/descriptions_brand.csv')

df_mt = pd.read_csv('./data.ibm/tweets_male.csv')
df_ft = pd.read_csv('./data.ibm/tweets_female.csv')
df_bt = pd.read_csv('./data.ibm/tweets_brand.csv')

dfabnb_ad= pd.read_csv('./data.ibm/airbnb_ad.csv')
dfabnb_adr= pd.read_csv('./data.ibm/airbnb_ad_rand.csv')
dfblg_at= pd.read_csv('./data.ibm/dfblg_at.csv')
dfblg_atr= pd.read_csv('./data.ibm/dfblg_atr.csv')
#Union Male and Female data to get both gender dataframes
framesTD=[df_md, df_fd, df_bd]
df_ad= pd.concat(framesTD) # males , females and brands together
framesTT=[df_mt, df_ft, df_bt]
df_at= pd.concat(framesTT)

framesTD=[df_md, df_fd]
df_mfd= pd.concat(framesTD) #Twitter males and females together
framesTT=[df_mt, df_ft]
df_mft= pd.concat(framesTT)

print 'Tone Analysed Data \n------------------'
print 'all twusers:', len(df_ad)
print 'num males  :' ,len(df_md)
print 'num females:', len(df_fd)
print 'num brands :', len(df_bd)
print 'aBnB users :', len(dfabnb_ad)
print 'num blogs  :', len(dfblg_at)
print '\t    -----'
print 'total for viz:' , len(df_ad)+len(dfabnb_ad)+len(dfblg_at)
dfabnb_adr.tail(3)

Tone Analysed Data 
------------------
all twusers: 12098
num males  : 4647
num females: 5349
num brands : 2102
aBnB users : 1042
num blogs  : 1830
	    -----
total for viz: 14970

For AirBnB and blogs we additionally analysed a random sentence from the profile or content document. If the document is only a single sentence or phrase the whole document will be considered the sentence.

Boxplots: We chose box plots as a simple way of visualising the differences in each aspect of tone. The dashed lines in the figures show the mean and varience.
Histograms: Histograms were used to further visualize the distribution of each aspect of tone.

import warnings
warnings.filterwarnings('ignore')

def extractBoxTrace(df,cat, xgroups, ycolNames, color='#3D9970'):
    xs=[]
    ys=[]
    for i,xgroup in enumerate(xgroups):
        for index, row in df.iterrows():
            xs.append(xgroup)
            ys.append(row[ycolNames[i]])
    trace = go.Box(
                y=ys,
                x=xs,
                name=cat,
                marker=dict(
                    color= color
                ),
                boxmean='sd'
             )
    return trace

def extractHistTrace(xs, name):
    trace1 = go.Histogram(
        name=name,
        x=xs,
        xbins=dict(
                start=0,
                end=1,
                size=0.05
                )
    )
    return trace1

Twitter Profile Descriptions: compare Big 5¶

trace0=extractBoxTrace(df_md, 'male', ['openness','conscientiousness', 'extraversion' , 'agreeableness', 'neuroticism' ],['openness_big5','conscientiousness_big5', 'extraversion_big5', 'agreeableness_big5', 'neuroticism_big5'] )
trace1=extractBoxTrace(df_fd, 'female', ['openness','conscientiousness', 'extraversion' , 'agreeableness', 'neuroticism' ],['openness_big5','conscientiousness_big5', 'extraversion_big5', 'agreeableness_big5', 'neuroticism_big5'], '#FF4136' )
trace2=extractBoxTrace(df_bd, 'brand', ['openness','conscientiousness', 'extraversion' , 'agreeableness', 'neuroticism' ],['openness_big5','conscientiousness_big5', 'extraversion_big5', 'agreeableness_big5', 'neuroticism_big5'], '#FF851B' )

data = [trace0, trace1, trace2]
layout = go.Layout(
    yaxis=dict(
        title='Big 5 Score',
        zeroline=False
    ),
    boxmode='group'
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='big5')

Figure 1 Big 5 Twitter Profile Descriptions

According to public commitment theory, participants in the public conditions should use language that echoes a sense of commitment to the presentation, such as language related to certainty (e.g., ‘‘absolutely’’, ‘‘forever’’,‘‘always’’). Because brand Twitter accounts are more public than personal accounts, they are more in accordance to the public commitment theory. Figure 1 shows the conscientiousness and neuroticism of brand descriptions are significantly different from male and female descriptions. Higher conscientiousness suggests that brand accounts are more likely to be perceived as disciplined, dutiful, achievement-striving, confident, driven, or organized. Lower neuroticism mean that they are more likely to be perceived as calm, bland, content, relaxed, unconcerned, or careful than personal accounts.

Tweets: Compare Emotions¶

traceE3=extractBoxTrace(df_mt, 'male-tweet', ['anger', 'disgust', 'fear', 'joy' ,'sadness' ],['anger', 'disgust', 'fear', 'joy' ,'sadness' ] )
traceE4=extractBoxTrace(df_ft, 'female-tweet', ['anger', 'disgust', 'fear', 'joy' ,'sadness' ],['anger', 'disgust', 'fear', 'joy' ,'sadness' ], '#FF4136' )
traceE5=extractBoxTrace(df_bt, 'brand-tweet', ['anger', 'disgust', 'fear', 'joy' ,'sadness' ],['anger', 'disgust', 'fear', 'joy' ,'sadness' ], '#FF851B' )

data = [traceE3, traceE4, traceE5]
layout = go.Layout(
    yaxis=dict(
        title='Emotion Score',
        zeroline=False
    ),
    boxmode='group'
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='box-grouped')

High five! You successfuly sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~sas/0 or inside your plot.ly account where it is named 'box-grouped'

Figure 2 Emotions of Twitter Accounts

Figure 2 shows that there aren’t much difference in emotion scores of male, female and brand tweets and descriptions. It is important to know that the emotion scores only reveal the percentage of the five different emotions shown in a tweet or description. This figure suggests the component of emotions are basically the same, but from figure 1 we know that brand accounts in general are less emotional.

Twitter : Confidence in Language Tone¶

x1 = df_md['confident'].values
x2 = df_fd['confident'].values
x3 = df_bd['confident'].values

hist_data = [x1, x2, x3] # 
group_labels = ['twitter-male', 'twitter-female', 'twitter-brand'] #

trace1=extractHistTrace(x1, 'Male')
trace2=extractHistTrace(x2, 'Feamale')
trace3=extractHistTrace(x2, 'Brand')

data = [trace1, trace2, trace3]
layout = go.Layout(
    barmode='group'
)

# fig = go.Figure(data=data, layout=layout)
# py.iplot(fig, filename='twitter-confidence-sh')

fig = FF.create_distplot(hist_data, group_labels, bin_size=0.05, curve_type='kde')
py.iplot(fig, filename='Conf: Twitter Distplot', validate=False)

Figure 3 Confidence in Language Tone

Figure 3 shows that there aren’t much difference in confidence scores of male, female and brand tweets and descriptions.

AirBnB vs Twitter Descriptions: Big 5¶

We compared the tone of the complete AirBnB profile descrioption vs the Twitter profile description first. Then we compare random sentences from AirBnb profile vs Twitter profiles to address for the character limit in twitter descriptions.

traceB0=extractBoxTrace(df_mfd, 'twitter users', ['openness','conscientiousness', 'extraversion' , 'agreeableness', 'neuroticism' ],['openness_big5','conscientiousness_big5', 'extraversion_big5', 'agreeableness_big5', 'neuroticism_big5'] )
traceB1=extractBoxTrace(df_bd, 'twitter brands', ['openness','conscientiousness', 'extraversion' , 'agreeableness', 'neuroticism' ],['openness_big5','conscientiousness_big5', 'extraversion_big5', 'agreeableness_big5', 'neuroticism_big5'],'#FF851B' )
traceB2=extractBoxTrace(dfabnb_ad, 'airbnb users', ['openness','conscientiousness', 'extraversion' , 'agreeableness', 'neuroticism' ],['openness_big5','conscientiousness_big5', 'extraversion_big5', 'agreeableness_big5', 'neuroticism_big5'], '#FF0000' )
data = [traceB0, traceB1, traceB2]
layout = go.Layout(
    yaxis=dict(
        title='Big 5 Score',
        zeroline=False
    ),
    boxmode='group'
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='twVsABnB_big5_box')

Figure 4-1 Big 5 AirBnB vs Twitter Descriptions: complete profile text

traceB0=extractBoxTrace(df_mfd, 'twitter users', ['openness','conscientiousness', 'extraversion' , 'agreeableness', 'neuroticism' ],['openness_big5','conscientiousness_big5', 'extraversion_big5', 'agreeableness_big5', 'neuroticism_big5'] )
traceB1=extractBoxTrace(df_bd, 'twitter brands', ['openness','conscientiousness', 'extraversion' , 'agreeableness', 'neuroticism' ],['openness_big5','conscientiousness_big5', 'extraversion_big5', 'agreeableness_big5', 'neuroticism_big5'],'#FF851B' )
traceB2=extractBoxTrace(dfabnb_adr, 'airbnb users', ['openness','conscientiousness', 'extraversion' , 'agreeableness', 'neuroticism' ],['openness_big5','conscientiousness_big5', 'extraversion_big5', 'agreeableness_big5', 'neuroticism_big5'], '#FF0000' )
data = [traceB0, traceB1, traceB2]
layout = go.Layout(
    yaxis=dict(
        title='Big 5 Score',
        zeroline=False
    ),
    boxmode='group'
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='twVsABnB_big5rand_box')

Figure 4-2 Big 5 AirBnB vs Twitter Descriptions: Profile Text, Random Sentence

We were surprised to see that Airbnb host descriptions were drastically low in openness and we looked into the data. Short descriptions like “Welcome Home!”, ”Welcome!”, “Hope you enjoy your stay!” get 0 openness scores but >0.99 agreeableness and extraversion scores. It is quite common for Airbnb hosts to use short sentences for greetings as a self-description to show their hospitality.

Another category for low-openness but high-agreeableness-and-extraversion is sentences expressing their love, for example ”I love traveling”, “I love my dog!!”,”I love NYC!”. “thank you” is also low in openness but high in agreeableness and extraversion. Longer sentences like “We enjoy meeting new people and making them feel welcome in our beautiful city.” are around the same score level in those three categories.

We then looked into sentences that are high in openness: “artist, photographer, entrepreneur, gypsy spirit and butterfly.””The people, the ambiance, the size, the summers, and yes even the rain.””Owner of the property”. We found these sentences are more like poems, missing persons or verbs. We think the reason for that might be the Tone Analyzer defines openness as openness to ideas, and it is considered less open if the language is more straightforward. Airbnb hosts are much higher in extraversion and agreeableness as we mentioned in H4. There is a bias here as extraverted and agreeable people are more likely to become Airbnb hosts.

Following table shows the p values from hypothesis tests for difference in means.

	Conscientiousness	Extraversion	Agreeableness	Neuroticism
male/female	0.0012	0.0056	0.0014	0.4122
male/brand	0	0	0	0
brand/female	0	0	0	0
individual/airbnb	0	0	0	0
individual/brand	0	0	0	0
brand/airbnb	0	0	0	0

The comparison labels are:
male = Twitter male users
female = Twitter female users
brand = Twitter brand pages
individual = Twitter Males and Females
airbnb = AirBnB users

Since there is a large amount of samples, the p values are almost always lower than 0.05.
The difference of mean values in all dimentions of big five are significant except for "Neurtoticism" between twitter male and female users.

AirBnB vs Twitter Descriptions: Language Style¶

x1 = df_mfd['confident'].values
x2 = df_bd['confident'].values
x3 = dfabnb_ad['confident'].values
hist_data = [x1, x2, x3] # 
group_labels = ['twitter-user', 'twitter-brand', 'airbnb-user'] #

# Create distplot with curve_type set to 'normal'
fig = FF.create_distplot(hist_data, group_labels, show_hist=True, bin_size=.2, curve_type='kde')

# Add title
fig['layout'].update(title='Confidence: AirBnB vs Twitter')

# Plot!
py.iplot(fig, filename='Lang_tone_conf_abnb_twit_hist', validate=False)

Figure 5 Language Style AirBnB vs Twitter Descriptions

Tweets Vs Blogs: Big 5¶

traceB0=extractBoxTrace(df_mft, 'twitter user', ['openness','conscientiousness', 'extraversion' , 'agreeableness', 'neuroticism' ],['openness_big5','conscientiousness_big5', 'extraversion_big5', 'agreeableness_big5', 'neuroticism_big5'] )
traceB1=extractBoxTrace(df_bt, 'twitter brand', ['openness','conscientiousness', 'extraversion' , 'agreeableness', 'neuroticism' ],['openness_big5','conscientiousness_big5', 'extraversion_big5', 'agreeableness_big5', 'neuroticism_big5'], '#FF851B' )
traceB2=extractBoxTrace(dfblg_atr, 'blogger', ['openness','conscientiousness', 'extraversion' , 'agreeableness', 'neuroticism' ],['openness_big5','conscientiousness_big5', 'extraversion_big5', 'agreeableness_big5', 'neuroticism_big5'], '#8acfea' )

data = [traceB0, traceB1, traceB2]
layout = go.Layout(
    yaxis=dict(
        title='Big 5 Score',
        zeroline=False
    ),
    boxmode='group'
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='tw-blg-big5')

Figure 6 Big 5 TAUW vs Tweets

The big 5 charcteristics between tweets and random sentences from the blogs are compared in Figure 6. The side by side box plots show that the “personality” of a TAUW author perceived from blogs are more similar to Twitter brand accounts than personal accounts. This suggests that the imagined audience of a tech blogger is more similar to those of a brand account. They might be more aware of the publicity and size of potential audience in general than individual users.

Conclusion¶

In this analysis of online self-presentations, we found that self-presentation varies under different contexts. Such difference can be analyzed from language usage. To be more specific, Twitter brand accounts are more conscientious and show less emotions; Airbnb hosts’ profile descriptions show significantly more extraversion, agreeableness compared to Twitter descriptions; and TUAW blogs are more analytical and show less emotions than tweets.

However, there is not much difference in language style (confidence, analytical, tentative) scores of male, female and brand tweets and descriptions, and even though there is a statistical significant difference in male and female language style, the difference is not drastic. We are also surprised to see that openness of airbnb host’s descriptions are much lower than Twitter users, but that might be caused by the limitation of using Tone Analyzer as our language analysis tool.

References¶

[1] M. Back, J. Stopfer, S. Vazire, S. Gaddis, S. Schmukle, B. Egloff, and S. Gosling. Facebook Profiles Reflect Actual Personality, Not Self-Idealization. Psychological Science, 21(3):372, 2010.
[2] Golbeck J, Robles C, Turner K. Predicting personality with social media[C]//CHI'11 Extended Abstracts on Human Factors in Computing Systems. ACM, 2011: 253-262.
[3] Marwick A E. I tweet honestly, I tweet passionately: Twitter users, context collapse, and the imagined audience[J]. New media & society, 2011, 13(1): 114-133.
[4] A blogger’s blog: Exploring the definition of a medium. Reconstruction 6(4). URL(consulted July 2009). Available at: http://reconstruction.eserver.org/064/boyd.shtml

Appendix A¶

Calling IBM Watson API¶

Following code block defines how to call IBM API and transform the response into simpler flat map representations.

import json
import pandas as pd
from watson_developer_cloud import ToneAnalyzerV3Beta
import plotly.plotly as py
import plotly.graph_objs as go

%matplotlib inline
%matplotlib notebook

tone_analyzer = ToneAnalyzerV3Beta(
    username='33fa1185-0d94-4fa9-bf8d-33cb4f5328fa',
    password='Gh0Tx2idlHqy',
    version='2016-02-11')

resp= tone_analyzer.tone(text='I am really happy. I may be happy. I might be sad. Wondering why')

tone_analyzer = ToneAnalyzerV3Beta(
    username='33fa1185-0d94-4fa9-bf8d-33cb4f5328fa',
    password='Gh0Tx2idlHqy',
    version='2016-02-11')

def flattenResponseSummary(doc):
    flatMap={}
    for toneCat in doc['document_tone']['tone_categories']:
        for tone in toneCat['tones']:
            flatMap[tone['tone_id']]=tone['score']
    flatMap['level']='Document'
    return flatMap

def flattenResponseFull(doc):
    docTone=flattenResponseSummary(doc)
    sentTones=[]
    for sent in doc['sentences_tone']:
        flatMap={}
        flatMap['sentence_id']=sent['sentence_id']
        flatMap['text']=sent['text']
        flatMap['level']='Sentence'
        analyzed=False
        for toneCat in sent['tone_categories']:
             for tone in toneCat['tones']:
                flatMap[tone['tone_id']]=tone['score']
                analyzed=True                
        if analyzed:
            sentTones.append(flatMap)    
    r=[docTone]
    r.extend(sentTones)    
    return r

flat=flattenResponseFull(resp)
flat[1]

{u'agreeableness_big5': 0.911,
 u'analytical': 0.0,
 u'anger': 0.050725,
 u'confident': 0.0,
 u'conscientiousness_big5': 0.56,
 u'disgust': 0.024891,
 u'extraversion_big5': 0.854,
 u'fear': 0.02272,
 u'joy': 0.904239,
 'level': 'Sentence',
 u'neuroticism_big5': 0.934,
 u'openness_big5': 0.015,
 u'sadness': 0.104346,
 'sentence_id': 0,
 u'tentative': 0.0,
 'text': u'I am really happy.'}

Below Code block defines how to visualize the response from IBM. Following example visualizes the document level tone values for big5, emotions and language style for the above doc.

def plotToneScores(tone):
    emotions=[tone['anger'], tone['disgust'], tone['fear'], tone['joy'], tone['sadness']]
    emotionsLbls=['Anger', 'Disgust', 'Fear', 'Joy', 'Sadness']
    big5=[tone['openness_big5'],tone['conscientiousness_big5'], tone['extraversion_big5'], tone['agreeableness_big5'], tone['neuroticism_big5']]
    big5Lbls=['Openness','Conscientiousness', 'Extraversion' , 'Agreeableness', 'Neuroticism' ]
    style=[tone['analytical'], tone['confident'], tone['tentative']]
    styleLbl=['Analytical','Confident','Tentative']
    trace0=go.Bar(
                name='Big 5',
                x=big5,
                y= big5Lbls,
                orientation = 'h',
                marker= {
                            'color':"rgb(44, 160, 44)"
                        }
            )
    trace1=go.Bar(
                name='Emotions',
                x=emotions,
                y= emotionsLbls,
                orientation = 'h',
                marker= {
                            'color':"rgb(23, 190, 207)"
                        }
            )
    trace2=go.Bar(
                name='Lang Style',
                x=style,
                y= styleLbl,
                orientation = 'h',
                marker= {
                            'color':"rgb(255, 153, 0)"
                        }
            )
    
    data = [trace2, trace1, trace0]    

    return py.iplot(data, filename='horizontal-bar')

plotToneScores(flat[0])

Analysing the data¶

Below code block defines functions which takes csv files of tweets, profile descriptions etc and call the IBM tone analyser on them. input: the csv of tweets, airbnb descriptions etc, as a pandas dataframe output: list of analysis results for each row in input

def analyzeTweets(df, tweets, descriptions):
    i=0
    for index, row in df.iterrows():
        descText=str(row['description'])
        tweetText=str(row['text'])
        try:
            if len(descText)>0:
                descText=descText.decode('ascii', 'ignore')
                if len(descText)>0:
                    desc=tone_analyzer.tone(text=descText)
                    desc=flattenResponseSummary(desc)
                    desc['_unit_id']=row['_unit_id']
                    descriptions.append(desc)
            if len(tweetText)>0:
                tweetText=tweetText.decode('ascii', 'ignore')
                if len(tweetText)>0:
                    tweet=tone_analyzer.tone(text=tweetText)
                    tweet=flattenResponseSummary(tweet)
                    tweet['_unit_id']=row['_unit_id']
                    tweets.append(tweet)
            i+=1
            print i, row['_unit_id']
        except Exception as e:
            print e.message[:10], i, row['_unit_id']
    return  tweets, descriptions



def analyzeDescriptions(df,descriptions, randSents):    
    for index, row in df.iterrows():
        descText=str(row['description'])
        try:
            if len(descText)>0:
                descText=descText.decode('ascii', 'ignore')
                if len(descText)>0:
                    desc=tone_analyzer.tone(text=descText)
                    desc=flattenResponseFull(desc)
                    for d in desc:
                        d['_unit_id']=row['_unit_id']                    
                    descriptions.append(desc[0])

                    idx=0
                    #get random sentence for long docs
                    if len(desc)>1:                        
                        idx=randint(1,len(desc)-1)
                    randSents.append(desc[idx])
            print row['_unit_id']
        except Exception as e:
            print e.message[:10], index, row['_unit_id']
    return descriptions

Above defined functions are called for Twitter, Airbnb and Blog data files to get tone analysis as output

#
twitusers = pd.read_csv('data.in/gender-classifier-DFE-791531.csv')
abnbusers= pd.read_csv('data.in/cm_airbnb1k.csv', skip_blank_lines=True)
blgusers=pd.read_csv('data.in/blogs2k.csv')

males = twitusers[(twitusers['gender']=='male') & (twitusers['gender:confidence']==1)]
females = twitusers[(twitusers['gender']=='female') & (twitusers['gender:confidence']==1)]
brands =  twitusers[(twitusers['gender']=='brand') & (twitusers['gender:confidence']==1)]
print 'all twusers:', len(twitusers)
print 'num males  :' ,len(males)
print 'num females:', len(females)
print 'num brands :', len(brands)
print 'aBnB users :', len(abnbusers)
print 'num blogs  :', len(blgusers)


abnbDescriptions=[]
abnbRandom=[]
analyzeDescriptions(abnbusers, abnbDescriptions,abnbRandom)

len(abnbDescriptions)
len(abnbRandom)

dfabnb_ad=pd.DataFrame(abnbDescriptions)
dfabnb_adr=pd.DataFrame(abnbRandom)
dfabnb_ad.to_csv('data.out/airbnb_ad.csv')
dfabnb_adr.to_csv('data.out/airbnb_ad_rand.csv')

blgContent=[]
blgRandom=[]
analyzeDescriptions(blgusers, blgContent,blgRandom)
dfblg_at=pd.DataFrame(blgContent)
dfblg_atr=pd.DataFrame(blgRandom)
dfblg_at.to_csv('data.out/dfblg_at.csv')
dfblg_atr.to_csv('data.out/dfblg_atr.csv')

m_descs=[]
m_tweets=[]
male_tweets, male_descs =analyzeTweets(males.head(10),tweets, twdescriptions)

df_md=pd.DataFrame(male_descs)
df_md=df_md.set_index('_unit_id')

df_mt=pd.DataFrame(male_tweets)
df_mt=df_mt.set_index('_unit_id')
# allusers.join(df_md, on='_unit_id')

fm_descs=[]
fm_tweets=[]
female_tweets, female_descs=analyzeTweets(females, fm_tweets, fm_descs)
df_fd=pd.DataFrame(female_descs)
df_fd=df_fd.set_index('_unit_id')

df_ft=pd.DataFrame(female_tweets)
df_ft=df_ft.set_index('_unit_id')


brand_descs=[]
brand_tweets=[]
analyzeTweets(brands, brand_tweets, brand_descs)
df_bd=pd.DataFrame(brand_descs)
df_bd=df_bd.set_index('_unit_id')

df_bt=pd.DataFrame(brand_tweets)
df_bt=df_bt.set_index('_unit_id')

df_md.to_csv('data.out/df_md.csv')
df_mt.to_csv('data.out/df_mt.csv')

df_fd.to_csv('data.out/df_fd.csv')
df_ft.to_csv('data.out/df_ft.csv')

df_bd.to_csv('data.out/df_bd.csv')
df_bt.to_csv('data.out/df_bt.csv')

Appendix B¶

AirBnB Scraper¶

We used python 'lxml' package to build a scraper that could scrape user profiles from AirBnB lxml is a python library written for parsing XML and HTML documents. We filter out all host profiles with a description and at least one review from a guest.

from lxml import html
import requests
from random import randint
import pandas as pd

N=10
USER_PAGE="https://www.airbnb.com/users/show/"
pages=[]
profile_ids=[]
for i in range(0,N):
    if i%100==0: print len(pages)
    profile_id= randint(200000, 899999)
    URL=USER_PAGE+str(profile_id)
    page = requests.get(URL)
    tree = html.fromstring(page.content)
    pages.append(tree)
    profile_ids.append(profile_id)

print 'num pages collected', len(pages)

users=[]
# for i in range(0,1)
for i,tree in enumerate(pages):
    rev_as_host=tree.xpath('//*[@id="reviews"]/div/div[@class="reviews_section as_host"]')
#     print 'is host:', i, len(rev_as_host)>0 
    description= tree.xpath('//*[@id="site-content"]/div[2]/div/div[2]/div[2]/p/text()')
    print description
    if len(description)>0:
        description=description.pop()
    #If non empty description and has reviews as a host    
    if len(description)>0 and rev_as_host:        
        name=tree.xpath('//*[@id="site-content"]/div[2]/div/div[2]/div[1]/div[2]/h1/text()').pop().rstrip().lstrip().encode('ascii', 'ignore')
        address=tree.xpath('//*[@id="site-content"]/div[2]/div/div[2]/div[1]/div[2]/div[1]/a/text()').pop().rstrip().lstrip()
        since=tree.xpath('//*[@id="site-content"]/div[2]/div/div[2]/div[1]/div[2]/div[1]/span/text()').pop().rstrip().lstrip()
        # badges=tree.xpath('//*[@id="site-content"]/div[2]/div/div[2]/div[3]/div')
        user={
            '_unit_id':profile_ids[i],
            'profile_id':profile_ids[i],
            'name':name,
            'address': address,
            'since': since,
            'description': description
        }
        users.append(user)

user_df=pd.DataFrame(users)
user_df.tail(2)
# user_df.to_csv('sampleabnb.csv', encoding='utf-8')

	Unnamed: 0	_unit_id	agreeableness_big5	anger	conscientiousness_big5	disgust	extraversion_big5	fear	joy	level	neuroticism_big5	openness_big5	sadness	sentence_id	text
1039	1039	601861	0.994	0.053038	0.560	0.158564	0.963	0.069856	0.194340	Sentence	0.149	0.030	0.211043	1.0	We get both here.
1040	1040	606673	0.632	0.095705	0.243	0.093001	0.784	0.036744	0.510069	Document	0.687	0.346	0.136052	NaN	NaN
1041	1041	472736	0.806	0.017947	0.541	0.025325	0.722	0.301987	0.808816	Document	0.673	0.309	0.037938	NaN	NaN