Sasinda Ruksha <sp2335@cornell.edu> Yilin Xu <yx349@cornell.edu>
May 10, 2016
What people say and share in social media reveals a considerable amount of information about their psychological status and personality. Whether people reveal their personality traits and psychological state differently online vs offline and in different contexts has been an important research question for consideration specially with the advent of social media platforms like Twitter and Facebook. There is significant research on how people behave online vs offline, whether they try to idealize themselves on online platforms etc. In the case of Facebook, research has shown that the information in users’ profiles are reflective of their actual personalities, not an “idealized” version of themselves[1]. Furthermore it’s been shown that users’ Big Five personality traits can be predicted from the public information they share on Facebook[2].
“All the world’s a stage, and all the men and women merely players.” — Shakespeare (and Goffman)
We are aware that people use different social media platforms for different purposes, and have a sense of audience in every mediated conversation in order to present themselves appropriately[3]. What we are interested in is in general, whether different kinds of users on different platforms have different self-presentations under different contexts. In order to research that, we analyzed data from three platforms: Twitter, Airbnb and The Unofficial Apple Weblog.While these platforms usually provides different social contexts, they share a similarity that their users’ self-presentations are mostly based on conventional signals.
In general, we have a hypothesis H1 that self-presentation varies under different contexts. Twitter users may tend to present themselves in fixed, singular, and self-conscious ways when they’re describing themselves, while more dynamic and “careless” in Tweets. TUAW is consisted with bloggers who write about Apple products for a ‘cognitively constructed’ audience, an imagined group of readers who may not actually read the blog[4]. As a result, their tone and language in general might be different from other platforms. On the other hand, on goal-oriented websites like Airbnb, hosts may be highly conscious of their self-presentations and how their audience might interpret them. Our other hypotheses are as follows:
We used data from multiple sources, the gender classifier data from CrowdFlower (https://www.crowdflower.com/data-for-everyone/), self-descriptions of Airbnb hosts and TUAW blog posts(http://socialcomputing.asu.edu/datasets/TUAW).
The twitter dataset was filtered to include profiles wich were 100% confident(by the annotators) to be males or female or a brand.
twitusers = pd.read_csv('data.in/gender-classifier-DFE-791531.csv')
abnbusers= pd.read_csv('data.in/cm_airbnb1k.csv', skip_blank_lines=True)
blgusers=pd.read_csv('data.in/blogs2k.csv')
males = twitusers[(twitusers['gender']=='male') & (twitusers['gender:confidence']==1)]
females = twitusers[(twitusers['gender']=='female') & (twitusers['gender:confidence']==1)]
brands = twitusers[(twitusers['gender']=='brand') & (twitusers['gender:confidence']==1)]
print 'all twusers:', len(twitusers)
print 'num males :' ,len(males)
print 'num females:', len(females)
print 'num brands :', len(brands)
print 'aBnB users :', len(abnbusers)
print 'num blogs :', len(blgusers)
We used the IBM Watson™ Tone Analyzer Service to get emotions, social tendencies, and writing style from our data. The Tone Analyzer uses linguistic analysis to detect and interpret emotional, social, and language cues found in text. Emotions include anger, fear, joy, sadness, and disgust. Social tendencies are the Big Five personality traits. These include openness, conscientiousness, extraversion, agreeableness, and emotional range(neuroticism). Writing styles include confident, analytical, and tentative.
Refer to Appendix A for the analysis code and further description.
We load the tone analysed datasets to pandas dataframes and use plotly inorder to visualize them.
import pandas as pd
import plotly.plotly as py
import plotly.graph_objs as go
import matplotlib.pyplot as plt
from plotly.tools import FigureFactory as FF
import numpy as np
%matplotlib inline
%matplotlib notebook
#sign in to plotly
py.sign_in('sas', 'ir9n78uhxq')
# Read in the data
df_md = pd.read_csv('./data.ibm/descriptions_male.csv')
df_fd = pd.read_csv('./data.ibm/descriptions_female.csv')
df_bd = pd.read_csv('./data.ibm/descriptions_brand.csv')
df_mt = pd.read_csv('./data.ibm/tweets_male.csv')
df_ft = pd.read_csv('./data.ibm/tweets_female.csv')
df_bt = pd.read_csv('./data.ibm/tweets_brand.csv')
dfabnb_ad= pd.read_csv('./data.ibm/airbnb_ad.csv')
dfabnb_adr= pd.read_csv('./data.ibm/airbnb_ad_rand.csv')
dfblg_at= pd.read_csv('./data.ibm/dfblg_at.csv')
dfblg_atr= pd.read_csv('./data.ibm/dfblg_atr.csv')
#Union Male and Female data to get both gender dataframes
framesTD=[df_md, df_fd, df_bd]
df_ad= pd.concat(framesTD) # males , females and brands together
framesTT=[df_mt, df_ft, df_bt]
df_at= pd.concat(framesTT)
framesTD=[df_md, df_fd]
df_mfd= pd.concat(framesTD) #Twitter males and females together
framesTT=[df_mt, df_ft]
df_mft= pd.concat(framesTT)
print 'Tone Analysed Data \n------------------'
print 'all twusers:', len(df_ad)
print 'num males :' ,len(df_md)
print 'num females:', len(df_fd)
print 'num brands :', len(df_bd)
print 'aBnB users :', len(dfabnb_ad)
print 'num blogs :', len(dfblg_at)
print '\t -----'
print 'total for viz:' , len(df_ad)+len(dfabnb_ad)+len(dfblg_at)
dfabnb_adr.tail(3)
For AirBnB and blogs we additionally analysed a random sentence from the profile or content document. If the document is only a single sentence or phrase the whole document will be considered the sentence.
Boxplots: We chose box plots as a simple way of visualising the differences in each aspect of tone. The dashed lines in the figures show the mean and varience.
Histograms: Histograms were used to further visualize the distribution of each aspect of tone.
import warnings
warnings.filterwarnings('ignore')
def extractBoxTrace(df,cat, xgroups, ycolNames, color='#3D9970'):
xs=[]
ys=[]
for i,xgroup in enumerate(xgroups):
for index, row in df.iterrows():
xs.append(xgroup)
ys.append(row[ycolNames[i]])
trace = go.Box(
y=ys,
x=xs,
name=cat,
marker=dict(
color= color
),
boxmean='sd'
)
return trace
def extractHistTrace(xs, name):
trace1 = go.Histogram(
name=name,
x=xs,
xbins=dict(
start=0,
end=1,
size=0.05
)
)
return trace1
trace0=extractBoxTrace(df_md, 'male', ['openness','conscientiousness', 'extraversion' , 'agreeableness', 'neuroticism' ],['openness_big5','conscientiousness_big5', 'extraversion_big5', 'agreeableness_big5', 'neuroticism_big5'] )
trace1=extractBoxTrace(df_fd, 'female', ['openness','conscientiousness', 'extraversion' , 'agreeableness', 'neuroticism' ],['openness_big5','conscientiousness_big5', 'extraversion_big5', 'agreeableness_big5', 'neuroticism_big5'], '#FF4136' )
trace2=extractBoxTrace(df_bd, 'brand', ['openness','conscientiousness', 'extraversion' , 'agreeableness', 'neuroticism' ],['openness_big5','conscientiousness_big5', 'extraversion_big5', 'agreeableness_big5', 'neuroticism_big5'], '#FF851B' )
data = [trace0, trace1, trace2]
layout = go.Layout(
yaxis=dict(
title='Big 5 Score',
zeroline=False
),
boxmode='group'
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='big5')
Figure 1 Big 5 Twitter Profile Descriptions
According to public commitment theory, participants in the public conditions should use language that echoes a sense of commitment to the presentation, such as language related to certainty (e.g., ‘‘absolutely’’, ‘‘forever’’,‘‘always’’). Because brand Twitter accounts are more public than personal accounts, they are more in accordance to the public commitment theory. Figure 1 shows the conscientiousness and neuroticism of brand descriptions are significantly different from male and female descriptions. Higher conscientiousness suggests that brand accounts are more likely to be perceived as disciplined, dutiful, achievement-striving, confident, driven, or organized. Lower neuroticism mean that they are more likely to be perceived as calm, bland, content, relaxed, unconcerned, or careful than personal accounts.
traceE3=extractBoxTrace(df_mt, 'male-tweet', ['anger', 'disgust', 'fear', 'joy' ,'sadness' ],['anger', 'disgust', 'fear', 'joy' ,'sadness' ] )
traceE4=extractBoxTrace(df_ft, 'female-tweet', ['anger', 'disgust', 'fear', 'joy' ,'sadness' ],['anger', 'disgust', 'fear', 'joy' ,'sadness' ], '#FF4136' )
traceE5=extractBoxTrace(df_bt, 'brand-tweet', ['anger', 'disgust', 'fear', 'joy' ,'sadness' ],['anger', 'disgust', 'fear', 'joy' ,'sadness' ], '#FF851B' )
data = [traceE3, traceE4, traceE5]
layout = go.Layout(
yaxis=dict(
title='Emotion Score',
zeroline=False
),
boxmode='group'
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='box-grouped')
Figure 2 Emotions of Twitter Accounts
Figure 2 shows that there aren’t much difference in emotion scores of male, female and brand tweets and descriptions. It is important to know that the emotion scores only reveal the percentage of the five different emotions shown in a tweet or description. This figure suggests the component of emotions are basically the same, but from figure 1 we know that brand accounts in general are less emotional.
x1 = df_md['confident'].values
x2 = df_fd['confident'].values
x3 = df_bd['confident'].values
hist_data = [x1, x2, x3] #
group_labels = ['twitter-male', 'twitter-female', 'twitter-brand'] #
trace1=extractHistTrace(x1, 'Male')
trace2=extractHistTrace(x2, 'Feamale')
trace3=extractHistTrace(x2, 'Brand')
data = [trace1, trace2, trace3]
layout = go.Layout(
barmode='group'
)
# fig = go.Figure(data=data, layout=layout)
# py.iplot(fig, filename='twitter-confidence-sh')
fig = FF.create_distplot(hist_data, group_labels, bin_size=0.05, curve_type='kde')
py.iplot(fig, filename='Conf: Twitter Distplot', validate=False)
Figure 3 Confidence in Language Tone
Figure 3 shows that there aren’t much difference in confidence scores of male, female and brand tweets and descriptions.
We compared the tone of the complete AirBnB profile descrioption vs the Twitter profile description first. Then we compare random sentences from AirBnb profile vs Twitter profiles to address for the character limit in twitter descriptions.
traceB0=extractBoxTrace(df_mfd, 'twitter users', ['openness','conscientiousness', 'extraversion' , 'agreeableness', 'neuroticism' ],['openness_big5','conscientiousness_big5', 'extraversion_big5', 'agreeableness_big5', 'neuroticism_big5'] )
traceB1=extractBoxTrace(df_bd, 'twitter brands', ['openness','conscientiousness', 'extraversion' , 'agreeableness', 'neuroticism' ],['openness_big5','conscientiousness_big5', 'extraversion_big5', 'agreeableness_big5', 'neuroticism_big5'],'#FF851B' )
traceB2=extractBoxTrace(dfabnb_ad, 'airbnb users', ['openness','conscientiousness', 'extraversion' , 'agreeableness', 'neuroticism' ],['openness_big5','conscientiousness_big5', 'extraversion_big5', 'agreeableness_big5', 'neuroticism_big5'], '#FF0000' )
data = [traceB0, traceB1, traceB2]
layout = go.Layout(
yaxis=dict(
title='Big 5 Score',
zeroline=False
),
boxmode='group'
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='twVsABnB_big5_box')
Figure 4-1 Big 5 AirBnB vs Twitter Descriptions: complete profile text
traceB0=extractBoxTrace(df_mfd, 'twitter users', ['openness','conscientiousness', 'extraversion' , 'agreeableness', 'neuroticism' ],['openness_big5','conscientiousness_big5', 'extraversion_big5', 'agreeableness_big5', 'neuroticism_big5'] )
traceB1=extractBoxTrace(df_bd, 'twitter brands', ['openness','conscientiousness', 'extraversion' , 'agreeableness', 'neuroticism' ],['openness_big5','conscientiousness_big5', 'extraversion_big5', 'agreeableness_big5', 'neuroticism_big5'],'#FF851B' )
traceB2=extractBoxTrace(dfabnb_adr, 'airbnb users', ['openness','conscientiousness', 'extraversion' , 'agreeableness', 'neuroticism' ],['openness_big5','conscientiousness_big5', 'extraversion_big5', 'agreeableness_big5', 'neuroticism_big5'], '#FF0000' )
data = [traceB0, traceB1, traceB2]
layout = go.Layout(
yaxis=dict(
title='Big 5 Score',
zeroline=False
),
boxmode='group'
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='twVsABnB_big5rand_box')
Figure 4-2 Big 5 AirBnB vs Twitter Descriptions: Profile Text, Random Sentence
We were surprised to see that Airbnb host descriptions were drastically low in openness and we looked into the data. Short descriptions like “Welcome Home!”, ”Welcome!”, “Hope you enjoy your stay!” get 0 openness scores but >0.99 agreeableness and extraversion scores. It is quite common for Airbnb hosts to use short sentences for greetings as a self-description to show their hospitality.
Another category for low-openness but high-agreeableness-and-extraversion is sentences expressing their love, for example ”I love traveling”, “I love my dog!!”,”I love NYC!”. “thank you” is also low in openness but high in agreeableness and extraversion. Longer sentences like “We enjoy meeting new people and making them feel welcome in our beautiful city.” are around the same score level in those three categories.
We then looked into sentences that are high in openness: “artist, photographer, entrepreneur, gypsy spirit and butterfly.””The people, the ambiance, the size, the summers, and yes even the rain.””Owner of the property”. We found these sentences are more like poems, missing persons or verbs. We think the reason for that might be the Tone Analyzer defines openness as openness to ideas, and it is considered less open if the language is more straightforward. Airbnb hosts are much higher in extraversion and agreeableness as we mentioned in H4. There is a bias here as extraverted and agreeable people are more likely to become Airbnb hosts.
Following table shows the p values from hypothesis tests for difference in means.
Openness | Conscientiousness | Extraversion | Agreeableness | Neuroticism | |
---|---|---|---|---|---|
male/female | 0 | 0.0012 | 0.0056 | 0.0014 | 0.4122 |
male/brand | 0 | 0 | 0 | 0 | 0 |
brand/female | 0 | 0 | 0 | 0 | 0 |
individual/airbnb | 0 | 0 | 0 | 0 | 0 |
individual/brand | 0 | 0 | 0 | 0 | 0 |
brand/airbnb | 0 | 0 | 0 | 0 | 0 |
The comparison labels are:
male = Twitter male users
female = Twitter female users
brand = Twitter brand pages
individual = Twitter Males and Females
airbnb = AirBnB users
Since there is a large amount of samples, the p values are almost always lower than 0.05.
The difference of mean values in all dimentions of big five are significant except for "Neurtoticism" between twitter male and female users.
x1 = df_mfd['confident'].values
x2 = df_bd['confident'].values
x3 = dfabnb_ad['confident'].values
hist_data = [x1, x2, x3] #
group_labels = ['twitter-user', 'twitter-brand', 'airbnb-user'] #
# Create distplot with curve_type set to 'normal'
fig = FF.create_distplot(hist_data, group_labels, show_hist=True, bin_size=.2, curve_type='kde')
# Add title
fig['layout'].update(title='Confidence: AirBnB vs Twitter')
# Plot!
py.iplot(fig, filename='Lang_tone_conf_abnb_twit_hist', validate=False)
Figure 5 Language Style AirBnB vs Twitter Descriptions
traceB0=extractBoxTrace(df_mft, 'twitter user', ['openness','conscientiousness', 'extraversion' , 'agreeableness', 'neuroticism' ],['openness_big5','conscientiousness_big5', 'extraversion_big5', 'agreeableness_big5', 'neuroticism_big5'] )
traceB1=extractBoxTrace(df_bt, 'twitter brand', ['openness','conscientiousness', 'extraversion' , 'agreeableness', 'neuroticism' ],['openness_big5','conscientiousness_big5', 'extraversion_big5', 'agreeableness_big5', 'neuroticism_big5'], '#FF851B' )
traceB2=extractBoxTrace(dfblg_atr, 'blogger', ['openness','conscientiousness', 'extraversion' , 'agreeableness', 'neuroticism' ],['openness_big5','conscientiousness_big5', 'extraversion_big5', 'agreeableness_big5', 'neuroticism_big5'], '#8acfea' )
data = [traceB0, traceB1, traceB2]
layout = go.Layout(
yaxis=dict(
title='Big 5 Score',
zeroline=False
),
boxmode='group'
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='tw-blg-big5')
Figure 6 Big 5 TAUW vs Tweets
The big 5 charcteristics between tweets and random sentences from the blogs are compared in Figure 6. The side by side box plots show that the “personality” of a TAUW author perceived from blogs are more similar to Twitter brand accounts than personal accounts. This suggests that the imagined audience of a tech blogger is more similar to those of a brand account. They might be more aware of the publicity and size of potential audience in general than individual users.
In this analysis of online self-presentations, we found that self-presentation varies under different contexts. Such difference can be analyzed from language usage. To be more specific, Twitter brand accounts are more conscientious and show less emotions; Airbnb hosts’ profile descriptions show significantly more extraversion, agreeableness compared to Twitter descriptions; and TUAW blogs are more analytical and show less emotions than tweets.
However, there is not much difference in language style (confidence, analytical, tentative) scores of male, female and brand tweets and descriptions, and even though there is a statistical significant difference in male and female language style, the difference is not drastic. We are also surprised to see that openness of airbnb host’s descriptions are much lower than Twitter users, but that might be caused by the limitation of using Tone Analyzer as our language analysis tool.
[1] M. Back, J. Stopfer, S. Vazire, S. Gaddis, S. Schmukle, B. Egloff, and S. Gosling. Facebook Profiles Reflect Actual Personality, Not Self-Idealization. Psychological Science, 21(3):372, 2010.
[2] Golbeck J, Robles C, Turner K. Predicting personality with social media[C]//CHI'11 Extended Abstracts on Human Factors in Computing Systems. ACM, 2011: 253-262.
[3] Marwick A E. I tweet honestly, I tweet passionately: Twitter users, context collapse, and the imagined audience[J]. New media & society, 2011, 13(1): 114-133.
[4] A blogger’s blog: Exploring the definition of a medium. Reconstruction 6(4). URL(consulted July 2009). Available at: http://reconstruction.eserver.org/064/boyd.shtml
import json
import pandas as pd
from watson_developer_cloud import ToneAnalyzerV3Beta
import plotly.plotly as py
import plotly.graph_objs as go
%matplotlib inline
%matplotlib notebook
tone_analyzer = ToneAnalyzerV3Beta(
username='33fa1185-0d94-4fa9-bf8d-33cb4f5328fa',
password='Gh0Tx2idlHqy',
version='2016-02-11')
resp= tone_analyzer.tone(text='I am really happy. I may be happy. I might be sad. Wondering why')
tone_analyzer = ToneAnalyzerV3Beta(
username='33fa1185-0d94-4fa9-bf8d-33cb4f5328fa',
password='Gh0Tx2idlHqy',
version='2016-02-11')
def flattenResponseSummary(doc):
flatMap={}
for toneCat in doc['document_tone']['tone_categories']:
for tone in toneCat['tones']:
flatMap[tone['tone_id']]=tone['score']
flatMap['level']='Document'
return flatMap
def flattenResponseFull(doc):
docTone=flattenResponseSummary(doc)
sentTones=[]
for sent in doc['sentences_tone']:
flatMap={}
flatMap['sentence_id']=sent['sentence_id']
flatMap['text']=sent['text']
flatMap['level']='Sentence'
analyzed=False
for toneCat in sent['tone_categories']:
for tone in toneCat['tones']:
flatMap[tone['tone_id']]=tone['score']
analyzed=True
if analyzed:
sentTones.append(flatMap)
r=[docTone]
r.extend(sentTones)
return r
flat=flattenResponseFull(resp)
flat[1]
Below Code block defines how to visualize the response from IBM. Following example visualizes the document level tone values for big5, emotions and language style for the above doc.
def plotToneScores(tone):
emotions=[tone['anger'], tone['disgust'], tone['fear'], tone['joy'], tone['sadness']]
emotionsLbls=['Anger', 'Disgust', 'Fear', 'Joy', 'Sadness']
big5=[tone['openness_big5'],tone['conscientiousness_big5'], tone['extraversion_big5'], tone['agreeableness_big5'], tone['neuroticism_big5']]
big5Lbls=['Openness','Conscientiousness', 'Extraversion' , 'Agreeableness', 'Neuroticism' ]
style=[tone['analytical'], tone['confident'], tone['tentative']]
styleLbl=['Analytical','Confident','Tentative']
trace0=go.Bar(
name='Big 5',
x=big5,
y= big5Lbls,
orientation = 'h',
marker= {
'color':"rgb(44, 160, 44)"
}
)
trace1=go.Bar(
name='Emotions',
x=emotions,
y= emotionsLbls,
orientation = 'h',
marker= {
'color':"rgb(23, 190, 207)"
}
)
trace2=go.Bar(
name='Lang Style',
x=style,
y= styleLbl,
orientation = 'h',
marker= {
'color':"rgb(255, 153, 0)"
}
)
data = [trace2, trace1, trace0]
return py.iplot(data, filename='horizontal-bar')
plotToneScores(flat[0])
Below code block defines functions which takes csv files of tweets, profile descriptions etc and call the IBM tone analyser on them. input: the csv of tweets, airbnb descriptions etc, as a pandas dataframe output: list of analysis results for each row in input
def analyzeTweets(df, tweets, descriptions):
i=0
for index, row in df.iterrows():
descText=str(row['description'])
tweetText=str(row['text'])
try:
if len(descText)>0:
descText=descText.decode('ascii', 'ignore')
if len(descText)>0:
desc=tone_analyzer.tone(text=descText)
desc=flattenResponseSummary(desc)
desc['_unit_id']=row['_unit_id']
descriptions.append(desc)
if len(tweetText)>0:
tweetText=tweetText.decode('ascii', 'ignore')
if len(tweetText)>0:
tweet=tone_analyzer.tone(text=tweetText)
tweet=flattenResponseSummary(tweet)
tweet['_unit_id']=row['_unit_id']
tweets.append(tweet)
i+=1
print i, row['_unit_id']
except Exception as e:
print e.message[:10], i, row['_unit_id']
return tweets, descriptions
def analyzeDescriptions(df,descriptions, randSents):
for index, row in df.iterrows():
descText=str(row['description'])
try:
if len(descText)>0:
descText=descText.decode('ascii', 'ignore')
if len(descText)>0:
desc=tone_analyzer.tone(text=descText)
desc=flattenResponseFull(desc)
for d in desc:
d['_unit_id']=row['_unit_id']
descriptions.append(desc[0])
idx=0
#get random sentence for long docs
if len(desc)>1:
idx=randint(1,len(desc)-1)
randSents.append(desc[idx])
print row['_unit_id']
except Exception as e:
print e.message[:10], index, row['_unit_id']
return descriptions
Above defined functions are called for Twitter, Airbnb and Blog data files to get tone analysis as output
#
twitusers = pd.read_csv('data.in/gender-classifier-DFE-791531.csv')
abnbusers= pd.read_csv('data.in/cm_airbnb1k.csv', skip_blank_lines=True)
blgusers=pd.read_csv('data.in/blogs2k.csv')
males = twitusers[(twitusers['gender']=='male') & (twitusers['gender:confidence']==1)]
females = twitusers[(twitusers['gender']=='female') & (twitusers['gender:confidence']==1)]
brands = twitusers[(twitusers['gender']=='brand') & (twitusers['gender:confidence']==1)]
print 'all twusers:', len(twitusers)
print 'num males :' ,len(males)
print 'num females:', len(females)
print 'num brands :', len(brands)
print 'aBnB users :', len(abnbusers)
print 'num blogs :', len(blgusers)
abnbDescriptions=[]
abnbRandom=[]
analyzeDescriptions(abnbusers, abnbDescriptions,abnbRandom)
len(abnbDescriptions)
len(abnbRandom)
dfabnb_ad=pd.DataFrame(abnbDescriptions)
dfabnb_adr=pd.DataFrame(abnbRandom)
dfabnb_ad.to_csv('data.out/airbnb_ad.csv')
dfabnb_adr.to_csv('data.out/airbnb_ad_rand.csv')
blgContent=[]
blgRandom=[]
analyzeDescriptions(blgusers, blgContent,blgRandom)
dfblg_at=pd.DataFrame(blgContent)
dfblg_atr=pd.DataFrame(blgRandom)
dfblg_at.to_csv('data.out/dfblg_at.csv')
dfblg_atr.to_csv('data.out/dfblg_atr.csv')
m_descs=[]
m_tweets=[]
male_tweets, male_descs =analyzeTweets(males.head(10),tweets, twdescriptions)
df_md=pd.DataFrame(male_descs)
df_md=df_md.set_index('_unit_id')
df_mt=pd.DataFrame(male_tweets)
df_mt=df_mt.set_index('_unit_id')
# allusers.join(df_md, on='_unit_id')
fm_descs=[]
fm_tweets=[]
female_tweets, female_descs=analyzeTweets(females, fm_tweets, fm_descs)
df_fd=pd.DataFrame(female_descs)
df_fd=df_fd.set_index('_unit_id')
df_ft=pd.DataFrame(female_tweets)
df_ft=df_ft.set_index('_unit_id')
brand_descs=[]
brand_tweets=[]
analyzeTweets(brands, brand_tweets, brand_descs)
df_bd=pd.DataFrame(brand_descs)
df_bd=df_bd.set_index('_unit_id')
df_bt=pd.DataFrame(brand_tweets)
df_bt=df_bt.set_index('_unit_id')
df_md.to_csv('data.out/df_md.csv')
df_mt.to_csv('data.out/df_mt.csv')
df_fd.to_csv('data.out/df_fd.csv')
df_ft.to_csv('data.out/df_ft.csv')
df_bd.to_csv('data.out/df_bd.csv')
df_bt.to_csv('data.out/df_bt.csv')
We used python 'lxml' package to build a scraper that could scrape user profiles from AirBnB lxml is a python library written for parsing XML and HTML documents. We filter out all host profiles with a description and at least one review from a guest.
from lxml import html
import requests
from random import randint
import pandas as pd
N=10
USER_PAGE="https://www.airbnb.com/users/show/"
pages=[]
profile_ids=[]
for i in range(0,N):
if i%100==0: print len(pages)
profile_id= randint(200000, 899999)
URL=USER_PAGE+str(profile_id)
page = requests.get(URL)
tree = html.fromstring(page.content)
pages.append(tree)
profile_ids.append(profile_id)
print 'num pages collected', len(pages)
users=[]
# for i in range(0,1)
for i,tree in enumerate(pages):
rev_as_host=tree.xpath('//*[@id="reviews"]/div/div[@class="reviews_section as_host"]')
# print 'is host:', i, len(rev_as_host)>0
description= tree.xpath('//*[@id="site-content"]/div[2]/div/div[2]/div[2]/p/text()')
print description
if len(description)>0:
description=description.pop()
#If non empty description and has reviews as a host
if len(description)>0 and rev_as_host:
name=tree.xpath('//*[@id="site-content"]/div[2]/div/div[2]/div[1]/div[2]/h1/text()').pop().rstrip().lstrip().encode('ascii', 'ignore')
address=tree.xpath('//*[@id="site-content"]/div[2]/div/div[2]/div[1]/div[2]/div[1]/a/text()').pop().rstrip().lstrip()
since=tree.xpath('//*[@id="site-content"]/div[2]/div/div[2]/div[1]/div[2]/div[1]/span/text()').pop().rstrip().lstrip()
# badges=tree.xpath('//*[@id="site-content"]/div[2]/div/div[2]/div[3]/div')
user={
'_unit_id':profile_ids[i],
'profile_id':profile_ids[i],
'name':name,
'address': address,
'since': since,
'description': description
}
users.append(user)
user_df=pd.DataFrame(users)
user_df.tail(2)
# user_df.to_csv('sampleabnb.csv', encoding='utf-8')