TheGameTheoreticWebTheGameTheoreticWebWeb(2.0)Mining:AnalyzingSocial Media

AnupamJoshi Joint work with Tim Finin and several students EbiquityGroup,UMBC [email protected] http://ebiquity.umbc.edu/ SocialMediaSocialMedia

• “Socialmediadescribesthe onlinetoolsandplatforms thatpeopleusetoshare opinions,insights, experiences,and

perspectives ” wikipedia • Levelofuserparticipation Twitterment beta andthoughtsharingacross variedtopics StateoftheBlogosphereStateofthe

“Blogosphere isthecollectivetermencompassingallasacommunityor socialnetwork’’ WikipediaNov06 Knowing&InfluencingyourAudienceKnowing&InfluencingyourAudience

• Yourgoalistocampaignforapresidential candidate • Howcanyoutrackthebuzzabouthim/her? • Whataretherelevantcommunitiesand bogs? • Whichcommunitiesaresupporters,whichareskeptical, whichareputoffbythehype? • Isyourcampaignhavinganeffect?Thedesiredeffect? • Whichbloggersareinfluentialwithpoliticalaudience?Of these,whicharealreadyonboardandwhicharelost causes? • Towhomshouldyousenddetailsortalkto? Knowing&InfluencingyourMarketKnowing&InfluencingyourMarket

• YourgoalistomarketApple’siPhone • Howcanyoutrackthebuzzaboutit? • Whataretherelevantcommunitiesand blogs? • Whichcommunitiesarefans,whichare suspicious,whichareputoffbythehype? • Isyouradvertisinghavinganeffect?The desiredeffect? • Whichbloggersareinfluentialinthis market?Ofthese,whicharealready onboardandwhicharelostcauses? • Towhomshouldyousenddetailsor evaluationsamples? OpinionsinSocialMediaOpinionsinSocialMedia

Reader’sPerspective “LastnightinBostonatamiddollarfundraiser Narrative“John Edwards is JohnEdwardsgaveafantasticspeech .Itwas Good!” onethe loosest most charismatic speechesIhave seenhimgive.Manyofthepointsandlinewere fromhisstandardstumpspeechbuttherewasa definiteconfidenceandsenseofhumorinhis Expressed delivery . Opinions HealsodwelledontheenvironmentmorethanI haveseenhimdoinotherspeeches.The environmentalsectionkickedoffwithwitha good and true linethatgotabigovation:“Onglobal 1 Opinionscan warming:AlGorewasright .”……..” influencethevotesof others

[1] http://www.dailykos.com/storyonly/2007/10/4/71218/3740 WhatisInfluence?WhatisInfluence?

“the act or power of producing an effect without apparent exertion of force or direct exercise of command ’’ MeasurableInfluence Theabilityofabloggertopersuadeanotherbloggerto • Takeactionbymeansofcreatinganewpostaboutthetopic andcommentingontheoriginal (text and graph mining) . • Quotetheblogger’sviewsinherpost (text mining) . • Linktotheoriginalpostvia,comments(graph mining) . • Linktothebloggerthroughothermeanslikedel.icio.us,digg, citeULike,Connotea,etc. (graph mining) • Subscribetothefeed (graph mining) . EpidemicbasedInfluenceModelsEpidemicbasedInfluenceModels “Find the minimum set of nodes, influencing which would maximize the infection in the network” Kempetal. Influence Graph

• LinearThresholdModel 1/3 Σ b ≥ θ Active wv v 2 1 wistheactiveneighborofv, 2/5 3 1/3 θv intrinsicthresholdforanode θv 1/3 • GreedyHeuristic 1 1 1/5 • Assignrandomθv • Computeapproxinfluencedset 2/5 5 • Ateachstep,addthenodethat Active increasesthemarginalgaininthesize 1/2 4 Inactive oftheinfluencedset 1/2

Otherapproaches:Latane’,iRank, LimitationsofExistingApproachesLimitationsofExistingApproaches

• Selectednodesmaybelongto First10nodesselectedusing differenttopics GreedyHillClimbingHeuristic • Opinionsorbiasnot http://www.engadget.com http://www.boingboing.net considered http://www.dailykos.com • Informationisspread http://postsecret.blogspot.com http://slashdot.org throughoutthenetworkwithout http://www.albinoblacksheep.com consideringsocialstructure http://www.opinionjournal.com http://profiles.blogdrive.com • Intrinsicthresholdθv isbasedona pseudorandomfunction http://godlessmom.blogspot.com http://thinkprogress.org • Staticviewofthenetwork,no TECH, POLITICS, DAILY/NEWS temporalevidence FindingCommunities(andFeeds)ThatMatterFindingCommunities(andFeeds)ThatMatter Analysis of Bloglines Feeds 83K publicly listed subscribers 2.8M feeds, 500K are unique Before Merge 26K users (35%) use folders to organize subscriptions Data collected in May 2006

TopAdvertisingFeeds 1. Adrants» MarketingandAdvertisingNewsWithAttitude 2. Adverblog:advertisingandnewmediamarketing 3. http://adrag.com 4. adfreak After Merge 5. AdJab 6. MITAdvertisingLab:futureofadvertisingandadvertisingtechnology 7. AdPulp:DailyJuicefromtheAdBiz 8. Advertising/DesignGoodness RelatedTags: advertising marketing media news design http://ftm.umbc.edu FeedsThatMatterFeedsThatMatter

TopFeedsfor“Politics” TopFeedsfor“Knitting” Mergedfolders:“political”,“politicalblogs” Mergedfolders“knittingblogs” • TalkingPointsMemo:byJoshuaMicah Marshall • YarnHarlotknitting • DailyKos:StateoftheNation • WendyKnits! • Eschaton • SeeEunnyKnit! • TheWashingtonMonthly • theblueblog • Wonkette,PoliticsforPeoplewithDirtyMinds • http://instapundit.com/ • Grumperinagoestolocalyarnshopsand HomeDepot • InformedComment • PowerLine • YouKnitWhat?? • AMERICAblog:Becauseagreatnation • MasonDixonKnitting deservesthetruth • knitandtonic • CrooksandLiars • CrazyAuntPurl • http://www.lollygirl.com/blog/ InfluenceinCommunitiesInfluenceinCommunities

http://michellemalkin.com/

http://instapundit.com http://dailykos.com http://volokh.com

http://crooksandliars.com http://rightwingnews.com

Communitiesdetectedusing“Fastalgorithmfordetecting communitystructureinnetworks ”,M.E.J.Newman AuthorityandPopularityAuthorityandPopularity

Authority Popularity • contributestoinfluence • Authorityandpopularityoften • Influencemaybesubjective. treatedequally • Asource,authoritativeinone communitycouldinfluence • Onblogsearchengines,authority anothercommunitynegatively. ismeasuredusinginlinks,which isatbestpopularity Withinacommunity,an • Popularitydoesn’tmeaninfluence authoritativesourcewouldbe influential. • Dilbertisextremelypopularbut notinfluential LinkPolarity/BiasLinkPolarity/Bias

• Linkingaloneisnotindicatorofinfluence • Polaritycanindicatethetypeofinfluence • Consistentnegative/positiveopinionoveraperiodoftime canindicatebias • Linkpolarity/citationsignalcanalsobehelpfulindetermining trust S tr o on tive pi gly StrongNega nio P n os Opinion itiv e

tive MildlyNega opinion DemocratBlog RepublicanBlog OurApproachtoLinkPolarityOurApproachtoLinkPolarity • ShallowSentimentAnalysis • Calculatethenumberofpositivelyoriented( Np )and Negativelyorientedwords( Nn )inthetextwindowaround thelink • ApplyStemming,basiccanonicalization • Corpusincludessimplebigramsoftheform“not_good”

• Polarity=(Np– Nn)/(Np+Nn) • Denominatoractsasanormalizationmechanism

• NaturalLanguageProcessingis shallow ,yetlarge scaleeffectshelp! LinkPolarityExampleLinkPolarityExample

• “StephenColbert'sperformanceattheWhiteHouseCorrespondents' Association dinnerhasgarneredhim huge applause intheblogosphereandalsoonCSpan whereitwasshownmorethanonce.Thoseofuswhohavebeen angry withBush forquitesometimebecauseofhis arrogant andfeckless corruption ofour countrywereevenmorethrilledtoseeandknowthathehadnorecoursebuttosit thereandwatchhisaspirationsforgreatnessbedestroyedbya master ofirony. This willbehis legacy :Istandbythisman.Istandbythismanbecausehestands forthings.Notonlyforthings,hestandsonthings.Thingslikeaircraftcarriersand rubbleandrecentlyfloodedcitysquares.Andthatsendsa strong message,that nomatterwhathappenstoAmerica,shewillalwaysrebound withthemost powerfully stagedphotoopsintheworld.WewhohavebeenwatchingStephen Colbertevisceratepoliticiansthathavecomeonhisshowknewhewasa gifted comedian.ButittookSaturday'sdinnertodemonstratehowincredibly effective the artformColberthaschosenisforexposingthePotemkinRegimeBushandhis henchmenhavecreated.Roveandtherightwingmachinehavenoanswertothe performancebuttosay"it bombed ", "itwasn'tfunny",andtohopethatbyignoring it,thecausticcleansingagentithaslobbedintotheircampcanbecontained.Yet, theRepublicanspinmeistersarethemastersofspin.”[2] This http://dailykos.com/storyonly/2006/4/30/1441/59811 Np=8 , Nn=4 ;Polarity= Np – Nn / Np + Nn =0.33 [2]http://www.pacificviews.org/weblog/archives/001989.html PropagatingInfluencePropagatingInfluence • BasedonworkofGuhaetal [1] formodelingpropagationof trustanddistrust • Framework

• Mij representsinfluence/biasfromuseritoj.(0<=M ij <=1)

• Mij isinitializedtothepolarityfromitoj. • BeliefMatrix M representstheinitialsetofknownbeliefs,andis sparse • GoalistocomputeallunknownvaluesinM • BeliefMatrixafteri th atomicpropagation

• Mi+1 =M i *C i • CombinedOperator T T T • Ci =a 1 *M+a 2 *M *M+a 3 *M +a 4 *M*M • a{0.4,0.4,0.1,0.1}representsweighingfactor [1]GuhaR,KumarR,RaghavanP,TomkinsA.Propagationoftrust anddistrust.In: Proceedings of the Thirteenth International Conference, New York, NY, USA, May 2004. ACM Press, 2004. ExperimentsExperiments

• Domain • PoliticalBlogosphere • DatasetfromBuzzmetrics [2] providespostpostlinkstructureover14million posts • Fewoffthetopicpostshelpaggregation • Potentialbusinessvalue

• ReferenceDataset • HandlabeleddatasetfromLadaAdamicetal [3] classifyingpoliticalblogsinto rightandleftleaningbloggers • Timeframe:2004presidentialelections,over1500blogsanalyzed • Overlapof300blogsbetweenBuzzmetricsandreferencedataset

• Goal • ClassifytheblogsinBuzzmetricsdatasetasdemocratandrepublicanand comparewithreferencedataset [2]LadaA.AdamicandNatalieGlance,"Thepoliticalblogosphereandthe2004USElection",inProceedingsoftheWWW2005Workshop Buzzmetrics– www.buzzmetrics.com EvaluationMetricsEvaluationMetrics

PolarityImprovesClassificationbyalmost26%

Confusion Matrix

• Accuracy=73% • TruePositiveRate(Recall)= 78% • FalsePositiveRate(FP)= 31% • TrueNegativeRate(Recall) =69% • FalseNegativeRate(FN)= 21% • Precision(R)=75% • Precision(D)=72% • ( SampleDataSampleData

• Trustpropagation compensatesforinitial incorrect polarity( DK– AT ) • Trustpropagation doesnot changecorrect polarity( AT DK ) • Trustpropagationassigns correctpolarityfor non existent directlinks( ATIP ) • Numbersin italics problematic( MMAT ) • Improvesentimentdetection? MSMClassificationResultsMSMClassificationResults InterestingObservationsInterestingObservations

• 24outof27sourcesclassified “correctly” • guardian , foxnews , humaneventsonline , mediamatters • MainOutliers “thenation” and “bostonglobe” • Bothleftandrightleaningblogs talknegativelyabout“nytimes” and“abcnews” andpositively about“rawstory” and“examiner” IdentifyingBiasusingKLDivergenceIdentifyingBiasusingKLDivergence 120 SplogsspoilthepartySplogsspoiltheparty 

100 Splogsintop100search results.Eachlinerepresentsa query. 80 hybrid cars

60 NumberofSplogs cholesterol

40

20

0 5 10 15 20 25 30PreliminaryWork:OpinionExtraction 35 40 45 50 55 60 65 70 75 80 85 90 95 100 SPLOGS!SPLOGS!

SPLOGS BY NUMBERS

• 75%ofupdatepings(eBiquity2006) • 20%ofindexedBlogosphere(Umbria2006) • 56%ofupdatepings(eBiquity2007)

56% of all active blogs are splogs! (2007) DATASETSDATASETS • SPLOG2005 • SampledSummer2005atTechnorati • Asearchengine,somanysplogsalreadyremoved • Labeledsamplesof700blogsand700splogs • OnlyBloghomepages • SPLOG2006 • SampledOct2006atWeblogs.com • Labeledsamplesof750blogsand750splogs • Bloghomepages+feeds EXPERIMENTALEXPERIMENTAL SETUPSETUP

• Binaryfeatureencoding • Top50Kselectedusingfrequencycount • SVMs – Defaultparameters – LinearKernel • Nostemmingorstopwordelimination • NaïveBayes • Tenfoldcrossvalidation URL

2005 2006 URL

• 3,4,5charactergramsfromURL • Capturesprofitablecontexts • Highlyeffectiveatstreams • Supportsanextremelylowcostclassifier

2005 2006 WORDS

2005 2006 WORDS

• Words(Text)onaBlog • Previouslyeffectiveintopicclassification • Capturesprofitableadvertisingcontexts • InterestingAuthenticGenreObserved

2005 2006 OUTLINKS

2005 2006 OUTLINKS

• Outlinkstokenizedbynonalphabets • SimilartoURLngrams,likelymorerobust

• Novelfeaturespace

2005 2006 ANCHORS

2005 2006 ANCHORS

• Anchortexttokenizedintowords • Subsumedbywords,butobfuscationdifficult • Capturepersonalizationofpublishingtemplate • Novelfeaturespace

2005 2006 SplogSplog softwaresoftware ?!?!

“Honestly, Do you think people who make $10k/month from adsense make blogs manually? Come on, they need to make them as fast as possible. Save Time = More Money! It's Common SENSE! How much money do you think you will save if you can increase your work pace by a hundred times? Think about it…”

“Discover The Amazing Stealth Traffic Secrets Insiders Use To Drive Thousands Of Targeted Visitors To Any Site They Desire!”

“Holy Grail Of Advertising... “ “Easily Dominate Any Market, Any $ 197 Search Engine, Any Keyword.” Capture HTML Stylistic Patterns in Authentic Blogs HTMLTAGS

2005 2006 HTMLTAGS

• UseHTMLTags– stylisticinformation • Capturesignaturesofsplogsoftware • Fullylanguageindependent • Novelfeaturespace

2005 2006 META-PING SYSTEM

Increasing Cost

PRE-INDEXING SPING FILTER LANGUAGELANGUAGE IDENTIFIERIDENTIFIER Ping Stream

REGULARREGULAR BLACKLISTSBLACKLISTS URLURL HOMEPAGEHOMEPAGE FEEDFEED EXPRESSIONSEXPRESSIONS WHITELISTSWHITELISTS FILTERSFILTERS FILTERSFILTERS FILTERSFILTERS

Ping Stream Ping Stream BLOGBLOG IDENTIFIERIDENTIFIER

IPIP BLACKLISTSBLACKLISTS PINGPING LOGLOG AUTHENTICAUTHENTIC BLOGSBLOGS THEGAMETHEORETICWEBTHEGAMETHEORETICWEB QouthPeterNorvigQouthPeterNorvig • “TheotherthingthatIhadn'treallythoughtaboutwhenwe startedthisallishowkindofgametheoreticthewholething is. Atfirstwethoughtofourselvesasthisobserverofthe Web. ThattheWebwasoutthereandwemadeacopyofit andindexeditandifpeoplewantedtheycouldcomeand accessthatindex.ButitwasjustareflectionoftheWebout there. Andnowweunderstandthatwe'recoevolvingwith theWebandthatwhenwemakeamoveitchangestheWeb andwhentheWebchangeswechangeandgoingbackand forth.Andsoallthesearchengineoptimizersandsoonare watchingandwhatwedoandwewatchwhattheydoandthe Webistheinteractionbetweenus. AndthatissomethingI hadn'tevenconsideredbeforewesawithappening.” • In Singularity 2007 ThisistrueofSocialMediaaswellThisistrueofSocialMediaaswell

• IfIknowthatyouareoutthere,tryingtoinfermy opinions(orpreventmefrom)thenIwill activelyworktodefeatthat.Sincethecontentisuser generated,Icandothatfairlyquickly. • Spamadaptationisaclassicexample. ADAPTIVEADAPTIVE CONTEXTCONTEXT • Changeindistributioninfeaturespace

• ConceptDrift– Seasonal,seeninbothsplogsand

blogs f1,f 2,f 3 ..f m

• AdversarialScenario– seeninsplogs

P( splog(x)/O(x) ) • ConceptDescriptionneedstobeupdated

P( O(x)/splog(x) ) ENSEMBLEENSEMBLE INTUITIONINTUITION

• Streamofunlabeledinstances(drifting) • Baseclassifierswithpotentially independentfeaturespaces • Isanensemble(probabilisticcommittee) ofthecataloguemorerobusttodrift? • Areinstancesclassifiedbytheensemble effectivetoretrainbaseclassifiers(semi supervisedlearning)? • Motivatedbycotraining ENSEMBLEENSEMBLE INTUITIONINTUITION

unlabeled instances

classify

classify classify

retrain

baseclassifiersensemblecommittee updatedclassifiers (probabilistic) POTENTIALPOTENTIAL TOTO ADAPTADAPT

URL Outlink

Anchor Tag

Chargram Wordgrams

Words EXPERIMENTALEXPERIMENTAL SETUPSETUP

• Acatalogofsevenclassifiers • SPLOG2005asbaselabeleddataset • SPLOG2006asevaluationstream • 10KTopFeatures • SVMbasedlearning • SPLOG2006separatedoutintounlabeled streamandtestset(3fold) • F1performancemetricevaluation RESULTSRESULTS –– WORD WORD DRIFTDRIFT RESULTSRESULTS –– ALL ALL CLASSIFIERSCLASSIFIERS ConclusionConclusion

• Using topic, social structure, opinions and temporal information wecandevelopanaccuratemodelfor influence,biasandtrustontheBlogosphere. • Weapplythisframeworkonrealworlddataand describetechniquesforidentifyinginfluenceonthe Blogosphere. • Splogsareabigissue– wehavedevelopedefficient techniquestodetecttheminnearrealtime. • DoestheGameTheoreticNatureofthissystem raisefundamentalnewchallengesforDataMining. BackupSlides GenerativeModelsforBlogosphereGenerativeModelsforBlogosphere

Graphsareeverywhere..andsoarePowerlaws!!

Insimplewords,powerlawcanbeexplainedby“richget richerphenomenon ” OR“20%ofthepopulationholds 80%ofthewealth ”

Consideringwebasagraph: InternetMappingProject [lumeta.com]

FriendshipNetwork[Moody‘01] Scalefreenetwork : Structureandproperties independentofnetworksize Fewhighconnectivity node(hubs)

http://www.prefuse.org/gallery/

Propertiesofinterest(graphtheory) Averagedegreeofnode,degreedistribution,degreecorrelation, distributionof strongly/weaklyconnectedcomponents,clusteringcoefficientand reciprocity GenerativeModelsforBlogosphereGenerativeModelsforBlogosphere

• Reducetimetogeneratedata crawlingtheblogosphereoverafewweeks samplingtherightblogstogetarepresentativesample

• Reducetimeinpreprocessinganddatacleaning removinglinkspointingoutsidethedataset,outsidethetimeframe splogremoval[1]

• Generategraphsofdifferentproperties\sizes averagedegreeofnode,degreedistributions

• Testingofnewalgorithmsforbloggraphs e.g.spreadofinfluenceinblogosphere[2],communitydetection[3]

• Extrapolation howwillfastgrowthaffecttheblogosphereproperties? howdoesthisaffecttheconnectedcomponents?

[1]Kolarietal“Svmsfortheblogosphere:Blogidentificationandsplogdetection,” inAAAISpringSymposiumonComputational ApproachestoAnalyzingWeblogs,2006. [2]Javaetal“Modelingthespreadofinfluenceontheblogosphere,” tech.rep.,UniversityofMaryland,BaltimoreCounty,March2006. [3]Linetal“DiscoveryofBlogCommunitiesbasedonMutualAwareness ExistingApproachesExistingApproaches

Barabasi Albert Erdos-Renyi preferential attachment random model web model

PreferentialAttachment:Thelikelihoodoflinkingtoapopularwebsiteishigher

• Twolevelnetwork:blogandpostlevel • Inlinksandoutlinkstoandfromposts • NEEDtomodelbloggerinteractions

[1] M. Newman, “The structure and function of complex networks,” 2003 [3] R. Albert, Statistical mechanics of complex networks . PhD thesis, 2001. [7] J. Leskovec, M. McGlohon, C. Faloutsos, N. Glance, and M. Hurst, “Cascading behavior in large blog graphs”, ICWSM , 2007 [32] X. Shi, B. Tseng, and L. Adamic, “Looking at the blogosphere topology through different lenses” ICWSM , 2007 ModelParametersModelParameters

1. Probabilityofrandomreads( rR )

2. Probabilityofrandomlyselectingwriter( rW )

3. Probabilitythatnewnodedoesnotlinktotheexisting network( pD )

4. Growthexponent( g) – howmanylinksshouldbeaddedeverystep? ProposedModelProposedModel

1.Addnewblognode Step=1 I will not link to Reciprocallinks Step=2 2.Selectwriter anyone! Stronglyconnectedcomponents Subsetofnodeshavingdirected 3.Writersreadblogposts,write pathfromeverynodetoeveryother posts node dailykos Weaklyconnectedcomponents

Should I read Informationflow - randomly? - preferentially? michellemalkin Should I link to someone? If yes who? >> Preferentially based on indegree of node

Writer selection: randomly? OR Random writer Random >> Preferentially based on outdegree? destination PropertiesofSimulatedBlogGraphsPropertiesofSimulatedBlogGraphs

EffectoftextwindowsizeEffectoftextwindowsize

• Optimalwindowsizeis750charactersforourexperiments • Smallwindowsize– Nonopinionatedphrases • LargeWindowsize– Analysisofnonrelatedtext • Specifictoourexperiments,numbersmaynotbegeneralized EffectofatomicpropagationparametersEffectofatomicpropagationparameters

• XaxisBitset={directtrust,co–citation,transposetrustand trustcoupling}={0001 1111} • Eachparametersettoeither0oritsoptimalvalue • Collectiveinfluenceofallparametershelps! AtomicPropagationAtomicPropagation

• DirectPropagation B C • Given:AtrustsBandBtrustsC • Implies:AtrustsC • Operator:M A

• Cocitation • Given:AtrustsBandC,DtrustC A B • Implies:DtrustsB • Operator:M T *M D C AtomicPropagationContd…AtomicPropagationContd…

• TransposeTrust A B • Given:AtrustsBandCtrustsB • Implies:CtrustsA,AtrustsC T • Operator:M C

• TrustCoupling • Given:DtrustsA,AtrustsC A C andBtrustsC • Implies:DtrustsB D B • Operator:M*M T AtomicPropagationcontd…AtomicPropagationcontd…

• CombinedOperator T T T • Ci=a 1 *M+a 2 *M *M+a 3 *M +a 4 *M*M

• ai {0.4,0.4,0.1,0.1}representsweighingfactor

• BeliefMatrixafteri th atomicpropagation

• Mi+1 =M i *C i

• Weperformpropagationstill“convergence” (tillthe newiterationdoesnotchangevaluesinMabove “threshold”) SeparatingBlogWheatfromBlogChaffSeparatingBlogWheatfromBlogChaff Datacleaningfor • Splogremoval • Postcontentidentification

Pre Indexing Steps

NonNon EnglishEnglish TitleTitle andand CollectionCollection ParsingParsing SplogSplog Detection Detection BlogBlog removal removal ContentContent ExtractionExtraction

1 2 3 4

BlogVox:SeparatingBlogWheatfromBlogChaff" ,IJCAI2007Analyticsof NoisyandUnstructuredText DataCleaning:SplogsDataCleaning:Splogs

HostAds Indexaffiliates, Promote pageRank

Plagiarized content

PreliminaryWork:OpinionExtraction EffectofSplogsEffectofSplogs 100

Distributionofsplogsfor ‘spamterms’ inTRECcorpus 80 discount

pregnancy 60

NumberofSplogs insurance

40

20

"BlogTrackOpenTask:SpamBlog Classification" , TREC 2006 Blog Track Notebook , 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 DataCleaning:ContentIdentificationDataCleaning:ContentIdentification

• BaselineHeuristic • SVMMethod Data Aggregators Language Processing Fact Repository Interface 1 2 11 RSS OntoSem Ontology & Aggregator Instance browser

3 4 News Feeds TMRs FR Text Search 12

RDQL Query 13 6 5 OntoSem2OWL 9 Swoogle Index 14

Dekade Editor 7 OntoSem Ontology Inferred (OWL) 10 Triples Semantic RSS 15 Knowledge 8 TMR Editor Environment Semantic Web Tools BlogVoxOpinionExtractionSystemBlogVoxOpinionExtractionSystem

• TREC06 :Finding opinionated BlogVox posts,eitherpositiveornegative, aboutaquery • 2006TRECBlogcorpus: • 80Kblogs • 300Kposts • 50testqueries • BlogVox opinionextraction system • Documentandsentencelevel scorers • Combinedscoresusingan SVMmetalearner • Datacleaning:splogsand postidentification BrandMonitoring/BusinessAnalyticsBrandMonitoring/BusinessAnalytics

BlogAnalytics/MarketIntelligence Buzz Opinions Influence Reputation Competition FinancialAnalyst Limitations • Proprietary • Somecompaniesconductextensivemanualresearch TopCitedMediaSourcesTopCitedMediaSources

TopMSMSourcesontheBlogosphere PropagatingInfluencePropagatingInfluence

• Trustonly • Ignoredistrust(negativepolarities)completely

• FinalBeliefMatrix=M k ,M 0 =T • (K:Numberofatomicpropagationstillconvergence)

• OnestepDistrust • Distrustpropagatessinglestepwhiletrustpropagatesrepeatedly

• FinalBeliefMatrix=M k *(TD),M 0 =T • (K:Numberofatomicpropagationstillconvergence)

• PropagatedDistrust • Treatdistrustandtrustequivalent

• FinalBeliefMatrix=M k ,M 0 =T D • (K:Numberofatomicpropagationstillconvergence) Affiliate Programs SPAMDEXINGSPAMDEXING Context Ads

(i)

arbitrage ads/affiliate links

(ii)

in-links

Spam pages, JavaScript Spammer Spam Blogs Redirect owned [DOORWAY] domains Affiliate Program Buyers spamdex

(iii) Spam pages, Spam Blogs, Spam Comments, SERP Guestbook Spam Wiki Spam Search Engines WORDGRAMS

2005 2006 WORDGRAMS

• Word2grams,2adjacentwords • ShallowNLPtechniquetotacklewordsalad • Wordsaladlesscommoninwebspam(TFIDF) • Wordxgramfeatures,exponentialwithx

2005 2006 CHARACTERGRAMS

2005 2006 CHARACTERGRAMS

• 3,4,5charactergramsfromblogcontent • Cancapturecharactersalad(e.g.p1lls) • Featureselectionimportant

2005 2006