adelsone's picture
From adelsone rss RSS  subscribe Subscribe

Entity Spotting in Informal Text 



Entity Spotting in Informal Text

 

 
 
Views:  368
Published:  December 24, 2009
 
0
download

Share plick with friends Share
save to favorite
Report Abuse Report Abuse
 
Related Plicks
4H0-004

4H0-004

From: Aatishas
Views: 30 Comments: 0

 
4H0-004

4H0-004

From: Aasrita
Views: 21 Comments: 0

 
Hospital management system

Hospital management system

From: shkhera
Views: 340 Comments: 0

 
Prepking 1Z0-257  Dumps

Prepking 1Z0-257 Dumps

From: nmista
Views: 41 Comments: 0
1Z0-257 ,1Z0-257 exam,1Z0-257 exam questions,1Z0-257 dumps

 
TestKing Microsoft  MB2-422

TestKing Microsoft MB2-422

From: Passguide5
Views: 194 Comments: 0
http://www.passguide.com/Microsoft.html
 
free pass4sure microsoft MB2-422

free pass4sure microsoft MB2-422

From: berryFu
Views: 196 Comments: 0
http://www.passguide.com/MB2-422.html
 
See all 
 
More from this user
GNW 2004AR ir

GNW 2004AR ir

From: adelsone
Views: 387
Comments: 0

Mac Fotorettung

Mac Fotorettung

From: adelsone
Views: 45
Comments: 0

Julian Peachment Coversations

Julian Peachment Coversations

From: adelsone
Views: 352
Comments: 0

[Finance]Insurance Companies Set To Pay Out More Frequently[16144]

[Finance]Insurance Companies Set To Pay Out More Frequently[16144]

From: adelsone
Views: 257
Comments: 0

On Open networks for learning and open innovation

On Open networks for learning and open innovation

From: adelsone
Views: 227
Comments: 0

Lecture_4.4-DoDAF_O v..

Lecture_4.4-DoDAF_Ov..

From: adelsone
Views: 39
Comments: 0

See all 
 
 
 URL:          AddThis Social Bookmark Button
Embed Thin Player: (fits in most blogs)
Embed Full Player :
 
 

Name

Email (will NOT be shown to other users)

 

 
 
Comments: (watch)
 
 
Notes:
 
Slide 1: Entity Spotting in Informal Text Meena Nagarajan with Daniel Gruhl*, Jan Pieper*, Christine Robson*, Amit P. Sheth Kno.e.sis, Wright State IBM Research - Almaden, San Jose CA* Thursday, October 29, 2009 1
Slide 2: Tracking Online Popularity http://www.almaden.ibm.com/cs/projects/iis/sound/ Thursday, October 29, 2009 2
Slide 3: Tracking Online Popularity • • • What is the buzz in the online Music Community? Ranking and displaying top X music artists, songs, tracks, albums.. Spotting entities, despamming, sentiment identification, aggregation, top X lists.. http://www.almaden.ibm.com/cs/projects/iis/sound/ Thursday, October 29, 2009 3
Slide 4: Spotting music entities in user-generated content in online music forums (MySpace) Thursday, October 29, 2009 4
Slide 5: Chatter in Online Music Communities http://knoesis.wright.edu/research/semweb/projects/music/ Thursday, October 29, 2009 5
Slide 6: Goal: Semantic Annotation of artists, tracks, songs, albums.. Music Brainz RDF Ohh these sour times... rock! Ohh these <track id=574623> sour times </track> ... rock! Thursday, October 29, 2009 6
Slide 7: Multiple Senses in the same Domain • 60 songs with Merry Christmas Yesterday • 3600 songs with • 195 releases of American Pie American Pie Caught AMERICAN PIE on cable so much fun! Thursday, October 29, 2009 • 31 artists covering 7
Slide 8: Annotating UGC, other Challenges • Several Cultural named entities • artifacts of culture, common words in everyday language LOVED UR MUSIC YESTERDAY! Just showing some Love to you Madonna you are The Queen to me Lily your face lights up when you smile! Thursday, October 29, 2009 8
Slide 9: Annotating UGC, other Challenges • Informal Text • slang, abbreviations, misspellings.. • indifferent approach to grammar.. • Context dependent terms • Unknown distributions Thursday, October 29, 2009 9
Slide 10: Our Approach Spotting and subsequent sense disambiguation of spots Ohh these sour times... rock! Ohh these <track id=574623> sour times </track> ... rock! Thursday, October 29, 2009 10
Slide 11: .1 Ground Truth Data Set ur experimental evaluation focuses on user comments from the MySpace pages f three artists: Madonna, Rihanna and Lily Allen (see Table 2). The artists ere selected to be popular enough to draw comment but different enough to rovide variety. The entity definitions were taken from the MusicBrainz RDF (see igure 1), which also includes some but not all common aliases and misspellings. adonna ihanna Ground Truth Data Set an artist with a extensive discography as well as a current album and concert tour a pop singer with recent accolades including a Grammy Award and a very active MySpace presence an independent artist with song titles that include “Smile,” “Allright, Still”, “Naive”, and “Friday Night” who also generates a fair amount of buzz around her personal life not related to music • 3 artists : Madonna, Rihanna, Lily Allen • • • hand tagged bythe Ground Truth Data Set Table 2. Artists in 4 authors 1858 spots (MySpace UGC) using naive spotter over MusicBrainz artist metadata Adjudicate if a spot is an entity or not (or inconclusive) illy Allen We establish a ground truth data Artist Good spots Bad spots Precision Agreement Agreement et of 1858 entity spots (best case for (Spots scored) for these 100% 75 % 100% 75% naive spotter) rtists (breakdown in Table 3). The Rihanna (615) 165 18 351 8 33% ata was obtained by crawling the Lily (523) 268 42 10 100 73% rtist’s MySpace page comments and Madonna (720) 138 24 503 20 23% entifying all exact string matches f the artist’s song titles. Only com- Table 3. Manual scoring agreements on ents with at least one spot were re- naive entity spotter results. Thursday, These spots were then hand ained. October 29, 2009 11
Slide 12: Experiments and Results Thursday, October 29, 2009 12
Slide 13: Experiments All entities from MusicBrainz 1. Light weight, edit distance based entity spotter Thursday, October 29, 2009 13
Slide 14: Experiments 1. Naive spotter using all entities from all of MusicBrainz 2. This new Merry Christmas tune is so good! ? but which one ? Disambiguate between the 60+ Merry Christmas entries in MusicBrainz Thursday, October 29, 2009 14
Slide 15: Experiments 2. Constrain set of possible entities from Musicbrainz - to increase spotting accuracy - constrain using cues from the comment to eliminate alternatives This new Merry Christmas tune is so good! Thursday, October 29, 2009 15
Slide 16: Experiments 3. Eliminate non-music mentions Natural language and domain specific cues Your SMILE rocks! Thursday, October 29, 2009 16
Slide 17: Restricted Entity Spotting Thursday, October 29, 2009 17
Slide 18: 2. Restricted Entity Spotting • Investigating the relationship between number of entities used and spotting accuracy • Understand systematic ways of scoping domain models for use in semantic annotation particular constraints in annotator systems detector ? • Experiments to gauge benefits of implementing • harder artist age detector vs. easier gender Thursday, October 29, 2009 18
Slide 19: ets of artists that are factors of 10 smaller (10%, 1%, etc). These subsets ys contain our three actual artists (Madonna, Rihanna and Lily Allen), use we are interested in simulating restrictions that remove invalid artists. most restricted entity set contains just the songs of one artist (≈0.0001% of MusicBrainz taxonomy). In order to rule out selection bias, we perform 200 om draws of sets of artists for each set size - a total of 1200 experiments. re 2 Precision the precision increases as the set of possible entities shrinks. shows that (best size, all each setcase for 200 results are plotted and a best fit line has been added naive spotter) dicate the average precision. Note that the figure is in log-log scale. !"""#$ 2a. Random Restrictions #""$ #""$ !"#$"%&'()'&*"'+,-.$'/#0.%1'&02(%(34 !""#$ !"#$ !#$ #$ #"$ #"$ #$ %&'()*''+, /178,,1 %&'()*''+,-.)/(.012+)314+ 5&61,,1-.)/(.012+)314+ /178,,1-.)/(.012+)314+ !#"$.-.(%'()'&*"'56(&&"# 33% 73% 23% !#$ !"#$ Domain restrictions of 10% of the RDF result in approximately 9.8 times improvement in precision 5&61,,1 !""#$ !"""#$ 2. Precision of a naive spotter using differently sized portions of the MusicBrainz nomy to spot song titles on artist’s MySpace pages e observe that the curves in Figure 2 conform to a power law formula, 1 ifically a Zipf distribution ( nR2 ). Zipf’s law was originally applied to demonte the Zipf distribution in frequency of words in natural language corpora and has since been demonstrated in other corpora including web searches Figure 2 shows that song titles in Informal English exhibit the same frecy characteristics as plain English. Furthermore, we can see that in the age case, a domain restrictions of 10% of the MusicBrainz RDF will result oximately in a 9.8 times improvement in precision of a naive spotter. his result is remarkably consistent across all three artists. The R2 values he power lines on the three artists are 0.9776, 0.979, 0.9836, which gives a ation of 0.61% in R2 value between spots on the three MySpace pages. Thursday, October 29, 2009 • From all of MusicBrainz (281890 artists, 6220519 tracks) to songs of one artist (for all three artists) 19
Slide 20: 2b. Real-world Constraints for Restrictions “Happy 25th Rhi!” (eliminate using Artist DOB - metadata in MusicBrainz) “ur new album dummy is awesome” (eliminate using Album release dates - metadata in MusicBrainz) • Systematic scoping of the RDF • Question: Do real-world constraints from usefulness metadata reduce size of the entity spot set in a meaningful way? • Experiments: Derived manually and tested for Thursday, October 29, 2009 20
Slide 21: D 1,193 20-30 year career Recent Album Restrictions- Applied to Madonna E 6,491 Artists who released an album in the past year F 10,501 Artists who released an album in the past 5 years Artist Age Restrictions- Applied to Lily Allen H 112 Artist born 1985, album in past 2 years Restrictions over MusicBrainz J 284 Artists born in 1985 (or bands founded in 1985) Key Count Restriction L 4,780 Artists or bands under 25 with album in past 2 years Artist 10,187 Artists or bands under 25 Applied to Madonna M Career Length Restrictions- years old B 22 80’s artists with recent (within to Lily Allen Number of Album Restrictions- Applied 1 year) album C 154 First album 1983 K 1,530 Only one album, released in the past 2 years D 1,193 20-30 year career N 19,809 Artists with only one album Recent Album Restrictions- Applied to Madonna Recent Album Restrictions- Applied to Rihanna E 6,491 3 albums exactly, first album last the past year Q 83 Artists who released an album in year F R 10,501 3+ albums, first album last year the past 5 years 196 Artists who released an album in Artist Age Restrictions- Applied to Lily Allen S 1,398 First album last year H 112 Artist with 3+ albums, in past 2 past T 2,653 Artistsborn 1985, album one in theyears year J 284 born in 1985 (or bands founded in 1985) U 6,491 Artists who released an album in the past year L Specific4,780 Artists or bands under 25 witheach Artist Artist Restrictions- Applied to album in past 2 years M 10,187 Madonna only under 25 years old A 1 Artists or bands Number of 1 Lily Allen only Album Restrictions- Applied to Lily Allen G P 1 Only one album, released in the past 2 years K 1,530 Rihanna only Z 19,809 Artists with MusicBrainz N 281,890 All artists in only one album Recent Album Restrictions- Applied to Rihanna Q 83 3 albums exactly, first album last year Table 4. youralbums, first years!last sample restrictions. ! efficacy of various year D. I’ve been The fan for 25 album M. Happy 25th R 196 3+ S 1,398 First album last year consider2009 2,653 Artists of restrictions onecareer,past year album Tthree classes with 3+ albums, - in the age and Thursday, October 29, Real-world Constraints .... .... based 21
Slide 22: Real-world Constraints • Applied different constraints to different artists • Reduce potential entity spot size • Run naive spotter • Measure precision Thursday, October 29, 2009 22
Slide 23: Real-world Constraints Rihanna: short career, recent album releases, 3 album releases etc.... !"""#$ !""#$ !"#$ !#$ #$ #"$ “I heart your new album” “I love all your 3 albums” “You are most favorite new pop artist” #"$ #$ !#$ !"#$ !""#$ &.*2)&+.*8*&2>?@+ &/.0+.+*<5-+)*=0/+.*&2>?@*<&+* 0%*.5)*,&+.*8*3)&/+ *&22*&/.0+.+*<5-*/)2)&+)1*&%* &2>?@*0%*.5)*,&+.*8*3)&/+ *)%.0/)*B?+0:*C/&0%D*.&A-%-@3*7"!"""8$*,/):0+0-%; !"""#$ Thursday, October 29, 2009 23 !"#$%&%'()'*)+,#)-.'++#" %&'()*+,-..)/*./&0%)1*-%*-%23*405&%%&*+-%6+** *****789!9$*,/):0+0-%; )A&:.23*8*&2>?@+ #""$ #""$
Slide 24: Real-world Constraints Age restrictions, only one album, last year releases, extensive career etc... !"""#$ !""#$ !"#$ !#$ #$ #"$ !"#$%&%'()'*)+,#)-.'++#" 1%&40*=">)*%&'()')*+(',*%3* %4567*(3*',1*8%)'*01%& %&'()')*+,:)1*;(&)'* %&'()')*+(',*%* &141%)1*+%)*(3*#<=/ -"./"*01%&*2%&11& %&'()')*+(',*%3*%4567*(3*',1*8%)'*01%& %&'()')*+(',*%3*%4567*(3*',1*8%)'*9*01%&) 13'(&1*B6)(2*F&%(3)*'%G:3:70**D"!"""9$*8&12()(:3E #"$ #$ !#$ !"#$ !""#$ #"$ %-*%7+48*(-*'95*:%)'*';,*<5%&) %&'()')*4-25&*=0*<5%&)* #$ ,72*1,&*+%-2)*75))* '95-*=0*<5%&)*,726 !#$ !"#$ !""#$ %&'()')*+,&-*(-*#./0* 1,&*+%-2)*3,4-252*(-*#./06 %&'()')*;('9*,-7<*,-5*%7+48 5-'(&5*E4)(D*F&%(-G*'%H,-,8<*1"!""C$*:&5D()(,-6 !"""#$ !"""#$ Thursday, October 29, 2009 Madonna Lily Allen !"#$%&%'()'*)+,#)-.'++#" 24 3%?@1*)8:''1&*'&%(31A*:3*:340*B%A:33%*):3C)* ********D--!9$*8&12()(:3E #""$ #""$ !"""#$ !""#$ !"#$ !#$ #$ #"$ -%>?5*):,''5&*'&%(-52*,-*,-7<*@(7<*A775-*),-B) ***************1C#$*:&5D()(,-6 #""$ #""$
Slide 25: Take aways.. • Real world restrictions closely follow distribution of random restrictions, conforming loosely to a Zipf distribution size regardless of restriction • Confirms general effectiveness of limiting domain • Choosing which constraints to implement is simple - pick whatever is easiest first • use metadata from the model to guide you Thursday, October 29, 2009 25
Slide 26: Non-music Mentions Thursday, October 29, 2009 26
Slide 27: Disambiguating Nonmusic References UGC on Lily Allen’s page about her new track Smile Got your new album Smile. Loved it! Keep your SMILE on! Thursday, October 29, 2009 27
Slide 28: Binary Classification, SVM Syntactic features + POS tag of s POS tag of one token before s POS tag of one token after s Typed dependency between s and sentiment word * Typed dependency between s and domain-specific term * Boolean Typed dependency between s and sentiment * Boolean Typed dependency between s and domain-specific term * Word-level features + Capitalization of spot s + Capitalization of first letter of s + s in Quotes Domain-specific features Sentiment expression in the same sentence as s Sentiment expression elsewhere in the comment Domain-related term in the same sentence as s Domain-related term elsewhere in the comment + Refers to basic features, others are advanced features ∗ These features apply only to one-word-long spots. Table 6. Features used by the SVM learner Thursday, October 29, 2009 28 Got your new album Smile. Loved it! Keep your SMILE on! Notation-S s.POS s.POSb s.POSa s.POS-TDsent ∗ s.POS-TDdom ∗ s.B-TDsent ∗ s.B-TDdom ∗ Notation-W s.allCaps s.firstCaps s.inQuotes Notation-D s.Ssent s.Csent s.Sdom s.Cdom Training data 550 good spots 550 bad spots Test data 120 good spots 229 * 2 bad spots
Slide 29: Most Useful Combinations FP best : All features, other combinations 42-91 78-50 TP best : word, domain, contextual 90-35 Not all syntactic features are Recall intensive useless, contrary to general belief, wrt informal text Thursday, October 29, 2009 29 Precision intensive TP next best : word, domain, contextual (POS)
Slide 30: Naive MB spotter + NLP '!!" &!" 5('*%$%63)7)8'*#"" %!" $!" #!" !" ()*+, -./00,1 2!345 6&35! 6$3$! 6#345 6'36! 6!35% 71,89-9/(:;/1:<9=>:?==,( 71,89-9/(:;/1:@9A)(() 71,89-9/(:;/1:B)C/(() @,8)==:D)==:0A1,,E %#3&$ %'36! %!3&5 $53&% $#32' • Annotate using naive spotter • best case baseline (artist is known) !"#$$%&%'()#**+(#*,)$-"%.$)/0#"%12%30#"%14 PR tradeoffs: choosing feature combinations depending on end application requirement Thursday, October 29, 2009 • follow with NLP analytics to weed out FPs input data • run on less than entire 30
Slide 31: Summary.. • Real-time large-scale data processing • prohibits computationally intensive NLP techniques • Simple inexpensive NL learners over a dictionary• restricting the taxonomy results in proportionally higher precision based naive spotter can yield reasonable performance • Spot + Disambiguate a feasible approach for (esply. Cultural) NER in Informal Text Thursday, October 29, 2009 31
Slide 32: Thank You! • Bing,Yahoo, Google: Meena Nagarajan • Contact us • • • {dgruhl, jhpieper, crobson}@us.ibm.com, {meena, amit}@knoesis.org • More about this work http://www.almaden.ibm.com/cs/projects/iis/sound/ http://knoesis.wright.edu/researchers/meena Thursday, October 29, 2009 32

   
Time on Slide Time on Plick
Slides per Visit Slide Views Views by Location