Slide 1: Entity Spotting in Informal Text
Meena Nagarajan with Daniel Gruhl*, Jan Pieper*, Christine Robson*, Amit P. Sheth Kno.e.sis, Wright State IBM Research - Almaden, San Jose CA*
Thursday, October 29, 2009
1
Slide 2: Tracking Online Popularity
http://www.almaden.ibm.com/cs/projects/iis/sound/
Thursday, October 29, 2009
2
Slide 3: Tracking Online Popularity
• • •
What is the buzz in the online Music Community? Ranking and displaying top X music artists, songs, tracks, albums.. Spotting entities, despamming, sentiment identification, aggregation, top X lists..
http://www.almaden.ibm.com/cs/projects/iis/sound/
Thursday, October 29, 2009
3
Slide 4: Spotting music entities in user-generated content in online music forums (MySpace)
Thursday, October 29, 2009
4
Slide 5: Chatter in Online Music Communities
http://knoesis.wright.edu/research/semweb/projects/music/
Thursday, October 29, 2009 5
Slide 6: Goal: Semantic Annotation of artists, tracks, songs, albums..
Music Brainz RDF
Ohh these sour times... rock! Ohh these <track id=574623> sour times </track> ... rock!
Thursday, October 29, 2009 6
Slide 7: Multiple Senses in the same Domain
• 60 songs with Merry
Christmas Yesterday
• 3600 songs with • 195 releases of
American Pie American Pie Caught AMERICAN PIE on cable so much fun!
Thursday, October 29, 2009
• 31 artists covering
7
Slide 8: Annotating UGC, other Challenges
• Several Cultural named entities • artifacts of culture, common words in
everyday language
LOVED UR MUSIC YESTERDAY!
Just showing some Love to you Madonna you are The Queen to me
Lily your face lights up when you smile!
Thursday, October 29, 2009 8
Slide 9: Annotating UGC, other Challenges
• Informal Text • slang, abbreviations, misspellings.. • indifferent approach to grammar.. • Context dependent terms • Unknown distributions
Thursday, October 29, 2009 9
Slide 10: Our Approach
Spotting and subsequent sense disambiguation of spots
Ohh these sour times... rock! Ohh these <track id=574623> sour times </track> ... rock!
Thursday, October 29, 2009 10
Slide 11: .1
Ground Truth Data Set
ur experimental evaluation focuses on user comments from the MySpace pages f three artists: Madonna, Rihanna and Lily Allen (see Table 2). The artists ere selected to be popular enough to draw comment but different enough to rovide variety. The entity definitions were taken from the MusicBrainz RDF (see igure 1), which also includes some but not all common aliases and misspellings.
adonna ihanna
Ground Truth Data Set
an artist with a extensive discography as well as a current album and concert tour a pop singer with recent accolades including a Grammy Award and a very active MySpace presence an independent artist with song titles that include “Smile,” “Allright, Still”, “Naive”, and “Friday Night” who also generates a fair amount of buzz around her personal life not related to music
• 3 artists : Madonna, Rihanna, Lily Allen
• • •
hand tagged bythe Ground Truth Data Set Table 2. Artists in 4 authors
1858 spots (MySpace UGC) using naive spotter over MusicBrainz artist metadata Adjudicate if a spot is an entity or not (or inconclusive)
illy Allen
We establish a ground truth data Artist Good spots Bad spots Precision Agreement Agreement et of 1858 entity spots (best case for (Spots scored) for these 100% 75 % 100% 75% naive spotter) rtists (breakdown in Table 3). The Rihanna (615) 165 18 351 8 33% ata was obtained by crawling the Lily (523) 268 42 10 100 73% rtist’s MySpace page comments and Madonna (720) 138 24 503 20 23% entifying all exact string matches f the artist’s song titles. Only com- Table 3. Manual scoring agreements on ents with at least one spot were re- naive entity spotter results. Thursday, These spots were then hand ained. October 29, 2009
11
Slide 12: Experiments and Results
Thursday, October 29, 2009
12
Slide 13: Experiments
All entities from MusicBrainz
1. Light weight, edit distance based entity spotter
Thursday, October 29, 2009
13
Slide 14: Experiments
1. Naive spotter using all entities from all of MusicBrainz 2. This new Merry Christmas tune is so good!
? but which one ?
Disambiguate between the 60+ Merry Christmas entries in MusicBrainz
Thursday, October 29, 2009 14
Slide 15: Experiments
2. Constrain set of possible entities from Musicbrainz - to increase spotting accuracy - constrain using cues from the comment to eliminate alternatives This new Merry Christmas tune is so good!
Thursday, October 29, 2009 15
Slide 16: Experiments
3. Eliminate non-music mentions Natural language and domain specific cues Your SMILE rocks!
Thursday, October 29, 2009 16
Slide 17: Restricted Entity Spotting
Thursday, October 29, 2009
17
Slide 18: 2. Restricted Entity Spotting
• Investigating the relationship between number
of entities used and spotting accuracy
• Understand systematic ways of scoping
domain models for use in semantic annotation particular constraints in annotator systems detector ?
• Experiments to gauge benefits of implementing • harder artist age detector vs. easier gender
Thursday, October 29, 2009 18
Slide 19: ets of artists that are factors of 10 smaller (10%, 1%, etc). These subsets ys contain our three actual artists (Madonna, Rihanna and Lily Allen), use we are interested in simulating restrictions that remove invalid artists. most restricted entity set contains just the songs of one artist (≈0.0001% of MusicBrainz taxonomy). In order to rule out selection bias, we perform 200 om draws of sets of artists for each set size - a total of 1200 experiments. re 2 Precision the precision increases as the set of possible entities shrinks. shows that (best size, all each setcase for 200 results are plotted and a best fit line has been added naive spotter) dicate the average precision. Note that the figure is in log-log scale.
!"""#$
2a. Random Restrictions
#""$ #""$
!"#$"%&'()'&*"'+,-.$'/#0.%1'&02(%(34 !""#$ !"#$ !#$ #$ #"$
#"$ #$
%&'()*''+, /178,,1
%&'()*''+,-.)/(.012+)314+ 5&61,,1-.)/(.012+)314+ /178,,1-.)/(.012+)314+
!#"$.-.(%'()'&*"'56(&&"#
33% 73% 23%
!#$ !"#$
Domain restrictions of 10% of the RDF result in approximately 9.8 times improvement in precision
5&61,,1
!""#$
!"""#$
2. Precision of a naive spotter using differently sized portions of the MusicBrainz nomy to spot song titles on artist’s MySpace pages
e observe that the curves in Figure 2 conform to a power law formula, 1 ifically a Zipf distribution ( nR2 ). Zipf’s law was originally applied to demonte the Zipf distribution in frequency of words in natural language corpora and has since been demonstrated in other corpora including web searches Figure 2 shows that song titles in Informal English exhibit the same frecy characteristics as plain English. Furthermore, we can see that in the age case, a domain restrictions of 10% of the MusicBrainz RDF will result oximately in a 9.8 times improvement in precision of a naive spotter. his result is remarkably consistent across all three artists. The R2 values he power lines on the three artists are 0.9776, 0.979, 0.9836, which gives a ation of 0.61% in R2 value between spots on the three MySpace pages.
Thursday, October 29, 2009
• From all of MusicBrainz (281890 artists, 6220519
tracks) to songs of one artist (for all three artists)
19
Slide 20: 2b. Real-world Constraints for Restrictions
“Happy 25th Rhi!” (eliminate using Artist DOB - metadata in MusicBrainz) “ur new album dummy is awesome” (eliminate using Album release dates - metadata in MusicBrainz)
• Systematic scoping of the RDF • Question: Do real-world constraints from
usefulness
metadata reduce size of the entity spot set in a meaningful way?
• Experiments: Derived manually and tested for
Thursday, October 29, 2009 20
Slide 21: D 1,193 20-30 year career Recent Album Restrictions- Applied to Madonna E 6,491 Artists who released an album in the past year F 10,501 Artists who released an album in the past 5 years Artist Age Restrictions- Applied to Lily Allen H 112 Artist born 1985, album in past 2 years Restrictions over MusicBrainz J 284 Artists born in 1985 (or bands founded in 1985) Key Count Restriction L 4,780 Artists or bands under 25 with album in past 2 years Artist 10,187 Artists or bands under 25 Applied to Madonna M Career Length Restrictions- years old B 22 80’s artists with recent (within to Lily Allen Number of Album Restrictions- Applied 1 year) album C 154 First album 1983 K 1,530 Only one album, released in the past 2 years D 1,193 20-30 year career N 19,809 Artists with only one album Recent Album Restrictions- Applied to Madonna Recent Album Restrictions- Applied to Rihanna E 6,491 3 albums exactly, first album last the past year Q 83 Artists who released an album in year F R 10,501 3+ albums, first album last year the past 5 years 196 Artists who released an album in Artist Age Restrictions- Applied to Lily Allen S 1,398 First album last year H 112 Artist with 3+ albums, in past 2 past T 2,653 Artistsborn 1985, album one in theyears year J 284 born in 1985 (or bands founded in 1985) U 6,491 Artists who released an album in the past year L Specific4,780 Artists or bands under 25 witheach Artist Artist Restrictions- Applied to album in past 2 years M 10,187 Madonna only under 25 years old A 1 Artists or bands Number of 1 Lily Allen only Album Restrictions- Applied to Lily Allen G P 1 Only one album, released in the past 2 years K 1,530 Rihanna only Z 19,809 Artists with MusicBrainz N 281,890 All artists in only one album Recent Album Restrictions- Applied to Rihanna Q 83 3 albums exactly, first album last year Table 4. youralbums, first years!last sample restrictions. ! efficacy of various year D. I’ve been The fan for 25 album M. Happy 25th R 196 3+ S 1,398 First album last year consider2009 2,653 Artists of restrictions onecareer,past year album Tthree classes with 3+ albums, - in the age and Thursday, October 29,
Real-world Constraints
.... ....
based
21
Slide 22: Real-world Constraints
• Applied different constraints to different
artists
• Reduce potential entity spot size • Run naive spotter • Measure precision
Thursday, October 29, 2009
22
Slide 23: Real-world Constraints
Rihanna: short career, recent album releases, 3 album releases etc....
!"""#$ !""#$ !"#$ !#$ #$ #"$
“I heart your new album” “I love all your 3 albums” “You are most favorite new pop artist”
#"$ #$ !#$ !"#$ !""#$
&.*2)&+.*8*&2>?@+
&/.0+.+*<5-+)*=0/+.*&2>?@*<&+* 0%*.5)*,&+.*8*3)&/+ *&22*&/.0+.+*<5-*/)2)&+)1*&%* &2>?@*0%*.5)*,&+.*8*3)&/+ *)%.0/)*B?+0:*C/&0%D*.&A-%-@3*7"!"""8$*,/):0+0-%;
!"""#$
Thursday, October 29, 2009 23
!"#$%&%'()'*)+,#)-.'++#"
%&'()*+,-..)/*./&0%)1*-%*-%23*405&%%&*+-%6+** *****789!9$*,/):0+0-%; )A&:.23*8*&2>?@+
#""$ #""$
Slide 24: Real-world Constraints
Age restrictions, only one album, last year releases, extensive career etc...
!"""#$
!""#$
!"#$
!#$
#$
#"$
!"#$%&%'()'*)+,#)-.'++#"
1%&40*=">)*%&'()')*+(',*%3* %4567*(3*',1*8%)'*01%& %&'()')*+,:)1*;(&)'* %&'()')*+(',*%* &141%)1*+%)*(3*#<=/ -"./"*01%&*2%&11& %&'()')*+(',*%3*%4567*(3*',1*8%)'*01%& %&'()')*+(',*%3*%4567*(3*',1*8%)'*9*01%&) 13'(&1*B6)(2*F&%(3)*'%G:3:70**D"!"""9$*8&12()(:3E
#"$ #$ !#$ !"#$ !""#$
#"$ %-*%7+48*(-*'95*:%)'*';,*<5%&) %&'()')*4-25&*=0*<5%&)* #$ ,72*1,&*+%-2)*75))* '95-*=0*<5%&)*,726 !#$ !"#$ !""#$
%&'()')*+,&-*(-*#./0* 1,&*+%-2)*3,4-252*(-*#./06 %&'()')*;('9*,-7<*,-5*%7+48 5-'(&5*E4)(D*F&%(-G*'%H,-,8<*1"!""C$*:&5D()(,-6
!"""#$
!"""#$
Thursday, October 29, 2009
Madonna
Lily Allen
!"#$%&%'()'*)+,#)-.'++#"
24
3%?@1*)8:''1&*'&%(31A*:3*:340*B%A:33%*):3C)* ********D--!9$*8&12()(:3E
#""$ #""$
!"""#$
!""#$
!"#$
!#$
#$
#"$
-%>?5*):,''5&*'&%(-52*,-*,-7<*@(7<*A775-*),-B) ***************1C#$*:&5D()(,-6
#""$ #""$
Slide 25: Take aways..
• Real world restrictions closely follow distribution
of random restrictions, conforming loosely to a Zipf distribution size regardless of restriction
• Confirms general effectiveness of limiting domain • Choosing which constraints to implement is simple
- pick whatever is easiest first
• use metadata from the model to guide you
Thursday, October 29, 2009 25
Slide 26: Non-music Mentions
Thursday, October 29, 2009
26
Slide 27: Disambiguating Nonmusic References
UGC on Lily Allen’s page about her new track Smile Got your new album Smile. Loved it! Keep your SMILE on!
Thursday, October 29, 2009
27
Slide 28: Binary Classification, SVM
Syntactic features + POS tag of s POS tag of one token before s POS tag of one token after s Typed dependency between s and sentiment word * Typed dependency between s and domain-specific term * Boolean Typed dependency between s and sentiment * Boolean Typed dependency between s and domain-specific term * Word-level features + Capitalization of spot s + Capitalization of first letter of s + s in Quotes Domain-specific features Sentiment expression in the same sentence as s Sentiment expression elsewhere in the comment Domain-related term in the same sentence as s Domain-related term elsewhere in the comment + Refers to basic features, others are advanced features ∗ These features apply only to one-word-long spots. Table 6. Features used by the SVM learner
Thursday, October 29, 2009 28
Got your new album Smile. Loved it! Keep your SMILE on!
Notation-S s.POS s.POSb s.POSa s.POS-TDsent ∗ s.POS-TDdom ∗ s.B-TDsent ∗ s.B-TDdom ∗ Notation-W s.allCaps s.firstCaps s.inQuotes Notation-D s.Ssent s.Csent s.Sdom s.Cdom
Training data
550 good spots 550 bad spots Test data
120 good spots
229 * 2 bad spots
Slide 29: Most Useful Combinations
FP best : All features, other combinations 42-91 78-50 TP best : word, domain, contextual 90-35 Not all syntactic features are Recall intensive useless, contrary to general belief, wrt informal text
Thursday, October 29, 2009 29
Precision intensive
TP next best : word, domain, contextual (POS)
Slide 30: Naive MB spotter + NLP
'!!" &!" 5('*%$%63)7)8'*#"" %!" $!" #!" !" ()*+, -./00,1 2!345 6&35! 6$3$! 6#345 6'36! 6!35% 71,89-9/(:;/1:<9=>:?==,( 71,89-9/(:;/1:@9A)(() 71,89-9/(:;/1:B)C/(() @,8)==:D)==:0A1,,E %#3&$ %'36! %!3&5 $53&% $#32'
• Annotate using naive
spotter
• best case baseline
(artist is known)
!"#$$%&%'()#**+(#*,)$-"%.$)/0#"%12%30#"%14
PR tradeoffs: choosing feature combinations depending on end application requirement
Thursday, October 29, 2009
• follow with NLP analytics
to weed out FPs input data
• run on less than entire
30
Slide 31: Summary..
• Real-time large-scale data processing • prohibits computationally intensive NLP techniques • Simple inexpensive NL learners over a dictionary• restricting the taxonomy results in proportionally
higher precision
based naive spotter can yield reasonable performance
• Spot + Disambiguate a feasible approach for (esply.
Cultural) NER in Informal Text
Thursday, October 29, 2009 31
Slide 32: Thank You!
• Bing,Yahoo, Google: Meena Nagarajan • Contact us
• • •
{dgruhl, jhpieper, crobson}@us.ibm.com, {meena, amit}@knoesis.org
• More about this work
http://www.almaden.ibm.com/cs/projects/iis/sound/ http://knoesis.wright.edu/researchers/meena
Thursday, October 29, 2009
32