Analyse automatique d’articles scientifiques
Cyril Labb´e Universit´eGrenoble Alpes - LIG - ´equipe Sigma June 25, 2019
C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 1 / 44 Pourquoi Ecrire ? Table of Contents
1 Pourquoi Ecrire ?
2 Publications et Scientometrie Scientometrics: what for? SCIgen a Probabilistic Context Free Grammar
3 Of the use of fake publications h-index hacking Resume Padding Journal Hijacking
4 Detection of SCIgen papers Google Search SciDetect: Automatic detection
5 Automatic detection of questionable research papers Fact checking science Seek & Blastn tool
C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 2 / 44 Pourquoi Ecrire ? Pour construire la connaissance scientifique
Les ancˆetres (1665) Londres : Philosophical Transactions of the Royal Society, Paris : Journal des s¸cavans.
Sp´ecificit´es des publications scientifiques : un public de sp´ecialistes, contributions au ”d´ebat scientifique” avec des travaux originaux.
C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 3 / 44 Pourquoi Ecrire ? La publication d’un article
Publisher, print and distribute Editor Peer (Scientist)
Supervision c transfert Peer
Access Fee Computing Peer Machinery and Evaluation Intelligence Readers, Libraries Writes
Scientist
C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 4 / 44 Pourquoi Ecrire ? Nouveaux Syst`emesd’Information scientifiques
Grand nombre de sources d’information : Les catalogues des maisons d’´edition scientifiques Les archives ouvertes et les r´eseaux sociaux
L’Information a des caract´eristiques vari´ees : Acc`espayant ou gratuit : public, restreint ou priv´e Revue par les pairs ou non
Pour des objectifs vari´es : Etat de l’art / Bibliom´etrie / Scientom´etrie
L’article scientifique est au cœur du syst`eme : Garantir la validit´edes informations pr´esent´ees ? Comment garantir leurs qualit´es ? Y-a-t’il des syst`emesplus vertueux que d’autres ?
C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 5 / 44 Publications et Scientometrie Table of Contents
1 Pourquoi Ecrire ?
2 Publications et Scientometrie Scientometrics: what for? SCIgen a Probabilistic Context Free Grammar
3 Of the use of fake publications h-index hacking Resume Padding Journal Hijacking
4 Detection of SCIgen papers Google Search SciDetect: Automatic detection
5 Automatic detection of questionable research papers Fact checking science Seek & Blastn tool
C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 6 / 44 Publications et Scientometrie Scientometrics: what for? Ranking scientists and journals
Citations Definition (Impact Factor) Average number of citations to papers published by the journal over the last two years. Computed since 1975. 2years Time after publication
Number of citations Definition (h-index [Hirsch, 2005]) Ascientisthasindexh if h of his or her Np papers have at least h citations each and the other (Np h) papers have h citations each. h
Papers
0 h Np
C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 7 / 44 Publications et Scientometrie Scientometrics: what for? Ranking Uni, Journals and Scientists
Librarian Impact Factor What are the must-buys for my readers? Average number of citations (....) over the last two years. Computed since 1975. Scientist Where shall I submit my research? h-index and variations http://sci2s.ugr.es/hindex
Research Administration h5-index, g-index, hm-index, a-index, hg-index, ar -index... Who shall I hire? Who deserve a promotion? ARWU Students Academic Ranking of World Universities (Shanghai ranking) since 2003. Where to study? With whom? In which country? Collaborative distance Government Who deserve investment? What for? Which scientific field?
C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 8 / 44 Publications et Scientometrie Scientometrics: what for? R`egles quantitatives.
En France... Publiant : au moins 1 publication par an, ou 2 publications de rang A sur la p´eriode. Produisant : les arguments qui permettent de consid´erer une personne non-publiante comme produisante.
... et ailleurs ”at least one international publication per year” Rules for defense (MS Thesis, PhD thesis)
C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 9 / 44 Publications et Scientometrie Scientometrics: what for? Chronos
Web of Science
Scopus
2004 2006 2008 2010 2012 2014 2016 2018
h-indexPoP V1.0 Abiteboul par V1.0 l’administrateur du Coll`egede France G´en´eration automatique de texte
C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 10 / 44 Non zero probability to 1
Publications et Scientometrie SCIgen a Probabilistic Context Free Grammar PCFG: Probabilistic Context Free Grammar
Sets of symbols Set of non terminal symbols = , , , , N {SP S V P} Set of terminal symbols ⌃= ”.”, sing, dance, flight, seas, oceans, air, streets, hills, fields . { }
Set of rules i R 1 : . p( 1)=1 R SP ! S R 2 : We shall in the p( 2)=1/4 R S ! V P R 4 : We shall in the and in the , p( 4)=1/4 R S ! V P P S R 3 : , p( 3)=1/2 R S !SS R 5..7 : sing dance flight p( )=1/3 i=5..7 R V ! | | Ri 8..13 : seas oceans air streets hills fields p( )=1/6 i=8..13 R P ! | | | | | Ri
Terminal string example: s :Weshallsingintheairandinthehills,Weshalldanceinthefields. p(s)= p( j ) j R
C.Labb´eQ (UGA-LIG) Ike Antkare & Co June 25, 2019 11 / 44 Publications et Scientometrie SCIgen a Probabilistic Context Free Grammar PCFG: Probabilistic Context Free Grammar
Sets of symbols Set of non terminal symbols = , , , , N {SP S V P} Set of terminal symbols ⌃= ”.”, sing, dance, flight, seas, oceans, air, streets, hills, fields . { }
Set of rules i R 1 : . p( 1)=1 R SP ! S R 2 : We shall in the p( 2)=1/4 Non zero R S ! V P R 4 : We shall in the and in the , p( 4)=1/4 probability R S ! V P P S R 3 : , p( 3)=1/2 to R S !SS R 1 5..7 : sing dance flight p( )=1/3 i=5..7 R V ! | | Ri 8..13 : seas oceans air streets hills fields p( )=1/6 i=8..13 R P ! | | | | | Ri
Terminal string example: s :Weshallsingintheairandinthehills,Weshalldanceinthefields. p(s)= p( j ) j R
C.Labb´eQ (UGA-LIG) Ike Antkare & Co June 25, 2019 11 / 44 Publications et Scientometrie SCIgen a Probabilistic Context Free Grammar
SCIgen 2005 by J. Stribling, M. Krohn & D. Aguayo
... maximize amusement, rather than coherence ...
Titre Abstract Introduction Model Impl Eval RelatedWork Concl References
Intro_A Intro_A2 Intro_A3 Intro_closing
ntro A Many SCI PEOPLE would agree that, had it not been for SCI GENERIC NOUN,... I ! ntro A In recent years, much research has been devoted to the SCI ACT; ,... I ! ntro A SCI THING MOD and SCI THING MOD, while SCI ADJ in theory, have not until... I ! ntro A The SCI ACT is a SCI ADJSCI PROBLEM. I ! ntro A The SCI ACT has SCI VERBEDSCI THING MOD, and current trends... I ! ntro A The implications of SCI BUZZWORD ADJ SCI BUZZWORD NOUN have... I ! ...... ! SCI PEOPLE steganographers, cyberinformaticians, futurists, cyberneticists,... ! SCI BUZZWORD ADJ omniscient, introspective, peer to peer, ambimorphic,... !
C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 12 / 44 Publications et Scientometrie SCIgen a Probabilistic Context Free Grammar
C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 13 / 44 Publications et Scientometrie SCIgen a Probabilistic Context Free Grammar Chronos
Web of Science (Thomson Reuter)
Scopus (Elsevier)
2004 2006 2008 2010 2012 2014
h-index PoP Abiteboul par V1.0 l’administrateur du Coll`egede SCIgen France
C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 14 / 44 Of the use of fake publications Table of Contents
1 Pourquoi Ecrire ?
2 Publications et Scientometrie Scientometrics: what for? SCIgen a Probabilistic Context Free Grammar
3 Of the use of fake publications h-index hacking Resume Padding Journal Hijacking
4 Detection of SCIgen papers Google Search SciDetect: Automatic detection
5 Automatic detection of questionable research papers Fact checking science Seek & Blastn tool
C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 15 / 44 Of the use of fake publications h-index hacking
Building a citation farm [Labb´e, 2010]
Modified SCIgen
... 1 100 ...
0 ...
......
Real Documents Ike Antkare’s 101 Documents
C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 16 / 44 Of the use of fake publications h-index hacking
Ike Antkare h-index [Labb´e, 2010]
C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 17 / 44 Of the use of fake publications h-index hacking Chronos
Web of Science (Thomson Reuter)
Scopus (Elsevier)
2004 2006 2008 2010 2012 2014 2016 2018
Ike Antkare h-index PoP Abiteboul par V1.0 l’administrateur du Coll`egede SCIgen France
C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 18 / 44 Of the use of fake publications Resume Padding IEEEXplore: 12 nov. 2014
C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 19 / 44 Of the use of fake publications Resume Padding IEEEXplore: 2 feb. 2016
C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 20 / 44 Of the use of fake publications Journal Hijacking
Beware Hijacking Je↵rey Beall http://scholarlyoa.com
C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 20 / 44 Of the use of fake publications Journal Hijacking Publication : Gold Open Access
Publisher, print and distribute Editor Peer (Scientist)
Supervision c transfert Peer For free
Computing Publication Fee Peer Machinery and Evaluation Intelligence Readers, Libraries Writes
Scientist
C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 21 / 44 Of the use of fake publications Journal Hijacking Beware : Predatory Publishers Get me off Your Fucking Mailing List
David Mazieres` and Eddie Kohler New York University University of California, Los Angeles http://www.mailavenger.org/
Abstract your fucking mailing list. Get me off your fuck- ing mailing list. Get me off your fucking mail- Get me off your fucking mailing list. Get me off ing list. Get me off your fucking mailing list. your fucking mailing list. Get me off your fuck- Get me off your fucking mailing list. Get me off ing mailing list. Get me off your fucking mail- your fucking mailing list. Get me off your fuck- ing list. Get me off your fucking mailing list. ing mailing list. Get me off your fucking mail- Get me off your fucking mailing list. Get me off ing list. Get me off your fucking mailing list. your fucking mailing list. Get me off your fuck- Get me off your fucking mailing list. Get me ing mailing list. Get me off your fucking mail- off your fucking mailing list. Get me off your ing list. Get me off your fucking mailing list. fucking mailing list. Get me off your fucking mailing list. Get me off Get me off your fucking mailing list. Get me your fucking mailing list. Get me off your fuck- off your fucking mailing list. Get me off your ing mailing list. Get me off your fucking mail- fucking mailing list. Get me off your fucking ing list. Get me off your fucking mailing list. mailing list. Get me off your fucking mailing Get me off your fucking mailing list. Get me list. Get me off your fucking mailing list. Get off your fucking mailing list. Get me off your me off your fucking mailing list. Get me off fucking mailing list. Get me off your fucking your fucking mailing list. Get me off your fuck- mailing list. ing mailing list. Get me off your fucking mail- ing list. Get me off your fucking mailing list. Get me off your fucking mailing list. Get me off 1 Introduction your fucking mailing list. Get me off your fuck- ing mailing list. Get me off your fucking mail- Get me off your fucking mailing list. Get me off ing list. Get me off your fucking mailing list. your fucking mailing list. Get me off your fuck- Get me off your fucking mailing list. Get me ing mailing list. Get me off your fucking mail- off your fucking mailing list. Get me off your ing list. Get me off your fucking mailing list. fucking mailing list. Get me off your fucking C.Labb´e (UGA-LIG)Get me off your fucking mailing list. GetIke me Antkare off & Co June 25, 2019 21 / 44 mailing list. your fucking mailing list. Get me off your fuck- ing mailing list. Get me off your fucking mail- Get me off your fucking mailing list. Get me ing list. Get me off your fucking mailing list. off your fucking mailing list. Get me off your Get me off your fucking mailing list. Get me off fucking mailing list. Get me off your fucking
1 Detection of SCIgen papers Table of Contents
1 Pourquoi Ecrire ?
2 Publications et Scientometrie Scientometrics: what for? SCIgen a Probabilistic Context Free Grammar
3 Of the use of fake publications h-index hacking Resume Padding Journal Hijacking
4 Detection of SCIgen papers Google Search SciDetect: Automatic detection
5 Automatic detection of questionable research papers Fact checking science Seek & Blastn tool
C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 22 / 44 Detection of SCIgen papers Google Search Phrase search
Many SCI PEOPLE would agree that, had it not been for SCI GENERIC NOUN, ... In recent years, much research has been devoted to the SCI ACT; ... SCI THING MOD and SCI THING MOD, while SCI ADJ in theory, have not until ... The SCI ACT has SCI VERBEDSCI THING MOD, and current trends ... The implications of SCI BUZZWORD ADJ SCI BUZZWORD NOUN have ...
C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 23 / 44 Detection of SCIgen papers Google Search Phrase search
Many SCI PEOPLE would agree that, had it not been for SCI GENERIC NOUN, ... In recent years, much research has been devoted to the SCI ACT; ... SCI THING MOD and SCI THING MOD, while SCI ADJ in theory, have not until ... The SCI ACT has SCI VERBEDSCI THING MOD, and current trends ... The implications of SCI BUZZWORD ADJ SCI BUZZWORD NOUN have ...
C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 23 / 44 Detection of SCIgen papers SciDetect: Automatic detection
Distance inter-textuelle : [Labb´eand Labb´e, 2006] A: le le chat ( 1 , 2 , 0 ) B: un chat chat ( 2 , 0 , 1 ) { } 3 3 3 { } 3 3 3
un un un
un chat chat un chat chat un chat chat 1/3 1/3 1/3
chat chat chat 2/3 2/3 2/3
2/3 2/3 2/3 le le chat le le chat le le chat le le le
1 2 Distance intertextuelle : D(A,B) = 2 i (A B) fi,A fi,B = 3 2 [ | | P Interpr´etation:
D(A,B) = la proportion de mots (word tokens) di↵´erents dans les deux textes.
C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 24 / 44 Detection of SCIgen papers SciDetect: Automatic detection
Regroupement Hi´erarchique [Labb´eand Labb´e,2013]
1 D(I ,J) = I J ( i I j J D(i,j) + D(i,j)) | || | 2 2 P P IJ I 00.45 J 0.45 0
C et D forment le groupe J 1 D(I ,x) = 2 (D(A,x) + D(B,x))
ICD I 0 0.35 0.55 C 0.35 0 0.3 D 0.55 0.3 0
A et B forment le groupe I
ABCD A 0 0.2 0.3 0.5 B 0.2 0 0.4 0.6 C 0.3 0.4 0 0.3 D 0.5 0.6 0.3 0
C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 25 / 44 Detection of SCIgen papers SciDetect: Automatic detection
D´etection automatique [Labb´eand Labb´e, 2013]
Distance inter-textuelle :