Génération et détection automatique de faux articles scientifiques

Séminaire Aristote

Cyril Labbé

Université Joseph Fourier - LIG

19 Décembre 2014

C.Labbé (UJF-LIG) Ike Antkare & Co 19 Décembre 2014 1 / 28 1 Preliminaries SCIgen a Probabilistic Context Free Grammar

2 Ike Antkare, one of the great starts in the scientific firmament

3 Detection of SCIgen papers Google Search Automatic classification

C.Labbé (UJF-LIG) Ike Antkare & Co 19 Décembre 2014 2 / 28 Preliminaries Table of Contents

1 Preliminaries Scientometrics SCIgen a Probabilistic Context Free Grammar

2 Ike Antkare, one of the great starts in the scientific firmament

3 Detection of SCIgen papers Google Search Automatic classification

C.Labbé (UJF-LIG) Ike Antkare & Co 19 Décembre 2014 3 / 28 Preliminaries Scientometrics Ranking scientists and journals

Number of Definition (h-index [Hirsch, 2005]) A scientist has index h if h of his or her Np papers have at least h citations each and the other (N h) p papers have h citations each. h 

Papers

0 h Np Citations Definition (Impact Factor) Average number of citations to papers published by the journal over the last two years. Computed since 1975. 2 years Time after publication

C.Labbé (UJF-LIG) Ike Antkare & Co 19 Décembre 2014 4 / 28 Preliminaries Scientometrics Tools that count citations.

Toll based tools. Provided by publisher (, Thomson reuters); Based on publishers catalogs (ACM, IEEE, Springer, Elsevier); Selected venues only (all peer reviewed).

Free tools: , CiteSeerX,... Crawling the web / selected catalogs / added by users; Social media (Google+, Scholarometer, Microsoft Academics...).

Free tools that computes indicators Publish or Perish; Scholarometer; Microsoft Academics; Google+; and many more...

C.Labbé (UJF-LIG) Ike Antkare & Co 19 Décembre 2014 5 / 28 Tools to generate publications.

Preliminaries Scientometrics Chronos

Web of Science (Thomson Reuter)

Scopus (Elsevier)

2004 2006 2008 2010 2012 2014

h-indexPoP Abiteboul par V1.0 l’administrateur du Collège de France

C.Labbé (UJF-LIG) Ike Antkare & Co 19 Décembre 2014 6 / 28 Preliminaries Scientometrics Chronos

Web of Science (Thomson Reuter)

Scopus (Elsevier)

2004 2006 2008 2010 2012 2014

h-indexPoP Abiteboul par V1.0 l’administrateur du Collège de France Tools to generate publications.

C.Labbé (UJF-LIG) Ike Antkare & Co 19 Décembre 2014 6 / 28 Non zero probability to 1

Preliminaries SCIgen a Probabilistic Context Free Grammar PCFG: Probabilistic Context Free Grammar

Sets of symbols Set of non terminal symbols = , , , , N {SP S V P} Set of terminal symbols ⌃= ”.”, sing, dance, flight, seas, oceans, air, streets, hills, fields . { }

Set of rules i R : . p( )=1 R1 SP! S R1 : We shall in the p( )=1/4 R2 S! V P R2 : We shall in the and in the , p( )=1/4 R4 S! V P P S R4 : , p( )=1/2 R3 S!SS R3 .. : sing dance flight p( )=1/3 i=5..7 R5 7 V! | | Ri .. : seas oceans air streets hills fields p( )=1/6 i=8..13 R8 13 P! | | | | | Ri

Terminal string example: s : We shall sing in the air and in the hills, We shall dance in the fields. p(s)= p( j ) j R

C.LabbéQ (UJF-LIG) Ike Antkare & Co 19 Décembre 2014 7 / 28 Preliminaries SCIgen a Probabilistic Context Free Grammar PCFG: Probabilistic Context Free Grammar

Sets of symbols Set of non terminal symbols = , , , , N {SP S V P} Set of terminal symbols ⌃= ”.”, sing, dance, flight, seas, oceans, air, streets, hills, fields . { }

Set of rules i R : . p( )=1 R1 SP! S R1 : We shall in the p( )=1/4 Non zero R2 S! V P R2 : We shall in the and in the , p( )=1/4 probability R4 S! V P P S R4 : , p( )=1/2 to R3 S!SS R3 1 .. : sing dance flight p( )=1/3 i=5..7 R5 7 V! | | Ri .. : seas oceans air streets hills fields p( )=1/6 i=8..13 R8 13 P! | | | | | Ri

Terminal string example: s :Weshallsingintheairandinthehills,Weshalldanceinthefields. p(s)= p( j ) j R

C.LabbéQ (UJF-LIG) Ike Antkare & Co 19 Décembre 2014 7 / 28 Preliminaries SCIgen a Probabilistic Context Free Grammar

SCIgen 2005 by J. Stribling, M. Krohn & D. Aguayo

... maximize amusement, rather than coherence ...

Titre Introduction Model Impl Eval RelatedWork Concl References

Intro_A Intro_A2 Intro_A3 Intro_closing

ntro_A Many SCI_PEOPLE would agree that, had it not been for SCI_GENERIC_NOUN,... I ! ntro_A In recent years, much research has been devoted to the SCI_ACT; ,... I ! ntro_A SCI_THING_MOD and SCI_THING_MOD, while SCI_ADJ in theory, have not until... I ! ntro_A The SCI_ACT is a SCI_ADJSCI_PROBLEM. I ! ntro_A The SCI_ACT has SCI_VERBEDSCI_THING_MOD, and current trends... I ! ntro_A The implications of SCI_BUZZWORD_ADJ SCI_BUZZWORD_NOUN have... I ! ...... !

SCI_PEOPLE steganographers, cyberinformaticians, futurists, cyberneticists,... ! SCI_BUZZWORD_ADJ omniscient, introspective, peer to peer, ambimorphic,... !

C.Labbé (UJF-LIG) Ike Antkare & Co 19 Décembre 2014 8 / 28 Preliminaries SCIgen a Probabilistic Context Free Grammar

C.Labbé (UJF-LIG) Ike Antkare & Co 19 Décembre 2014 9 / 28 Preliminaries SCIgen a Probabilistic Context Free Grammar Chronos

Web of Science (Thomson Reuter)

Scopus (Elsevier)

2004 2006 2008 2010 2012 2014

h-index PoP Abiteboul par V1.0 l’administrateur du Collège de SCIgen France

C.Labbé (UJF-LIG) Ike Antkare & Co 19 Décembre 2014 10 / 28 Ike Antkare, one of the great starts in the scientific firmament Table of Contents

1 Preliminaries Scientometrics SCIgen a Probabilistic Context Free Grammar

2 Ike Antkare, one of the great starts in the scientific firmament

3 Detection of SCIgen papers Google Search Automatic classification

C.Labbé (UJF-LIG) Ike Antkare & Co 19 Décembre 2014 11 / 28 Ike Antkare, one of the great starts in the scientific firmament

Building a farm [Labbé, 2010]

Modified SCIgen

... 1 100 ...

0 ...

......

Real Documents Ike Antkare’s 101 Documents

C.Labbé (UJF-LIG) Ike Antkare & Co 19 Décembre 2014 12 / 28 Ike Antkare, one of the great starts in the scientific firmament

Ike Antkare h-index [Labbé, 2010]

C.Labbé (UJF-LIG) Ike Antkare & Co 19 Décembre 2014 13 / 28 Ike Antkare, one of the great starts in the scientific firmament Chronos

Web of Science (Thomson Reuter)

Scopus (Elsevier)

2004 2006 2008 2010 2012 2014

Ike h-index PoP Antkare Abiteboul par V1.0 l’administrateur du Collège de SCIgen France

C.Labbé (UJF-LIG) Ike Antkare & Co 19 Décembre 2014 14 / 28 Ike Antkare, one of the great starts in the scientific firmament Get cited or Perish

Conclusion

Completeness Accuracy Robustness

Google Scholar (free) Good Good enough Spamable WoK / Scopus (fee-based) incomplete Error Free Excellent

A scholar/scientific would never fraud like that...

C.Labbé (UJF-LIG) Ike Antkare & Co 19 Décembre 2014 15 / 28 Ike Antkare, one of the great starts in the scientific firmament Get cited or Perish

Conclusion

Completeness Accuracy Robustness

Google Scholar (free) Good Good enough Spamable WoK / Scopus (fee-based) incomplete Error Free Excellent

A scholar/scientific would never fraud like that...

C.Labbé (UJF-LIG) Ike Antkare & Co 19 Décembre 2014 15 / 28 Detection of SCIgen papers Table of Contents

1 Preliminaries Scientometrics SCIgen a Probabilistic Context Free Grammar

2 Ike Antkare, one of the great starts in the scientific firmament

3 Detection of SCIgen papers Google Search Automatic classification

C.Labbé (UJF-LIG) Ike Antkare & Co 19 Décembre 2014 16 / 28 Corpus Downloaded Years Type Number Acceptance Corpus name from of papers of papers rate size

MLT IEEE 2008 various 122 NA 122 ieee.org 2010

Detection of SCIgen papers Google Search

Phrase search and More Like This IEEE http://www.computer.org

Many SCI_PEOPLE would agree that, had it not been for SCI_GENERIC_NOUN, ... In recent years, much research has been devoted to the SCI_ACT; ... SCI_THING_MOD and SCI_THING_MOD, while SCI_ADJ in theory, have not until ... The SCI_ACT has SCI_VERBEDSCI_THING_MOD, and current trends ... The implications of SCI_BUZZWORD_ADJ SCI_BUZZWORD_NOUN have ...

C.Labbé (UJF-LIG) Ike Antkare & Co 19 Décembre 2014 17 / 28 Corpus Downloaded Years Type Number Acceptance Corpus name from of papers of papers rate size

MLT IEEE 2008 various 122 NA 122 ieee.org 2010

Detection of SCIgen papers Google Search

Phrase search and More Like This IEEE http://www.computer.org

Many SCI_PEOPLE would agree that, had it not been for SCI_GENERIC_NOUN, ... In recent years, much research has been devoted to the SCI_ACT; ... SCI_THING_MOD and SCI_THING_MOD, while SCI_ADJ in theory, have not until ... The SCI_ACT has SCI_VERBEDSCI_THING_MOD, and current trends ... The implications of SCI_BUZZWORD_ADJ SCI_BUZZWORD_NOUN have ...

C.Labbé (UJF-LIG) Ike Antkare & Co 19 Décembre 2014 17 / 28 Detection of SCIgen papers Google Search

Phrase search and More Like This IEEE http://www.computer.org

Many SCI_PEOPLE would agree that, had it not been for SCI_GENERIC_NOUN, ... In recent years, much research has been devoted to the SCI_ACT; ... SCI_THING_MOD and SCI_THING_MOD, while SCI_ADJ in theory, have not until ... The SCI_ACT has SCI_VERBEDSCI_THING_MOD, and current trends ... The implications of SCI_BUZZWORD_ADJ SCI_BUZZWORD_NOUN have ...

Corpus Downloaded Years Type Number Acceptance Corpus name from of papers of papers rate size

MLT IEEE 2008 various 122 NA 122 ieee.org 2010

C.Labbé (UJF-LIG) Ike Antkare & Co 19 Décembre 2014 17 / 28 Detection of SCIgen papers Google Search Spotting the fake:

Corpus Downloaded Years Type Number Acceptance Corpus name from of papers of papers rate size

MLT IEEE 2008 various 122 NA 122 ieee.org 2010 Track 1 58 18.4% Corpus Z Conf. 2010 Track 2 33 16.1% 153 Web Site Track 3 36 Demo 32 36% Ike SCIgen 2009-2010 - 100 100% 100

Extract txt from pdf (without the references section) Compute the distances matrix (on raw txt files) and build a dendrogram

C.Labbé (UJF-LIG) Ike Antkare & Co 19 Décembre 2014 18 / 28 Detection of SCIgen papers Automatic classification

Intertextual Distance: [Labbé and Labbé, 2006] 1 2 0 2 0 1 A: {le le chat} ( 3 , 3 , 3 ) B: {un chat chat } ( 3 , 3 , 3 )

un un un

un chat chat un chat chat un chat chat 1/3 1/3 1/3

chat chat chat 2/3 2/3 2/3

2/3 2/3 2/3 le le chat le le chat le le chat le le le

1 2 Intertextual Distance: D(A,B) = 2 i (A B) fi,A fi,B = 3 2 [ | | P Interpretation:

D(A,B) = the proportion of word tokens that are different in the two texts.

C.Labbé (UJF-LIG) Ike Antkare & Co 19 Décembre 2014 19 / 28 Detection of SCIgen papers Automatic classification

Hierarchical Clustering [Labbé and Labbé, 2013]

1 D(I ,J) = I J ( i I j J D(i,j) + D(i,j)) | || | 2 2 P P IJ I 00.45 J 0.45 0

C et D are clustered in a group J D = 1 (D + D ) (I ,x) 2 (A,x) (B,x)

ICD I 0 0.35 0.55 C 0.35 0 0.3 D 0.55 0.3 0

A et B are clustered in a group I

ABCD A 0 0.2 0.3 0.5 B 0.2 0 0.4 0.6 C 0.3 0.4 0 0.3 D 0.5 0.6 0.3 0

C.Labbé (UJF-LIG) Ike Antkare & Co 19 Décembre 2014 20 / 28 Detection of SCIgen papers Automatic classification Spotting the fake:

Corpus Downloaded Years Type Number Acceptance Corpus name from of papers of papers rate size

MLT IEEE 2008 various 122 NA 122 ieee.org 2010 Track 1 58 18.4% Corpus Z Conf. 2010 Track 2 33 16.1% 153 Web Site Track 3 36 Demo 32 36% Ike SCIgen 2009-2010 - 100 100% 100

Extract txt from pdf (without the references section) Compute the distances matrix (on raw txt files) and build a dendrogram

C.Labbé (UJF-LIG) Ike Antkare & Co 19 Décembre 2014 21 / 28 Detection of SCIgen papers Automatic classification

Dendrogram (Z, MTL, Ike) ...... 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 l l l l l l l l l I l l I l I I I l l I I I I I l I I l l I l I I l l I I I I I I I I I I I l l I I I I I l I I I l I I I I I I I I I I l I I I I I I I I l l l l I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I l I I I I I I I I I I I I I I I I I I I I l l I I I I l I I l l l l l l l ! I I l l ! l ! ! ! ! l l ! ! ! l I I l l l l ! ! l l l l l l ! ! ! ! l l l l ! l l I I l l l l l ! l l l l ! l l l ! l l ! l ! ! l ! l l ! l ! ! ! l l l ! ! l ! ! ! ! l l ! ! l l l ! l l l l ! ! ! ! l l ! ! ! ! l l l ! ! ! l ! ! ! ! ! l l ! ! ! l ! ! l l ! ! ! ! ! ! l l ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! l l ! ! ! ! ! l ! ! ! ! ! ! ! ! ! ! ! ! l l l l l l l l l l l l l l l l

SCIGen Corpus Z MLT

C.Labbé (UJF-LIG) Ike Antkare & Co 19 Décembre 2014 22 / 28 Detection of SCIgen papers Automatic classification

SCIgen Detection: proposed method http://scigendetection.imag.fr

Corpus Downloaded Years Field Corpus size

arXiv1 arxiv.org 08–10 ComputerScience 15338 MLT ieee.org 08–10 Computer Science 122 SCIgen-Origin Original SCIgen – Computer Science 236 SCIgen-Physics Modified SCIgen – Physics 414

Let t be a text under test. Fake t be the distance between t and the nearest fake

Fake If t < 0.55 5 Then SCIgen origin must be seriously considered (misclass. risk < 10 ). Fake Else (t > 0.55) non-SCIgen origin must be seriously considered. 1 open repository for scholarly papers C.Labbé (UJF-LIG) Ike Antkare & Co 19 Décembre 2014 23 / 28 Detection of SCIgen papers Automatic classification

Site web de détection http://scigendetection.imag.fr

Site de démonstration pour l’article [Labbé and Labbé, 2013] Input : MyConf.zip contenant des fichiers pdf Output : la classe (SCIgen/non-SCIgen) de chaque pdf, dendrogramme, doublons,... Utilisation en production.

Utilisation automatique (depuis février 2014) : Nombre d’archives soumises > 51000 (nombre d’articles testés > 100000)

Utilisation curative : Détection de SCIgen déjà parus : 120 IEEE, 16 Springer, 3 Hal

C.Labbé (UJF-LIG) Ike Antkare & Co 19 Décembre 2014 24 / 28 Detection of SCIgen papers Automatic classification Scopus, Wok,...

Web of Science (Thomson Reuter)

Nature Scopus (Elsevier)

2004 2006 2008 2010 2012 2014

Ike h-index PoP Antkare Abiteboul par V1.0 l’administrateur du Collège de SCIgen France

C.Labbé (UJF-LIG) Ike Antkare & Co 19 Décembre 2014 25 / 28 Detection of SCIgen papers Automatic classification Related/Ongoing Work

Spoofing [Beel and Gipp, 2010, Lopez-Cozar et al., 2012], Academic optim. [Beel et al., 2010];

Detecting methods: Bib. based [Xiong and Huang, 2009],Compression[Dalkilic et al., 2006], ad-hoc dist. [Lavoie and Krishnamoorthy, 2010],Phrasesearch[Springer, 2014], Structural distances between texts [Fahrenberg et al., 2014].

No SCIgen paper in arXiv (Computer Science) Image borrowed from [Ginsparg, 2014]; PCA, only stop-words. Supposed non Zipfian.

C.Labbé (UJF-LIG) Ike Antkare & Co 19 Décembre 2014 26 / 28 Detection of SCIgen papers Automatic classification Conclusion and Future/Ongoing works

Publication procedures, models and habits Why fake papers were accepted, published and ... sold. Traditional publisher vs open access. Knowledge diffusion: better and less... or as much as possible.

Blind management rules...... are an incitation to malpractices: slicing, plagiarism, faked data, ...

Automatic detection of new generators Hand written PCFG : find dense cluster inside a population. Study other kind of generator (language model).

In the web today Automatic knowledge extraction/detection/generation. How to separate the wheat from the chaff... and scale up !

C.Labbé (UJF-LIG) Ike Antkare & Co 19 Décembre 2014 27 / 28 Detection of SCIgen papers Automatic classification Thanks

Beel, J. and Gipp, B. (2010). Ginsparg, P. (2014). Duplicate and fake publications in the scientific literature: how Academic search engine spam Automated screening: Arxiv many scigen papers in computer and google scholar’s resilience screens spot fake papers. science? against it. ,508(7494):44–44. Scientometrics,94(1):379–396. Journal of Electronic Publishing, 13(3). Hirsch, J. E. (2005). Lavoie, A. and Krishnamoorthy, An index to quantify an M. (2010). Beel, J., Gipp, B., and Wilde, E. individual’s scientific research Algorithmic Detection of (2010). output. Computer Generated Text. Academic search engine Proceedings of the National ArXiv e-prints. optimization (aseo). Academy of Science, Journal of scholarly publishing, 102:16569–16572. Lopez-Cozar, E. D., 41(2):176–190. Labbé, C. (2010). Robinson-García, N., and Torres-Salinas, D. (2012). Ike antkare, one of the great Dalkilic, M. M., Clark, W. T., Manipulating google scholar stars in the scientific firmament. citations and google scholar Costello, J. C., and Radivojac, P. International Society for (2006). metrics: Simple, easy and Scientometrics and Informetrics Using compression to identify Newsletter,6(2):48–52. tempting. classes of inauthentic texts. arXiv preprint arXiv:1212.0638. In Proceedings of the 2006 SIAM Labbé, C. and Labbé, D. (2006). Conference on Data Mining. Xiong, J. and Huang, T. (2009). A tool for literary studies. An effective method to identify intertextual distance and tree Fahrenberg, U., Biondi, F., machine automatically generated classification. Corre, K., Jégourel, C., paper. Literary and Linguistic In Knowledge Engineering and Kongshøj, S., and Legay, A. Computing,21(3):311–326. (2014). Software Engineering, 2009. Measuring structural distances KESE ’09. Pacific-Asia Labbé, C. and Labbé, D. (2013). Conference on,pages101–102. between texts. CoRR, abs/1403.4024.

C.Labbé (UJF-LIG) Ike Antkare & Co 19 Décembre 2014 28 / 28