Ike Antkare : Génèse et échos

Cyril Labbé

Université Joseph Fourier - LIG

Octobre 2014

C.Labbé (UJF-LIG) Ike Antkare & Co Octobre 2014 1 / 24 1 Preliminaries SCIgen a Probabilistic Context Free Grammar

2 Ike Antkare, one of the great starts in the scientific firmament

3 Detection of SCIgen papers : June 2012 Google Search Automatic classification

C.Labbé (UJF-LIG) Ike Antkare & Co Octobre 2014 2 / 24 Preliminaries Table of Contents

1 Preliminaries Scientometrics SCIgen a Probabilistic Context Free Grammar

2 Ike Antkare, one of the great starts in the scientific firmament

3 Detection of SCIgen papers : June 2012 Google Search Automatic classification

C.Labbé (UJF-LIG) Ike Antkare & Co Octobre 2014 3 / 24 Preliminaries Scientometrics Ranking scientists and journals

Number of Definition (h-index [Hirsch, 2005]) A scientist has index h if h of his or her Np papers have at least h citations each and the other (N h) p papers have h citations each. h 

Papers

0 h Np Citations Definition (Impact Factor) Average number of citations to papers published by the journal over the last two years. Computed since 1975. 2 years Time after publication

C.Labbé (UJF-LIG) Ike Antkare & Co Octobre 2014 4 / 24 Preliminaries Scientometrics Tools that count citations.

Toll based tools. Provided by publisher (, Thomson reuters); Based on publishers catalogs (ACM, IEEE, Springer, Elsevier,...); Selected venues only ( all peer reviewed).

Free tools: , CiteSeerX,... Crawling the web and/or selected catalogs and/or added by users; Social media (Google+, Scholarometer, Microsoft Academics...).

Free tools that computes indicators Publish or Perish; Scholarometer; Microsoft Academics; Google+; and many more...

C.Labbé (UJF-LIG) Ike Antkare & Co Octobre 2014 5 / 24 Tools to generate publications.

Preliminaries Scientometrics Chronos 2015

Web of Science(Thomson Reuter)

Scopus (Elsevier)

Google Scholar

2004 2006 2008 2010 2012 2014

h-index PoP Abiteboul par V1.0 l’administrateur du Collège de France

C.Labbé (UJF-LIG) Ike Antkare & Co Octobre 2014 6 / 24 Preliminaries Scientometrics Chronos 2015

Web of Science(Thomson Reuter)

Scopus (Elsevier)

Google Scholar

2004 2006 2008 2010 2012 2014

h-index PoP Abiteboul par V1.0 l’administrateur du Collège de France Tools to generate publications.

C.Labbé (UJF-LIG) Ike Antkare & Co Octobre 2014 6 / 24 Non zero probability to 1

Preliminaries SCIgen a Probabilistic Context Free Grammar PCFG: Probabilistic Context Free Grammar

Sets of symbols Set of non terminal symbols = , , , , N {SP S V P} Set of terminal symbols ⌃= ”.”, sing, dance, flight, seas, oceans, air, streets, hills, fields . { }

Set of rules i R : . p( )=1 R1 SP! S R1 : We shall in the p( )=1/4 R2 S! V P R2 : We shall in the , p( )=1/2 R3 S! V P S R3 : We shall in the and in the , p( )=1/4 R4 S! V P P S R4 .. : sing dance flight p( )=1/3 i=5..7 R5 7 V! | | Ri .. : seas oceans air streets hills fields p( )=1/6 i=8..13 R8 13 P! | | | | | Ri

Terminal string example: s : We shall sing in the air and in the hills, We shall dance in the fields. p(s)= p( j ) j R

C.LabbéQ (UJF-LIG) Ike Antkare & Co Octobre 2014 7 / 24 Preliminaries SCIgen a Probabilistic Context Free Grammar PCFG: Probabilistic Context Free Grammar

Sets of symbols Set of non terminal symbols = , , , , N {SP S V P} Set of terminal symbols ⌃= ”.”, sing, dance, flight, seas, oceans, air, streets, hills, fields . { }

Set of rules i R : . p( )=1 R1 SP! S R1 : We shall in the p( )=1/4 Non zero R2 S! V P R2 : We shall in the , p( )=1/2 probability R3 S! V P S R3 : We shall in the and in the , p( )=1/4 to R4 S! V P P S R4 1 .. : sing dance flight p( )=1/3 i=5..7 R5 7 V! | | Ri .. : seas oceans air streets hills fields p( )=1/6 i=8..13 R8 13 P! | | | | | Ri

Terminal string example: s :Weshallsingintheairandinthehills,Weshalldanceinthefields. p(s)= p( j ) j R

C.LabbéQ (UJF-LIG) Ike Antkare & Co Octobre 2014 7 / 24 Preliminaries SCIgen a Probabilistic Context Free Grammar

SCIgen: 2005 by J. Stribling, M. Krohn & D. Aguayo

... maximize amusement, rather than coherence ...

Titre Introduction Model Impl Eval RelatedWork Concl References

Intro_A Intro_A2 Intro_A3 Intro_closing

ntro_A Many SCI_PEOPLE would agree that, had it not been for SCI_GENERIC_NOUN,... I ! ntro_A In recent years, much research has been devoted to the SCI_ACT; ,... I ! ntro_A SCI_THING_MOD and SCI_THING_MOD, while SCI_ADJ in theory, have not until... I ! ntro_A The SCI_ACT is a SCI_ADJSCI_PROBLEM. I ! ntro_A The SCI_ACT has SCI_VERBEDSCI_THING_MOD, and current trends... I ! ntro_A The implications of SCI_BUZZWORD_ADJ SCI_BUZZWORD_NOUN have... I ! ...... !

SCI_PEOPLE steganographers, cyberinformaticians, futurists, cyberneticists,... ! SCI_BUZZWORD_ADJ omniscient, introspective, peer to peer, ambimorphic,... !

C.Labbé (UJF-LIG) Ike Antkare & Co Octobre 2014 8 / 24 Preliminaries SCIgen a Probabilistic Context Free Grammar SCIGen example

C.Labbé (UJF-LIG) Ike Antkare & Co Octobre 2014 9 / 24 Preliminaries SCIgen a Probabilistic Context Free Grammar Chronos 2015

Web of Science(Thomson Reuter)

Scopus (Elsevier)

Google Scholar

2004 2006 2008 2010 2012 2014

h-index PoP Abiteboul par V1.0 l’administrateur du Collège de SCIgen France

C.Labbé (UJF-LIG) Ike Antkare & Co Octobre 2014 10 / 24 Ike Antkare, one of the great starts in the scientific firmament Table of Contents

1 Preliminaries Scientometrics SCIgen a Probabilistic Context Free Grammar

2 Ike Antkare, one of the great starts in the scientific firmament

3 Detection of SCIgen papers : June 2012 Google Search Automatic classification

C.Labbé (UJF-LIG) Ike Antkare & Co Octobre 2014 11 / 24 Ike Antkare, one of the great starts in the scientific firmament

SCIgen texts citing SCIgen texts [Labbé, 2010]

Modified SCIgen

... 1 100 ...

0 ...

......

Real Documents Ike Antkare’s 101 Documents

C.Labbé (UJF-LIG) Ike Antkare & Co Octobre 2014 12 / 24 Ike Antkare, one of the great starts in the scientific firmament Ike Antkare h-index according GS (2010)

C.Labbé (UJF-LIG) Ike Antkare & Co Octobre 2014 13 / 24 Ike Antkare, one of the great starts in the scientific firmament Chronos 2015

Web of Science(Thomson Reuter)

Scopus (Elsevier)

Google Scholar

2004 2006 2008 2010 2012 2014

Ike Antkare h-index PoP Abiteboul par V1.0 l’administrateur du Collège de SCIgen France

C.Labbé (UJF-LIG) Ike Antkare & Co Octobre 2014 14 / 24 Ike Antkare, one of the great starts in the scientific firmament Get cited or Perish

Conclusion

Completeness Accuracy Robustness

Google Scholar (free) Good Good enough Spamable WoK / Scopus (fee-based) incomplete Error Free Excellent

A scholar/scientific would never fraud like that...

C.Labbé (UJF-LIG) Ike Antkare & Co Octobre 2014 15 / 24 Ike Antkare, one of the great starts in the scientific firmament Get cited or Perish

Conclusion

Completeness Accuracy Robustness

Google Scholar (free) Good Good enough Spamable WoK / Scopus (fee-based) incomplete Error Free Excellent

A scholar/scientific would never fraud like that...

C.Labbé (UJF-LIG) Ike Antkare & Co Octobre 2014 15 / 24 Detection of SCIgen papers : June 2012 Table of Contents

1 Preliminaries Scientometrics SCIgen a Probabilistic Context Free Grammar

2 Ike Antkare, one of the great starts in the scientific firmament

3 Detection of SCIgen papers : June 2012 Google Search Automatic classification

C.Labbé (UJF-LIG) Ike Antkare & Co Octobre 2014 16 / 24 Corpus Downloaded Years Type Number Acceptance Corpus name from of papers of papers rate size

MLT IEEE 2008 various 122 NA 122 ieee.org 2010

Detection of SCIgen papers : June 2012 Google Search

Phrase search and More Like This IEEE http://www.computer.org

Many SCI_PEOPLE would agree that, had it not been for SCI_GENERIC_NOUN, ... In recent years, much research has been devoted to the SCI_ACT; ... SCI_THING_MOD and SCI_THING_MOD, while SCI_ADJ in theory, have not until ... The SCI_ACT has SCI_VERBEDSCI_THING_MOD, and current trends ... The implications of SCI_BUZZWORD_ADJ SCI_BUZZWORD_NOUN have ...

C.Labbé (UJF-LIG) Ike Antkare & Co Octobre 2014 17 / 24 Detection of SCIgen papers : June 2012 Google Search

Phrase search and More Like This IEEE http://www.computer.org

Many SCI_PEOPLE would agree that, had it not been for SCI_GENERIC_NOUN, ... In recent years, much research has been devoted to the SCI_ACT; ... SCI_THING_MOD and SCI_THING_MOD, while SCI_ADJ in theory, have not until ... The SCI_ACT has SCI_VERBEDSCI_THING_MOD, and current trends ... The implications of SCI_BUZZWORD_ADJ SCI_BUZZWORD_NOUN have ...

Corpus Downloaded Years Type Number Acceptance Corpus name from of papers of papers rate size

MLT IEEE 2008 various 122 NA 122 ieee.org 2010

C.Labbé (UJF-LIG) Ike Antkare & Co Octobre 2014 17 / 24 Detection of SCIgen papers : June 2012 Automatic classification

Intertextual Distance: [Labbé and Labbé, 2006] 1 2 0 2 0 1 A: {le le chat} ( 3 , 3 , 3 ) B: {un chat chat } ( 3 , 3 , 3 )

un un un

un chat chat un chat chat un chat chat 1/3 1/3 1/3

chat chat chat 2/3 2/3 2/3

2/3 2/3 2/3 le le chat le le chat le le chat le le le

1 2 Intertextual Distance: D(A,B) = 2 i (A B) fi,A fi,B = 3 2 [ | | P Interpretation:

D(A,B) = the proportion of word tokens that are different in the two texts.

C.Labbé (UJF-LIG) Ike Antkare & Co Octobre 2014 18 / 24 Detection of SCIgen papers : June 2012 Automatic classification

SCIgen Detection: proposed method http://scigendetection.imag.fr

Corpus Downloaded Years Field Corpus size

arXiv1 arxiv.org 08–10 ComputerScience 15338 MLT ieee.org 08–10 Computer Science 122 SCIgen-Origin Original SCIgen – Computer Science 236 SCIgen-Physics Modified SCIgen – Physics 414

Let t be a text under test. Fake t be the distance between t and the nearest fake

Fake If t < 0.55 5 Then SCIgen origin must be seriously considered (misclass. risk < 10 ). Fake Else (t > 0.55) non-SCIgen origin must be seriously considered. 1 open repository for scholarly papers C.Labbé (UJF-LIG) Ike Antkare & Co Octobre 2014 19 / 24 Detection of SCIgen papers : June 2012 Automatic classification

Dendrogram (Z, MTL, SCIgen) ...... 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Distance l l l l l l l l l I l l I l I I I l l I I I I I l I I l l I l I I l l I I I I I I I I I I I l l I I I I I l I I I l I I I I I I I I I I l I I I I I I I I l l l l I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I l I I I I I I I I I I I I I I I I I I I I l l I I I I l I I l l l l l l l ! I I l l ! l ! ! ! ! l l ! ! ! l I I l l l l ! ! l l l l l l ! ! ! ! l l l l ! l l I I l l l l l ! l l l l ! l l l ! l l ! l ! ! l ! l l ! l ! ! ! l l l ! ! l ! ! ! ! l l ! ! l l l ! l l l l ! ! ! ! l l ! ! ! ! l l l ! ! ! l ! ! ! ! ! l l ! ! ! l ! ! l l ! ! ! ! ! ! l l ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! l l ! ! ! ! ! l ! ! ! ! ! ! ! ! ! ! ! ! l l l l l l l l l l l l l l l l

SCIGen Corpus Z MLT

C.Labbé (UJF-LIG) Ike Antkare & Co Octobre 2014 20 / 24 Detection of SCIgen papers : June 2012 Automatic classification Scopus, Wok,... 2015

Web of Science(Thomson Reuter)

Nature Scopus (Elsevier)

Google Scholar

2004 2006 2008 2010 2012 2014

Ike Antkare h-index PoP Abiteboul par V1.0 l’administrateur du Collège de SCIgen France

C.Labbé (UJF-LIG) Ike Antkare & Co Octobre 2014 21 / 24 Detection of SCIgen papers : June 2012 Automatic classification Related/Ongoing Work

Spoofing [Beel and Gipp, 2010, Lopez-Cozar et al., 2012], Academic optim. [Beel et al., 2010];

Detecting methods: Bib. based [Xiong and Huang, 2009],Compression[Dalkilic et al., 2006], ad-hoc dist. [Lavoie and Krishnamoorthy, 2010],Phrasesearch[Springer, 2014].

No SCIgen paper in arXiv (Computer Science) Image borrowed from [Ginsparg, 2014]; PCA, only stop-words. Supposed non Zipfian.

C.Labbé (UJF-LIG) Ike Antkare & Co Octobre 2014 22 / 24 Detection of SCIgen papers : June 2012 Automatic classification Conclusion and Future/Ongoing works

Publication procedures, models and habits Why fake papers were accepted, published and ... sold. Traditional publisher vs open access. Knowledge diffusion: better and less... or as much as possible. Automatic knowledge extraction/detection/generation.

Blind management rules...... are an incitation to malpractices: slicing, plagiarism, faked data, ...

Automatic detection of new generators Hand written PCFG : find dense cluster inside a population. Study other kind of generator (language model).

In the web today How to separate the wheat from the chaff... and scale up !

C.Labbé (UJF-LIG) Ike Antkare & Co Octobre 2014 23 / 24 Detection of SCIgen papers : June 2012 Automatic classification Thanks

Beel, J. and Gipp, B. (2010). International Society for Scientometrics and Informetrics Newsletter,6(2):48–52. Academic search engine spam and google scholar’s resilience against it. Labbé, C. and Labbé, D. (2006). Journal of Electronic Publishing,13(3). A tool for literary studies. intertextual distance and Beel, J., Gipp, B., and Wilde, E. (2010). tree classification. Literary and Linguistic Computing,21(3):311–326. Academic search engine optimization (aseo). Journal of scholarly publishing,41(2):176–190. Labbé, C. and Labbé, D. (2013). Dalkilic, M. M., Clark, W. T., Costello, J. C., and Duplicate and fake publications in the scientific literature: how many scigen papers in computer Radivojac, P. (2006). science? Using compression to identify classes of inauthentic Scientometrics,94(1):379–396. texts. In Proceedings of the 2006 SIAM Conference on Data Lavoie, A. and Krishnamoorthy, M. (2010). Mining. Algorithmic Detection of Computer Generated Text. Ginsparg, P. (2014). ArXiv e-prints. Automated screening: Arxiv screens spot fake papers. Lopez-Cozar, E. D., Robinson-García, N., and ,508(7494):44–44. Torres-Salinas, D. (2012). Hirsch, J. E. (2005). Manipulating google scholar citations and google scholar metrics: Simple, easy and tempting. An index to quantify an individual’s scientific research arXiv preprint arXiv:1212.0638. output. Proceedings of the National Academy of Science, Xiong, J. and Huang, T. (2009). 102:16569–16572. An effective method to identify machine automatically Labbé, C. (2010). generated paper. Ike antkare, one of the great stars in the scientific In Knowledge Engineering and Software Engineering, firmament. 2009. KESE ’09. Pacific-Asia Conference on,pages 101–102.

C.Labbé (UJF-LIG) Ike Antkare & Co Octobre 2014 24 / 24