Analyse automatique d’articles scientifiques

Cyril Labb´e Universit´eGrenoble Alpes - LIG - ´equipe Sigma June 25, 2019

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 1 / 44 Pourquoi Ecrire ? Table of Contents

1 Pourquoi Ecrire ?

2 Publications et Scientometrie : what for? SCIgen a Probabilistic Context Free Grammar

3 Of the use of fake publications h-index hacking Resume Padding Journal Hijacking

4 Detection of SCIgen papers Google Search SciDetect: Automatic detection

5 Automatic detection of questionable research papers Fact checking science Seek & Blastn tool

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 2 / 44 Pourquoi Ecrire ? Pour construire la connaissance scientifique

Les ancˆetres (1665) Londres : Philosophical Transactions of the Royal Society, Paris : Journal des s¸cavans.

Sp´ecificit´es des publications scientifiques : un public de sp´ecialistes, contributions au ”d´ebat scientifique” avec des travaux originaux.

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 3 / 44 Pourquoi Ecrire ? La publication d’un article

Publisher, print and distribute Editor Peer (Scientist)

Supervision c transfert Peer

Access Fee Computing Peer Machinery and Evaluation Intelligence Readers, Libraries Writes

Scientist

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 4 / 44 Pourquoi Ecrire ? Nouveaux Syst`emesd’Information scientifiques

Grand nombre de sources d’information : Les catalogues des maisons d’´edition scientifiques Les archives ouvertes et les r´eseaux sociaux

L’Information a des caract´eristiques vari´ees : Acc`espayant ou gratuit : public, restreint ou priv´e Revue par les pairs ou non

Pour des objectifs vari´es : Etat de l’art / Bibliom´etrie / Scientom´etrie

L’article scientifique est au cœur du syst`eme : Garantir la validit´edes informations pr´esent´ees ? Comment garantir leurs qualit´es ? Y-a-t’il des syst`emesplus vertueux que d’autres ?

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 5 / 44 Publications et Scientometrie Table of Contents

1 Pourquoi Ecrire ?

2 Publications et Scientometrie Scientometrics: what for? SCIgen a Probabilistic Context Free Grammar

3 Of the use of fake publications h-index hacking Resume Padding Journal Hijacking

4 Detection of SCIgen papers Google Search SciDetect: Automatic detection

5 Automatic detection of questionable research papers Fact checking science Seek & Blastn tool

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 6 / 44 Publications et Scientometrie Scientometrics: what for? Ranking scientists and journals

Citations Definition (Impact Factor) Average number of to papers published by the journal over the last two years. Computed since 1975. 2years Time after publication

Number of citations Definition (h-index [Hirsch, 2005]) Ascientisthasindexh if h of his or her Np papers have at least h citations each and the other (Np h) papers have h citations each. h 

Papers

0 h Np

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 7 / 44 Publications et Scientometrie Scientometrics: what for? Ranking Uni, Journals and Scientists

Librarian Impact Factor What are the must-buys for my readers? Average number of citations (....) over the last two years. Computed since 1975. Scientist Where shall I submit my research? h-index and variations http://sci2s.ugr.es/hindex

Research Administration h5-index, g-index, hm-index, a-index, hg-index, ar -index... Who shall I hire? Who deserve a promotion? ARWU Students Academic Ranking of World Universities (Shanghai ranking) since 2003. Where to study? With whom? In which country? Collaborative distance Government Who deserve investment? What for? Which scientific field?

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 8 / 44 Publications et Scientometrie Scientometrics: what for? R`egles quantitatives.

En France... Publiant : au moins 1 publication par an, ou 2 publications de rang A sur la p´eriode. Produisant : les arguments qui permettent de consid´erer une personne non-publiante comme produisante.

... et ailleurs ”at least one international publication per year” Rules for defense (MS Thesis, PhD thesis)

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 9 / 44 Publications et Scientometrie Scientometrics: what for? Chronos

Web of Science

Scopus

2004 2006 2008 2010 2012 2014 2016 2018

h-indexPoP V1.0 Abiteboul par V1.0 l’administrateur du Coll`egede France G´en´eration automatique de texte

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 10 / 44 Non zero probability to 1

Publications et Scientometrie SCIgen a Probabilistic Context Free Grammar PCFG: Probabilistic Context Free Grammar

Sets of symbols Set of non terminal symbols = , , , , N {SP S V P} Set of terminal symbols ⌃= ”.”, sing, dance, flight, seas, oceans, air, streets, hills, fields . { }

Set of rules i R 1 : . p( 1)=1 R SP! S R 2 : We shall in the p( 2)=1/4 R S! V P R 4 : We shall in the and in the , p( 4)=1/4 R S! V P P S R 3 : , p( 3)=1/2 R S!SS R 5..7 : sing dance flight p( )=1/3 i=5..7 R V! | | Ri 8..13 : seas oceans air streets hills fields p( )=1/6 i=8..13 R P! | | | | | Ri

Terminal string example: s :Weshallsingintheairandinthehills,Weshalldanceinthefields. p(s)= p( j ) j R

C.Labb´eQ (UGA-LIG) Ike Antkare & Co June 25, 2019 11 / 44 Publications et Scientometrie SCIgen a Probabilistic Context Free Grammar PCFG: Probabilistic Context Free Grammar

Sets of symbols Set of non terminal symbols = , , , , N {SP S V P} Set of terminal symbols ⌃= ”.”, sing, dance, flight, seas, oceans, air, streets, hills, fields . { }

Set of rules i R 1 : . p( 1)=1 R SP! S R 2 : We shall in the p( 2)=1/4 Non zero R S! V P R 4 : We shall in the and in the , p( 4)=1/4 probability R S! V P P S R 3 : , p( 3)=1/2 to R S!SS R 1 5..7 : sing dance flight p( )=1/3 i=5..7 R V! | | Ri 8..13 : seas oceans air streets hills fields p( )=1/6 i=8..13 R P! | | | | | Ri

Terminal string example: s :Weshallsingintheairandinthehills,Weshalldanceinthefields. p(s)= p( j ) j R

C.Labb´eQ (UGA-LIG) Ike Antkare & Co June 25, 2019 11 / 44 Publications et Scientometrie SCIgen a Probabilistic Context Free Grammar

SCIgen 2005 by J. Stribling, M. Krohn & D. Aguayo

... maximize amusement, rather than coherence ...

Titre Introduction Model Impl Eval RelatedWork Concl References

Intro_A Intro_A2 Intro_A3 Intro_closing

ntro A Many SCI PEOPLE would agree that, had it not been for SCI GENERIC NOUN,... I ! ntro A In recent years, much research has been devoted to the SCI ACT; ,... I ! ntro A SCI THING MOD and SCI THING MOD, while SCI ADJ in theory, have not until... I ! ntro A The SCI ACT is a SCI ADJSCI PROBLEM. I ! ntro A The SCI ACT has SCI VERBEDSCI THING MOD, and current trends... I ! ntro A The implications of SCI BUZZWORD ADJ SCI BUZZWORD NOUN have... I ! ...... ! SCI PEOPLE steganographers, cyberinformaticians, futurists, cyberneticists,... ! SCI BUZZWORD ADJ omniscient, introspective, peer to peer, ambimorphic,... !

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 12 / 44 Publications et Scientometrie SCIgen a Probabilistic Context Free Grammar

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 13 / 44 Publications et Scientometrie SCIgen a Probabilistic Context Free Grammar Chronos

Web of Science (Thomson Reuter)

Scopus ()

2004 2006 2008 2010 2012 2014

h-index PoP Abiteboul par V1.0 l’administrateur du Coll`egede SCIgen France

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 14 / 44 Of the use of fake publications Table of Contents

1 Pourquoi Ecrire ?

2 Publications et Scientometrie Scientometrics: what for? SCIgen a Probabilistic Context Free Grammar

3 Of the use of fake publications h-index hacking Resume Padding Journal Hijacking

4 Detection of SCIgen papers Google Search SciDetect: Automatic detection

5 Automatic detection of questionable research papers Fact checking science Seek & Blastn tool

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 15 / 44 Of the use of fake publications h-index hacking

Building a farm [Labb´e, 2010]

Modified SCIgen

... 1 100 ...

0 ...

......

Real Documents Ike Antkare’s 101 Documents

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 16 / 44 Of the use of fake publications h-index hacking

Ike Antkare h-index [Labb´e, 2010]

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 17 / 44 Of the use of fake publications h-index hacking Chronos

Web of Science (Thomson Reuter)

Scopus (Elsevier)

2004 2006 2008 2010 2012 2014 2016 2018

Ike Antkare h-index PoP Abiteboul par V1.0 l’administrateur du Coll`egede SCIgen France

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 18 / 44 Of the use of fake publications Resume Padding IEEEXplore: 12 nov. 2014

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 19 / 44 Of the use of fake publications Resume Padding IEEEXplore: 2 feb. 2016

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 20 / 44 Of the use of fake publications Journal Hijacking

Beware Hijacking Je↵rey Beall http://scholarlyoa.com

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 20 / 44 Of the use of fake publications Journal Hijacking Publication : Gold Open Access

Publisher, print and distribute Editor Peer (Scientist)

Supervision c transfert Peer For free

Computing Publication Fee Peer Machinery and Evaluation Intelligence Readers, Libraries Writes

Scientist

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 21 / 44 Of the use of fake publications Journal Hijacking Beware : Predatory Publishers Get me off Your Fucking Mailing List

David Mazieres` and Eddie Kohler New York University University of California, Los Angeles http://www.mailavenger.org/

Abstract your fucking mailing list. Get me off your fuck- ing mailing list. Get me off your fucking mail- Get me off your fucking mailing list. Get me off ing list. Get me off your fucking mailing list. your fucking mailing list. Get me off your fuck- Get me off your fucking mailing list. Get me off ing mailing list. Get me off your fucking mail- your fucking mailing list. Get me off your fuck- ing list. Get me off your fucking mailing list. ing mailing list. Get me off your fucking mail- Get me off your fucking mailing list. Get me off ing list. Get me off your fucking mailing list. your fucking mailing list. Get me off your fuck- Get me off your fucking mailing list. Get me ing mailing list. Get me off your fucking mail- off your fucking mailing list. Get me off your ing list. Get me off your fucking mailing list. fucking mailing list. Get me off your fucking mailing list. Get me off Get me off your fucking mailing list. Get me your fucking mailing list. Get me off your fuck- off your fucking mailing list. Get me off your ing mailing list. Get me off your fucking mail- fucking mailing list. Get me off your fucking ing list. Get me off your fucking mailing list. mailing list. Get me off your fucking mailing Get me off your fucking mailing list. Get me list. Get me off your fucking mailing list. Get off your fucking mailing list. Get me off your me off your fucking mailing list. Get me off fucking mailing list. Get me off your fucking your fucking mailing list. Get me off your fuck- mailing list. ing mailing list. Get me off your fucking mail- ing list. Get me off your fucking mailing list. Get me off your fucking mailing list. Get me off 1 Introduction your fucking mailing list. Get me off your fuck- ing mailing list. Get me off your fucking mail- Get me off your fucking mailing list. Get me off ing list. Get me off your fucking mailing list. your fucking mailing list. Get me off your fuck- Get me off your fucking mailing list. Get me ing mailing list. Get me off your fucking mail- off your fucking mailing list. Get me off your ing list. Get me off your fucking mailing list. fucking mailing list. Get me off your fucking C.Labb´e (UGA-LIG)Get me off your fucking mailing list. GetIke me Antkare off & Co June 25, 2019 21 / 44 mailing list. your fucking mailing list. Get me off your fuck- ing mailing list. Get me off your fucking mail- Get me off your fucking mailing list. Get me ing list. Get me off your fucking mailing list. off your fucking mailing list. Get me off your Get me off your fucking mailing list. Get me off fucking mailing list. Get me off your fucking

1 Detection of SCIgen papers Table of Contents

1 Pourquoi Ecrire ?

2 Publications et Scientometrie Scientometrics: what for? SCIgen a Probabilistic Context Free Grammar

3 Of the use of fake publications h-index hacking Resume Padding Journal Hijacking

4 Detection of SCIgen papers Google Search SciDetect: Automatic detection

5 Automatic detection of questionable research papers Fact checking science Seek & Blastn tool

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 22 / 44 Detection of SCIgen papers Google Search Phrase search

Many SCI PEOPLE would agree that, had it not been for SCI GENERIC NOUN, ... In recent years, much research has been devoted to the SCI ACT; ... SCI THING MOD and SCI THING MOD, while SCI ADJ in theory, have not until ... The SCI ACT has SCI VERBEDSCI THING MOD, and current trends ... The implications of SCI BUZZWORD ADJ SCI BUZZWORD NOUN have ...

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 23 / 44 Detection of SCIgen papers Google Search Phrase search

Many SCI PEOPLE would agree that, had it not been for SCI GENERIC NOUN, ... In recent years, much research has been devoted to the SCI ACT; ... SCI THING MOD and SCI THING MOD, while SCI ADJ in theory, have not until ... The SCI ACT has SCI VERBEDSCI THING MOD, and current trends ... The implications of SCI BUZZWORD ADJ SCI BUZZWORD NOUN have ...

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 23 / 44 Detection of SCIgen papers SciDetect: Automatic detection

Distance inter-textuelle : [Labb´eand Labb´e, 2006] A: le le chat ( 1 , 2 , 0 ) B: un chat chat ( 2 , 0 , 1 ) { } 3 3 3 { } 3 3 3

un un un

un chat chat un chat chat un chat chat 1/3 1/3 1/3

chat chat chat 2/3 2/3 2/3

2/3 2/3 2/3 le le chat le le chat le le chat le le le

1 2 Distance intertextuelle : D(A,B) = 2 i (A B) fi,A fi,B = 3 2 [ | | P Interpr´etation:

D(A,B) = la proportion de mots (word tokens) di↵´erents dans les deux textes.

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 24 / 44 Detection of SCIgen papers SciDetect: Automatic detection

Regroupement Hi´erarchique [Labb´eand Labb´e,2013]

1 D(I ,J) = I J ( i I j J D(i,j) + D(i,j)) | || | 2 2 P P IJ I 00.45 J 0.45 0

C et D forment le groupe J 1 D(I ,x) = 2 (D(A,x) + D(B,x))

ICD I 0 0.35 0.55 C 0.35 0 0.3 D 0.55 0.3 0

A et B forment le groupe I

ABCD A 0 0.2 0.3 0.5 B 0.2 0 0.4 0.6 C 0.3 0.4 0 0.3 D 0.5 0.6 0.3 0

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 25 / 44 Detection of SCIgen papers SciDetect: Automatic detection

D´etection automatique [Labb´eand Labb´e, 2013]

Distance inter-textuelle :

(a,b) = proportion de mots (tokens) di↵´erents dans les deux textes.

Soit Hierarchical Clustering

...... 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 t un texte `atester. Fake t =min(t,f )

l f SCIgen l l 2 l l l l l l I l l I l I I I l l I I I I I l I I l l I l I I l l I I I I I I I I I I I l l I I I I I l I I I l I I I I I I Fake I I I I l I I I I I I I I l l l l I I I I I I I I I I I I I I I I I I I I

I I Si ( < )Alors I I I

I Seuil I I I I I I

I I I t I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I l I I I I I I I I I I I I I I I I I I I I l l I I I I l I I l l l l l l l ! I I l l ! l ! ! ! Une g´en´eration SCIgen doit ! l l ! ! ! l I I l l l l ! ! l l l l l l ! ! ! ! l l l l ! l l I I l l l l l ! l l l l ! l l l ! l l ! l ! ! l ! l l ! l ! ! ! l l l ! ! l ! ! ! ! l l ! ! l l l ! l l l l ! ! ! ! l l ! ! ! ! l l l ! ! ! l ! ! ! ! ! l l ! ! ! l ! ! l l ! ! ! ! ! ! l l ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! l l ! ! ! ! ! l ! ! ! ! ! ! ! ! ! ! ! ! l ˆe t r e c o n s i d ´e r ´e e , l l l l 5

l l (risque < 10 ). l l l

l l Sinon l l l l Une origine non-SCIgen doit SCIGen Corpus Z MLT ˆe t r e c o n s i d ´e r ´e e .

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 26 / 44 Detection of SCIgen papers SciDetect: Automatic detection SCIgen papers and its clones

SSME: Int. Conf. on Services Science, Management and Engineering. 2009. IEEEXplorer, indexed in Scopus and WoK 150 papers, 4 SCIgen and 1 duplicate. Ocial acceptance rate : 28%

SCIgen inside (publishers) SCIgen inside (social networks) 120 IEEE (retracted or deleted), http://www.researchgate.net 16 Springer (retracted), http://scholar.harvard.edu 1Elsevier(accepted-unpublished) http://www.academia.edu

Other generators Mathgen (http://thatsmathematics.com/mathgen/) The Postmodernism Generator (http://www.elsewhere.org/pomo/) scigen-physics (https://bitbucket.org/birkenfeld/scigen-physics) Auto. SBIR Grant Proposal Generator (http://www.nadovich.com/chris/randprop/)

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 27 / 44 Detection of SCIgen papers SciDetect: Automatic detection

Dans la presse internationale scientifique et grand public (2014)

Publishers withdraw more than 120 gib- berish papers

Science publisher fooled by gibberish papers Publier ou p´erir: faux articles pour faux congr`es More Computer-Generated Nonsense Fraudulent scientific papers published, Papers Pulled From Science Journals then withdrawn

Wissenschaftsverlag l¨oscht 16 sinn- Ike Antkare, le grand scientifique qui Science publisher fooled by gibberish freie Artikel n’existait pas How Gobbledygook Ended Up in Re- spected Scientific Journals

How computer-generated fake papers Science Publishers Remove Papers Generated as a Hoax Wieder ließen Fachverlage Nonsens are flooding academia ungepr¨uftdurchgehen

Fake Research Papers: How Did More Than 120 ’Gib- berish’ Computer-Generated Studies Get Published?

Fraudulent scientific papers published, then with- drawn

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 28 / 44 Detection of SCIgen papers SciDetect: Automatic detection Chronos

Web of Science (Thomson Reuter)

Nature Scopus (Elsevier)

2004 2006 2008 2010 2012 2014 2016 2018

Ike Antkare h-index PoP Abiteboul par V1.0 l’administrateur du Coll`egede SCIgen France

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 29 / 44 Detection of SCIgen papers SciDetect: Automatic detection No SCIgen paper in arXiv (Computer Science)

Automated screening: ArXiv screens spot fake papers Only stop-words PCA Supposed non Zipfian

Image borrowed from [Ginsparg, 2014]

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 30 / 44 Detection of SCIgen papers SciDetect: Automatic detection Publication : Self Archiving (Green Open Access)

Publisher, print and distribute Editor Peer (Scientist)

Supervision c transfert Peer

Computing ... Peer Machinery and Evaluation Intelligence Readers, Open Archive Libraries Write For free upload Scientist

Computing Machinery and Intelligence

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 31 / 44 Detection of SCIgen papers SciDetect: Automatic detection Where to find pirated papers

Pirated papers

LibGen

Sci-Hub (Alexandra Elbakyan)

Bohannon J, Elbakyan A (2016) Data from: Who’s downloading pirated papers? Everyone. Dryad Digital Repository. https://doi.org/10.5061/dryad.q447c

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 32 / 44 Detection of SCIgen papers SciDetect: Automatic detection

Overlay Journal : les ´epi-journaux

Overlay Network Editor provide links to Peer (Scientist) selected paper

Supervision Peer

Computing Peer Machinery and Evaluation Readers, Intelligence Libraries

Open Archive For free Write

upload Scientist

Computing Machinery and Intelligence

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 33 / 44 Detection of SCIgen papers SciDetect: Automatic detection

Springer- funded SciDetect: http://scidetect.forge.imag.fr

Press release, march 2015 ”The open source software discovers text that has been generated with the SCIgen computer program and other fake-paper generators like Mathgen and Physgen.” ”SciDetect is highly flexible and can be quickly customized to cope with new methods of automatically generating fake or random text”

Do not cop with other problems Peer review rings Paper mills Black market and authorship selling

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 34 / 44 Automatic detection of questionable research papers Table of Contents

1 Pourquoi Ecrire ?

2 Publications et Scientometrie Scientometrics: what for? SCIgen a Probabilistic Context Free Grammar

3 Of the use of fake publications h-index hacking Resume Padding Journal Hijacking

4 Detection of SCIgen papers Google Search SciDetect: Automatic detection

5 Automatic detection of questionable research papers Fact checking science Seek & Blastn tool

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 35 / 44 Automatic detection of questionable research papers Automatic detection of questionable research papers [Byrne and Labb´e, 2017b, Byrne and Labb´e, 2017a]

Scientific ethics Non-sense detection Plagiarism, auto-plagiarism, Paper generator (SCIgen, content reuse... physic-gen, MathGen...) N grams signature Authorship detection (hashing functions). (inter-textual distance).

Need to detect questionable scientific results

Fabrications (making up data or results) Error spreading Falsification (manipulating data or results) = Wrong belief False or unsupported armations ) Research irreproducibility Genuine errors

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 36 / 44 Automatic detection of questionable research papers Fact checking science Starting point : striking similarities, obvious errors

Jennifer Byrne: 5 Publications from China: First reported TPD52L2 Single gene knockdown (20 years ago) experiments. 5 Publications with obvious Human cancer cell lines. errors!

Conclusions highlight potential therapy ...TPD52L2... novel therapeutic target for glioma treatment. ...TPD52L2... novel clues for oral squamous cell carcinoma therapy. ...TPD52L2... therapeutic approach for the treatment of breast cancer. ...TPD52L2 is indispensable in gastric cancer proliferation. ...TPD52L2 could be a novel therapeutic target for human liver cancer.

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 37 / 44 Fact-Check using blastn (NCBI)

Query= SeqDSeqD (evalue = 10) Length=68 Sequences producing significantsignificant alignments:alignments: ...... >.... HomoHomo sapienssapiens NIN1/PSMD8NIN1/PSMD8 bindingbinding proteinprotein 11 homologhomolog (NOB1)...(NOB1)... Length=1775 ... Query 9 GCCAAGGAAGTGCAATTGCATA 30 |||||||||||||||||||||| Sbjct 1505 GCCAAGGAAGTGCAATTGCATA 1526 .... Query 37 TATGCAATTGCACTTCCTTGG 57 |||||||||||||||||||||| Sbjct 1526 TATGCAATTGCACTTCCTTGG 1506

Automatic detection of questionable research papers Fact checking science Obvious errors: example

PMID : 25262828

Fact-Check using blastn (NCBI) Materials and methods Query= SeqASeqA (evalue = 10) Length=54 The shRNA sequence (5’-GCGGAGGGTTTGAAAGAATATCTC- Sequences producing sigsignificantnificant align alignments:ments: GAGATATTCTTTCAAACCCTCCGCTTTTTT-3’) targeting ...... TPD52L2 (NM 199360) was inserted into the pFH-L plasmid >.... Homo sapi sapiensens tu tumormor pro proteintein D52 likelike 22 (TPD52L2),(TPD52L2), ... (Shanghai Hollybio, China). A scrambled shRNA that shared no Length=2230 homology with the mammalian genome (5’-CTAGCCCGGCCAAG- ... GAAGTGCAATTGCATACTCGAGTATGCAATTGCACTTC- Query 1 GCGGAGGGTTTGAAAGAATAT 21 ||||||||||||||||||||| CTTGGTTTTTTGTTAAT-3’) was used as control. Sbjct 894 GCGGAGGGTTTGAAAGAATAT 914 .... Query 28 ATATTCTTTCAAACCCTCCGC 48 ||||||||||||||||||||| Sbjct 914 ATATTCTTTCAAACCCTCCGC 894 SeqD SeqA

50 GCGG 50 GTAG Targets(21/21) Targets(22/22)

Gene TPD52L2 Gene Nob1

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 38 / 44 Fact-Check using blastn (NCBI)

Query= SeqASeqA (evalue = 10) Length=54 Sequences producing sigsignificantnificant align alignments:ments: ...... >.... Homo sapi sapiensens tu tumormor pro proteintein D52 likelike 22 (TPD52L2),(TPD52L2), ... Length=2230 ... Query 1 GCGGAGGGTTTGAAAGAATAT 21 ||||||||||||||||||||| Sbjct 894 GCGGAGGGTTTGAAAGAATAT 914 .... Query 28 ATATTCTTTCAAACCCTCCGC 48 ||||||||||||||||||||| Sbjct 914 ATATTCTTTCAAACCCTCCGC 894

Automatic detection of questionable research papers Fact checking science Obvious errors: example

Fact-Check using blastn (NCBI)

PMID : 25262828 Query= SeqDSeqD (evalue = 10) Length=68 Sequences producing significantsignificant alignments:alignments: Materials and methods ...... >.... HomoHomo sapienssapiens NIN1/PSMD8NIN1/PSMD8 bindingbinding The shRNA sequence (5’-GCGGAGGGTTTGAAAGAATATCTC- proteinprotein 11 homologhomolog (NOB1)...(NOB1)... Length=1775 GAGATATTCTTTCAAACCCTCCGCTTTTTT-3’) targeting ... TPD52L2 (NM 199360) was inserted into the pFH-L plasmid Query 9 GCCAAGGAAGTGCAATTGCATA 30 (Shanghai Hollybio, China). A scrambled shRNA that shared no |||||||||||||||||||||| Sbjct 1505 GCCAAGGAAGTGCAATTGCATA 1526 homology with the mammalian genome (5’-CTAGCCCGGCCAAG- .... GAAGTGCAATTGCATACTCGAGTATGCAATTGCACTTC- Query 37 TATGCAATTGCACTTCCTTGG 57 CTTGGTTTTTTGTTAAT-3’) was used as control. |||||||||||||||||||||| Sbjct 1526 TATGCAATTGCACTTCCTTGG 1506

SeqD SeqA

50 GCGG 50 GTAG Targets(21/21) Targets(22/22)

Gene TPD52L2 Gene Nob1

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 38 / 44 Automatic detection of questionable research papers Seek & Blastn tool Seek & Blastn at a glance

(1) Facts extraction: Named entity recogni- Facts to check Materials and methods The shRNA sequence (5’-GCGGAGGGTTTGAAA- tion,extractnucleotide GAATATCTCGAGATATTCTTTCAAACCCTCCGCTTTTTT- Status DNA Seq 3’) targeting TPD52L2 (NM 199360) was inserted into and status...... the pFH-L plasmid (Shanghai Hollybio, China). A scrambled shRNA that shared no homology with the Targeting GCG...TTT mammalian genome (5’-CTAGCCCGGCCAAGGAAGTG- Non-Targ. CTA...AAT CAATTGCATACTCGAGTATGCAATTGCACTTCCTTG- GTTTTTTGTTAAT-3’) was used as control......

(2) Blastn call software gives the hit list

Hit lists (Blastn results) Checked Facts (3) Comparison Satus DNA Seq hit list DNA Seq Targ. GCG...TTT ...... Non-Targ CTA...AAT TPD52L2, ... GCG...TTT ...... NOB1,... CTA...AAT ......

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 39 / 44 Automatic detection of questionable research papers Seek & Blastn tool Ambigu¨ıt´es : polys´emie, homonymie, structurale,...

Le pr´esidenta le pouvoir de faire taire l’avocat. Je ne vais pas pouvoir manger l’avocat. l’´e t ´e `a l’est a ´e t ´e tr`esbeau et l’est toujours.

Je suis le secr´etaire. Je vais `ala grange et la ferme.

Il poursuit la jeune fille `av´elo. Il avuun homme avec un t´elescope. Tous les participants prendront un bus.

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 40 / 44 Automatic detection of questionable research papers Seek & Blastn tool Seek & Blastn

Related works Detection of statistically flawed paper Fake news detection

Seek & Blastn perspectives Online tool : http://scigendetection.imag.fr/TPD52 Avoid false positive, more in-deep analysis of sentences.

Retractions, Errors corrections Retractions ( 18), Expression of concern ( 11), 45 to be treated ⇡ ⇡ ⇡ Citation analysis (to be done)

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 41 / 44 Automatic detection of questionable research papers Seek & Blastn tool Chronos

Web of Science (Thomson Reuter)

Nature Scopus (Elsevier)

2004 2006 2008 2010 2012 2014 2016 2018

Ike Antkare h-index PoP Abiteboul par V1.0 l’administrateur du Coll`egede SCIgen France

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 42 / 44 Automatic detection of questionable research papers Seek & Blastn tool Conclusion and Future/Ongoing works

Publication procedures, models and habits Why fake papers were accepted, published and ... sold. Traditional publisher vs open access. Knowledge di↵usion: better and less... or as much as possible.

Blind management rules...... are an incitation to malpractices: slicing, plagiarism, faked data, ...

Automatic detection of new generators Hand written PCFG : find dense cluster inside a population. Study other kind of generator (language model).

In the web today Automatic knowledge extraction/detection/generation. How to separate the wheat from the cha↵... and scale up !

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 43 / 44 Automatic detection of questionable research papers Seek & Blastn tool Thanks

Amancio, D. R. (2015). Dalkilic, M. M., Clark, W. T., Costello, J. C., and Radivojac, P. Labb´e,C. and Labb´e,D. (2006).

Comparing the topological properties of real and artificially (2006). Atoolforliterarystudies.intertextualdistanceandtree generated scientific manuscripts. Using compression to identify classes of inauthentic texts. classification. Scientometrics,105(3):1763–1779. In Proceedings of the 2006 SIAM Conference on Data Mining. Literary and Linguistic Computing,21(3):311–326.

Beel, J. and Gipp, B. (2010). Fahrenberg, U., Biondi, F., Corre, K., J´egourel, C., Kongshøj, S., Labb´e,C. and Labb´e,D. (2013).

Academic search engine spam and ’s resilience and Legay, A. (2014). Duplicate and fake publications in the scientific literature: how against it. Measuring structural distances between texts. many scigen papers in computer science? Journal of Electronic Publishing,13(3). CoRR,abs/1403.4024. Scientometrics,94(1):379–396.

Beel, J., Gipp, B., and Wilde, E. (2010). Ginsparg, P. (2014). Lavoie, A. and Krishnamoorthy, M. (2010).

Academic search engine optimization (aseo). Automated screening: Arxiv screens spot fake papers. Algorithmic Detection of Computer Generated Text. Journal of scholarly publishing,41(2):176–190. Nature,508(7494):44–44. ArXiv e-prints.

Byrne, J. A. and Labb´e,C. (2017a). Hirsch, J. E. (2005). Lopez-Cozar, E. D., Robinson-Garc´ıa, N., and Torres-Salinas, D.

Fact checking nucleotide sequences in life science publications: An index to quantify an individual’s scientific research output. (2012). The seek & blastn tool. Proceedings of the National Academy of Science, Manipulating google scholar citations and google scholar metrics: In International Congress on Peer Review and Scientific 102:16569–16572. Simple, easy and tempting. Publication, Enhancing the quality and credibility of science, arXiv preprint arXiv:1212.0638. Chicago. Labb´e,C. (2010). Xiong, J. and Huang, T. (2009). Ike antkare, one of the great stars in the scientific firmament. Byrne, J. A. and Labb´e,C. (2017b). International Society for Scientometrics and Informetrics An e↵ective method to identify machine automatically generated Striking similarities between publications from china describing Newsletter,6(2):48–52. paper. single gene knockdown experiments in human cancer cell lines. In KESE ’09. Pacific-Asia Conference,pages101–102. Scientometrics,110(3):1471–1493.

C.Labb´e (UGA-LIG) Ike Antkare & Co June 25, 2019 44 / 44