Distribution of Maximal Repeats from Tagged Sequential Data

2019/7/19. 23 AI Summer Program : Hadoop Map&Reduce Programming for Big Traffic Data Management Applications using the Class Frequency Distribution of Maximal Repeats from Tagged Sequential Data. 王經篤博士 (Dr. Jing-Doo Wang) 亞洲大學（Asia University） Chinese proverbs: 『老王』賣瓜 Is it sweet and juicy? http://www.9ht.com/xue/44228.html http://www.pxmart.com.tw/px/ingredients.px?id=2592 Outline • Introduction – What is “Sequential Data”? – A scalable approach of Maximal Repeat Extraction • Applications with Tagged Sequential Data – Analyzing Text Archaeology. – Extracting Significant Travel Time Interval – Mining for Biomarker. – Improving Quality Control. • Future Works What is “Sequential Data”? • Textual Data : News, Journal Articles, etc. http://edition.cnn.com/2017/11/22/health/jfk-assassination-back-pain/index.html From:https://www.udn.com/news/story/7266/2834500 https://www.ncbi.nlm.nih.gov/pubmed/24372032 What is “Sequential Data”? • Genomic Sequences From:http://blogs.nature.com/naturejobs/2015/10/08 /big-data-the-impact-of-the-human-genome-project/ What is “Sequential Data”? • Traffic Transportation https://tptis2015.blogspot.tw/2015/07/300-brt.html https://attach.mobile01.com/640x480/attach/201312/ https://tptis2015.blogspot.tw/2017/10/blog-post.html mobile01-b004e8fd829e35140b3de0d91e847953.jpg Product Traceability **************************************** http://www.slideshare.net/5045033/ss-1002323 7 http://technews.tw/2016/04/11/tsmc-and-largan/ www.iconarchive.com It‘s a big data problem ! How to mine from these “sequential data”? http://clipart- http://clipart- library.com/clipart/kiKB8qLRT.htm library.com/clipart/6Tr5BGG7c.htm How to mine from these “sequential data”? ? From: http://globe-views.com/dcim/dreams/mine/mine-03.jpg It’s a Big Data problem! http://haphazardstuffblog.com/wp- content/uploads/2012/01/Big-truck.jpg http://www.mining.com/wp-content/uploads/2015/06/Veladero-Mine.jpg What kind of “features” extracted from Sequential Data? • http://www.quickanddirtytips.com/sites/ default/files/images/2499/question- http://images.slideplayer.com/16/5176005/slides/slide_2.jpg mark2.jpg What kind of “Mineral” do you want (mine)? https://www.popsci.com/features/how-to-be-an-expert-in- anything/images/feature_video.jpg https://media1.britannica.com/eb-media/71/143171-049-53725C29.jpg Outline • Introduction – What is “Sequential Data”? – A scalable approach of Maximal Repeat Extraction • Applications with Tagged Sequential Data – Analyzing Text Archaeology. – Extracting Significant Travel Time Interval – Mining for Biomarker. – Improving Quality Control. • Future Works Journal of Supercomputing, April 2016 https://link.springer.com/article/10.1007/s11227-016- 1713z?wt_mc=internal.event.1.SEM.ArticleAuthorOnlineFirst Why use “Maximal Repeats ” as features? • Dictionary – How to identify new words or phrases? – e.g. “just do it”, “洪荒之力”。 • N-gram (K-mers) – 2-gram, 3-gram,…,5-grams. (Google Ngram viewer) – The value of “N” is limited. • Maximal Repeat – The length of maximal repeat is variable. Example: Maximal Repeat Pattern “xabcyiiizabcqabcyrxar” • ab • bc Not Maximal repeat Pattern • abc • abcy 17 Distinctive Pattern Mining(1) Classes These Classes are labeled by Domain Experts S1:******************************** S2:*********#****?***********@***** S3:********************$*********** S4:*****&*******%****************** Sequences S5:********************$*********** S6:*********#****?************@**** S7:*****&*******%****************** S8:******************************** S9:*****&*******%****************** S10:*********#****?************@**** S11:******************************** 18 [email protected] Distinctive Pattern Mining(2) Classes ******************************** ******************************** ******************************** *********#****?************@**** *********#****?************@**** *********#****?************@**** *****&*******%****************** *****&*******%****************** *****&*******%****************** ********************$*********** ********************19 $*********** [email protected] Distinctive Pattern Mining(3) Maximal Repeats #****? @**** &*******% $********** ***** Class Frequency [email protected] Applying for U.S.A. Patent From: https://www.google.com/patents/US20170255634 Patent Publication Date : Sep. 7, 2017 http://haphazardstuffblog.com/wp- content/uploads/2012/01/Big- truck.jpg Outline • Introduction – What is “Sequential Data”? – A scalable approach of Maximal Repeat Extraction • Applications with Tagged Sequential Data – Analyzing Text Archaeology. – Extracting Significant Travel Time Interval – Mining for Biomarker. – Improving Quality Control. • Future Works Applications with Tagged Sequential Data • Analyzing Trend Analysis via Text Archaeology. • Extracting Significant Travel Time Interval from Gantry Timestamped Sequences. • Mining for Biomarker from Genomic Sequences. • Improving Quality Control via Product Traceability. From: http://www.mdpi.com/2076-3417/7/9/878 Superhighway From: http://chiangchiafeng.tian.yam.com/posts/70456997 e-Tag http://news.u-car.com.tw/article/16077 中華民國國道（高速公路）的電子收費系統（Electronic Toll Collection，簡稱ETC） From: https://i.ytimg.com/vi/1ML2FFS2dJg/maxresdefault.jpg https://attach.mobile01.com/640x480/attach/201312/mobile01-b004e8fd829e35140b3de0d91e847953.jpg Gantry Sequences Of different Vehicle Types (VT) Gantry Timestamp Sequences with Timestamps Gantry Timestamp Sequences with TimeStamps for different Vehicles Type Significant Time Intervals of Vehicles http://www.7car.tw/articles/read/25927 https://buzzorange.com/wp- content/uploads/2015/04/640_4a486 dc48d6f1414404627e1c45f1cf9.jpg http://news.ltn.com.tw/photo/society/breakingnews/10883 61_1 05F0055N,13:33 05F0287N,13:15 05F0309N,13:13 05F0438N,13:06 05F0528N,13:00 Significant Time Intervals of Vehicles 05F0528N_13_M1_00 05F0438N_13_M1_06 05F0309N_13_M1_13 Significant Time intervals 05F0287N_13_M1_15 05F0055N_13_M1_33 ##4 ##5 ## (2016-11-15_Mon_41#1#1) (2016-11-29_Mon_41#1#1) (2016-12-09_Thu_31#1#1) Class Frequency Distribution (2016-12-20_Mon_31#1#1) Weekday vs. 24 Hours/per day Vehicle Types vs. 24 Hours/per day Significant Patterns of Travel Time Intervals of Vehicles Outline • Introduction – What is “Sequential Data”? – A scalable approach of Maximal Repeat Extraction • Applications with Tagged Sequential Data – Analyzing Text Archaeology. – Extracting Significant Travel Time Interval – Mining for Biomarker. – Improving Quality Control. • Future Works 1+5 cluster nodes 2+ 8 cluster nodes Cloud Computing Environment Artificial Intelligence Artificial Intelligence Cloud Machine Big Data Computing Learning Leverage古希臘的科學 principle (槓桿原理) (Maximal阿基米德 Repeat Extraction撐起 with 地球Class Frequency的支點 Distribution) Domain Knownledge ? Expert Relationship? ? Labels (Tags) Sequential Infrastructure Data ? (Cloud Computing) From:https://phycat.files.wordpress.com/2015/03/leverbigcorners.gif?w=810 插圖：紀玲玉 Acknowledgements （Precision Medicine） • Jeffrey J.P. Tsai ( 亞洲大學蔡進發校長) 計劃名稱：以生醫大數據分析為基礎的精準癌症醫療研究(2/3) 計畫編號: MOST 106-2632-E-468-002 計畫執行起迄: 106/08/01~107/07/31 Acknowledgements （Bioinformatics） • Charles C.N. Wang • Tsung-Chi Chen • Wen-Ling Chan • Rouh-Mei Hu • Jan-Gowth Chang • Yi-Chun Wang Acknowledgements （Traffic Information Analysis） • 黃銘崇主任 • 連耀南教授 • 潘信宏教授 • 何承遠教授 Acknowledgements （Big-Data: Hadoop Computing） Jazz Wang (王耀聰) Philip Lin ( 林奇暻) wei-chiu chuang （莊偉赳） • Apache Hadoop Committer/PMC member Acknowledgements • Hadoop Cluster Set Up and Consulting – SYSTEX 精誠資訊（2017） • Herb Hsu-徐啟超 – Athemaster 炬識科技股份有限公司（2018） • Ferrari • 亞洲大學資訊發展處黃仁德先生『老王』賣瓜，自賣自誇 Lao Wang selling melons praises his own goods http://www.9ht.com/xue/44228.html http://www.pxmart.com.tw/px/ingredients.px?id=2592 Thanks for your listening! http://www.pptschool.com/250.html www.flickr.com www.slideshare.net.

Distribution of Maximal Repeats from Tagged Sequential Data

Corpora: Google Ngram Viewer and the Corpus of Historical American English

BEYOND JEWISH IDENTITY Rethinking Concepts and Imagining Alternatives

CS15-319 / 15-619 Cloud Computing

Internet and Data

Introduction to Text Analysis: a Coursebook

Google Ngram Viewer Turns Snippets Into Insight Computers Have

Apache Hadoop & Spark – What Is It ?

Conceptboard Совместное Редактирование, Виртуальная Д

Google Ngram Viewer by Andrew Weiss

Smartness Mandate Above, Smartness Is Both a Reality and an Imaginary, and It Is This Comingling That Underwrites Both Its Logic and the Magic of Its Popularity

Google N-Gram Viewer Does Not Include Arabic Corpus! Towards N-Gram Viewer for Arabic Corpus

Science in the Forest, Science in the Past Hbooksau