Introduction to Recurrent Neural Networks (Part 2)

Lecture 14 Introduction to Recurrent Neural Networks (Part 2)

STAT 479: Deep Learning, Spring 2019 Sebastian Raschka http://stat.wisc.edu/~sraschka/teaching/stat479-ss2019/

Sebastian Raschka STAT 479: Deep Learning SS 2019 1 y Single layer RNN y y y(t+1)

h Unfold h h h(t+1)

x x x x(t+1)

Multilayer RNN y y y y(t+1)

h h h h(t+1) 2 2 2 2 Unfold h h h h(t+1) 1 1 1 1 x x x x(t+1)

Sebastian Raschka STAT 479: Deep Learning SS 2019 !2 Each hidden unit receives y Single layer RNN y y 2 inputsy(t+1)

h Unfold h h h(t+1)

x x x x(t+1)

Multilayer RNN y y y y(t+1)

h h h h(t+1) 2 2 2 2 Unfold h h h h(t+1) 1 1 1 1 x x x x(t+1)

Sebastian Raschka STAT 479: Deep Learning SS 2019 !3 Weight matrices in a single-hidden layer RNN

y y y y W = W ; W h [ hh hx ] W W W yh Wyh yh yh Whh Whh Whh h Unfold h h h

W W W hx hx hx Whx x x x x

Sebastian Raschka STAT 479: Deep Learning SS 2019 !4 Weight matrices in a single-hidden layer RNN

y y y y W = W ; W h [ hh hx ] W W W yh Wyh yh yh Whh Whh Whh h Unfold h h h

W W W hx hx hx Whx x x x x

Net input: t t t 1 zh i = Whxxh i + Whhhh i + bh AAACfXicbVHJTsMwEHXCVspW4MjFoqrEWiUFCS5IFVw4gkQXqSmR4zqNVceJbAdRovwFX8aNX+ECTmkRXUay9PzmvfF4xosZlcqyPg1zaXllda2wXtzY3NreKe3uNWWUCEwaOGKRaHtIEkY5aSiqGGnHgqDQY6TlDe7yfOuFCEkj/qSGMemGqM+pTzFSmnJL706IVOD56VvmpkH2nDoM8T4jUEFHjFAGb+BE1MpFr9nk+rpYfzqtD/700/XP7UUWL3MDt1S2qtYo4Dywx6AMxvHglj6cXoSTkHCFGZKyY1ux6qZIKIp1+aKTSBIjPEB90tGQo5DIbjqaXgYrmulBPxL6cAVH7H9HikIph6GnlXmPcjaXk4tynUT5192U8jhRhOPfh/yEQRXBfBWwRwXBig01QFhQ3SvEARIIK72woh6CPfvledCsVe2Lau3xsly/HY+jAA7AITgCNrgCdXAPHkADYPBlQOPYODG+zYp5ZlZ/paYx9uyDqTCvfgAIfcRL h Activation: t t hh i = h zh i AAACP3icdVDLSsNAFJ3UV62vqks3g0Wom5JUQTdC0Y3LCvYBTQ2T6SQZOpmEmYlQQ//Mjb/gzq0bF4q4deckraCtHhg4nHvunXuPGzMqlWk+GYWFxaXlleJqaW19Y3OrvL3TllEiMGnhiEWi6yJJGOWkpahipBsLgkKXkY47vMjqnVsiJI34tRrFpB8in1OPYqS05JTbdohU4HppML5JbYa4zwhU0BY5G8MzaEvqh8gJbJf61W/33dj5pyOzHUKnXDFrZg44T6wpqYApmk750R5EOAkJV5ghKXuWGat+ioSiWE8t2YkkMcJD5JOephyFRPbT/P4xPNDKAHqR0I8rmKs/O1IUSjkKXe3M9peztUz8q9ZLlHfaTymPE0U4nnzkJQyqCGZhwgEVBCs20gRhQfWuEAdIIKx05CUdgjV78jxp12vWUa1+dVxpnE/jKII9sA+qwAInoAEuQRO0AAb34Bm8gjfjwXgx3o2PibVgTHt2wS8Yn19BALDS h Sebastian Raschka STAT 479: Deep Learning SS 2019 !5 Weight matrices in a single-hidden layer RNN

y y y y W = W ; W h [ hh hx ] W W W yh Wyh yh yh Whh Whh Whh h Unfold h h h

W W W hx hx hx Whx x x x x

Net input: Net input:

t t t 1 t t zh i = Whxxh i + Whhhh i + bh AAACfXicbVHJTsMwEHXCVspW4MjFoqrEWiUFCS5IFVw4gkQXqSmR4zqNVceJbAdRovwFX8aNX+ECTmkRXUay9PzmvfF4xosZlcqyPg1zaXllda2wXtzY3NreKe3uNWWUCEwaOGKRaHtIEkY5aSiqGGnHgqDQY6TlDe7yfOuFCEkj/qSGMemGqM+pTzFSmnJL706IVOD56VvmpkH2nDoM8T4jUEFHjFAGb+BE1MpFr9nk+rpYfzqtD/700/XP7UUWL3MDt1S2qtYo4Dywx6AMxvHglj6cXoSTkHCFGZKyY1ux6qZIKIp1+aKTSBIjPEB90tGQo5DIbjqaXgYrmulBPxL6cAVH7H9HikIph6GnlXmPcjaXk4tynUT5192U8jhRhOPfh/yEQRXBfBWwRwXBig01QFhQ3SvEARIIK72woh6CPfvledCsVe2Lau3xsly/HY+jAA7AITgCNrgCdXAPHkADYPBlQOPYODG+zYp5ZlZ/paYx9uyDqTCvfgAIfcRL h zh i = Wyhhh i + by

AAACSnicbVBNS8MwGE7n1Dm/qh69BIcgCKNVQS/C0IvHCe4D1lnSLN3C0rQkqVBLf58XT978EV48KOLFdB+i214IPHme533z5vEiRqWyrFejsFRcXlktrZXXNza3ts2d3aYMY4FJA4csFG0PScIoJw1FFSPtSBAUeIy0vOF1rrceiJA05HcqiUg3QH1OfYqR0pRrIidAauD56WPmpkl2nzoM8T4jUEFHjFAGL+HU1MpNg2x6HSz2H//6vcxNXLNiVa1RwXlgT0AFTKrumi9OL8RxQLjCDEnZsa1IdVMkFMV6fNmJJYkQHqI+6WjIUUBkNx1FkcFDzfSgHwp9uIIj9m9HigIpk8DTznxHOavl5CKtEyv/optSHsWKcDx+yI8ZVCHMc4U9KghWLNEAYUH1rhAPkEBY6fTLOgR79svzoHlStU+rJ7dnldrVJI4S2AcH4AjY4BzUwA2ogwbA4Am8gQ/waTwb78aX8T22FoxJzx74V4XiDzjytVE= y

Activation: Output: t t t t hh i = h zh i AAACP3icdVDLSsNAFJ3UV62vqks3g0Wom5JUQTdC0Y3LCvYBTQ2T6SQZOpmEmYlQQ//Mjb/gzq0bF4q4deckraCtHhg4nHvunXuPGzMqlWk+GYWFxaXlleJqaW19Y3OrvL3TllEiMGnhiEWi6yJJGOWkpahipBsLgkKXkY47vMjqnVsiJI34tRrFpB8in1OPYqS05JTbdohU4HppML5JbYa4zwhU0BY5G8MzaEvqh8gJbJf61W/33dj5pyOzHUKnXDFrZg44T6wpqYApmk750R5EOAkJV5ghKXuWGat+ioSiWE8t2YkkMcJD5JOephyFRPbT/P4xPNDKAHqR0I8rmKs/O1IUSjkKXe3M9peztUz8q9ZLlHfaTymPE0U4nnzkJQyqCGZhwgEVBCs20gRhQfWuEAdIIKx05CUdgjV78jxp12vWUa1+dVxpnE/jKII9sA+qwAInoAEuQRO0AAb34Bm8gjfjwXgx3o2PibVgTHt2wS8Yn19BALDS h yh i = y zh i

AAACP3icdVDLSsNAFJ3UV62vqEs3g0Wom5JUQTdC0Y3LCvYBTQ2T6SQdOpmEmYkQQ//Mjb/gzq0bF4q4deekraCtHhg4nHvunXuPFzMqlWU9GYWFxaXlleJqaW19Y3PL3N5pySgRmDRxxCLR8ZAkjHLSVFQx0okFQaHHSNsbXuT19i0Rkkb8WqUx6YUo4NSnGCktuWbLCZEaeH6Wjm4yhyEeMAIVdMSYjeAZdCQNQuSmjkeDyrf7buT+05HbDqFrlq2qNQacJ/aUlMEUDdd8dPoRTkLCFWZIyq5txaqXIaEo1lNLTiJJjPAQBaSrKUchkb1sfP8IHmilD/1I6McVHKs/OzIUSpmGnnbm+8vZWi7+Vesmyj/tZZTHiSIcTz7yEwZVBPMwYZ8KghVLNUFYUL0rxAMkEFY68pIOwZ49eZ60alX7qFq7Oi7Xz6dxFMEe2AcVYIMTUAeXoAGaAIN78AxewZvxYLwY78bHxFowpj274BeMzy+YMbEF y Sebastian Raschka STAT 479: Deep Learning SS 2019 !6 Backpropagation through time T t L = Lh i

AAACEXicbZC7TsMwFIYdrqXcAowsFhVSpyopSLBUqmBh6FCk3qSmrRzXaa06TmSfIFVRX4GFV2FhACFWNjbehiTtAC2/ZOnTf86xfX43FFyDZX0ba+sbm1vbuZ387t7+waF5dNzSQaQoa9JABKrjEs0El6wJHATrhIoR3xWs7U5u03r7gSnNA9mAach6PhlJ7nFKILEGZjFfqzg68gcxVOxZP27McK0fO4LIkWAYsKMymuUHZsEqWZnwKtgLKKCF6gPzyxkGNPKZBCqI1l3bCqEXEwWcphc6kWYhoRMyYt0EJfGZ7sXZRjN8njhD7AUqORJw5v6eiImv9dR3k06fwFgv11Lzv1o3Au+6F3MZRsAknT/kRQJDgNN48JArRkFMEyBU8eSvmI6JIhSSENMQ7OWVV6FVLtkXpfL9ZaF6s4gjh07RGSoiG12hKrpDddREFD2iZ/SK3own48V4Nz7mrWvGYuYE/ZHx+QMPsJyK t=1 X

t 1 t t+1

LAAACAXicbZDLSsNAFIZPvNZ6i7oR3AwWwY0lqYIui25cuKhgL9DEMplO2qGTSZiZCCXUja/ixoUibn0Ld76NSZqFtv4w8PGfc2bm/F7EmdKW9W0sLC4tr6yW1srrG5tb2+bObkuFsSS0SUIeyo6HFeVM0KZmmtNOJCkOPE7b3ugqq7cfqFQsFHd6HFE3wAPBfEawTq2euX9znzgciwGnSJ/YyJE5T8o9s2JVrVxoHuwCKlCo0TO/nH5I4oAKTThWqmtbkXYTLDUj2YVOrGiEyQgPaDdFgQOq3CTfYIKOUqeP/FCmR2iUu78nEhwoNQ68tDPAeqhma5n5X60ba//CTZiIYk0FmT7kxxzpEGVxoD6TlGg+TgETydK/IjLEEhOdhpaFYM+uPA+tWtU+rdZuzyr1yyKOEhzAIRyDDedQh2toQBMIPMIzvMKb8WS8GO/Gx7R1wShm9uCPjM8fFaeWAQ== h i LAAAB/3icbZC7TsMwFIZPyq2UWwCJhcWiQmKqkoIEYwULA0OR6EVqQuW4TmvVcSLbQapCB16FhQGEWHkNNt6GJM0ALUey9On/z7GPfy/iTGnL+jZKS8srq2vl9crG5tb2jrm711ZhLAltkZCHsuthRTkTtKWZ5rQbSYoDj9OON77K/M4DlYqF4k5PIuoGeCiYzwjWqdQ3D27uE4djMeQUaeTInKaVvlm1alZeaBHsAqpQVLNvfjmDkMQBFZpwrFTPtiLtJlhqRrILnVjRCJMxHtJeigIHVLlJvv8UHafKAPmhTI/QKFd/TyQ4UGoSeGlngPVIzXuZ+J/Xi7V/4SZMRLGmgswe8mOOdIiyMNCASUo0n6SAiWTproiMsMREp5FlIdjzX16Edr1mn9bqt2fVxmURRxkO4QhOwIZzaMA1NKEFBB7hGV7hzXgyXox342PWWjKKmX34U8bnDy/blY8= h i LAAACAXicbZDLSsNAFIZPvNZ6i7oR3AwWQRBKUgVdFt24cFHBXqCJZTKdtEMnkzAzEUqoG1/FjQtF3PoW7nwbkzQLbf1h4OM/58zM+b2IM6Ut69tYWFxaXlktrZXXNza3ts2d3ZYKY0lok4Q8lB0PK8qZoE3NNKedSFIceJy2vdFVVm8/UKlYKO70OKJugAeC+YxgnVo9c//mPnE4FgNOkT6xkSNznpR7ZsWqWrnQPNgFVKBQo2d+Of2QxAEVmnCsVNe2Iu0mWGpGsgudWNEIkxEe0G6KAgdUuUm+wQQdpU4f+aFMj9Aod39PJDhQahx4aWeA9VDN1jLzv1o31v6FmzARxZoKMn3IjznSIcriQH0mKdF8nAImkqV/RWSIJSY6DS0LwZ5deR5atap9Wq3dnlXql0UcJTiAQzgGG86hDtfQgCYQeIRneIU348l4Md6Nj2nrglHM7MEfGZ8/EomV/w== h i

y y y y W = W ; W h [ hh hx ] W W W yh Wyh yh yh Whh Whh Whh h Unfold h h h

W W W hx hx hx Whx x x x x

Sebastian Raschka STAT 479: Deep Learning SS 2019 !7 Backpropagation through time

y y y y W = W ; W h [ hh hx ] W W W yh Wyh yh yh Whh Whh Whh h Unfold h h h

W W W hx hx hx Whx x x x x

Werbos, Paul J. "Backpropagation through time: what it does and how to do it." Proceedings of the IEEE 78, no. 10 (1990): 1550-1560.

The loss is computed as the T t sum over all time steps: L = Lh i

y y y y W = W ; W h [ hh hx ] W W W yh Wyh yh yh Whh Whh Whh h Unfold h h h

W W W hx hx hx Whx x x x x

Werbos, Paul J. "Backpropagation through time: what it does and how to do it." Proceedings of the IEEE 78, no. 10 (1990): 1550-1560. T L = L(t)

y y y y W = W ; W h [ hh hx ] W W W yh Wyh yh yh Whh Whh Whh h Unfold h h h

W W W hx hx hx Whx x x x x

Werbos, Paul J. "Backpropagation through time: what it does and how to do it." Proceedings of the IEEE 78, no. 10 (1990): 1550-1560.

T (t) (t) (t) t (t) (k) (t) @L @L @y @h @h L = L = (t) (t) (k) @Whh @y · @h · @h · @Whh

AAACA3icbVDLSsNAFJ34rPVVdaebwSLUTUmqoJtC0Y2LLir0BX0xmU7aoZNJmLkRSgi48VfcuFDErT/hzr8xabPQ1gMXDufcy7332L7gGkzz21hZXVvf2MxsZbd3dvf2cweHTe0FirIG9YSn2jbRTHDJGsBBsLavGHFtwVr25DbxWw9Mae7JOkx91nPJSHKHUwKxNMgdZ6vlrg7cQQhlK+qH9QhX+2EBzqPsIJc3i+YMeJlYKcmjFLVB7qs79GjgMglUEK07lulDLyQKOBUsynYDzXxCJ2TEOjGVxGW6F85+iPBZrAyx46m4JOCZ+nsiJK7WU9eOO10CY73oJeJ/XicA57oXcukHwCSdL3ICgcHDSSB4yBWjIKYxIVTx+FZMx0QRCnFsSQjW4svLpFkqWhfF0v1lvnKTxpFBJ+gUFZCFrlAF3aEaaiCKHtEzekVvxpPxYrwbH/PWFSOdOUJ/YHz+AO3elms= ! t=1 AAADJHichVLLahsxFNVM0jZ1X06y7EbEFJyNmUkKCYVAaDZdZJFCHQcsd9DIGo+w5oF0p2CEPqab/ko3XaQJXXTTb6nGGUjjadoLgsM59+g+pLiUQkMQ/PT8tfUHDx9tPO48efrs+Yvu5ta5LirF+JAVslAXMdVcipwPQYDkF6XiNIslH8Xzk1offeJKiyL/AIuSTzI6y0UiGAVHRZvemw5JFGWGlFSBoBKffjR92LX2liEZhTROzMhGJsWptUf/tZiFbThM2LQAvOK41dtl0rtWInkCfaKrLDLzo9CJYFeva1nvkeb3NtRK+8f4RIlZCrudqNsLBsEycBuEDeihJs6i7hWZFqzKeA5MUq3HYVDCxNR1mOS2QyrNS8rmdMbHDuY043pilo9s8SvHTHFSKHdywEv2T4ehmdaLLHaZdcN6VavJv2njCpLDiRF5WQHP2U2hpJIYClz/GDwVijOQCwcoU8L1illK3fbA/at6CeHqyG1wvjcI9wd771/3jt8269hAL9EO6qMQHaBj9A6doSFi3mfvq3fpffe/+N/8a//HTarvNZ5tdCf8X78B3t4IGA== k=1 X X computed as a multiplication of adjacent time steps: @h(t) t @h(i) (k) = (i 1) @h @h AAACdHicfVHLSgMxFM2Mr1pfVRcudBEthYpYZqqgm0LRjcsK9gGdWjJppg3NPEjuCGWYL/Dv3PkZblybabvQVr0QOJxzXznXjQRXYFnvhrmyura+kdvMb23v7O4V9g9aKowlZU0ailB2XKKY4AFrAgfBOpFkxHcFa7vj+0xvvzCpeBg8wSRiPZ8MA+5xSkBT/cKr40lCEyciEjgR2PEJjFwvGaXPSRnO0/QPaaylmhPJcNBPeG18YWsSUvxfN/53N35pa7FfKFoVaxp4GdhzUETzaPQLb84gpLHPAqCCKNW1rQh6STaBCpbmnVixiNAxGbKuhgHxmeolU9NSXNLMAHuh1C8APGW/VyTEV2riuzozW1Utahn5m9aNwbvtJTyIYmABnQ3yYoEhxNkF8IBLRkFMNCBUcr0rpiOifQN9p7w2wV788jJoVSv2VaX6eF2s383tyKFjdIbKyEY3qI4eUAM1EUUfxpGBjVPj0zwxi2Zplmoa85pD9CPMyhc8JcD3 i=Yk+1 Sebastian Raschka STAT 479: Deep Learning SS 2019 !10 Backpropagation through time

y y y y W = W ; W h [ hh hx ] W W W yh Wyh yh yh Whh Whh Whh h Unfold h h h

W W W hx hx hx Whx x x x x

Werbos, Paul J. "Backpropagation through time: what it does and how to do it." Proceedings of the IEEE 78, no. 10 (1990): 1550-1560.

T (t) (t) (t) t (t) (k) (t) @L @L @y @h @h L = L = (t) (t) (k) @Whh @y · @h · @h · @Whh

AAACA3icbVDLSsNAFJ34rPVVdaebwSLUTUmqoJtC0Y2LLir0BX0xmU7aoZNJmLkRSgi48VfcuFDErT/hzr8xabPQ1gMXDufcy7332L7gGkzz21hZXVvf2MxsZbd3dvf2cweHTe0FirIG9YSn2jbRTHDJGsBBsLavGHFtwVr25DbxWw9Mae7JOkx91nPJSHKHUwKxNMgdZ6vlrg7cQQhlK+qH9QhX+2EBzqPsIJc3i+YMeJlYKcmjFLVB7qs79GjgMglUEK07lulDLyQKOBUsynYDzXxCJ2TEOjGVxGW6F85+iPBZrAyx46m4JOCZ+nsiJK7WU9eOO10CY73oJeJ/XicA57oXcukHwCSdL3ICgcHDSSB4yBWjIKYxIVTx+FZMx0QRCnFsSQjW4svLpFkqWhfF0v1lvnKTxpFBJ+gUFZCFrlAF3aEaaiCKHtEzekVvxpPxYrwbH/PWFSOdOUJ/YHz+AO3elms= ! t=1 AAADJHichVLLahsxFNVM0jZ1X06y7EbEFJyNmUkKCYVAaDZdZJFCHQcsd9DIGo+w5oF0p2CEPqab/ko3XaQJXXTTb6nGGUjjadoLgsM59+g+pLiUQkMQ/PT8tfUHDx9tPO48efrs+Yvu5ta5LirF+JAVslAXMdVcipwPQYDkF6XiNIslH8Xzk1offeJKiyL/AIuSTzI6y0UiGAVHRZvemw5JFGWGlFSBoBKffjR92LX2liEZhTROzMhGJsWptUf/tZiFbThM2LQAvOK41dtl0rtWInkCfaKrLDLzo9CJYFeva1nvkeb3NtRK+8f4RIlZCrudqNsLBsEycBuEDeihJs6i7hWZFqzKeA5MUq3HYVDCxNR1mOS2QyrNS8rmdMbHDuY043pilo9s8SvHTHFSKHdywEv2T4ehmdaLLHaZdcN6VavJv2njCpLDiRF5WQHP2U2hpJIYClz/GDwVijOQCwcoU8L1illK3fbA/at6CeHqyG1wvjcI9wd771/3jt8269hAL9EO6qMQHaBj9A6doSFi3mfvq3fpffe/+N/8a//HTarvNZ5tdCf8X78B3t4IGA== k=1 X X computed as a multiplication of adjacent time steps: t This is very problematic: @h(t) @h(i) (k) = (i 1) Vanishing/Exploding gradient problem! @h @h AAACdHicfVHLSgMxFM2Mr1pfVRcudBEthYpYZqqgm0LRjcsK9gGdWjJppg3NPEjuCGWYL/Dv3PkZblybabvQVr0QOJxzXznXjQRXYFnvhrmyura+kdvMb23v7O4V9g9aKowlZU0ailB2XKKY4AFrAgfBOpFkxHcFa7vj+0xvvzCpeBg8wSRiPZ8MA+5xSkBT/cKr40lCEyciEjgR2PEJjFwvGaXPSRnO0/QPaaylmhPJcNBPeG18YWsSUvxfN/53N35pa7FfKFoVaxp4GdhzUETzaPQLb84gpLHPAqCCKNW1rQh6STaBCpbmnVixiNAxGbKuhgHxmeolU9NSXNLMAHuh1C8APGW/VyTEV2riuzozW1Utahn5m9aNwbvtJTyIYmABnQ3yYoEhxNkF8IBLRkFMNCBUcr0rpiOifQN9p7w2wV788jJoVSv2VaX6eF2s383tyKFjdIbKyEY3qI4eUAM1EUUfxpGBjVPj0zwxi2Zplmoa85pD9CPMyhc8JcD3 i=Yk+1 Sebastian Raschka STAT 479: Deep Learning SS 2019 !11 A good resource that explains backpropation through time nicely:

Boden, Mikael. "A guide to recurrent neural networks and backpropagation." the Dallas project (2002).

Sebastian Raschka STAT 479: Deep Learning SS 2019 !12 Solutions to the vanishing/exploding gradient problems 1) Gradient Clipping: set a max value for gradients if they grow to large (solves only exploding gradient problem)

2) Truncated backpropagation through time (TBPTT) - simply limits the number of time steps the signal can backpropagate after each forward pass. E.g., even if the sequence has 100 elements/steps, we may only backpropagate through 20 or so

3) Long short-term memory (LSTM) -- uses a memory cell for modeling long-range dependencies and avoid vanishing gradient problems Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9, no. 8 (1997): 1735-1780. Sebastian Raschka STAT 479: Deep Learning SS 2019 !13 Long-short term memory (LSTM)

LSTM cell:

C C

f i g Tanh

σ σ Tanh σ o To next W W b W W b fx fh f Wix Wih bi Wgx Wgh bg ox oh o layer

h To next h time step

x Sebastian Raschka STAT 479: Deep Learning SS 2019 !14 C C

f i g Tanh

σ σ Tanh y σ Singleo layer RNN y y y(t+1) To next W W b W W b fx fh f Wix Wih bi Wgx Wgh bg ox oh o layer

h Unfold h h h(t+1) h To next h time step

x x x x(t+1) x Multilayer RNN y y y y(t+1)

h h h h(t+1) 2 2 2 2 Unfold h h h h(t+1) 1 1 1 1

x x x x(t+1)

Sebastian Raschka STAT 479: Deep Learning SS 2019 !15 Long-short term memory (LSTM)

Cell state at previous time step Cell state at current time step

C C

f i g Tanh

σ σ Tanh σ o To next W W b W W b fx fh f Wix Wih bi Wgx Wgh bg ox oh o layer

h To next h time step

x Sebastian Raschka STAT 479: Deep Learning SS 2019 !16 Long-short term memory (LSTM)

activation from activation for next time step previous time step

C C

f i g Tanh

σ σ Tanh σ o To next W W b W W b fx fh f Wix Wih bi Wgx Wgh bg ox oh o layer

h To next h time step

x Sebastian Raschka STAT 479: Deep Learning SS 2019 !17 Long-short term memory (LSTM) element-wise multiplication operator element-wise addition operator

C C

f i g Tanh

σ σ Tanh σ o To next W W b W W b fx fh f Wix Wih bi Wgx Wgh bg ox oh o layer

h To next h time step

x logistic sigmoid activation functions Sebastian Raschka STAT 479: Deep Learning SS 2019 !18 Long-short term memory (LSTM)

Gers, Felix A., Jürgen Schmidhuber, and Fred Cummins. "Learning to forget: Continual prediction with LSTM." (1999): 850-855. "Forget Gate": controls which information is remembered, and which is forgotten; can reset the cell state t t 1 ft = Wfxxh i + Wfhhh i + bf

AAACcnicbVFda9swFJXdrkvTfWQbfWmh0xYGHduCnRa2l0JoX/qYwpIU4szIimyLyLKRrkuC8Q/Y39vbfkVf+gMqpy5Nm14QHM49h3vvUZAJrsFx/lv2xuaLrZeN7ebOq9dv3rbevR/qNFeUDWgqUnUZEM0El2wAHAS7zBQjSSDYKJidVf3RFVOap/I3LDI2SUgkecgpAUP5rb+hX0B54mkeJcQTLIRDLyEQB2ExKv0ixPMS3xPz8k/hCSIjwTBgTy1R+e2xPn7Qx6v6H+66I6gcpad4FMNXv9V2Os6y8Dpwa9BGdfX91j9vmtI8YRKoIFqPXSeDSUEUcGqmNL1cs4zQGYnY2EBJEqYnxTKyEn8xzBSHqTJPAl6yq46CJFovksAoq2X1015FPtcb5xD+mhRcZjkwSe8GhbnAkOIqfzzlilEQCwMIVdzsimlMFKFgfqlpQnCfnrwOht2Oe9TpXhy3e6d1HA20jz6jQ+Sin6iHzlEfDRBF19audWB9tG7sPfuTXWdnW7XnA3pU9vdba6rAMg== ⇣ ⌘ C C

f i g Tanh

σ σ Tanh σ o To next W W b W W b fx fh f Wix Wih bi Wgx Wgh bg ox oh o layer

h To next h time step

x Sebastian Raschka STAT 479: Deep Learning SS 2019 !19 Long-short term memory (LSTM) t t 1 "Input Gate": it = Wixxh i + Wihhh i + bi

AAACe3icbVFbS8MwFE7rfd6mPgoSHOK8jdYL+iKIvvio4JywzpFmaRtM05KciqP0T/jTfPOf+CKYzomXeSDw8V04J+f4qeAaHOfVssfGJyanpmcqs3PzC4vVpeVbnWSKsiZNRKLufKKZ4JI1gYNgd6liJPYFa/kPF6XeemRK80TeQD9lnZiEkgecEjBUt/rsxQQiP8h50c2hOPU0D2PiCRZA/UtqGYnjpwJ/EU/Ffe4JIkPBMGBPDVCx89sfffujn/49dzThl4nCUzyMYKtbrTkNZ1B4FLhDUEPDuupWX7xeQrOYSaCCaN12nRQ6OVHAqelS8TLNUkIfSMjaBkoSM93JB7sr8IZhejhIlHkS8ID9mchJrHU/9o2zHFb/1UryP62dQXDSyblMM2CSfjYKMoEhweUhcI8rRkH0DSBUcTMrphFRhII5V8Uswf375VFwu99wDxr714e1s/PhOqbRKlpHdeSiY3SGLtEVaiKK3qw1a9OqW+92zd62dz+ttjXMrKBfZR99AKRcxF4= ⇣ ⌘ "Input Node": t t 1 gt = tanh Wgxxh i + Wghhh i + bg

AAACenicbVFLS8NAEN7EV62vqkcRFosvxJKooBdB9OJRwVqhqWWz3SRLN5uwO5GWkB/hX/PmL/HiwU2t+KgDCx/fg5md8VPBNTjOq2VPTc/MzlXmqwuLS8srtdW1e51kirImTUSiHnyimeCSNYGDYA+pYiT2BWv5/atSbz0xpXki72CYsk5MQskDTgkYqlt79mICkR/kYdHNoTj3gMgIe4IFsPcltYwU4kGBv4hB8Zh7gshQMAzYUyNUHPzyR9/26Kf90J0M+GWg8BQPI9jv1upOwxkVngTuGNTRuG66tRevl9AsZhKoIFq3XSeFTk4UcGq6VL1Ms5TQPglZ20BJYqY7+Wh1Bd42TA8HiTJPAh6xPxM5ibUexr5xlsPqv1pJ/qe1MwjOOjmXaQZM0s9GQSYwJLi8A+5xxSiIoQGEKm5mxTQiilAw16qaJbh/vzwJ7o8a7nHj6PakfnE5XkcFbaAttIdcdIou0DW6QU1E0Zu1ae1Yu9a7vWXv2wefVtsaZ9bRr7JPPgCJHMPm ⇣ ⌘ C C

f i g Tanh

σ σ Tanh σ o To next W W b W W b fx fh f Wix Wih bi Wgx Wgh bg ox oh o layer

h To next h time step

x Sebastian Raschka STAT 479: Deep Learning SS 2019 !20 Long-short term memory (LSTM)

Brief summary of the gates so far ...

Forget Gate Input Node Input Gate

t t 1 Ch i = Ch i ft (it gt)

AAACVHicbVFNS8MwGE47p3N+TT16CQ5BD452CnoRhrt4VHAqrHOkWdqFpUlJ3gqj7EfqQfCXePFgthWZzhcCD88HefMkTAU34HkfjltaKa+uVdarG5tb2zu13b0HozJNWYcqofRTSAwTXLIOcBDsKdWMJKFgj+GoPdUfX5g2XMl7GKesl5BY8ohTApbq10bt5zwQRMaCYcCBnqHJVSBYBMeL2qn/o+JADRTgqJ/DJNA8HsKJpVKRmXmMT4XCFC+Y+rW61/Bmg5eBX4A6Kua2X3sLBopmCZNABTGm63sp9HKigVO7RzXIDEsJHZGYdS2UJGGml89KmeAjywxwpLQ9EvCMXUzkJDFmnITWmRAYmr/alPxP62YQXfZyLtMMmKTzi6JMYFB42jAecM0oiLEFhGpud8V0SDShYP+hakvw/z55GTw0G/5Zo3l3Xm9dF3VU0AE6RMfIRxeohW7QLeogil7Rp4Mcx3l3vtySW55bXafI7KNf425/A3Oss+0= ⇣ ⌘ For updating the cell state C C

f i g Tanh

σ σ Tanh σ o To next W W b W W b fx fh f Wix Wih bi Wgx Wgh bg ox oh o layer

h To next h time step

x Sebastian Raschka STAT 479: Deep Learning SS 2019 !21 Long-short term memory (LSTM)

Output gate for updating the values of hidden units:

t t 1 ot = Woxxh i + Wohhh i + bo

AAACfHicbVHdatswFJbddUuzbvPay15ULC2klAY7LW1vBmW72WUGS1OIsyArsi0iS0Y6HgnGT9E3210fpTdlcpLSn+yA4OP74RydE+WCG/D9O8fdeLP59l1jq/l++8PHT97nnWujCk1Znyqh9E1EDBNcsj5wEOwm14xkkWCDaPq91gd/mDZcyV8wz9koI4nkMacELDX2bsOMQBrFparGJVRfQ8OTjISCxdB+lAZWUniGK/zIzKrfZSiITATDgEO9QNXxy0D65E+f+0+C9URUJ6pQ8ySFo7HX8jv+ovA6CFaghVbVG3t/w4miRcYkUEGMGQZ+DqOSaODUdmmGhWE5oVOSsKGFkmTMjMrF8ip8aJkJjpW2TwJesM8TJcmMmWeRddbDmtdaTf5PGxYQX45KLvMCmKTLRnEhMChcXwJPuGYUxNwCQjW3s2KaEk0o2Hs17RKC119eB9fdTnDa6f48a119W62jgfbQF9RGAbpAV+gH6qE+ouje2XfazpHz4B64x+7J0uo6q8wuelHu+T9EhcSg ⇣ ⌘ C C

f i g Tanh

σ σ Tanh σ o To next W W b W W b fx fh f Wix Wih bi Wgx Wgh bg ox oh o layer

h To next h time step

x Sebastian Raschka STAT 479: Deep Learning SS 2019 !22 Long-short term memory (LSTM)

C C

f i g Tanh

σ σ Tanh σ o To next W W b W W b fx fh f Wix Wih bi Wgx Wgh bg ox oh o layer y Single layer RNN y y y(t+1) h To next h time step h Unfold h h h(t+1) x t t hh i = o tanh Ch i x x x x(t+1) t AAACTnicbVFNSwMxEM3Wr1q/qh69BIugl7Krgl4E0YtHBatCt5ZsOtsNZpMlmRXKsr/Qi3jzZ3jxoIimtYJfA4HHe28mk5cok8Ki7z96lYnJqemZ6mxtbn5hcam+vHJhdW44tLiW2lxFzIIUClooUMJVZoClkYTL6OZ4qF/egrFCq3McZNBJWV+JWHCGjurWIUwZJlFcJOV1EUqm+hIo0tCMUHnwJeuyW2BJQ93TTkWmEhpKiHHzy3D8b39oRD/BrW694Tf9UdG/IBiDBhnXabf+EPY0z1NQyCWzth34GXYKZlBwN7YW5hYyxm9YH9oOKpaC7RSjOEq64ZgejbVxRyEdsd87CpZaO0gj5xwub39rQ/I/rZ1jvN8phMpyBMU/L4pzSVHTYba0JwxwlAMHGDfC7Up5wgzj6H6g5kIIfj/5L7jYbgY7ze2z3cbh0TiOKlkj62STBGSPHJITckpahJM78kReyKt37z17b977p7XijXtWyY+qVD8AQei2vQ== Multilayer RNN y y y y(t+1) ⇣ ⌘

h h h h(t+1) 2 2 2 2 Unfold h h h h(t+1) 1 1 1 1

x x x x(t+1)

Sebastian Raschka STAT 479: Deep Learning SS 2019 !23 Long-short term memory (LSTM) • Still popular and widely used today • A recent, related approach is the Gated Recurrent Unit (GRU) Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. "Learning phrase representations using RNN encoder-decoder for statistical machine translation." arXiv preprint arXiv:1406.1078 (2014). • Nice article exploring LSTMs and comparing them to GRUs Jozefowicz, Rafal, Wojciech Zaremba, and Ilya Sutskever. "An empirical exploration of recurrent network architectures." In International Conference on Machine Learning, pp. 2342-2350. 2015.

GRU image source: https://en.wikipedia.org/wiki/ Gated_recurrent_unit#/media/ File:Gated_Recurrent_Unit,_base_type.svg

Sebastian Raschka STAT 479: Deep Learning SS 2019 !24 RNNs with LSTMs in PyTorch Conceptually simple, the (very) hard part is the data processing pipeline

... self.rnn = torch.nn.LSTMCell(input_size, hidden_size) ...

def forward(self, x): embedded = self.embedding(text) h = self.initial_hidden_state() for input in x: These two are h = self.rnn(input, h) equivalent, but

... the bottom one is self.rnn = torch.nn.LSTM(input_size, hidden_size) substantially faster ...

def forward(self, x): h_0 = self.initial_hidden_state() output, h = self.rnn(x, h_0)

Sebastian Raschka STAT 479: Deep Learning SS 2019 !25 Diﬀerent Types of Sequence Modeling Tasks

many-to-one one-to-many

many-to-many many-to-many

Figure based on:

The Unreasonable Eﬀectiveness of Recurrent Neural Networks by Andrej Karpathy (http://karpathy.github.io/2015/05/21/rnn-eﬀectiveness/)

Sebastian Raschka STAT 479: Deep Learning SS 2019 !26 Many-to-One

y y y y W = W ; W h [ hh hx ] W W W yh Wyh yh yh Whh Whh Whh h Unfold h h h

W W W hx hx hx Whx x x x x

Sebastian Raschka STAT 479: Deep Learning SS 2019 !27 Sebastian Raschka STAT 479: Deep Learning SS 2019 !28 Many-to-One ("Word"-RNN)

1 - Simple Sentiment Analysis (Simple RNN) https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/1%20- %20Simple%20Sentiment%20Analysis.ipynb

2 - Updated Sentiment Analysis (LSTM) https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/2%20- %20Upgraded%20Sentiment%20Analysis.ipynb

Optional:

Using TorchText with Your Own Datasets https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/A%20- %20Using%20TorchText%20with%20Your%20Own%20Datasets.ipynb

4 - Convolutional Sentiment Analysis https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/4%20- %20Convolutional%20Sentiment%20Analysis.ipynb

Sebastian Raschka STAT 479: Deep Learning SS 2019 !29 Many-to-One ("Character"-RNN)

Classifying Names with a Character-level RNN https://pytorch.org/tutorials/intermediate/char_rnn_classiﬁcation_tutorial.html

Sebastian Raschka STAT 479: Deep Learning SS 2019 !30 many-to-one many-to-oneone-to-many one-to-many

many-to-many many-to-manymany-to-many many-to-many "one" "training" "generating new text"

Generating Names with a Character-level RNN https://pytorch.org/tutorials/intermediate/char_rnn_classiﬁcation_tutorial.html

Sebastian Raschka STAT 479: Deep Learning SS 2019 !31 At each time step Softmax output (probability) for eachmany-to-one possible"next letter" one-to-many

For next each input, ignore the prediction but use the "correct" next letter from the dataset

many-to-many many-to-many "training"

Generating Names with a Character-level RNN https://pytorch.org/tutorials/intermediate/char_rnn_classiﬁcation_tutorial.html

Sebastian Raschka STAT 479: Deep Learning SS 2019 !32 many-to-one one-to-many

To generate new text, now, sample from the softmax outputs and provide the letter as input to the next time step

many-to-many many-to-many "one" "generating new text"

Generating Names with a Character-level RNN https://pytorch.org/tutorials/intermediate/char_rnn_classiﬁcation_tutorial.html

Sebastian Raschka STAT 479: Deep Learning SS 2019 !33 many-to-one one-to-many

To generate new text, now, sample from the softmax outputs and provide the letter as input to the next time step

many-to-many many-to-many "one" Note that this approach "generating new text" works with both Word- and Character-RNNs

Generating Names with a Character-level RNN https://pytorch.org/tutorials/intermediate/char_rnn_classiﬁcation_tutorial.html

Sebastian Raschka STAT 479: Deep Learning SS 2019 !34 Additional Tutorials: https://github.com/bentrevett/pytorch-seq2seq

Sebastian Raschka STAT 479: Deep Learning SS 2019 !35 many-to-one one-to-many

many-to-many many-to-many

Translation with a Sequence to Sequence Network and Attention (English to French) https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html

Figure based on:

The Unreasonable Effectiveness of Recurrent Neural Networks by Andrej Karpathy (http://karpathy.github.io/2015/05/21/rnn-effectiveness/) Sebastian Raschka STAT 479: Deep Learning SS 2019 !36 Data Processing for a Word RNN (e.g., for sentiment classification)

Sebastian Raschka STAT 479: Deep Learning SS 2019 !37 Data Processing for Word-RNNs

Step 1): Create one-hot encoded matrix

One-hot encoded matrix

0 0 0 0 ... 0 0 0 0 0 Text Corpus 0 0 0 0 ... 0 0 0 0 0 all-zero row vectors due to 0 0 0 0 ... 0 0 0 0 0 padding ...... Extract indices 0 ... 0 0 ... 1 0 ... 0 0 pos. 21 of unique words 0 ... 0 0 ... 0 0 ... 1 0 pos. 88

21, 88, 15, 22 0 ... 0 1 ... 0 0 ... 0 0 pos. 15 14, 56, 4, 6, 2, 11, 10, 33, 27, 38, 70, 76 0 ... 0 0 ... 0 1 ... 0 0 pos. 15 39, 29, 28, 24, 11, 5, 78, 39 77, 63, 22, 78, 34, 25, 67, 4, 83, 17, 23 19, 14, 8, 61, 23, 24, 4 23, 42, 18 Each sentence becomes 4, 8, 11, 25, 23, 42, 84, 76, 45, 24 45, 13, 68, 92, 33, 15, 16, 76, 25, 33, 89, 40, 16 a matrix, where each row is a one-hot encoded vector

Sebastian Raschka STAT 479: Deep Learning SS 2019 !38 Data Processing for Word-RNNs Step 2): Multiply with embedding matrix

Vocabulary A trainable matrix of type real indices

1 r1,1 r1,2 r1,3 r1,4 r1,5 r1,6 One-hot encoded matrix 2 r2,1 r2,2 r2,3 r2,4 r2,5 r2,6 0 0 0 0 ... 0 0 0 0 0 3 r3,1 r3,2 r3,3 r3,4 r3,5 r3,6 Text Corpus 0 0 0 0 ... 0 0 0 0 0 all-zero row vectors due to 0 0 0 0 ... 0 0 0 0 0 4 r r r r r r padding 4,1 4,2 4,3 4,4 4,5 4,6 ...... r r r r r r Extract indices 5 5,1 5,2 5,3 5,4 5,5 5,6 0 ... 0 0 ... 1 0 ... 0 0 pos. 21 of unique words 0 ... 0 0 ... 0 0 ... 1 0 pos.6 88 r6,1 r6,2 r6,3 r6,4 r6,5 r6,6 Number 0 ... 0 1 ... 0 0 ... 0 0 pos. 15 of unique 21, 88, 15, 22 7 r7,1 r7,2 r7,3 r7,4 r7,5 r7,6 14, 56, 4, 6, 2, 11, 10, 33, 27, 38, 70, 76 0 ... 0 0 ... 0 1 ... 0 0 pos. 15 words 39, 29, 28, 24, 11, 5, 78, 39 (or n ) 77, 63, 22, 78, 34, 25, 67, 4, 83, 17, 23 words 19, 14, 8, 61, 23, 24, 4 23, 42, 18 Each sentence becomes 4, 8, 11, 25, 23, 42, 84, 76, 45, 24 45, 13, 68, 92, 33, 15, 16, 76, 25, 33, 89, 40, 16 a matrix, where each row n-2 r r r r r r is a one-hot encoded n-2,1 n-2,2 n-2,3 n-2,4 n-2,5 n-2,6 vector n-1 rn-1,1 rn-1,2 rn-1,3 rn-1,4 rn-1,5 rn-1,6

n rn,1 rn,2 rn,3 rn,4 rn,5 rn,6

Output is a Number of features num words vocab size (or embedding size) matrix AAACJXicbVDLSgMxFM34tr6qLt0Ei+CqzKigCxeiG5dVrAqdOmTSWw1mkiG5U63D/Iwbf8WNC4sIrvwV08fC14HA4Zx7uTknTqWw6Psf3tj4xOTU9MxsaW5+YXGpvLxybnVmONS5ltpcxsyCFArqKFDCZWqAJbGEi/j2qO9fdMBYodUZdlNoJuxaibbgDJ0UlfdDoWiYMLyJ4/y0uMpDhHvMVZaE0Z02LVvQEEUClg6NjuYsDiMrHqAoonLFr/oD0L8kGJEKGaEWlXthS/MsAYVcMmsbgZ9iM2cGBZdQlMLMQsr4LbuGhqOKucPNfJCyoBtOadG2Nu4ppAP1+0bOEmu7Sewm+3nsb68v/uc1MmzvNXOh0gxB8eGhdiYpatqvjLaEAY6y6wjjRri/Un7DDOPoii25EoLfkf+S861qsF3dOtmpHByO6pgha2SdbJKA7JIDckxqpE44eSTP5JX0vCfvxXvz3oejY95oZ5X8gPf5BQJtp28= R ⇥ 2 Sebastian Raschka STAT 479: Deep Learning SS 2019 !39 The "huge" matrix multiplication is very ineﬃcient, so we replace it with an embedding "look-up"

Sebastian Raschka STAT 479: Deep Learning SS 2019 !40 Step 1: Read Sentences into a Word Matrix

A matrix of all zeros

0 0 0 0 0 0 0 0 0 0 Text Corpus 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Extract indices 0 0 0 0 0 0 0 0 0 0 of unique words 0 0 0 0 0 0 0 0 0 0

21, 88, 19, 26 0 0 0 0 0 0 0 0 0 0 14, 56, 4, 6, 2, 11, 10, 33, 27, 38, 70, 76 0 0 0 0 0 0 0 0 0 0 39, 29, 28, 24, 11, 5, 78, 39 77, 63, 22, 78, 34, 25, 67, 4, 83, 17, 23 19, 14, 8, 61, 23, 24, 4 23, 42, 18 4, 8, 11, 25, 23, 42, 84, 76, 45, 24 45, 13, 68, 92, 33, 15, 16, 76, 25, 33, 89, 40, 16

0 0 0 0 0 0 21 88 19 26 14, 56, 4 6 2 11 10 33 27 38 70 76 0 0 39 29 28 24 11 5 78 39 77, 63 22 78 34 25 67 4 83 17 23 Filling sequences from 0 0 0 19 14 8 61 23 24 4 the right side 0 0 0 0 0 0 0 23 42 18 4 8 11 25 23 42 84 76 45 24 45, 13, 68, 92 33 15 16 76 25 33 89 40 16

Sebastian Raschka STAT 479: Deep Learning SS 2019 !41 Step 2: "Look up" row that correspond to word index Vocabulary A trainable matrix of type real indices

1 r1,1 r1,2 r1,3 r1,4 r1,5 r1,6

2 r2,1 r2,2 r2,3 r2,4 r2,5 r2,6

3 r3,1 r3,2 r3,3 r3,4 r3,5 r3,6

4 r4,1 r4,2 r4,3 r4,4 r4,5 r4,6

5 r5,1 r5,2 r5,3 r5,4 r5,5 r5,6

6 r6,1 r6,2 r6,3 r6,4 r6,5 r6,6 Number of unique 7 r r r r r r 7,1 7,2 7,3 7,4 7,5 7,6 words

(or nwords)

n-2 rn-2,1 rn-2,2 rn-2,3 rn-2,4 rn-2,5 rn-2,6

n-1 rn-1,1 rn-1,2 rn-1,3 rn-1,4 rn-1,5 rn-1,6

n rn,1 rn,2 rn,3 rn,4 rn,5 rn,6

Number of features (or embedding size)

Sebastian Raschka STAT 479: Deep Learning SS 2019 !42 Data Processing for a Character RNN (e.g., for text generation)

Sebastian Raschka STAT 479: Deep Learning SS 2019 !43 Step 1: Break up text into characters

Break it into a sequence of Data: “Hello world!” characters

Input ‘ ‘ ‘!‘ Sequence: ‘H’ ‘e’ ‘l’ ‘l’ ‘o’ ‘w’ ‘o’ ‘r’ ‘l’ ‘d’ EOS

Predicting next ‘e’ ‘l’ ‘l’ ‘o’ ‘ ‘ ‘w’ ‘o’ ‘r’ ‘l’ ‘d’ ‘!‘ EOS character:

Sebastian Raschka STAT 479: Deep Learning SS 2019 !44 Step 2: Deﬁne Mapping Dictionaries

Mapping characters to integers Mapping integers to characters

‘H’ 2 6 ‘w’

‘e’ 1 0 ‘o’

‘l’ char2int 3 5 int2char ‘r’

‘l’ 3 3 ‘l’

‘o’ 0 4 ‘d’

Sebastian Raschka STAT 479: Deep Learning SS 2019 !45 Step 3: Deﬁne Inputs and Outputs

Text Corpus

Convert text into a long sequence of integers

Outputs are the 49, 29, 29, 29, 5, 19, 27, 0, 7, 3, 36, 65, 27, 41, 31, 0, 26, 4, 31, 27, 86, 10, 27, 3, 84, 67, characters shifted by 12, 0, 80, 31, 27, 58, 31, 0, 36, 28, 0, 75, 19, 22, . . . , 52, 84, 19, 31, 0, 22 Training data array x: Create sequences x and y 1 position as we want 49, 29, 29, 29, 5, 19, . . . , 41, 31 Sequence x: 73, 11, 56, 0, 36, 28, . . . , 72, 45 Batch 19, 22, 31, 67, 12, 0, . . . , 12, 0 49, 29, 29, 29, 5, 19, 27, 0, 7, 3, 36, 65, 27, size to predict the next 41, 31, 0, 26, 4, 31, 27, 86, 10, 27, 3, 84, 67, . . . 12, 0, 80, 31, 27, 58, 31, 0, 36, 28, 0, 75, 19, 22, 51, 51, 0, 51, 52, . . . , 86, 42 character 22, . . . , 52, 84, 19, 31, 0, 22 Training data array y: Sequence y: 29, 29, 29, 5, 19, 27, . . . , 31, 0 49, 29, 29, 29, 5, 19, 27, 0, 7, 3, 36, 65, 27, 11, 56, 0, 36, 28, 0, . . . , 45, 4 41, 31, 0, 26, 4, 31, 27, 86, 10, 27, 3, 84, 67, Batch 22, 31, 67, 12, 0, 48, . . . , 0, 86 12, 0, 80, 31, 27, 58, 31, 0, 36, 28, 0, 75, 19, size . . . 22, . . . , 52, 84, 19, 31, 0, 22 51, 51, 0, 51, 52, 4, . . . , 42, 0

Number of batches ✕ Number of steps

Sebastian Raschka STAT 479: Deep Learning SS 2019 !46 More details for how this is speciﬁcally implement in PyTorch are shown in the excellent Jupyter Notebooks I linked in the earlier slides

Sebastian Raschka STAT 479: Deep Learning SS 2019 !47