( 12 ) Patent Application Publication ( 10 ) Pub . No .: US 2021/0081752 A1 Chao Et Al

US 20210081752A1 IN ( 19 ) United States ( 12 ) Patent Application Publication ( 10 ) Pub . No .: US 2021/0081752 A1 Chao et al . ( 43 ) Pub . Date : Mar. 18 , 2021

( 54 ) IMITATION LEARNING SYSTEM Publication Classification ( 71 ) Applicant: NVIDIA Corporation , Santa Clara , CA ( 51 ) Int. Ci . ( US ) GO6N 3/00 ( 2006.01 ) GO6N 20/00 ( 2006.01 ) ( 72 ) Inventors : Yu - Wei Chao , Seattle, WA (US ) ; ( 52 ) U.S. CI . De - An Huang, Cupertino, CA (US ); CPC GO6N 37008 ( 2013.01 ) ; GO6N 20/00 Christopher Jason Paxton , Pittsburgh , ( 2019.01 ) PA ( US ) ; Animesh Garg , Berkeley, CA ( US ) ; Dieter Fox , Seattle , WA ( US ) ( 57 ) ABSTRACT Apparatuses, systems , and techniques to identify a goal of a ( 21 ) Appl. No .: 16 / 931,211 demonstration . In at least one embodiment, video data of a demonstration is analyzed to identify a goal . Object trajec ( 22 ) Filed : Jul . 16 , 2020 tories identified in the video data are analyzed with respect to a task predicate satisfied by a respective object trajectory, Related U.S. Application Data and with respect to motion predicate. Analysis of the tra ( 60 ) Provisional application No. 62 / 900,226 , filed on Sep. jectory with respect to the motion predicate is used to assess 13 , 2019 . intentionality of a trajectory with respect to the goal . 100 computing device camera 102 108 area 130

| box 112 can 114 1

| pot 116

stove 120 4 work surface 124 demonstrator 110

area 132

| I

robotic stove 122 de work surface 126 106 Patent Application Publication Mar. 18 , 2021 Sheet 1 of 57 US 2021/0081752 A1

100 computing device camera 102 108 area 130

stove 120 work surface 124 demonstrator 110

area 132

robotic stove 122 device work surface 126 106

FIG . 1 Patent Application Publication Mar. 18 , 2021 Sheet 2 of 57 US 2021/0081752 A1 200

box 212 video segment 0 204

can 214 Video 206 Data 202 &

Id 208

potC 216

FIG . 2 Patent Application Publication Mar. 18 , 2021 Sheet 3 of 57 US 2021/0081752 A1

300

box 312 video segment o 16 304 0

can 314 306 1 O I 1 O y

308

potO a316

FIG . 3 Patent Application Publication Mar. 18 , 2021 Sheet 4 of 57 US 2021/0081752 A1

400

m1 1 ) ga video . segment | 404 17b

video I segment 92 b2 408

FIG . 4 Patent Application Publication Mar. 18 , 2021 Sheet 5 of 57 US 2021/0081752 A1

500

- real environment 502 8 work surface 504 stove 506

burner 2 storage abstract 518 512 environment 510 burner 1 workspace I 516 514 1

FIG . 5 Patent Application Publication Mar. 18 , 2021 Sheet 6 of 57 US 2021/0081752 A1

600

obtain video data of a demonstration 602

segment the video data by object trajectory 2 604

identify a motion predicate satisfied by manipulation of a first object, where the motion predicate enables a subsequent trajectory of a 2 606 second object

identify a task predicate satisfied by manipulation of the second object 608

identify a goal of the demonstration based on task predicates classified as intentional or demonstrative n 610

FIG . 6 Patent Application Publication Mar. 18 , 2021 Sheet 7 of 57 US 2021/0081752 A1

identify goal of demonstration n 702

analyze video data of environment n 704

identify predicates to complete goal 706

identify trajectories to satisfy predicates 708

cause robotic manipulation device to manipulate objects according to trajectories r 710

FIG . 7 Patent Application Publication Mar. 18 , 2021 Sheet 8 of 57 US 2021/0081752 A1

TRAINING LOGIC / HARDWARE STRUCTURE ( S ) 815

CODE AND / OR DATA STORAGE DATA STORAGE 801 805 ACTIVATION STORAGE 820 ARITHMETIC LOGIC UNIT ( S ) 810

FIG . 8A

HARDWARE STRUCTURE ( S ) 815

DATA STORAGE CODE AND / OR 801 DATA STORAGE 805 COMPUTATIONAL COMPUTATIONAL HARDWARE HARDWARE 802 806

ACTIVATION STORAGE 820

FIG . 8B Patent Application Publication Mar. 18 , 2021 Sheet 9 of 57 US 2021/0081752 A1 Trained Neural Network 908

NewDataset 912 Result 914 A FIG.9

FrameworkTraining 904 UntrainedNeural Network 906

Training Dataset 902 Patent Application Publication Mar. 18 , 2021 Sheet 10 of 57 US 2021/0081752 A1 DATA CENTER 1000 APPLICATION LAYER 1040 APPLICATION ( S ) 1042

SOFTWARE LAYER 1030

SOFTWARE 1032

FRAMEWORK LAYER 1020 JOB CONFIGURATION SCHEDULER 1022 MANAGER 1024

DISTRIBUTED FILE SYSTEM 1028

RESOURCE MANAGER 1026

DATA CENTER INFRASTRUCTURE LAYER 1010 RESOURCE ORCHESTRATOR 1012

GROUPED COMPUTING RESOURCES 1014 815

NODE C.R. NODE C.R. ... NODE C.R. 1016 ( 1 ) 1016 ( 2 ) 1016 ( N ) 815 815 815 1318 ( 1) 1318 ( 2 ) 1318( N )

FIG . 10 Patent Application Publication Mar. 18 , 2021 Sheet 11 of 57 US 2021/0081752 A1

STEREOCAMERA(S) VIEWWIDE- 1170CAMERA)(S 1160 1168 SENSOR 1162 RADARSENSOR UDARSENSOR 1164 ULTRASONIC 1148 INFRAREDCAMERA(S)STEERING 1172 BRAKE ACTUATOR BRAKE SENSOR SYSTEM 1146 PROPULSION SYSTEM 1150 SENSOR(S) 1140 VIBRATION SENSOR 1142 HMIDISPLAY 1134 SPEED SENSOR 1144 ACCELERATOR(S) 1132 11547 ACTUATOR(S) GNSS 1158SENSORs)( THROTTLE/ 1152 FIG.11A INSTRUMENT CLUSTER STEERING SYSTEM1154 STEERING 1156 1174CAMERA)(S 1196 SURROUND IMU SENSOR(S) MICROPHONE 1166 BRAKE ACTUATOR 1148 MICROPHONE 1196 BRAKE SENSOR SYSTEM 1146 1142 ANTENNA(S) VIBRATION SENSOR WIRELESS 1126 1124 NETWORK INTERFACE SPEED SENSOR 1144 1100 CONTROLLER(S) 1136 1162 1196 LIDAR SENSOR 1164 ULTRASONIC SENSOR MICROPHONE RADAR SENSOR 1160 Patent Application Publication Mar. 18 , 2021 Sheet 12 of 57 US 2021/0081752 A1 STEREO CAMERA 1168 SURROUND 1174CAMERA(S) MID-RANGECAMERA 1176 MID-RANGECAMERA 1176

CAMERA(S) 1100 CAMERA(S) SURROUND 1174 0 SURROUND 1174 FIG.11B WIDE-VIEW CAMERA 1170 INFRAREDCAMERA 1172

LONG-RANGE CAMERA 1198 LONG-RANGE CAMERA 1198 STEREO CAMERA 1168 Patent Application Publication Mar. 18 , 2021 Sheet 13 of 57 US 2021/0081752 A1

GNSS RADAR ULTRASONIC LIDAR IMU SENSOR ( S ) SENSOR ( S ) SENSOR ( S ) SENSOR ( S ) SENSOR ( S ) MICROPHONE ( S ) 1158 1160 1162 1164 1166 1196

STEREO WIDE - VIEW INFRARED SURROUND LONG - RANGE MID - RANGE CAMERA ( s ) CAMERA ( s ) || CAMERA ( S ) || CAMERA ( s ) || CAMERA ( s ) CAMERA ( S ) 1168 1170 1172 1174 1198 1176 INFOTAINMENT SoC 1104 ( B ) CPU ( s ) SoC 1130 1118 815 SOC 1104A ) 815 CPU ( s ) 1106 GPU ( S ) 1108 815 815 GPU (s ) 1102 INSTRUMENT 1120 CLUSTER 1132 PROCESSOR ( S ) 1110 815 815 HD MAP HMI DISPLAY 1122 1134 CACHE ( S ) 1112 ,1126 NETWORK ADAS SYSTEM ACCELERATOR ( S ) 1114 INTERFACE 1138 815 1124 815 DATA STORE ( S ) CONTROLLER ( S ) DATA STORE ( S ) 1128 1136 1116

STEERING VIBRATION SPEED BRAKE PROPULSION STEERING SENSOR ( S ) SENSOR ( s ) SENSOR ( S ) SENSOR SYSTEM SYSTEM 1140 1142 1144 SYSTEM 1150 1154 1146 THROTTLE / STEERING BRAKE ACCELERATOR ACTUATORS ACTUATORS 1152 1156 1148

1 1 1

FIG . 11C Patent Application Publication Mar. 18 , 2021 Sheet 14 of 57 US 2021/0081752 A1

1100 PCleSWITCH 1182D) GPU 1184(E) 815 GPU 1184(H) 815 CPU 1180(B) 815 PCleSWITCH 1182C) GPU 1184(E) 815 GPU 1184(G) 815 1194 SERVER(S)1178 11DFIG. 1186

1182(B) 1184(B) 1184)(D 1192 PCleSWITCH GPU 815 1188 GPU 815 CPU 1180(A) 815

1182(A) 1184(A) 1184(C) PCleSWITCH GPU 815 GPU 815 NETWORK(s) 1190

W 1176 Patent Application Publication Mar. 18 , 2021 Sheet 15 of 57 US 2021/0081752 A1

PROCESSOR 1202 EXECUTION UNIT 1208 815

CACHE REGISTER FILE PACKED INSTRUCTION 1204 1206 SET 1209

PROCESSOR BUS 1210

MEMORY 1220 1214 MEMORY 1218 GRAPHICS / CONTROLLER INSTRUCTION ( S ) 1219 VIDEO CARD HUB 1212 1216 DATA 1221

1222 LEGACY I / O ? CONTROLLER 1223 DATA STORAGE I USER INPUT AND 1 1224 KEYBOARD INTERFACES 1225 10 WIRELESS CONTROLLER TRANSCEIVER HUB SERIAL EXPANSION 1226 1230 PORT 1227

FLASH BIOS AUDIO CONTROLLER 1228 1229 1 NETWORK CONTROLLER 1234 1200 FIG . 12 Patent Application Publication Mar. 18 , 2021 Sheet 16 of 57 US 2021/0081752 A1

1354 GPS1355 SSDOR HDD1320 1363 1364 MIC1365 CAMERAUSB3.0 WWANUNIT 1356NGFF)( SIM1357 SPEAKERS WLANUNIT 1350NGFF)( BLUETOOTHUNIT 1352NGFF)( HEADPHONES SATA AUDIO CODECAND CLASSD AMP1362 LPDDR31315 1CUARTOR USB2/3 PCIE SDIO UART USB HDA HDA DSP 1360 BIOS, FW FLASH 1322 1310 815 SPI PROCESSOR TPM 1338 PS2 1336 LPC KEYBOARD SMBUS 1-C SMBUSI EC1335

1-C FAN NFCUNIT 1345 HUB1340 1337 SENSOR THERMAL SENSOR1346 PS2 SMBUS 12C 1C 1C RC 1324 SCREEN1325 TOUCH PAD1330 SENSOR1339 FIG.13 DISPLAY TOUCH ALS THERMAL 1300 ACCELEROMETER 1341 1342 COMPASS 1343 GYROSCOPE 1344 Patent Application Publication Mar. 18 , 2021 Sheet 17 of 57 US 2021/0081752 A1

Computer System 1400 Main Memory 1404

Network CPU Display Input Interface 1402 Devices Devices 1422 815 1406 1408 Communication Bus 1410 Interconnect 1418 Switch 1420

PPU 1414 PPU 1414 1416 1416 815 815

Parallel Processing System 1412

FIG . 14 Patent Application Publication Mar. 18 , 2021 Sheet 18 of 57 US 2021/0081752 A1 COMPUTER SYSTEM 1500 COMPUTER 1510

15FIG. USB INTERFACE 1540

USBSTICK1520 USB INTERFACE LOGIC1550 PROCESSING UNIT 1530 815 Patent Application Publication Mar. 18 , 2021 Sheet 19 of 57 US 2021/0081752 A1

1620(N) MEMORY 1601(M) GPU MEMORY PROCESSOR 1650)(N 1626(M) 1640)(N GPU 1610(N) 815

1605(M) MULTI-CORE 815 1629(2)

1620(N-1) PROCESSOR 1610(N-1) be GPU 815 1628 1640(N-1). GPUMEMORY 1650(2) 1650(N-1) 1640(2) FIG.16A GPU 1610(2) 815 1620(2) 1605(1) GPUMEMORY MULTI-CORE 815 PROCESSOR 1629(1) 1626(1) GPU 1610(1) 815 1640(1)

1650(1) 1601(1) MEMORY 1620(1) PROCESSOR GPU MEMORY Patent Application Publication Mar. 18 , 2021 Sheet 20 of 57 US 2021/0081752 A1

1 GFX MEM 1633(1) GFX MEM 1633(2) GFX MEM 1633(M)

1631(1) 1631(2) 1632(N) 1 GRAPHICSACCELERATIONMODULE1646 GRAPHICS PROCESSING GRAPHICS PROCESSING : GRAPHICS PROCESSING 815 API

INTF1637 15 1636 INTRPTMGMT 1647 MGMT1648 1645 FETCH 1644 CACHE 1638 MMU 1639 I 1640 ACCELERATOR INTEGRATION CONTEXT REGISTERS SYSTEMMEMORY1614 FIG.16B INTE 1635 815

CACHE(S) : KLCIRCUIT 1625 COHERENCEBUS 1664 CORE1660D TLB1661D 1662D SHAREDCACHE(S) PROXY 1656 1 Î ?

1 CORE1660A TLB1661A CACHE(S) 1662A CORE1660B| TLB1661B CACHE(S) 1662B CORE1660C TLB1661C CACHE(S) 1662C

PROCESSOR 16071 Patent Application Publication Mar. 18 , 2021 Sheet 21 of 57 US 2021/0081752 A1

| MEM1 1633(2)10 GFX MEM 1633(1) GFX GFX MEM 1633(M)

1631(1) 1631(2) 1631(N) ACCELERATIONGRAPHICSMODULE1646 : GRAPHICS PROCESSING GRAPHICS PROCESSING GRAPHICS PROCESSING 815 ??? INTF 1637

FIG.16C 1640 INTE 1635 15 ACCELERATOR INTEGRATION 1636 INTRPTMGMT 1647 CONTEXT MGMT1648 REGISTERS 1645 FETCH 1644 CACHE 1638 MMU 1639 SYSTEMMEMORY1614

815 CACHE(S) 1625 COHERENCEBUS 1664 CORE1660D 1661DTLB 1662D SHAREDCACHE(S) PROXY CIRCUIT 1 ? 1656

? 1 CORE1660A TLB1661A CACHE(S) 1662A CORE1660B| TLB1661B CACHE(S) 1662B CORE1660C TLB1661C CACHE(S) 1662C PROCESSOR|1607 | Patent Application Publication Mar. 18 , 2021 Sheet 22 of 57 US 2021/0081752 A1

PROCESSOR 1607 815 APPLICATION 1680 APPLICATION GPU INVOCATION 1681 GPU INVOCATION

SYSTEM MEMORY 1614 APPLICATION EFFECTIVE OS VIRTUAL ADDRESS ADDRESS SPACE 1682 SPACE 1685 PROCESS ELEMENTS 1683 I 1 SEGMENT / PAGE TABLES 1 WORK 1686 . DESCRIPTOR (WD ) | 1684

ACCELERATION INTEGRATION SLICE 1690 MMU 1639 INT WD 1692 FETCH REGISTERS INTERRUPT 1691 1645 MGMT 1647

CONTEXT MGMT 1648 SAVE / RESTORE

GRAPHICS ACCELERATION MODULE 1646 815 EFFECTIVE ADDRESS FIG . 16D 1693 Patent Application Publication Mar. 18 , 2021 Sheet 23 of 57 US 2021/0081752 A1

? 1 EFFECTIVE ADDRESS 1693 815 SYSTEMMEMORY1614 INT 1692 HYPERVISORREALADDRESS SPACE1698 PROCESSELEMENTLIST 1699 815 HYPERVISOR1696 MMU1639 INTERRUPTMGMT1647 CONTEXTMGMT1648 | 1 SAVE/RESTORE 16EFIG. PROCESSOR1607 OS1695 VIRTUALOSADDRESSSPACEOSVIRTUALADDRESSSPACE 1685 SEGMENT/PAGETABLES 1686 SLICE1690 GRAPHICSACCELERATIONMODULE1646 ACCELERATIONINTEGRATION REGISTERS1645

APPLICATION1680 PROCESSELEMENT16831 APPLICATIONEFFECTIVE ADDRESSSPACE1682 WORKDESCRIPTOR 1684WD)( WDFETCH1691 Patent Application Publication Mar. 18 , 2021 Sheet 24 of 57 US 2021/0081752 A1

GPU 1610(N) 815 MMU1639E BIAS/ 1694E

COHERENCE www. GPU 1620(4)

J MEMORY

1610(N-1) GPU 815 MMU1639D BIASI 1694D GPU MEMORY 1620(3) COHERENCE 1 GPU 1620(2) MEMORY FIG.16F 1610(2) GPU 815 MMU1639C BIASI 1694C 1 UNIFIEDMEMORY COHERENCE 1620(1) GPU MEMORY

1 1 GPU 16101)( 815 MMU1639B BIAS/ 1694B MEMORY 16012)( COHERENCE | 1 PROCESSOR MEMORY 1601(1) MULTI-CORE 815 MMU1639A BIAS/ PROCESSOR1605 1639AMMU COHERENCE 1694A www Patent Application Publication Mar. 18 , 2021 Sheet 25 of 57 US 2021/0081752 A1

SOC INTEGRATED CIRCUIT 1700

APPLICATION GRAPHICS PROCESSOR ( s ) PROCESSOR 1705 1710 815 815

IMAGE VIDEO PROCESSOR PROCESSOR 1715 1720 815 815

USB UART SPI /SDIO 15 /1²C DISPLAY 1725 1730 1735 1740 1745

| SECURITY MEMORY CON FLASH MIPI HDMI | ENGINE 1 1770 TROLLER 1760 1755 1 1750 1765

FIG . 17 Patent Application Publication Mar. 18 , 2021 Sheet 26 of 57 US 2021/0081752 A1

GRAPHICS PROCESSOR 1810

VERTEX PROCESSOR 1805

815

r FRAGMENT FRAGMENT ! FRAGMENT 1 PROCESSOR PROCESSOR | 1 PROCESSORI 1815A 1815C - 1815N - 1

815 815 815 | FRAGMENT | FRAGMENT | FRAGMENT PROCESSOR I PROCESSOR I PROCESSOR 1815B 1815D 1815N

815 815 815

MMU MMU 1820A 1820B

- CACHE CACHE 1825A 1825B L.

INTERCONNECT INTERCONNECT wwwww 1830A 1830B L

FIG . 18A Patent Application Publication Mar. 18 , 2021 Sheet 27 of 57 US 2021/0081752 A1

GRAPHICS PROCESSOR 1840

INTER - CORE TASK MANAGER ( e.g. , THREAD DISPATCHER ) 1845

SHADER I SHADER SHADER : SHADER CORE CORE CORE 1 CORE 1855A 1855C 1855E 1855N - 1 815 815 815 11 815

i SHADERI SHADER SHADER SHADER CORE CORE i CORE i CORE 1855B 1855D Ii 1855F 1855N 815 815 815 815

TILING UNIT 1858

MMU MMU 1820A 1820B

CACHE CACHE 1825A 1825B

INTERCONNECT INTERCONNECT 1830A 1830B

FIG . 18B Patent Application Publication Mar. 18 , 2021 Sheet 28 of 57 US 2021/0081752 A1

GRAPHICS CORE 1900

SHARED INSTRUCTION CACHE - 1902 1901A 1901N

LOCAL INSTRUCTION CACHE LOCAL INSTRUCTION CACHE 1904A 1904N

THREAD SCHEDULER THREAD SCHEDULER 1906A 1906N

THREAD DISPATCHER ... THREAD DISPATCHER 1908A 1908N

AFU FPU ALU AFU FPU ALU 1912A 1914A 1916A 1912N 1914N 1916N

ACU DPFPU MPU ACU DPFPU MPU 1913A 1915A 1917A TEXTUREUNIT 1918 1913N 1915N 1917N

CACHE / SHARED MEMORY – 1920

815

FIG . 19A Patent Application Publication Mar. 18 , 2021 Sheet 29 of 57 US 2021/0081752 A1 1942B GPGPU 1930 MEMORY CONTROLLER MEMORY 1944B COMPUTE CLUSTER1936D 815 COMPUTE CLUSTER1936H 815

815 815 COMPUTE CLUSTER1936C COMPUTE CLUSTER1936G FIG.19B HOSTINTERFACE1932 GLOBALSCHEDULER1934 CACHEMEMORY1938 I/OHUB1939 GPULINK1940 COMPUTE CLUSTER1936B 815 COMPUTE CLUSTER1936F 815 COMPUTE CLUSTER1936A 815 COMPUTE CLUSTER1936E 815 MEMORY CONTROLLER 1942A MEMORY 1944A Patent Application Publication Mar. 18 , 2021 Sheet 30 of 57 US 2021/0081752 A1

2000

WIRELESS NETWORK NETWORK ADAPTER ADAPTER 2018 2019 DISPLAY DEVICE ( S ) 2010A ADD - IN I / O SWITCH DEVICE ( S ) 2016 2020

I / O HUB SYSTEM 2007 STORAGE INPUT 2014 DEVICE ( S ) 2008 I / O SUBSYSTEM 2011

COMMUNICATION LINK 2006 PARALLEL PROCESSOR ( S ) MEMORY SYSTEM 2012 HUB MEMORY 2005 2004 815

COMMUNICATION LINK 2013

DISPLAY DEVICE ( S ) PROCESSOR ( S ) 2010B 2002 PROCESSING SUBSYSTEM 815 2001

FIG . 20 Patent Application Publication Mar. 18 , 2021 Sheet 31 of 57 US 2021/0081752 A1

PARALLEL PROCESSOR MEMORY 2122 MEMORY MEMORY MEMORY UNIT UNIT UNIT 2124A 2124B 2124N PARALLEL PROCESSOR 2100

PARTITION PARTITION PARTITION UNIT UNIT ... UNIT 2120A 2120B 2120N MEMORY INTERFACE 2118

MEMORY CROSSBAR 2116

CLUSTER CLUSTER CLUSTER 2114A 2114B 2114N 815 815 815 PROCESSING CLUSTER ARRAY 2112

SCHEDULER 2110

HOST FRONT END INTERFACE VO UNIT 2108 2106 2104 PARALLEL PROCESSING UNIT 2102 2113

MEMORY HUB 2105 FIG . 21A Patent Application Publication Mar. 18 , 2021 Sheet 32 of 57 US 2021/0081752 A1

TO / FROM MEMORY UNIT 2124

FRAME BUFFER INTERFACE ROP 2125 2126

L2 CACHE 2121

PARTITION UNIT 2120

TO / FROM MEMORY CROSSBAR 2116

FIG . 21B Patent Application Publication Mar. 18 , 2021 Sheet 33 of 57 US 2021/0081752 A1

TO MEMORY CROSSBAR 1316 AND / OR OTHER PROCESSING CLUSTERS

MMU PREROP DATA CROSSBAR 2145 2142 2140

TO / FROM GRAPHICS MEMORY MULTIPROCESSOR TEXTURE CROSSBAR 2134 UNIT 2116 2136 L1 CACHE 815 2148

PROCESSING PIPELINE MANAGER CLUSTER 2132 2114

TO / FROM SCHEDULER 2110

FIG . 21C Patent Application Publication Mar. 18 , 2021 Sheet 34 of 57 US 2021/0081752 A1

SHARED MEMORY CACHE MEMORY 2170 2172

MEMORY AND CACHE INTERCONNECT 2168

LOADI GPGPU CORES STORE UNIT 2162 2166

ADDRESS MAPPING INSTRUCTION UNIT UNIT 2154 2156

INSTRUCTION CACHE 2152 GRAPHICS MULTIPROCESSOR 815 2134

FROM PIPELINE MANAGER 2132

FIG . 21D Patent Application Publication Mar. 18 , 2021 Sheet 35 of 57 US 2021/0081752 A1

2200 P2P GPU LINKS 2216

GPGPU GPGPU 2206A 2206B 815 815

GPGPU GPGPU 2206C 2206D 815 815

HOST INTERFACE SWITCH 2204

PROCESSOR 2202

815

FIG . 22 Patent Application Publication Mar. 18 , 2021 Sheet 36 of 57 US 2021/0081752 A1

1 2300 SAMPLERS 2354N SAMPLERS|2364N GRAPHICSPROCESSOR 815 SUB-CORE2350N 2370N SUB-CORE2360N GRAPHICSCORE-2380N EUs 2352N SHAREDRESOURCES EUS 2362N

MFX 2333 SAMPLERS 2354A SAMPLERS 2364A FIG.23 MEDIAENGINE2337- SUB-CORE2350A 2370A SUB-CORE2360A GRAPHICSCORE-2380A VQE 2330 EUS 2352A SHAREDRESOURCES EUS 2362A

PIPELINEFRONT-END 2304 VIDEO FRONTEND 2334 COMMAND STREAMER 2303 GEOMETRY PIPELINE 2336 INTERCONNECTRING -2302 Patent Application Publication Mar. 18 , 2021 Sheet 37 of 57 US 2021/0081752 A1

FRONTEND 2401 SIMPLEFP SCHEDULER2406 FPMOVE 2424 MICROCODEROM 2432 UOPQUEUE 2434 FPREGISTERFILE/ BYPASSNETWORK2410 TOLEVEL1CACHE ]FP 2422 INTEGER/FLOATINGPOINTUOPQUEUE2444 SLOW/GENERALFP SCHEDULER2404 SLOWALU 2420 PREFETCHER2426 DECODER2428 TRACECACHE 2430 INSTRUCTION INSTRUCTION ALLOCATOR/REGISTERRENAMER2440 FASTALU 2418 FIG.24 FASTSCHEDULER 2402 INTEGERREGISTERFILE/BYPASSNETWORK 2408 FASTALU 2416 AGU 2414 815 MEMORYUOP QUEUE2442 SCHEDULER2446 TOLEVEL1CACHE PROCESSOR2400 OUTOFORDERENGINE2403 MEMORY WWVAGU 2412 EXE BLOCK 2411 Patent Application Publication Mar. 18 , 2021 Sheet 38 of 57 US 2021/0081752 A1

HBM2 2540(3) HBM2 2540(4) HBM PHY 2544(3) HBM PHY 2544(4) ICL 2520(8) CPU 2550 MEM 2542(3) MEM 2542(4) 2520(7) CTRLR CONTROLLER CTRLR ICL MANAGEMENT ICL 2520(11)|12 2520(6) CLUSTER2510(3) CLUSTER2510(6) CLUSTER2510(9) ICL 815 815 815 815 ICL 2520(5) PROCESSING PROCESSING PROCESSING ICC 2530(C) 815 CLUSTER2510(2) CLUSTER2510(5) CLUSTER2510(8) CLUSTER2510(10)||11|12 FIG.25 ICC 2530(1) 815 815 815 815 DEEPLEARNINGAPPLICATIONPROCESSOR2500 PROCESSING PROCESSING PROCESSING PROCESSING ICL 2520(10) 2520(9) 2520)(4 ICL ICL CLUSTER2510(1) CLUSTER2510(4) CLUSTER2510(7) 815 815 815 815 ICL 25203 PROCESSING PROCESSING PROCESSING 2520(2) ICL MEM CTRLR 2542 MEM CTRLR 2542(2) PCleCONTROLLERAND DMABLOCK2570 PCleX16PORT 2580 2520(1) GPIOIPCSPI, 2560 ICL HBM PHY 2544 HBM PHY 2544(2) HBM2 2540(1) HBM2 2540(2) Patent Application Publication Mar. 18 , 2021 Sheet 39 of 57 US 2021/0081752 A1 NEURON OUTPUT 2606 NEURON OUTPUT 2606 NEURON OUTPUT 2606 NEURON2602 2608 NEURON2602 NEURON2602 2614 NEURON INPUT 2604 NEURON INPUT 2604 NEURON INPUT 2604 FIG.26 2606 2606 2606 NEURON2602 NEURON OUTPUT NEURON2602 NEURON OUTPUT NEURON2602 NEURON OUTPUT NEURON INPUT 2604 NEURON INPUT 2604 NEURON INPUT 2604 2612 2606 2606 2606 NEURON2602 NEURON OUTPUT NEURON2602 NEURON OUTPUT NEURON2602 NEURON OUTPUT NEURON INPUT 2604 NEURON INPUT 2604 NEURON INPUT 2604 NEUROMORPHICPROCESSOR2600 NEURON OUTPUT 2606 NEURON OUTPUT 2606 NEURON OUTPUT 2606 NEURON2602 NEURON2602 2608 NEURON2602 2610 NEURON INPUT 2604 NEURON INPUT 2604 NEURON INPUT 2604 Patent Application Publication Mar. 18 , 2021 Sheet 40 of 57 US 2021/0081752 A1

PROCESSOR ( S ) 2702 MEMORY DEVICE € 2720 INSTRUCTIONS REGISTER PROCESSOR CORE ( S ) CACHE FILE 2707 2721 2704 INSTRUCTION 2706 SEQUENCE 2709 DATA - 2722 815

MEMORY GRAPHICS CONTROLLER PROCESSOR ( S ) DISPLAY DEVICE 2711 2716 2708 EXTERNAL GRAPHICS 815 PROCESSOR 2712 815 INTERFACE BUS (ES ) - 2710 DATA STORAGE DEVICE 2724

TOUCH SENSORS 2725 PLATFORM CONTROLLER HUB WIRELESS 2730 TRANSCEIVER 2726

FIRMWARE INTERFACE 2728 ? ? 1 NETWORK AUDIO CONTROLLER CONTROLLER | LEGACY VO 2734 2746 CONTROLLER 2740

USB CONTROLLER ( S ) 2700 2742 1 1 KEYBOARD | CAMERA 1 | MOUSE 27431 2744 IL FIG . 27 Patent Application Publication Mar. 18 , 2021 Sheet 41 of 57 US 2021/0081752 A1 BUS CONTROLLER UNIT(S) 2816 SYSTEMAGENT CORE2810 DISPLAY CONTROLLER 2811 MEMORY CONTROLLER 2814 2808 815

815 UNITS)(| INTEGRATEDGRAPHICSPROCESSOR 2802NCORE CACHE 2804N| SHAREDCACHEUNIT(S)-2806 RING-2812 FIG.28 CORE2802A 815 CACHE UNIT(S) 2804A

I/O LINK 2813

PROCESSOR2800 EMBEDDED MEMORYMODULE 2818 Patent Application Publication Mar. 18 , 2021 Sheet 42 of 57 US 2021/0081752 A1

VIDEO CODEC ENGINE 2906 11 MEDIA PIPELINE 2916 1=

815 3D/MEDIA SUB SYSTEM 2915 1 MEMORYINTERFACE-2914 FIG.29 3D PIPELINE 2912 I BLIT ENGINE 2904 1 ENGINE2910 2900 GRAPHICS GRAPHICSPROCESSOR PROCESSING 2902 DISPLAY CONTROLLER DISPLAY DEVICE 2920 Patent Application Publication Mar. 18 , 2021 Sheet 43 of 57 US 2021/0081752 A1

1 1 3021 30221 30231 3025

CACHE(S) SAMPLER MATH INTER-THREAD GRAPHICSPROCESSINGENGINE 3010 COMMUNICATION SHARED FUNCTION LOGIC 3020 ARRAYCORE 3014 T ] FIG.30 GRAPHICS 815 815 CORE)(S 3015A SHARED FUNCTIONLOGIC 3026 CORE(S) 3015B UNIFIED RETURN BUFFER 3018 GRAPHICS GRAPHICS

3D PIPELINE 3012 MEDIA PIPELINE 3016

1 3003 COMMAND STREAMER From Memory Patent Application Publication Mar. 18 , 2021 Sheet 44 of 57 US 2021/0081752 A1

3130 MEDIA SAMPLER 3106D SHADER PROCESSOR 3107D SLM 3108D MEDIA SAMPLER 3106E SHADER PROCESSOR 3107E SLM 3108E MEDIA SAMPLER 3106F SHADER PROCESSOR 31075 SLM 3108F 3139 MEDIAPIPELINE TD/IC 3103D 3D SAMPLER 3105D TD/IC 3103E 3D SAMPLER 3105E TD/IC 3103F 3D SAMPLER 3105F ARRAYEU 3102D ARRAYEU 3104D EUARRAY 3102E EUARRAY 3104E EUARRAY 3102F EUARRAY 3104E GRAPHICS MICROCONTROLLER3138 SUB CORE SUB CORE 3101E SUB CORE 3101F FUNCTIONLOGIC3101D( SHAREDMEMORY/ SHARED 3110 CACHEMEMORY 3112 GEOMETRY& FUNCTIONFIXED PIPELINE 3114 ADDITIONALFIXED LOGICFUNCTION 3116 815 3100 FIG.31 SUB CORE 3101A SUB 3101B SUB CORE 3101C GRAPHICSSOCINTERFACE 3137 CORE MEDIA 3106A SHADER 3107A SLM 3108A MEDIA 3106B SHADER 3107B SLM 3108B MEDIA 3106C SHADER 3107C SLM 3108C 1 SAMPLER PROCESSOR SAMPLER PROCESSOR SAMPLER PROCESSOR GEOMETRY&FIXED FUNCTIONPIPELINE3136 TD/IC 3103A 3D SAMPLER 3105A TD/IC 3103B 3D SAMPLER 3105B TD/IC 3103C 3D SAMPLER 3105C EUARRAY 3102A EUARRAY 3104A ARRAYEU 3102B EUARRAY 3104B EUARRAY 3102C EUARRAY 3104C Patent Application Publication Mar. 18 , 2021 Sheet 45 of 57 US 2021/0081752 A1

SAMPLER 3210 DATACACHE 3212 DATAPORT 3214

3209N EU 3207N 815 TC 3211N EU 3208N 815

3209B EU 3207B 815 TC 3211B EU 3208B 815 FIG.32A

3209A EU 3207A 815 TC 3211A EU 3208A 815 THREAD 3204 DISPATCHER INSTRUCTIONCACHE 3206 EXECUTIONLOGIC 3200 SHADER PROCESSOR 3202 Patent Application Publication Mar. 18 , 2021 Sheet 46 of 57 US 2021/0081752 A1

SIMD FPUS 3234 SIMD ALUS 3235

GRAPHICSEXECUTIONUNIT-3208 815 SENDUNIT 3230 BRANCHUNIT 3232

ARF 3226 II FIG.32B

GRF 3224 +-7-77-7--7 INSTRUCTIONFETCHUNIT3237 .

? THREADARBITER3222 Patent Application Publication Mar. 18 , 2021 Sheet 47 of 57 US 2021/0081752 A1

Parallel Processing Unit ( PPU ) 3300 To System Bus 7 3302 I / O Unit Front End Unit 3306 3310

Scheduler Unit 3312 Hub 3316 3308 Work Distribution Unit GPUInterconnect 3314

GPC 3318

815

3320 XBar

Memory ( Y ) Memory Partition Unit ( U ) 3304 3322

FIG . 33 Patent Application Publication Mar. 18 , 2021 Sheet 48 of 57 US 2021/0081752 A1

To /From XBar

General Processing Cluster ( GPC ) 3400 Pipeline Manager PRE - ROP 3402 3404

1 MPC 3410

Primitive 11 Engine 3412 SM Raster Engine 3414 3408

815 DPC ( V ) 3406

WDX 3416 MMU 3418

To / From XBar To / From Xbar

FIG . 34 Patent Application Publication Mar. 18 , 2021 Sheet 49 of 57 US 2021/0081752 A1

To / From XBar

Memory Partition Unit 3500

Raster Operations Unit 3502

L2 Cache To / From 3504 XBar

Memory Interface 3506

To /From Memory

FIG . 35 Patent Application Publication Mar. 18 , 2021 Sheet 50 of 57 US 2021/0081752 A1 Streaming Multiprocessor 3600 815

Instruction Cache 3602

Scheduler Unit ( K ) 3604 Dispatch 3606

Core SFU LSU ( 1 to L ) ( 1 to M ) ( 1 to N ) 3610 3612 3614

Interconnect Network 3616

Shared Memory /L1 Cache 3618

FIG . 36 Patent Application Publication Mar. 18 , 2021 Sheet 51 of 57 US 2021/0081752 A1

3700 3718 3720 DEPLOYMENT SYSTEM3706 SOFTWARE SERVICES 3722 HARDWARE

3716 OUTPUT MODEL FIG.37 MODEL TRAINING 3714 MODEL REGISTRY 3724 TRAININGSYSTEM3704 LABELED CLINIC DATA 3712 Al-ASSISTED ANNOTATION 3710 3702 FABFAB IMAGING DATA 3708 Patent Application Publication Mar. 18 , 2021 Sheet 52 of 57 US 2021/0081752 A1 3800 UI 3814 SERVICE(S) 3820 DEPLOYMENTPIPELINE(S) VISUALIZATION CLOUD 3826 DEPLOYMENTSYSTEM3706 3810 PIPELINEMANAGER 3812

DICOM ADAPTER 3802B ALSERVICE(S) FIG.38 APPLICATIONORCHESTRATIONSYSTEM3828 3818 PARALLELCOMPUTINGPLATFORM3830 SYSTEMAI3824 OUTPUT MODEL(S) 3716

TRAININGPIPELINE(S) 3804 3714 3806 TRAININGSYSTEM3704 MODELTRAINING PRE-TRAINEDMODELS ET COMPUTE SERVICE(S) 3816 GPUS/GRAPHICS 3822 815 Al-ASSISTED ANNOTATION 3710 DICOM ADAPTER 3802A 3718 SOFTWARE 3720 SERVICES 3722 HARDWARE Patent Application Publication Mar. 18 , 2021 Sheet 53 of 57 US 2021/0081752 A1 DICOM OUTPUT 3914 DICOM WRITER 3912 3810A ORGAN 3910 SEGMENTATION 3916C

CTRECON 3908 3916B FIG39. 3906 DICOM READER 3916A

3812 3720 MANAGER SERVICES PIPELINE

3902 DICOM ADAPTER 3802B PACS SERVER(S) 3904 Patent Application Publication Mar. 18 , 2021 Sheet 54 of 57 US 2021/0081752 A1

4012 4000 3810B VISUALIZATION 4010 3720 SERVICES FIG.40A DETECTION 4008 RENDER COMPONENT 4018 ENGINE 4016 4006 INFERENCE RECONSTRUCTION DATAAUG. LIBRARY 4014 VULTRASOUND DICOM READER 3906

4002 Patent Application Publication Mar. 18 , 2021 Sheet 55 of 57 US 2021/0081752 A1 PACS SERVER(S) 3904 4020 3810C VCT DICOM WRITER 3912

METADATA FIG.40B FINEYES DETECTIONAI 4032

DETECTIONAI DETECTIONAI 4028 4030 AICONTROL 4024 PATIENTMOVEMENT 4026 COARSE VISUALIZATION 4022 EXPOSURE DICOM READER 3906 CTRECON 3908 Patent Application Publication Mar. 18 , 2021 Sheet 56 of 57 US 2021/0081752 A1

4100? 4110 4112 å MODELREFINED MODELTRAININGSYSTEM3704 4108 3714 MODELTRAINING IMPROVEDACCURACY FIG.41A 4106 ASSISTEDAl-ANNOTATION3710 4104 CUSTOMER DATASET INITIALMODEL

PRE-TRAINED MODELS3806 Patent Application Publication Mar. 18 , 2021 Sheet 57 of 57 US 2021/0081752 A1

TRAINING DATA 4138

4144 FIG.41B Al-ASSISTED ANNOTATIONTOOL 4136 ANNOTATION SERVERASSISTANT 4140 PRE-TRAINEDMODELS 4142

RAWIMAGES 4134

4132 US 2021/0081752 A1 Mar. 18 , 2021 1

IMITATION LEARNING SYSTEM [ 0019 ] FIG . 12 is a block diagram illustrating a computer system , according to at least one embodiment; CLAIM OF PRIORITY [ 0020 ] FIG . 13 is a block diagram illustrating a computer system , according to at least one embodiment; [ 0001 ] This application claims the benefit of U.S. Provi [ 0021 ] FIG . 14 illustrates a computer system , according to sional Application No. 62 / 900,226 , entitled “ MOTION at least one embodiment; REASONING FOR GOAL - BASED IMITATION LEARN [ 0022 ] FIG . 15 illustrates a computer system , according to ING ,” filed Sep. 13 , 2019 , the entire contents of which is at least one embodiment; incorporated herein by reference. [ 0023 ] FIG . 16A illustrates a computer system , according to at least one embodiment; TECHNICAL FIELD [ 0024 ] FIG . 16B illustrates a computer system , according [ 0002 ] At least one embodiment pertains to training robots to at least one embodiment; to achieve a goal based on observing a demonstration . For [ 0025 ] FIG . 16C illustrates a computer system , according example , at least one embodiment pertains to a robotic to at least one embodiment; device that observes a demonstration and , using various [ 0026 ] FIG . 16D illustrates a computer system , according novel techniques described herein , identifies the goal of the to at least one embodiment; demonstration . [ 0027 ] FIGS . 16E and 16F illustrate a shared program ming model , according to at least one embodiment ; BACKGROUND [ 0028 ] FIG . 17 illustrates exemplary integrated circuits [ 0003 ] Training robots to provide assistance presents a and associated graphics processors , according to at least one embodiment; variety of challenges . For example, a user might wish to [ 0029 ] FIGS . 18A and 18B illustrate exemplary integrated teach a robot to perform a task based on a demonstration of circuits and associated graphics processors , according to at that task by a user . However, techniques for training a robot least one embodiment; based on observation may require significant amounts of [ 0030 ] FIGS . 19A and 19B illustrate additional exemplary computing resources , or may be prone to error. graphics processor logic according to at least one embodi ment; BRIEF DESCRIPTION OF DRAWINGS [ 0031 ] FIG . 20 illustrates a computer system , according to [ 0004 ] FIG . 1 illustrates an example of demonstration at least one embodiment; learning, according to at least one embodiment; [ 0032 ] FIG . 21A illustrates a parallel processor, according [ 0005 ] FIG . 2 illustrates an example of video data seg to at least one embodiment; mentation, according to at least one embodiment; [ 0033 ] FIG . 21B illustrates a partition unit , according to at [ 0006 ] FIG . 3 illustrates an example of predicate identi least one embodiment; fication , according to at least one embodiment; [ 0034 ] FIG . 21C illustrates a processing cluster, according [ 0007 ] FIG . 4 illustrates an example of trajectory analysis , to at least one embodiment; according to at least one embodiment; [ 0035 ] FIG . 21D illustrates a graphics multiprocessor, [ 0008 ] FIG . 5 illustrates example aspects of demonstration according to at least one embodiment; learning, according to at least one embodiment; [ 0036 ] FIG . 22 illustrates a multi -graphics processing unit [ 0009 ] FIG . 6 illustrates an example of goal identification (GPU ) system , according to at least one embodiment ; in demonstration learning, according to at least one embodi [ 0037 ] FIG . 23 illustrates a graphics processor, according ment; to at least one embodiment ; [ 0010 ] FIG . 7 illustrates an example of goal reproduction [ 0038 ] FIG . 24 is a block diagram illustrating a processor by a robotic manipulation device , according to at least one micro - architecture for a processor, according to at least one embodiment; embodiment; [ 0011 ] FIG . 8A illustrates inference and / or training logic , [ 0039 ] FIG . 25 illustrates a deep learning application according to at least one embodiment; processor, according to at least one embodiment; [ 0012 ] FIG . 8B illustrates inference and / or training logic , [ 0040 ] FIG . 26 is a block diagram illustrating an example according to at least one embodiment; neuromorphic processor, according to at least one embodi [ 0013 ] FIG . 9 illustrates training and deployment of a ment; neural network , according to at least one embodiment; [ 0041 ] FIG . 27 illustrates at least portions of a graphics [ 0014 ] FIG . 10 illustrates an example data center system , processor, according to one or more embodiments ; according to at least one embodiment; [ 0042 ] FIG . 28 illustrates at least portions of a graphics [ 0015 ] FIG . 11A illustrates an example of an autonomous processor, according to one or more embodiments ; vehicle, according to at least one embodiment; [ 0043 ] FIG . 29 illustrates at least portions of a graphics [ 0016 ] FIG . 11B illustrates an example of camera loca processor, according to one or more embodiments ; tions and fields of view for the autonomous vehicle of FIG . [ 0044 ] FIG . 30 is a block diagram of a graphics processing 11A , according to at least one embodiment; engine of a graphics processor in accordance with at least [ 0017 ] FIG . 11C is a block diagram illustrating an one embodiment; example system architecture for the autonomous vehicle of [ 0045 ] FIG . 31 is a block diagram of at least portions of a FIG . 11A , according to at least one embodiment; graphics processor core , according to at least one embodi [ 0018 ] FIG . 11D is a diagram illustrating a system for ment; communication between cloud - based server ( s ) and the [ 0046 ] FIGS . 32A and 32B illustrate thread execution autonomous vehicle of FIG . 11A , according to at least one logic including an array of processing elements of a graphics embodiment; processor core according to at least one embodiment; US 2021/0081752 A1 Mar. 18 , 2021 2

[ 0047 ] FIG . 33 illustrates a parallel processing unit of the demonstration's goal . One challenge is that, when ( “ PPU ” ), according to at least one embodiment; applying goal - reasoning to the real -world demonstration, it [ 0048 ] FIG . 34 illustrates a general processing cluster may not be possible to determine the goal or intention from ( “GPC ” ), according to at least one embodiment; the final state of an observed scene , or from the observed [0049 ] FIG . 35 illustrates a memory partition unit of a sequence of high - level actions . For example, in a cooking parallel processing unit ( “ PPU ” ), according to at least one demonstration , a robot might observe a box of crackers embodiment; being moved and a pot being placed on the stove . It might [ 0050 ] FIG . 36 illustrates a streaming multi - processor, be inferred that the goal of the demonstration was to store according to at least one embodiment; the box of crackers and to place the pot on the stove . [ 0051 ] FIG . 37 is an example data flow diagram for an However, it might be the case that the movement of the box advanced computing pipeline , in accordance with at least of crackers was unintentional, or merely intended to move one embodiment ; the box out of the way . In at least one embodiment, the goal [ 0052 ] FIG . 38 is a system diagram for an example system of a demonstration is inferred based at least partially on for training, adapting, instantiating and deploying machine analysis of object trajectories. learning models in an advanced computing pipeline , in [ 0060 ] FIG . 1 illustrates an example of demonstration accordance with at least one embodiment ; learning, according to at least one embodiment. In the [ 0053 ] FIG . 39 includes an example illustration of an example 100 , a computing device 102 is connected to or advanced computing pipeline 3810A for processing imaging includes a camera 108. The computing device 102 comprises data, in accordance with at least one embodiment; at least one processor and a memory storing instructions [ 0054 ] FIG . 40A includes an example data flow diagram that, in response to execution by the at least one processor, of a virtual instrument supporting an ultrasound device , in cause the computing device 102 to perform steps or opera accordance with at least one embodiment; tions for goal - based imitation learning, in accordance with [ 0055 ] FIG . 40B includes an example data flow diagram embodiments described herein . The camera 108 may be any of a virtual instrument supporting an CT scanner , in accor of a variety of cameras capable of capturing video data , dance with at least one embodiment ; including video cameras, infrared cameras , multispectral [ 0056 ] FIG . 41A illustrates a data flow diagram for a imaging cameras , and so forth . process to train a machine learning model , in accordance [ 0061 ] In at least one embodiment, the camera 108 cap with at least one embodiment; and tures video data of an area 130 in which a demonstration is [ 0057 ] FIG . 41B is an example illustration of a client performed . The demonstration is performed , in at least one server architecture to enhance annotation tools with pre embodiment, by a human demonstrator 110. In the example trained annotation models , in accordance with at least one 100 , the demonstration relates to cooking. However, it will embodiment. be appreciated that this example is intended to be illustra tive , and as such the example should not be construed in a DETAILED DESCRIPTION manner which would limit the scope of the present disclo [ 0058 ] In at least one embodiment, a robotic device learns sure to only those embodiments that incorporate or involve to perform a new task based on a video demonstration . The the specific example provided . system is capable , in some cases , of learning the new task in [ 0062 ] In the example 100 , a box 112 , can 114 , and pot 116 as few as one demonstration , even if the demonstration are placed on work surface 124. The demonstrator 110 includes steps that are accidental or non - essential to the goal intends to cook the contents of the can 114 by placing the of the demonstration . For example, in at least one embodi can's contents into pot 116 and moving the pot 116 to a ment, a robot is trained to perform a task based on a burner of a stove 120. This goal is not communicated demonstration of the task by a human user . By leveraging explicitly to the computing device 102. In at least one supervised training or meta -learning , the robot learns to embodiment, the computing device 102 infers the goal based follow the demonstration and may subsequently perform the on video data of the observation that is captured by the task independently of its owner . However, it may not be camera 108. Subsequently, robotic device 106 can achieve necessary, and may be undesirable, for the robot to closely the goal again , in a potentially different area 132 with a follow each of the steps , actions , or details included in the different stove 122 and work surface 126 . demonstration . Rather, in at least one embodiment, the robot [ 0063 ] In at least one embodiment, the inferred goal is infers the goal or intention of the demonstrator in providing performed by a robotic device 106. The robotic device 106 the demonstration , and subsequently performs the demon may be include the computing device 102 , or the robotic strated task in view of the identified goal . The robot may do device 106 and the computing device 102 may be separate . so without recreating every action performed by the dem In at least one embodiment, the robotic device 106 com onstrator. For example, if the intent of a demonstration was prises at least processor and a memory comprising instruc to place a bowl on a counter, it may not matter whether the tions that, when executed by the processor ( s ) , cause robotic bowl was obtained from a cabinet or dishwasher, or whether device 106 to perform actions that accomplish the inferred the demonstrator's right or left hand was used to pick up the goal . The robotic device 106 may include , in at least one bowl . Accordingly, in at least one embodiment, the robot embodiment, a robotic arm or other mechanism for manipu infers the intention of the human demonstrator, in order to lating objects. Such devices may be referred to as robotic generalize what the robot learns to wider scenarios . manipulation devices. In at least one embodiment, the [ 0059 ] In at least one embodiment, the goal or intention of robotic device 106 is provided as an integrated device a demonstration is ambiguous, or unknown, to the observing comprising computing device ( s ) 102 and camera 108 . robotic device . More generally , the goal or intention of the [ 0064 ] In at least one embodiment, such as in the system demonstration is not known a priori by the observing depicted in FIG . 1 , movements of objects in a demonstration system , and the system might never be explicitly informed are analyzed as potentially satisfying two different classes of US 2021/0081752 A1 Mar. 18 , 2021 3 predicates, those being task predicates and motion predi ezze ;. The robot can then execute the output plan in ez based cates . For example, the movement of a cracker box may on the goal G that was derived from Dej . In at least one satisfy a task predicate such as in_storage ( cracker_box ) and embodiment, G is also independent of the agent's state and a motion predicate such as out_of_the_way ( cracker_box ). motion when the task domain definition is shared between e The predicate in storage ( cracker_box ) may be classified as and ez . For example , in at least one embodiment, a demon a task predicate , in that it may be a task related to the stration is performed in an abstract or artificial environment, accomplishment of a potential goal. The predicate out_of_ and the robot achieves the corresponding goal in a real the_way (cracker_box ), on the other hand , may be classified environment. as a motion predicate because it relates to enabling the [ 0070 ] FIG . 2 illustrates an example of video data seg motion , or trajectories, of other objects. The trajectory of a mentation , according to at least one embodiment. In the motion predicate may therefore enable a subsequent trajec example 200 , a video data 202 of a demonstration D = [ Z1, · tory that satsifies a task predicate . This may be seen , for .. , 2 ] is temporally separated into video segments { S } 204 , example, in FIG . 2 , which depicts a cracker_box being 206 , 208. In at least one embodiment, each segment i is moved out_of_the_way to enable movement of a pot onto a assumed to achieve one of the task predicates g . by manipu stove . In at least one embodiment, the computing device lating a single object. Segmentation of the video demon 102 , when performing goal - based imitation learning, stration may be based on analysis of the video data to attempts to classify predicates as task predicates , motion identify object trajectories that achieve a task predicate . In predicates , or both . at least one embodiment, action segmentation or other [ 0065 ] In at least one embodiment, the computing device related , similar, or corresponding methods are used to seg 102 , when performing goal - based imitation learning, oper ment the video data into video segments. ates under the assumption that the demonstrator intends to [ 0071 ] In at least one embodiment, each video segment cooperatively and unambiguously communicate the goal of 204 , 206 , 208 comprises video of an object being manipu the demonstration . The demonstration may therefore be lated by a demonstrator. In the example 200 , the first video described as intent- expressive or legible . In order to discern segment 204 shows manipulation of box 212 , the second this intent, the computing device 102 assumes that legibility video segment 206 shows manipulation of a can 214 , and the of the goal resides , at least in part, in the low -level object third video segment 208 shows manipulation of a pot 216 . trajectories, rather than the high - level actions . This assump These partitions may then be analyzed to determine whether tion allows the classification of task predicates and motion they satisfy task or motion predicates. predicates to be formulated as an inverse planning problem . [ 0072 ] FIG . 3 illustrates an example of task predicate [ 0066 ] As illustrated in FIG . 1 , video data of a real -world identification , according to at least one embodiment. In at demonstration may be obtained by a camera 108 and pro least one embodiment, analysis of a video segment is based cessed by a computing device 102. Given the video dem on an assumption that the video segment achieves a task onstration , the computing device 102 , in at least one predicate g ; by manipulating an object. Analysis may also embodiment, aims to output a symbolic goal of the demon proceed based on an assumption that the video segment also stration . For example, given a demonstration D = [ Z1 , . achieves a motion predicate . Inverse planning can then be Zr ], embodiments may identify an intended goal G of the applied to determine if the video segment demonstrates an demonstration . Here , Z , corresponds to a video frame of the intention to achieve a task predicate gi , a motion predicate , demonstration D. In at least one embodiment, the demon or both . Once the task predicates gi achieved in all of the stration is analyzed under the same task planning domain , or video segments have been determined , they can be pooled more generally the same knowledge , as the robot that is to together, based on the domain definition , to identify a final subsequently perform the same task . For example, a plan goal G = { g ; } . ning domain definition language ( “ PDDL ” ) file may be used [ 0073 ] In the example 300 of FIG . 3 , a first video segment for both goal identification and performance of a task to 304 , corresponding to video segment 204 in FIG . 2 , shows achieve the identified goal . a manipulation of box 312. This may be seen as satisfying [ 0067 ] In at least one embodiment, a domain contains a list a task predicate, such as in_storage ( box ). However, analysis of ground atoms or predicates , such as in_storage ( cracker_ of the box's trajectory may show that it also enabled the box ) , on_stove ( pot ), and so forth . The domain may also trajectory of pot 316 in a subsequent video segment 208 . contain a set of grounded operators. A grounded operator, in Consequently, the manipulation of box 312 also satisfies a at least one embodiment, comprises a name, list of argu motion predicate . ments, a list of preconditions for applying the operator, and [ 0074 ] Continuing the example, the second video segment identification of the state change effects of applying the 306 , corresponding to the video segment 206 in FIG . 2 , operator. The preconditions may correspond to one or more shows a manipulation of a can 314. In particular, the of the domain's predicates. The grounded operators may example 300 shows the can 314 being moved to the pot 316 sometimes be referred to as high - level actions , or as actions . and the contents of the can 314 being emptied into the pot [ 0068 ] In at least one embodiment, a goal G = [ 81 , ... , gy] 316. This might be treated , in some embodiments , as a comprises a list of predicates gi that, when satisfied , number of separate trajectories. However, for the purposes achieves the goal G. A subgoal g ; may also be described as of this example, it is treated as one trajectory. In any event, a task predicate. A subgoal g ; may be selected from a subset the manipulation of can 314 satisfies a task predicate , such of predicates in the domain . as in_pot ( food ). [ 0069 ] In at least one embodiment, the robot is enabled to [ 0075 ] The third video segment 308 , corresponding to the imitate the observed demonstration and apply it to different third video segment 208 of FIG . 2 , shows a manipulation of environments . For example, given a video demonstration the pot 316 to place it onto the stove . This trajectory satisfies De , in an environment e1 , identification of goal G permits a a task predicate on_stove (pot ). As stated above , this trajec task planning problem to be solved in a new environment tory was enabled by the prior manipulation of the box 312 . US 2021/0081752 A1 Mar. 18 , 2021 4

[ 0076 ] FIG . 4 illustrates an example of trajectory analysis, observed trajectory . In at least one embodiment, this is in according to at least one embodiment. In at least one line with Bayesian inverse planning . For example , S ; be the embodiment, identification of a motion predicate , or of a trajectory of b ; in S. In at least one embodiment, the task predicate enabled by a prior trajectory , is based on intention of S , can then be determined by : trajectory analysis as described in relation to FIG . 4. In the example 400 , video segment 404 shows manipulation of an object b? , and video segment 408 shows manipulation of an argmax object b2 . To facilitate understanding of the example , object g € g:, m ; ( b ) } P ( 8 | & i) = P & 18 ) P (8 ) b , may be presumed , for the purposes of the example, to correspond to the box 212 , 312 of FIGS . 2 and 3 , and object and b2 may be presumed to correspond to the pot 216 , 316 . exp ( -C ( $ ) - C6- )) [ 0077 ] In at least one embodiment, after temporally seg P & 18 ) o exp (-C 8-8( )) menting the video segments 404 , 408 of the demonstration based on task predicates , a segment Sy= [ z ', ... Z ;, ... , 27 ] includes the observations 4 in the segment and the task [ 0083 ] Referring to the above equation , in at least one predicate ? achieved in the segment. Based on the argu embodiment, s and q are the start and end locations of $ i ; E.SS- > g * and Eq are optimal trajectories to achieve g from ments of the task predicate, it is also known that the object s and q ; and C ) is a function to compute the cost of a b? is being manipulated in the segment. It may then be trajectory . In at least one embodiment, g corresponds to determined if the task predicate g , achieved in the segment movement to a point in space . In at least one embodiment, S ; is part of the actual goal of the demonstration . Aspects of g corresponds to movement through a region in space . In at goal recognition may be performed based on relationships least one embodiment, g is a task predicate or a motion between the different predicates. However, in some cases , it predicate that can be satisfied by movement to a point or to may not be possible to determine the intention of the a region in space . In the latter case , it may be inefficient to demonstration based only on g .. directly discretize the space and run a search algorithm . In [ 0078 ] To resolve such ambiguity at the predicate level, at least one embodiment, a path - planning algorithm , such as object motions or trajectories may be analyzed . For rapidly exploring random tree ( “ RRT ” ) or RRT * , is run from example , in addition to satisfying some task predicate g , the s to q to approximate optimal trajectories to achieve gu , manipulation of an object may also achieve something else . m ; ( b ;) . In at least one embodiment, other objects b ; i< j are For example, object b? is now at a new position and pose . treated as obstacles and a configuration space is computed . Independently of the task predicate , this new position and When g , is a region in space , the success condition for a path pose of b? can also enable later object trajectories. With planning algorithm , such as RRT * , is to reach the region g .. respect to FIG . 4 , the motion of object b , might satisfy a task Relatedly , m ; ( b ; ) means that b ; is not blocking the trajectory predicate g , and might also satisfy a motion predicate my . of bi+ 1. In at least one embodiment, a convex hull that covers [ 0079 ] In at least one embodiment, a high - level, action the trajectory of bit1 is found , and RRT * , or some other path based approach considers task planning aspects of goal planning algorithm , is used to find the shortest path such that recognition , and analysis based on two - dimensional trajec b ; does not intersect with this convex hull . tories considers motion - planning aspect of goal recognition . [ 0084 ] Applying this to the example of FIG . 4 , & , may In at least one embodiment, these aspects uses both the task refer to the trajectory of object b? , and motion reasoning and motion planning aspects to understand the goal of a may comprise a determination as to whether P (m , 151) > P ( demonstration . For example , although moving the bowl on g.181 ) . Embodiments may therefore determine , based on the to the stove might symbolically only contain in_hand ( bowl) trajectory $ 1 , whether the manipulation of b , in that segment as a precondition , there is also implicitly a motion constraint was more likely to have been intended to satisfy motion of a valid path that enables moving the bowl onto the stove . predicate m , or task predicate gi . Satisfying this constraint by moving b ; to create the valid [ 0085 ] In at least one embodiment, motion predicate rea path can be referred to as satisfying a motion predicate soning is used to determine, for each segment Si , if the m ; ( b ; ) . corresponding g; is intentional. In at least one embodiment, [ 0080 ] In some cases , moving an object achieves a task motion predicate reasoning is performed according to tech predicate but does not create a valid path for other objects. niques described in relation to FIG . 4 . In at least one embodiment, such cases are interpreted is [ 0086 ] In at least one embodiment, the task predicates gi intending to demonstrate just the task predicate. In other that have been determined to be intentional are pooled . In at cases , moving an object does not achieve any task predicate, least one embodiment, intentional task predicates are pooled but does enable a new valid path , then moving b? can be by removing certain task predicates that are preconditions assumed to teach satisfying a motion predicate m ;( bi) . for later task predicates . For example, although a pot may be [ 0081 ] In other cases , moving b ; satisfies both a task moved onto a stove , this may simply be a precondition for predicate ? , and a motion predicate m , (b ; ). If both ? , and cooking what is inside the pot . Later, the pot is moved back m ; ( b ; ) are true in Si , the intention of the motion , in relation onto the table to serve . It might therefore be inferred that the to the goal of the demonstration , is less clear. For example, pot being on stove is not the actual goal of the demonstra if moving the box 312 achieved the task predicate in_storage tion . ( cracker_box ), and created a valid path for moving pot 316 [ 0087 ] In at least one embodiment, reasoning at the object onto the stove , it may be difficult to determine if in_storage trajectory level is based on the pose of an object in three ( cracker_box ) is part of the goal . dimensional space . In at least one embodiment, object poses [ 0082 ] In at least one embodiment, a principle of rational are initialized using a convolutional neural network action is applied to assess whether the motion or task ( " CNN " ), such as PoseCNN . Detection output may then be predicate would be most efficiently brought about by the used , in at least one embodiment, as initialization to an US 2021/0081752 A1 Mar. 18 , 2021 5 object pose tracking algorithm . For example, in at least one defining a task , cooking an ingredient may involve the embodiment, an algorithm based on Rao -Blackwellized above -mentioned task predicates of pour and cook . One of particle filtering ( “ RBPF ” ), such as PoseRBPF , is used to these task predicates might be prevented by a blocking track the pose of an object. In at least one embodiment, the object. The blocking object might have to be moved to a pose tracking includes information related to position and different region to enable the predicate, but moving the orientation , and may be referred to as 6D pose tracking. In blocking object might not , in and of itself, be part of the at least one embodiment, multiple cameras are used , an a goal . Using both task and motion reasoning , as described view of the object is selected based on a maximum particular herein , may enable a robot to repeat the cooking task with score . The video frame observation Z , may be transformed to greater efficiency and reliability, e.g. by not moving objects an object - centric representation x , (21 :1 ) [ X , ' , À , " not related to the task if they are not blocking performance . X , k ]. Here , p is an object pose tracking pipeline , x , is an of the goal , or by allowing the robot to accomplish the goal object - centric representation of Zz , and * is an estimated 6D even when certain objects that were manipulated in the pose of the k - th object. In at least one embodiment, x , is demonstration are not present. extended with a detected hand center to improve the robust [ 0095 ] It will be appreciated that the examples provided ness of video segmentation based on hand - object interac are intended to be illustrative, and that as such , should not tions . be construed in a manner that would limit the present [ 0088 ] As described herein video data of the demonstra disclosure to only those embodiments that incorporate the tion can be segmented to enable reasoning about the object specific examples provided . For example, although certain trajectories in other steps . In at least one embodiment, each techniques are described herein in relation to a cooking task , segment comprises manipulation of a single object. In at the techniques are applicable in a wide variety of applica least one embodiment, a neural network is performed to tions . These include, but are not limited to , automobile segment the video based on data collected via action seg operation , self - driving cars , medical applications, entertain mentation . ment, factory automation, and so forth . [ 0089 ] In at least one embodiment, the segmentation of the [ 0096 ] FIG . 6 illustrates an example of goal identification demonstration's video data is done based on object pose in demonstration learning, according to at least one embodi trajectories extracted for motion reasoning . For each time ment. Although the example process 600 is depicted as a step t , there are known poses ? , for each object. By sequence of operations, it will be appreciated that, in comparing æ , temporally, it may be determined that an embodiments, the depicted operations may be altered in object k is being moved or otherwise manipulated. This various ways, and that some operations may be omitted , information may then be used to segment the video data reordered , or performed in parallel with other operations , according to the separate object trajectories. The segmenta except where an order is explicitly stated or logically tion , in at least one embodiment, is made more accurate by implied , such as when the input from one operation depends using hand detection to identify segments in which the upon the output of another operation . demonstrator's hand is interacting with an object. [ 0097 ] The operations depicted by FIG . 6 may be per [ 0090 ] In at least one embodiment, after segmenting the formed by a system , such as the computing device 102 video data , predicates for each frame are computed based on depicted in FIG . 1 , comprising at least one processor and a the estimated poses . By comparing predicates at the start and memory comprising instructions that, in response to being the end of the segment, the task predicate for a segment may executed by the at least one processor, cause the system to be determined . perform the depicted operations. [ 0091 ] FIG . 5 illustrates example aspects of demonstration [ 0098 ] At 602 , the example system obtains video data of learning , according to at least one embodiment. In the a demonstration. In at least one embodiment, a computing example 500 of FIG . 5 , a real environment 502 can comprise device, such as the computing device 102 depicted in FIG . a work surface 504 and a stove 506. In at least one 1 , obtains the video data via one or more cameras trained on embodiment, task and motion reasoning are performed in the physical area in which the demonstration will occur . The view of an abstract environment 510 , comprising a storage video data may be in any of a variety of formats that may region 512 , a workspace region 514 , and two burner regions include , but not limited to , MPEG , WMV, MOV, AVI, or 516 , 518. In at least one embodiment, motion and / or predi FLV . In at least one embodiment, video data is provided as cate reasoning is performed in view of these abstract images , in formats that may include but are not limited to regions . JPEG , GIF , or PNG . [ 0092 ] In at least one embodiment, a domain definition [ 009 ] The demonstration , in at least one embodiment, defines various predicates, such as the following: comprises a series of actions performed by a demonstrator . pour( X , Y ): in_hand (X ), on ( X , Y ) = in (X , Y ) The actions of the demonstrator manipulate objects in the environment. For example , a human demonstrator may cook ( X ) .in ( X , b ) , on_burner ( b ) = cooked ( X ) perform a series of actions such as moving a box , opening [ 0093 ] A domain definition may also impose constraints a can , turning a knob , placing a pot on a stove , and so forth . on the inputs. For example, regarding the above predicates, The human demonstrator performs these tasks with the the domain definition may state that pour is constrained by intention of showing how some goal might be accomplished . Y being a container. Note , however, that the goal is not known a priori by the [ 0094 ] In at least one embodiment, the domain definition system . Rather, the system infers the goal based on the defines possible tasks using predicate definitions, such as techniques described herein . those provided above . The domain definition may define [ 0100 ] At 604 , the system segments the video data by various objects such as soup , beans, cracker_box , mustard object trajectory . Segmentation of the video data , in at least bottle, and so forth , and indicate constraints on how they one embodiment, comprises identifying starting and stop may be used with the various predicates . As an example of ping points of portions of the video . In at least one embodi US 2021/0081752 A1 Mar. 18 , 2021 6 ment, the segmentation is done so that each segment corre determined to be intentional or demonstrative of the goal . In sponds to a movement of an object. For example, a segment at least one embodiment, a task predicate is used to identify may show the movement of an object as it is picked up at a a goal based on trajectory analysis. For example, a task point A and moved to a point B. In some cases , a segment predicate is used to identify a goal when the trajectory that may show other actions , which may also be described as satisfied the task predicate did not enable , or did not appear trajectories, such as rotating, opening , or otherwise acting to be intended to enable , a subsequent trajectory. In at least upon on object. A segment, in some cases , may contain one embodiment, using motion predicate reasoning , the video data depicting one such action . system decides if a task predicate is intentional, and then [ 0101 ] In at least one embodiment, the segmentation is pools all intentional task predicates into a final goal . In at done so that each segment can be analyzed to determine least one embodiment, intentional task predicates which are which predicates are satisfied by the movement . In at least preconditions for later intentional task predicates are one embodiment, a domain definition provides information removed , and those that remain constitute the final goal . usable by the system to identify predicates . A domain [ 0107 ] FIG . 7 illustrates an example of goal reproduction definition can be provided via a code or configuration file . by a robotic manipulation device , according to at least one The domain may relate to a particular types of activities , embodiment. Although the example process 700 is depicted such as cooking , driving, medical treatment, and so forth . In as a sequence of operations, it will be appreciated that, in at least one embodiment, a domain definition describes task embodiments , the depicted operations may be altered in predicates, but not motion predicates . various ways, and that some operations may be omitted , [ 0102 ] At 606 , the system identifies a motion predicate reordered , or performed in parallel with other operations, that is satisfied by manipulation of a first object. For except where an order is explicitly stated or logically example, the system may examine trajectories in subsequent implied , such as when the input from one operation depends video segments, and determine whether or not, or to what upon the output of another operation . degree , those trajectories were enabled by the manipulation [ 0108 ] The operations depicted by FIG . 7 may be per of the first object. formed by a system , such as robotic device 106 depicted in [ 0103 ] As described herein , for example with respect to FIG . 1. The robotic device , in at least one embodiment, FIG . 4 , embodiments may analyze an object trajectory with comprises a robotic manipulation device , such as a robotic respect to the trajectory satisfying a task predicate, and a arm , at least one processor, and a memory comprising motion predicate. In at least one embodiment, the system instructions that are executable by the at least one processor , identifies a task predicate from information in a domain and which when executed cause the robotic device to definition . The task predicate identifies a condition that is perform the operations described in relation to FIG . 7. In satisfied by the observed trajectory of the object. For general, the robotic device , having obtained the inferred example, if an object is moved to one side of the work goal , can be instructed to attempt to achieve the inferred goal surface, the trajectory of the object might satisfy a predicate in a new situation or environment. The robotic device may such as in_storage ( object ), where the storage” area is the therefore be able to assist a user be performing a task the side of the work surface . A similar predicate might be user had previously demonstrated . satisfied by placing the object in a cabinet. [ 0109 ] At 702 , in at least one embodiment, the robotic [ 0104 ] The trajectory of the first object is also analyzed device obtains the goal of a demonstration . This may be with respect to satisfying a motion predicate . As described performed , in at least one embodiment, by the robotic device herein , for example with respect to FIG . 4 , a trajectory may using the techniques described in relation to FIG . 6. In some be classified as a motion predicate based on the trajectory's embodiments, another device , such as the computing device enabling of a subsequent trajectory. In at least one embodi 102 depicted in FIG . 1 , observes a demonstration and ment, this classification comprises assigning a percentage determines its goal . This may also be done , in at least one likelihood or other value , indicative of a degree to which a embodiment, according to the techniques described in rela subsequent trajectory is enabled by the motion , or a degree tion to FIG . 6. In at least one embodiment, the determined to which the first object's trajectory is estimated to be goal is then transmitted to the robotic device . intentional with respect to enabling the subsequent trajec [ 0110 ] In at least one embodiment, the goal comprises one tory. or more predicates. In at least one embodiment, these [ 0105 ] At 608 , the system identifies a task predicate that predicates are those are were not identified as preconditions was satisfied by manipulation of a second object. The first of subsequent predicates during the demonstration . In at object's trajectory, in at least one embodiment, is analyzed least one embodiment, the predicates of the goal are repre with respect to the second object's trajectory to determine sentative of an end - state demonstrated by the user . In at least whether the demonstrator's intention was to satisfy the task one embodiment, the goal comprises one or more predicates predicate or the motion predicate. The system may deter that are included in the goal based , at least in part, on mine that the trajectory was more likely intended to enable intentionality associated with the predicate a subsequent trajectory, and classify the trajectory as satis [ 0111 ] At 704 , the robotic device analyzes video data of an fying a motion predicate. Similarly, if the system were to environment. This may be different than the environment in determine that the trajectory was not likely intended to which the demonstration occurred . For example , a cooking enable a subsequent trajectory , it may classify the first task might be demonstrated in one kitchen , and then subse trajectory as satisfying a task predicate. In at least one quently performed in a different kitchen . embodiment, these classifications are made by assigning a [ 0112 ] In at least one embodiment, analysis of the video percentage or probability value to the respective classifica data comprises detecting objects and object poses . This may tions . include, for example , identifying objects that are described [ 0106 ] At 610 , the system identifies a goal of the demon in a domain definition and identifying where those objects stration based on those task predicates that have been are located within the scene represented by the video data . US 2021/0081752 A1 Mar. 18 , 2021 7

[ 0113 ] At 706 , the robotic device identifies predicates that portion of code and / or data storage 801 may be included are necessary to complete the goal , in view of the objects with other on - chip or off - chip data storage , including a identified . In at least one embodiment, this is done by processor's L1 , L2 , or L3 cache or system memory. starting with one or more predicates that define the goal , and [ 0118 ] In at least one embodiment, any portion of code working backwards to identify intermediate and preliminary and / or data storage 801 may be internal or external to one or steps to achieve the goal . more processors or other hardware logic devices or circuits. [ 0114 ] At 708 , the robotic device identifies trajectories In at least one embodiment, code and / or code and / or data that that would satisfy the identified predicates . For storage 801 may be cache memory , dynamic randomly example , the robotic device may determine that an ingredi addressable memory ( “ DRAM ” ), static randomly address ent to be placed in a pot must be moved from its current able memory ( " SRAM ” ), non - volatile memory ( e.g. , flash location to the location of a pot , in order to satisfy a memory ), or other storage . In at least one embodiment, a predicate in_pot( ingredient) . In some cases , other objects choice of whether code and / or code and / or data storage 801 will need to be moved out of the way in order for the is internal or external to a processor, for example , or trajectory to be performed . The robotic device can identify comprising DRAM , SRAM , flash or some other storage type such situations and respond to them . This is similar to the may depend on available storage on - chip versus off- chip , situation in which the demonstrator moves an object out of latency requirements of training and / or inferencing func the way during the demonstration . However, such trajecto tions being performed , batch size of data used in inferencing ries may be determined , during goal identification , to be and / or training of a neural network , or some combination of intended as motion predicates rather than task predicates, these factors . and excluded from the robotic device's plan for achieving [ 0119 ] In at least one embodiment, inference and / or train the goal . For example , if an object was moved during the ing logic 815 may include, without limitation , a code and / or demonstration to satisfy a motion predicate, but is not data storage 805 to store backward and / or output weight blocking another trajectory, the robotic device does not and /or input / output data corresponding to neurons or layers move the object. Similar, if the object is not present when the of a neural network trained and / or used for inferencing in robotic device achieves the goal , the robotic device's plan aspects of one or more embodiments. In at least one embodi ning algorithm is not disrupted by the absence of the object. ment, code and / or data storage 805 stores weight parameters [ 0115 ] At 710 , the robotic device causes its robotic manipulation device , such as a robotic arm , to manipulate and / or input/ output data of each layer of a neural network objects in the environment in order to accomplish the goal . trainedments duringor used backward in conjunction propagation with oneof inputor more / output embodi data In at least one embodiment, the robotic device formulates and / or weight parameters during training and / or inferencing instructions that cause one or more of its manipulation using aspects of one or more embodiments . In at least one devices to implement a trajectory. For example, in order to embodiment, training logic 815 may include , or be coupled satisfy an in_pot ( ingredient) predicate, the robotic device to code and / or data storage 805 to store graph code or other may formulate instructions such as “ move arm to X1,91,21, " software to control timing and / or order, in which weight " grasp object , ” move arm to “ X2,72,22 ", and so on . and / or other parameter information is to be loaded to con figure, logic , including integer and / or floating point units Inference and Training Logic ( collectively, arithmetic logic units ( ALU ). [ 0116 ] FIG . 8A illustrates inference and / or training logic [ 0120 ] In at least one embodiment, code , such as graph 815 used to perform inferencing and /or training operations code , causes the loading of weight or other parameter associated with one or more embodiments . Details regarding information into processor ALUs based on an architecture of inference and / or training logic 815 are provided below in a neural network to which such code corresponds. In at least conjunction with FIGS . 8A and / or 8B . one embodiment, any portion of code and / or data storage [ 0117 ] In at least one embodiment, inference and / or train 805 may be included with other on - chip or off -chip data ing logic 815 may include, without limitation, code and / or storage , including a processor's L1 , L2 , or L3 cache or data storage 801 to store forward and / or output weight system memory . In at least one embodiment, any portion of and / or input / output data , and / or other parameters to config code and / or data storage 805 may be internal or external to ure neurons or layers of a neural network trained and / or used one or more processors or other hardware logic devices or for inferencing in aspects of one or more embodiments . In circuits. In at least one embodiment, code and / or data at least one embodiment, training logic 815 may include , or storage 805 may be cache memory, DRAM , SRAM , non be coupled to code and / or data storage 801 to store graph volatile memory ( e.g. , flash memory ), or other storage . In at code or other software to control timing and / or order, in least one embodiment, a choice of whether code and / or data which weight and / or other parameter information is to be storage 805 is internal or external to a processor, for loaded to configure, logic , including integer and / or floating example, or comprising DRAM , SRAM , flash memory or point units ( collectively , arithmetic logic units ( ALUS ). In at some other storage type may depend on available storage least one embodiment, code , such as graph code , loads on - chip versus off - chip , latency requirements of training weight or other parameter information into processor ALUS and / or inferencing functions being performed , batch size of based on an architecture of a neural network to which such data used in inferencing and / or training of a neural network , code corresponds. In at least one embodiment, code and / or or some combination of these factors . data storage 801 stores weight parameters and / or input/ [ 0121 ] In at least one embodiment, code and / or data output data of each layer of a neural network trained or used storage 801 and code and / or data storage 805 may be in conjunction with one or more embodiments during for separate storage structures. In at least one embodiment, code ward propagation of input/ output data and / or weight param and /or data storage 801 and code and /or data storage 805 eters during training and / or inferencing using aspects of one may be a combined storage structure . In at least one embodi or more embodiments . In at least one embodiment, any ment, code and / or data storage 801 and code and / or data US 2021/0081752 A1 Mar. 18 , 2021 8 storage 805 may be partially combined and partially sepa size of data used in inferencing and / or training of a neural rate . In at least one embodiment, any portion of code and / or network , or some combination of these factors. data storage 801 and code and / or data storage 805 may be [ 0125 ] In at least one embodiment, inference and / or train included with other on -chip or off -chip data storage , includ ing logic 815 illustrated in FIG . 8A may be used in con ing a processor's L1 , L2 , or L3 cache or system memory . junction with an application - specific integrated circuit [ 0122 ] In at least one embodiment, inference and / or train ( “ ASIC ” ), such as a TensorFlow® Processing Unit from ing logic 815 may include , without limitation , one or more Google , an inference processing unit ( IPU ) from arithmetic logic unit ( s ) ( “ ALU ( s ) ” ) 810 , including integer GraphcoreTM , or a Nervana® ( e.g. , “ Lake Crest” ) processor and / or floating point units , to perform logical and /or math from Intel Corp. In at least one embodiment, inference ematical operations based , at least in part on , or indicated by, and / or training logic 815 illustrated in FIG . 8A may be used training and / or inference code ( e.g. , graph code ), a result of in conjunction with central processing unit ( “ CPU ” ) hard which may produce activations ( e.g. , output values from ware , graphics processing unit ( “GPU ” ) hardware or other layers or neurons within a neural network ) stored in an hardware , such as field programmable gate arrays (“ FP activation storage 820 that are functions of input/ output GAs " ) . and / or weight parameter data stored in code and /or data [ 0126 ] FIG . 8B illustrates inference and /or training logic storage 801 and / or code and / or data storage 805. In at least 815 , according to at least one embodiment. In at least one one embodiment, activations stored in activation storage 820 embodiment, inference and / or training logic 815 may are generated according to linear algebraic and or matrix include, without limitation, hardware logic in which com based mathematics performed by ALU ( S ) 810 in response to putational resources are dedicated or otherwise exclusively performing instructions or other code , wherein weight val used in conjunction with weight values or other information ues stored in code and / or data storage 805 and / or data corresponding to one or more layers of neurons within a storage 801 are used as operands along with other values , neural network . In at least one embodiment, inference and / or such as bias values , gradient information, momentum val training logic 815 illustrated in FIG . 8B may be used in ues , or other parameters or hyperparameters, any or all of conjunction with an application - specific integrated circuit which may be stored in code and / or data storage 805 or code ( ASIC ), such as TensorFlow® Processing Unit from and / or data storage 801 or another storage on or off - chip . Google , an inference processing unit ( IPU ) from [ 0123 ] In at least one embodiment, ALU ( s ) 810 are GraphcoreTM , or a Nervana® ( e.g. , “ Lake Crest ” ) processor included within one or more processors or other hardware from Intel Corp. In at least one embodiment, inference logic devices or circuits , whereas in another embodiment, and / or training logic 815 illustrated in FIG . 8B may be used ALU ( s ) 810 may be external to a processor or other hard in conjunction with central processing unit ( CPU ) hardware, ware logic device or circuit that uses them ( e.g. , a co graphics processing unit ( GPU ) hardware or other hardware , processor ). In at least one embodiment, ALUs 810 may be such as field programmable gate arrays ( FPGAs ). In at least included within a processor's execution units or otherwise one embodiment, inference and / or training logic 815 within a bank of ALUsaccessible by a processor's execution includes, without limitation , code and / or data storage 801 units either within same processor or distributed between and code and / or data storage 805 , which may be used to different processors of different types ( e.g. , central process store code ( e.g. , graph code ) , weight values and / or other ing units, graphics processing units, fixed function units , information, including bias values , gradient information , etc. ) . In at least one embodiment, code and / or data storage momentum values , and / or other parameter or hyperparam 801 , code and / or data storage 805 , and activation storage eter information . In at least one embodiment illustrated in 820 may share a processor or other hardware logic device or FIG . 8B , each of code and / or data storage 801 and code circuit , whereas in another embodiment, they may be in and / or data storage 805 is associated with a dedicated different processors or other hardware logic devices or computational resource , such as computational hardware circuits , or some combination of same and different proces 802 and computational hardware 806 , respectively. In at sors or other hardware logic devices or circuits . In at least least one embodiment, each of computational hardware 802 one embodiment, any portion of activation storage 820 may and computational hardware 806 comprises one or more be included with other on - chip or off - chip data storage , ALUs that perform mathematical functions, such as linear including a processor's L1 , L2 , or L3 cache or system algebraic functions, only on information stored in code memory . Furthermore , inferencing and / or training code may and / or data storage 801 and code and / or data storage 805 , be stored with other code accessible to a processor or other respectively, result of which is stored in activation storage hardware logic or circuit and fetched and /or processed using 820 . a processor's fetch , decode , scheduling, execution , retire [ 0127 ] In at least one embodiment, each of code and / or ment and / or other logical circuits . data storage 801 and 805 and corresponding computational [ 0124 ] In at least one embodiment, activation storage 820 hardware 802 and 806 , respectively, correspond to different may be cache memory, DRAM , SRAM , non - volatile layers of a neural network , such that resulting activation memory ( e.g. , flash memory ), or other storage . In at least from one storage /computational pair 801/802 of code and /or one embodiment, activation storage 820 may be completely data storage 801 and computational hardware 802 is pro or partially within or external to one or more processors or vided as an input to a next storage / computational pair other logical circuits. In at least one embodiment, a choice 805/806 of code and / or data storage 805 and computational of whether activation storage 820 is internal or external to a hardware 806 , in order to mirror a conceptual organization processor, for example, or comprising DRAM , SRAM , flash of a neural network . In at least one embodiment, each of memory or some other storage type may depend on available storage / computational pairs 801/802 and 805/806 may cor storage on - chip versus off - chip , latency requirements of respond to more than one neural network layer. In at least training and /or inferencing functions being performed , batch one embodiment, additional storage / computation pairs ( not US 2021/0081752 A1 Mar. 18 , 2021 9 shown ) subsequent to or in parallel with storage/ computa anomaly detection , which allows identification of data tion pairs 801/802 and 805/806 may be included in inference points in new dataset 912 that deviate from normal patterns and / or training logic 815 . of new dataset 912 . [ 0131 ] In at least one embodiment, semi -supervised learn Neural Network Training and Deployment ing may be used , which is a technique in which in training dataset 902 includes a mix of labeled and unlabeled data . In [ 0128 ] FIG.9 illustrates training and deployment of a deep at least one embodiment, training framework 904 may be neural network , according to at least one embodiment. In at used to perform incremental learning, such as through least one embodiment, untrained neural network 906 is transferred learning techniques . In at least one embodiment, trained using a training dataset 902. In at least one embodi incremental learning enables trained neural network 908 to ment, training framework 904 is a PyTorch framework , adapt to new dataset 912 without forgetting knowledge whereas in other embodiments, training framework 904 is a instilled within trained neural network 908 during initial TensorFlow , Boost , Caffe, Microsoft Cognitive Toolkit / training CNTK , MXNet, Chainer, Keras , Deeplearning4j, or other training framework . In at least one embodiment, training Data Center framework 904 trains an untrained neural network 906 and [ 0132 ] FIG . 10 illustrates an example data center 1000 , in enables it to be trained using processing resources described which at least one embodiment may be used . In at least one herein to generate a trained neural network 908. In at least embodiment, data center 1000 includes a data center infra one embodiment, weights may be chosen randomly or by structure layer 1010 , a framework layer 1020 , a software pre -training using a deep belief network . In at least one layer 1030 and an application layer 1040 . embodiment, training may be performed in either a super [ 0133 ] In at least one embodiment, as shown in FIG . 10 , vised , partially supervised , or unsupervised manner. data center infrastructure layer 1010 may include a resource [ 0129 ] In at least one embodiment, untrained neural net orchestrator 1012 , grouped computing resources 1014 , and work 906 is trained using supervised learning, wherein node computing resources ( “ node C.R.s ” ) 1016 ( 1 ) -1016 ( N ), training dataset 902 includes an input paired with a desired where “ N ” represents a positive integer (which may be a output for an input, or where training dataset 902 includes different integer “ N ” than used in other figures ). In at least input having a known output and an output of neural one embodiment, node C.R.s 1016 ( 1 ) -1016 ( N ) may include , network 906 is manually graded . In at least one embodi but are not limited to , any number of central processing units ment, untrained neural network 906 is trained in a super ( “ CPUs ” ) or other processors ( including accelerators, field vised manner and processes inputs from training dataset 902 programmable gate arrays ( FPGAs ), graphics processors , and compares resulting outputs against a set of expected or etc. ) , memory storage devices 1018 ( 1 ) -1018 ( N ) ( e.g. , desired outputs. In at least one embodiment, errors are then dynamic read - only memory , solid state storage or disk propagated back through untrained neural network 906. In at drives ), network input/ output ( “ NW 1 / O ” ) devices , network least one embodiment, training framework 904 adjusts switches , virtual machines ( “ VMs” ), power modules , and weights that control untrained neural network 906. In at least cooling modules , etc. In at least one embodiment, one or one embodiment, training framework 904 includes tools to more node C.R.s from among node C.R.s 1016 ( 1 ) -1016 ( N ) monitor how well untrained neural network 906 is converg may be a server having one or more of above -mentioned ing towards a model , such as trained neural network 908 , computing resources . suitable to generating correct answers, such as in result 914 , [ 0134 ] In at least one embodiment, grouped computing based on input data such as a new dataset 912. In at least one resources 1014 may include separate groupings of node embodiment, training framework 904 trains untrained neural C.R.s housed within one or more racks ( not shown ) , or many network 906 repeatedly while adjust weights to refine an racks housed in data centers at various geographical loca output of untrained neural network 906 using a loss function tions ( also not shown ) . In at least one embodiment, separate and adjustment algorithm , such as stochastic gradient groupings of node C.R.s within grouped computing descent. In at least one embodiment, training framework 904 resources 1014 may include grouped compute , network , trains untrained neural network 906 until untrained neural memory or storage resources that may be configured or network 906 achieves a desired accuracy . In at least one allocated to support one or more workloads. In at least one embodiment, trained neural network 908 can then be embodiment, several node C.R.s including CPUs or proces deployed to implement any number of machine learning sors may grouped within one or more racks to provide operations. compute resources to support one or more workloads. In at [ 0130 ] In at least one embodiment, untrained neural net least one embodiment, one or more racks may also include work 906 is trained using unsupervised learning , wherein any number of power modules, cooling modules, and net untrained neural network 906 attempts to train itself using work switches , in any combination . unlabeled data . In at least one embodiment, unsupervised [ 0135 ] In at least one embodiment, resource orchestrator learning training dataset 902 will include input data without 1012 may configure or otherwise control one or more node any associated output data or “ ground truth ” data . In at least C.R.s 1016 ( 1 ) -1016 ( N ) and / or grouped computing resources one embodiment, untrained neural network 906 can learn 1014. In at least one embodiment, resource orchestrator groupings within training dataset 902 and can determine 1012 may include a software design infrastructure ( “ SDI ” ) how individual inputs are related to untrained dataset 902. In management entity for data center 1000. In at least one at least one embodiment, unsupervised training can be used embodiment, resource orchestrator 812 may include hard to generate a self -organizing map in trained neural network ware , software or some combination thereof. 908 capable of performing operations useful in reducing [ 0136 ] In at least one embodiment, as shown in FIG . 10 , dimensionality of new dataset 912. In at least one embodi framework layer 1020 includes a job scheduler 1022 , a ment, unsupervised training can also be used to perform configuration manager 1024 , a resource manager 1026 and US 2021/0081752 A1 Mar. 18 , 2021 10 a distributed file system 1028. In at least one embodiment, information using one or more machine learning models framework layer 1020 may include a framework to support according to one or more embodiments described herein . software 1032 of software layer 1030 and / or one or more For example, in at least one embodiment, a machine learning application ( s) 1042 of application layer 1040. In at least one model may be trained by calculating weight parameters embodiment, software 1032 or application ( s ) 1042 may according to a neural network architecture using software respectively include web - based service software or applica and computing resources described above with respect to tions , such as those provided by Amazon Web Services, data center 1000. In at least one embodiment, trained Google Cloud and Microsoft Azure . In at least one embodi machine learning models corresponding to one or more ment, framework layer 1020 may be , but is not limited to , a neural networks may be used to infer or predict information type of free and open - source software web application using resources described above with respect to data center framework such as Apache SparkTM ( hereinafter “ Spark ” ) 1000 by using weight parameters calculated through one or that may utilize distributed file system 1028 for large - scale more training techniques described herein . data processing ( e.g. , “ big data " ). In at least one embodi [ 0141 ] In at least one embodiment, data center may use ment, job scheduler 1032 may include a Spark driver to CPUs, application - specific integrated circuits ( ASICs ) , facilitate scheduling of workloads supported by various GPUs , FPGAs, or other hardware to perform training and / or layers of data center 1000. In at least one embodiment, inferencing using above - described resources . Moreover, one configuration manager 1024 may be capable of configuring or more software and / or hardware resources described above different layers such as software layer 1030 and framework may be configured as a service to allow users to train or layer 1020 including Spark and distributed file system 1028 performing inferencing of information, such as image rec for supporting large - scale data processing. In at least one ognition , speech recognition, or other artificial intelligence embodiment, resource manager 1026 may be capable of services . managing clustered or grouped computing resources [ 0142 ] Inference and / or training logic 815 are used to mapped to or allocated for support of distributed file system perform inferencing and / or training operations associated 1028 and job scheduler 1022. In at least one embodiment, with one or more embodiments. Details regarding inference clustered or grouped computing resources may include and /or training logic 815 are provided herein in conjunction grouped computing resources 1014 at data center infrastruc with FIGS . 8A and / or 8B . In at least one embodiment, ture layer 1010. In at least one embodiment, resource inference and / or training logic 815 may be used in system manager 1026 may coordinate with resource orchestrator FIG . 10 for inferencing or predicting operations based , at 1012 to manage these mapped or allocated computing least in part, on weight parameters calculated using neural resources . network training operations, neural network functions and / [ 0137 ] In at least one embodiment, software 1032 or architectures , or neural network use cases described included in software layer 1030 may include software used herein . by at least portions of node C.R.s 1016 ( 1 ) -1016 ( N ) , grouped [ 0143 ] In at least one embodiment, one or more circuits, computing resources 1014 , and / or distributed file system processors, systems , robots , or other devices or techniques 1028 of framework layer 1020. In at least one embodiment, are adapted , with reference to the above figure , to identify a one or more types of software may include, but are not goal of a demonstration based , at least partially, on the limited to , Internet web page search software , e - mail virus techniques described above in relations to FIGS. 1-7 . In at scan software, database software, and streaming video con least one embodiment, one or more circuits, processors , tent software. systems , robots , or other devices or techniques are adapted , [ 0138 ] In at least one embodiment, application ( s ) 1042 with reference to the above figure, to implement a robotic included in application layer 1040 may include one or more device capable of observing a demonstration , identifying a types of applications used by at least portions of node C.R.s goal of the demonstration , and achieving the goal by robotic 1016 ( 1 ) -1016 ( N ) , grouped computing resources 1014 , and / manipulation, based , at least partially , on the techniques or distributed file system 1028 of framework layer 1020. In described above in relations to FIGS . 1-7 . at least one embodiment, one or more types of applications may include, but are not limited to , any number of a Autonomous Vehicle genomics application , a cognitive compute, application and [ 0144 ] FIG . 11A illustrates an example of an autonomous a machine learning application , including training or infer vehicle 1100 , according to at least one embodiment. In at encing software, machine learning framework software least one embodiment, autonomous vehicle 1100 (alterna ( e.g. , PyTorch , TensorFlow , Caffe , etc. ) or other machine tively referred to herein as " vehicle 1100 ” ) may be , without learning applications used in conjunction with one or more limitation , a passenger vehicle, such as a car , a truck , a bus , embodiments . and / or another type of vehicle that accommodates one or [ 0139 ] In at least one embodiment, any of configuration more passengers . In at least one embodiment, vehicle 1100 manager 1024 , resource manager 1026 , and resource orches may be a semi- tractor -trailer truck used for hauling cargo . In trator 1012 may implement any number and type of self at least one embodiment, vehicle 1100 may be an airplane, modifying actions based on any amount and type of data robotic vehicle , or other kind of vehicle . acquired in any technically feasible fashion . In at least one [ 0145 ] Autonomous vehicles may be described in terms of embodiment, self -modifying actions may relieve a data automation levels , defined by National Highway Traffic center operator of data center 1000 from making possibly Safety Administration ( “ NHTSA ” ), a division of US Depart bad configuration decisions and possibly avoiding underuti ment of Transportation , and Society of Automotive Engi lized and / or poor performing portions of a data center . neers ( “ SAE ” ) “ Taxonomy and Definitions for Terms [ 0140 ] In at least one embodiment, data center 1000 may Related to Driving Automation Systems for On - Road Motor include tools , services, software or other resources to train Vehicles ” ( e.g. , Standard No. 13016-201806 , published on one or more machine learning models or predict or infer Jun . 15 , 2018 , Standard No. 13016-201609 , published on US 2021/0081752 A1 Mar. 18 , 2021 11

Sep. 30 , 2016 , and previous and future versions of this for example and without limitation , global navigation sat standard ). In at least one embodiment, vehicle 1100 may be ellite systems ( “ GNSS ” ) sensor ( s ) 1158 ( e.g. , Global Posi capable of functionality in accordance with one or more of tioning System sensor ( s ) ) , RADAR sensor ( s ) 1160 , ultra Level 1 through Level 5 of autonomous driving levels . For sonic sensor ( s ) 1162 , LIDAR sensor ( s ) 1164 , inertial example, in at least one embodiment, vehicle 1100 may be measurement unit ( “ IMU ” ) sensor ( s ) 1166 ( e.g. , accelerom capable of conditional automation ( Level 3 ) , high automa eter ( s ) , gyroscope (s ), a magnetic compass or magnetic com tion ( Level 4 ) , and /or full automation ( Level 5 ) , depending passes , magnetometer ( s ), etc. ) , microphone ( s) 1196 , stereo on embodiment. camera ( s ) 1168 , wide - view camera ( s ) 1170 ( e.g. , fisheye [ 0146 ] In at least one embodiment, vehicle 1100 may cameras ), infrared camera ( s ) 1172 , surround camera ( s) 1174 include , without limitation , components such as a chassis , a ( e.g. , 360 degree cameras ), long - range cameras ( not shown vehicle body, wheels ( e.g. , 2 , 4 , 6 , 8 , 18 , etc. ) , tires , axles , in FIG . 11A ) , mid -range camera ( s) (not shown in FIG . 11A ) , and other components of a vehicle . In at least one embodi speed sensor ( s ) 1144 ( e.g. , for measuring speed of vehicle ment, vehicle 1100 may include , without limitation , a pro 1100 ) , vibration sensor ( s ) 1142 , steering sensor ( s ) 1140 , pulsion system 1150 , such as an internal combustion engine, brake sensor ( s ) ( e.g. , as part of brake sensor system 1146 ) , hybrid electric power plant, an all - electric engine , and / or and / or other sensor types . another propulsion system type. In at least one embodiment, [ 0150 ] In at least one embodiment, one or more of con propulsion system 1150 may be connected to a drive train of troller ( s ) 1136 may receive inputs ( e.g. , represented by input vehicle 1100 , which may include , without limitation , a data ) from an instrument cluster 1132 of vehicle 1100 and transmission , to enable propulsion of vehicle 1100. In at provide outputs ( e.g. , represented by output data , display least one embodiment, propulsion system 1150 may be data, etc. ) via a human -machine interface ( “ HMI” ) display controlled in response to receiving signals from a throttle / 1134 , an audible annunciator, a loudspeaker, and / or via other accelerator ( s ) 1152 . components of vehicle 1100. In at least one embodiment, [ 0147 ] In at least one embodiment, a steering system 1154 , outputs may include information such as vehicle velocity, which may include , without limitation , a steering wheel , is speed , time , map data ( e.g. , a High Definition map ( not used to steer vehicle 1100 ( e.g. , along a desired path or shown in FIG . 11A ) , location data ( e.g. , vehicle's 1100 route ) when propulsion system 1150 is operating ( e.g. , when location , such as on a map ) , direction , location of other vehicle 1100 is in motion ). In at least one embodiment, vehicles ( e.g. , an occupancy grid ), information about objects steering system 1154 may receive signals from steering and status of objects as perceived by controller ( s ) 1136 , etc. actuator ( s ) 1156. In at least one embodiment, a steering For example, in at least one embodiment, HMI display 1134 wheel may be optional for full automation ( Level 5 ) func may display information about presence of one or more tionality. In at least one embodiment, a brake sensor system objects ( e.g. , a street sign , caution sign , traffic light chang 1146 may be used to operate vehicle brakes in response to ing , etc. ), and / or information about driving maneuvers receiving signals from brake actuator ( s ) 1148 and / or brake vehicle has made , is making, or will make ( e.g. , changing sensors . lanes now , taking exit 34B in two miles , etc. ). [ 0148 ] In at least one embodiment, controller ( s) 1136 , [ 0151 ] In at least one embodiment, vehicle 1100 further which may include, without limitation , one or more system includes a network interface 1124 which may use wireless on chips ( “ SoCs ” ) ( not shown in FIG . 11A ) and / or graphics antenna ( s ) 1126 and / or modem ( s ) to communicate over one processing unit( s ) ( “ GPU ( S ) " ) , provide signals ( e.g. , repre or more networks . For example , in at least one embodiment, sentative of commands) to one or more components and / or network interface 1124 may be capable of communication systems of vehicle 1100. For instance , in at least one over Long - Term Evolution (“ LTE ” ), Wideband Code Divi embodiment, controller ( s ) 1136 may send signals to operate sion Multiple Access ( “ WCDMA ” ), Universal Mobile Tele vehicle brakes via brake actuator ( s) 1148 , to operate steering communications System ( "UMTS ” ), Global System for system 1154 via steering actuator ( s ) 1156 , to operate pro Mobile communication ( “GSM ” ), IMT- CDMA Multi -Car pulsion system 1150 via throttle/ accelerator ( s) 1152. In at rier ( “ CDMA2000 ” ) networks, etc. In at least one embodi least one embodiment, controller (s ) 1136 may include one or ment, wireless antenna ( s ) 1126 may also enable communi more onboard ( e.g. , integrated ) computing devices that cation between objects in environment ( e.g. , vehicles, process sensor signals , and output operation commands mobile devices, etc. ) , using local area network ( s ), such as ( e.g. , signals representing commands ) to enable autonomous Bluetooth , Bluetooth Low Energy ( “ LE ” ), Z - Wave, ZigBee , driving and / or to assist a human driver in driving vehicle etc. , and / or low power wide - area network ( s) ( “ LPWANs ” ) , 1100. In at least one mbodiment, controller ( s ) 1136 may such as LoRaWAN , SigFox , etc. protocols . include a first controller for autonomous driving functions , [ 0152 ] Inference and / or training logic 815 are used to a second controller for functional safety functions, a third perform inferencing and / or training operations associated controller for artificial intelligence functionality ( e.g. , com with one or more embodiments. Details regarding inference puter vision ), a fourth controller for infotainment function and / or training logic 815 are provided herein in conjunction ality, a fifth controller for redundancy in emergency condi with FIGS . 8A and / or 8B . In at least one embodiment, tions , and / or other controllers. In at least one embodiment, inference and / or training logic 815 may be used in system a single controller may handle two or more of above FIG . 11A for inferencing or predicting operations based , at functionalities, two or more controllers may handle a single least in part, on weight parameters calculated using neural functionality, and /or any combination thereof. network training operations, neural network functions and / [ 0149 ] In at least one embodiment, controller ( s) 1136 or architectures , or neural network use cases described provide signals for controlling one or more components herein . and / or systems of vehicle 1100 in response to sensor data [ 0153 ] In at least one embodiment, one or more circuits , received from one or more sensors ( e.g. , sensor inputs ). In processors, systems , robots , or other devices or techniques at least one embodiment, sensor data may be received from , are adapted , with reference to the above figure , to identify a US 2021/0081752 A1 Mar. 18 , 2021 12 goal of a demonstration based , at least partially, on the vehicle 1100 ( e.g. , front - facing cameras ) may be used for techniques described above in relations to FIGS . 1-7 . In at surround view , to help identify forward facing paths and least one embodiment, one or more circuits, processors , obstacles , as well as aid in , with help of one or more of systems , robots, or other devices or techniques are adapted , controller ( s ) 1136 and / or control SoCs , providing informa with reference to the above figure , to implement a robotic tion critical to generating an occupancy grid and / or deter device capable of observing a demonstration , identifying a mining preferred vehicle paths. In at least one embodiment, goal of the demonstration , and achieving the goal by robotic front - facing cameras may be used to perform many similar manipulation , based , at least partially, on the techniques ADAS functions as LIDAR , including, without limitation , described above in relations to FIGS . 1-7 . emergency braking, pedestrian detection , and collision [ 0154 ] FIG . 11B illustrates an example of camera loca avoidance . In at least one embodiment, front - facing cameras tions and fields of view for autonomous vehicle 1100 of FIG . may also be used for ADAS functions and systems includ 11A , according to at least one embodiment. In at least one ing , without limitation , Lane Departure Warnings ( “ LDW ” ), embodiment, cameras and respective fields of view are one Autonomous Cruise Control ( " ACC ” ), and / or other func example embodiment and are not intended to be limiting . tions such as traffic sign recognition. For instance, in at least one embodiment, additional and / or [ 0159 ] In at least one embodiment, a variety of cameras alternative cameras may be included and / or cameras may be may be used in a front- facing configuration, including , for located at different locations on vehicle 1100 . example , a monocular camera platform that includes a [ 0155 ] In at least one embodiment, camera types for CMOS ( “ complementary metal oxide semiconductor ” ) color cameras may include , but are not limited to , digital cameras imager. In at least one embodiment, a wide - view camera that may be adapted for use with components and / or systems 1170 may be used to perceive objects coming into view from of vehicle 1100. In at least one embodiment, camera ( s ) may a periphery ( e.g. , pedestrians , crossing traffic or bicycles ) . operate at automotive safety integrity level ( “ ASIL " ) B Although only one wide - view camera 1170 is illustrated in and / or at another ASIL . In at least one embodiment, camera FIG . 11B , in other embodiments, there may be any number types may be capable of any image capture rate, such as 60 ( including zero ) wide - view cameras on vehicle 1100. In at frames per second ( fps ), 1220 fps, 240 fps, etc. , depending least one embodiment, any number of long - range camera ( s ) on embodiment. In at least one embodiment, cameras may 1198 ( e.g. , a long - view stereo camera pair ) may be used for be capable of using rolling shutters , global shutters, another depth - based object detection , especially for objects for type of shutter, or a combination thereof. In at least one which a neural network has not yet been trained . In at least embodiment, color filter array may include a red clear clear one embodiment, long - range camera ( s ) 1198 may also be clear ( “ RCCC ” ) color filter array, a red clear clear blue used for object detection and classification , as well as basic ( “ RCCB ” ) color filter array, a red blue green clear object tracking. ( “ RBGC ” ) color filter array , a Foveon X3 color filter array, [ 0160 ] In at least one embodiment, any number of stereo a Bayer sensors ( " RGGB ” ) color filter array , a monochrome camera ( s ) 1168 may also be included in a front - facing sensor color filter array , and / or another type of color filter configuration . In at least one embodiment, one or more of array . In at least one embodiment, clear pixel cameras , such stereo camera ( s) 1168 may include an integrated control unit as cameras with an RCCC , an RCCB , and /or an RBGC color comprising a scalable processing unit , which may provide a filter array, may be used in an effort to increase light programmable logic ( “ FPGA ” ) and a multi - core micro sensitivity. processor with an integrated Controller Area Network [ 0156 ] In at least one embodiment, one or more of camera ( “ CAN ” ) or Ethernet interface on a single chip . In at least ( s ) may be used to perform advanced driver assistance one embodiment, such a unit may be used to rate a 3D systems ( “ ADAS ” ) functions ( e.g. , as part of a redundant or map of an environment of vehicle 1100 , including a distance fail - safe design ). For example, in at least one embodiment, estimate for all points in an image . In at least one embodi a Multi - Function Mono Camera may be installed to provide ment, one or more of stereo camera ( s ) 1168 may include , functions including lane departure warning, traffic sign without limitation , compact stereo vision sensor ( s ) that may assist and intelligent headlamp control. In at least one embodiment, one or more of camera ( s ) ( e.g. , all cameras) includeleft and , withoutright) andlimitation an image, two cameraprocessing lenses chip ( one that each may on may record and provide image data ( e.g. , video ) simultane measure distance from vehicle 1100 to target object and use ously . generated information ( e.g. , metadata ) to activate autono [ 0157 ] In at least one embodiment, one or more camera mous emergency braking and lane departure warning func may be mounted in a mounting assembly, such as a custom tions . In at least one embodiment, other types of stereo designed ( three -dimensional ( " 3D " ) printed ) assembly, in camera ( s ) 1168 may be used in addition to , or alternatively order to cut out stray light and reflections from within from , those described herein . vehicle 1100 ( e.g. , reflections from dashboard reflected in [ 0161 ] In at least one embodiment, cameras with a field of windshield mirrors ) which may interfere with camera image view that include portions of environment to sides of vehicle data capture abilities . With reference to wing -mirror mount 1100 ( e.g. , side - view cameras ) may be used for surround ing assemblies, in at least one embodiment, wing -mirror view , providing information used to create and update an assemblies may be custom 3D printed so that a camera occupancy grid , as well as to generate side impact collision mounting plate matches a shape of a wing -mirror . In at least warnings. For example, in at least one embodiment, sur one embodiment, camera ( s ) may be integrated into wing round camera ( s ) 1174 ( e.g. , four surround cameras as illus mirrors . In at least one embodiment, for side - view cameras, trated in FIG . 11B ) could be positioned on vehicle 1100. In camera ( s ) may also be integrated within four pillars at each at least one embodiment, surround camera ( s ) 1174 may corner of a cabin . include , without limitation , any number and combination of [ 0158 ] In at least one embodiment, cameras with a field of wide - view cameras, fisheye camera ( s ), 360 degree camera view that include portions of an environment in front of ( s ) , and / or similar cameras. For instance, in at least one US 2021/0081752 A1 Mar. 18 , 2021 13 embodiment, four fisheye cameras may be positioned on a or more other types of busses using different protocols. In at front, a rear , and sides of vehicle 1100. In at least one least one embodiment, two or more busses may be used to embodiment, vehicle 1100 may use three surround camera perform different functions, and / or may be used for redun ( s ) 1174 ( e.g. , left, right, and rear ), and may leverage one or dancy . For example, a first bus may be used for collision more other camera ( s) ( e.g. , a forward - facing camera ) as a avoidance functionality and a second bus may be used for fourth surround - view camera . actuation control. In at least one embodiment, each bus of [ 0162 ] In at least one embodiment, cameras with a field of bus 1102 may communicate with any of components of view that include portions of an environment behind vehicle vehicle 1100 , and two or more busses of bus 1102 may 1100 ( e.g. , rear - view cameras) may be used for parking communicate with corresponding components. In at least assistance , surround view , rear collision warnings, and cre one embodiment, each of any number of system ( s ) on ating and updating an occupancy grid . In at least one chip ( s ) ( “ SoC (s )" ) 1104 ( such as SoC 1104 ( A ) and SoC embodiment, a wide variety of cameras may be used includ 1104 ( B ) , each of controller ( s ) 1136 , and / or each computer ing , but not limited to , cameras that are also suitable as a within vehicle may have access to same input data ( e.g. , front- facing camera ( s) ( e.g. , long - range cameras 1198 and / inputs from sensors of vehicle 1100 ) , and may be connected or mid - range camera ( s) 1176 , stereo camera ( s) 1168 ) , infra to a common bus , such CAN bus . red camera ( s ) 1172 , etc. ), as described herein . [ 0167 ] In at least one embodiment, vehicle 1100 may [ 0163 ] Inference and / or training logic 815 are used to include one or more controller ( s ) 1136 , such as those perform inferencing and / or training operations associated described herein with respect to FIG . 11A . In at least one with one or more embodiments . Details regarding inference embodiment, controller ( s) 1136 may be used for a variety of and / or training logic 815 are provided herein in conjunction functions. In at least one embodiment, controller ( s ) 1136 with FIGS . 8A and / or 8B . In at least one embodiment, may be coupled to any of various other components and inference and / or training logic 815 may be used in system systems of vehicle 1100 , and may be used for control of FIG . 11B for inferencing or predicting operations based , at vehicle 1100 , artificial intelligence of vehicle 1100 , info least in part, on weight parameters calculated using neural tainment for vehicle 1100 , and /or other functions . network training operations, neural network functions and / [ 0168 ] In at least one embodiment, vehicle 1100 may or architectures, or neural network use cases described include any number of SoCs 1104. In at least one embodi herein . ment, each of SoCs 1104 may include , without limitation , [ 0164 ] In at least one embodiment, one or more circuits , central processing units ( “ CPU ( s ) " ) 1106 , graphics process processors , systems , robots, or other devices or techniques ing units ( "GPU ( S ) " ) 1108 , processor ( s) 1110 , cache ( s ) 1112 , are adapted , with reference to the above figure, to identify a accelerator ( s) 1114 , data store ( s ) 1116 , and /or other compo goal of a demonstration based , at least partially, on the nents and features not illustrated . In at least one embodi techniques described above in relations to FIGS . 1-7 . In at ment, SoC ( s ) 1104 may be used to control vehicle 1100 in least one embodiment, one or more circuits, processors , a variety of platforms and systems. For example, in at least systems , robots, or other devices or techniques are adapted , one embodiment, SoC ( s ) 1104 may be combined in a system with reference to the above figure, to implement a robotic ( e.g. , system of vehicle 1100 ) with a High Definition ( “ HD ” ) device capable of observing a demonstration , identifying a map 1122 which may obtain map refreshes and / or updates goal of the demonstration , and achieving the goal by robotic via network interface 1124 from one or more servers ( not manipulation, based , at least partially, on the techniques shown in FIG . 11C ) . described above in relations to FIGS . 1-7 . [ 0169 ] In at least one embodiment, CPU ( s ) 1106 may [ 0165 ] FIG . 11C is a block diagram illustrating an include a CPU cluster or CPU complex (alternatively example system architecture for autonomous vehicle 1100 of referred to herein as a " CCPLEX ” ). In at least one embodi FIG . 11A , according to at least one embodiment. In at least ment, CPU ( s ) 1106 may include multiple cores and /or level one embodiment, each of components, features, and systems two ( “ L2” ) caches . For instance , in at least one embodiment, of vehicle 1100 in FIG . 11C is illustrated as being connected CPU ( s ) 1106 may include eight cores in a coherent multi via a bus 1102. In at least one embodiment, bus 1102 may processor configuration . In at least one embodiment, CPU ( s ) include , without limitation, a CAN data interface ( alterna 1106 may include four dual - core clusters where each cluster tively referred to herein as a “ CAN bus ” ). In at least one has a dedicated L2 cache ( e.g. , a 2 megabyte ( MB ) L2 embodiment, a CAN may be a network inside vehicle 1100 cache ). In at least one embodiment, CPU ( s ) 1106 ( e.g. , used to aid in control of various features and functionality of CCPLEX ) may be configured to support simultaneous clus vehicle 1100 , such as actuation of brakes , acceleration , ter operations enabling any combination of clusters of braking, steering, windshield wipers , etc. In at least one CPU ( S ) 1106 to be active at any given time . embodiment, bus 1102 may be configured to have dozens or [ 0170 ] In at least one embodiment, one or more of CPU ( s ) even hundreds of nodes , each with its own unique identifier 1106 may implement power management capabilities that ( e.g. , a CAN ID ) . In at least one embodiment, bus 1102 may include, without limitation , one or more of following fea be read to find steering wheel angle , ground speed , engine tures : individual hardware blocks may be clock - gated auto revolutions per minute ( " RPMs” ), button positions, and / or matically when idle to save dynamic power ; each core clock other vehicle status indicators . In at least one embodiment, may be gated when such core is not actively executing bus 1102 may be a CAN bus that is ASIL B compliant. instructions due to execution of Wait for Interrupt ( “ WFI” ) / [ 0166 ] In at least one embodiment, in addition to , or Wait for Event ( “ WFE ” ) instructions; each core may be alternatively from CAN , FlexRay and / or Ethernet protocols independently power - gated ; each core cluster may be inde may be used . In at least one embodiment, there may be any pendently clock - gated when all cores are clock - gated or number of busses forming bus 1102 , which may include , power - gated ; and / or each core cluster may be independently without limitation , zero or more CAN busses , zero or more power - gated when all cores are power- gated . In at least one FlexRay busses , zero or more Ethernet busses , and / or zero embodiment, CPU ( S ) 1106 may further implement an US 2021/0081752 A1 Mar. 18 , 2021 14 enhanced algorithm for managing power states, where when a GPU of GPU ( s ) 1108 memory management unit allowed power states and expected wakeup times are speci ( “ MMU ” ) experiences a miss , an address translation request fied , and hardware /microcode determines which best power may be transmitted to CPU ( s ) 1106. In response , 2 CPU of state to enter for core , cluster, and CCPLEX . In at least one CPU ( s ) 1106 may look in its page tables for a virtual- to embodiment, processing cores may support simplified physical mapping for an address and transmit translation power state entry sequences in software with work offloaded back to GPU ( s ) 1108 , in at least one embodiment. In at least to microcode . one embodiment, unified memory technology may allow a [ 0171 ] In at least one embodiment, GPU ( s ) 1108 may single unified virtual address space for memory of both include an integrated GPU ( alternatively referred to herein CPU ( s ) 1106 and GPU ( s ) 1108 , thereby simplifying GPU ( s ) as an " GPU ” ). In at least one embodiment, GPU ( s ) 1108 1108 programming and porting of applications to GPU ( S ) may be programmable and may be efficient for parallel 1108 workloads . In at least one embodiment, GPU ( S ) 1108 may [ 0175 ] In at least one embodiment, GPU ( s ) 1108 may use an enhanced tensor instruction set . In at least one include any number of access counters that may keep track embodiment, GPU ( s ) 1108 may include one or more stream of frequency of access of GPU ( s ) 1108 to memory of other ing microprocessors, where each streaming microprocessor processors . In at least one embodiment, access counter ( s) may include a level one ( “ L1 ” ) cache ( e.g. , an Ll cache with may help ensure that memory pages are moved to physical at least 96 KB storage capacity ) , and two or more streaming memory of a processor that is accessing pages most fre microprocessors may share an L2 cache ( e.g. , an L2 cache quently , thereby improving efficiency for memory ranges with a 512 KB storage capacity ). In at least one embodiment, shared between processors. GPU ( S ) 1108 may include at least eight streaming micro [ 0176 ] In at least one embodiment, one or more of SoC ( s ) processors . In at least one embodiment, GPU ( s ) 1108 may 1104 may include any number of cache ( s ) 1112 , including use compute application programming interface ( s) ( API ( ) ) . those described herein . For example, in at least one embodi In at least one embodiment, GPU ( S ) 1108 may use one or ment, cache ( s ) 1112 could include a level three ( “ L3 ” ) cache more parallel computing platforms and / or programming that is available to both CPU ( s ) 1106 and GPU ( S ) 1108 ( e.g. , models ( e.g. , NVIDIA's CUDA model ). that is connected to CPU ( s ) 1106 and GPU ( s ) 1108 ) . In at [ 0172 ] In at least one embodiment, one or more of GPU ( s ) least one embodiment, cache ( s) 1112 may include a write 1108 may be power - optimized for best performance in back cache that may keep track of states of lines , such as by automotive and embedded use cases . For example, in at least using a cache coherence protocol ( e.g. , MEI , MESI , MSI , one embodiment, GPU ( s ) 1108 could be fabricated on Fin etc. ) . In at least one embodiment, a L3 cache may include 4 field - effect transistor ( “ FinFET ” ) circuitry. In at least one MB of memory or more , depending on embodiment, embodiment, each streaming microprocessor may incorpo although smaller cache sizes may be used . rate a number of mixed - precision processing cores parti ( 0177 ] In at least one embodiment, one or more of SoC ( s ) tioned into multiple blocks . For example , and without limi 1104 may include one or more accelerator ( s) 1114 ( e.g. , tation , 64 PF32 cores and 32 PF64 cores could be partitioned hardware accelerators, software accelerators , or a combina into four processing blocks . In at least one embodiment, tion thereof ). In at least one embodiment, SoC ( s ) 1104 may each processing block could be allocated 16 FP32 cores , 8 include a hardware acceleration cluster that may include FP64 cores , 16 INT32 cores , two mixed - precision NVIDIA optimized hardware accelerators and / or large on - chip Tensor cores for deep learning matrix arithmetic , a level zero memory. In at least one embodiment, large on - chip memory ( “ LO ” ) instruction cache , a warp scheduler, a dispatch unit, ( e.g. , 4 MB of SRAM ), may enable a hardware acceleration and / or a 64 KB register file . In at least one embodiment, cluster to accelerate neural networks and other calculations . streaming microprocessors may include independent paral In at least one embodiment, a hardware acceleration cluster lel integer and floating - point data paths to provide for may be used to complement GPU ( s ) 1108 and to off - load efficient execution of workloads with a mix of computation some of tasks of GPU ( s ) 1108 ( e.g. , to free up more cycles and addressing calculations . In at least one embodiment, of GPU ( S ) 1108 for performing other tasks ) . In at least one streaming microprocessors may include independent thread embodiment, accelerator ( s ) 1114 could be used for targeted scheduling capability to enable finer - grain synchronization workloads ( e.g. , perception , convolutional neural networks and cooperation between parallel threads. In at least one ( “ CNNs ” ), recurrent neural networks ( “ RNNs ” ) , etc. ) that embodiment, streaming microprocessors may include a are stable enough to be amenable to acceleration . In at least combined L1 data cache and shared memory unit in order to one embodiment, a CNN may include a region - based or improve performance while simplifying programming. regional convolutional neural networks ( “ RCNNs” ) and Fast [ 0173 ] In at least one embodiment, one or more of GPU ( S ) RCNNs ( e.g. , as used for object detection ) or other type of 1108 may include a high bandwidth memory ( " HBM ) and / or CNN . a 16 GB HBM2 memory subsystem to provide , in some [ 0178 ] In at least one embodiment, accelerator ( s ) 1114 examples, about 900 GB / second peak memory bandwidth . ( e.g. , hardware acceleration cluster ) may include one or In at least one embodiment, in addition to , or alternatively more deep learning accelerator ( “ DLA ” ). In at least one from , HBM memory, a synchronous graphics random -ac embodiment, DLA ( s ) may include, without limitation , one cess memory ( “ SGRAM ” ) may be used , such as a graphics or more Tensor processing units ( “ TPUs” ) that may be double data rate type five synchronous random - access configured to provide an additional ten trillion operations memory (“ GDDR5 " ). per second for deep learning applications and inferencing. In [ 0174 ] In at least one embodiment, GPU ( S ) 1108 may at least one embodiment, TPUs may be accelerators config include unified memory technology . In at least one embodi ured to , and optimized for, performing image processing ment , address translation services ( “ ATS ” ) support may be functions ( e.g. , for CNNs, RCNNs, etc. ) . In at least one used to allow GPU ( s ) 1108 to access CPU ( s ) 1106 page embodiment, DLA ( s ) may further be optimized for a spe tables directly . In at least one embodiment, embodiment, cific set of neural network types and floating point opera US 2021/0081752 A1 Mar. 18 , 2021 15 tions , as well as inferencing. In at least one embodiment, [ 0183 ] In at least one embodiment, vector processors may design of DLA ( s ) may provide more performance per mil be programmable processors that may be designed to effi limeter than a typical general - purpose GPU , and typically ciently and flexibly execute programming for computer vastly exceeds performance of a CPU . In at least one vision algorithms and provide signal processing capabilities . embodiment, TPU ( s) may perform several functions, includ In at least one embodiment, a PVA may include a PVA core ing a single - instance convolution function , supporting, for and two vector processing subsystem partitions . In at least example, INTS , INT16 , and FP16 data types for both one embodiment, a PVA core may include a processor features and weights, as well as post - processor functions. In subsystem , DMA engine ( s ) ( e.g. , two DMA engines ), and / or at least one embodiment, DLA ( s ) may quickly and effi other peripherals. In at least one embodiment, a vector ciently execute neural networks, especially CNNs, on pro processing subsystem may operate as a primary processing cessed or unprocessed data for any of a variety of functions, engine of a PVA , and may include a vector processing unit including, for example and without limitation : a CNN for ( “ VPU ” ), an instruction cache , and / or vector memory ( e.g. , object identification and detection using data from camera “ VMEM ” ). In at least one embodiment, VPU core may sensors; a CNN for distance estimation using data from include a digital signal processor such as , for example, a camera sensors ; a CNN for emergency vehicle detection and single instruction , multiple data ( " SIMD ” ), very long identification and detection using data from microphones; a instruction word ( “ VLIW ” ) digital signal processor . In at CNN for facial recognition and vehicle owner identification least one embodiment, a combination of SIMD and VLIW using data from camera sensors , and / or a CNN for security may enhance throughput and speed . and / or safety related events . [ 0184 ] In at least one embodiment, each of vector proces [ 0179 ] In at least one embodiment, DLA ( s ) may perform sors may include an instruction cache and may be coupled any function of GPU ( s ) 1108 , and by using an inference to dedicated memory . As a result , in at least one embodi accelerator, for example, a designer may target either DLA ment, each of vector processors may be configured to ( s ) or GPU ( s ) 1108 for any function . For example, in at least execute independently of other vector processors . In at least one embodiment, a designer may focus processing of CNNs one embodiment, vector processors that are included in a and floating point operations on DLA ( s ) and leave other particular PVA may be configured to employ data parallel functions to GPU ( s ) 1108 and / or accelerator ( s ) 1114 . ism . For instance, in at least one embodiment, plurality of [ 0180 ] In at least one embodiment, accelerator ( s ) 1114 vector processors included in a single PVA may execute a may include programmable vision accelerator (“ PVA ” ) , common computer vision algorithm , but on different regions which may alternatively be referred to herein as a computer of an image . In at least one embodiment, vector processors vision accelerator. In at least one embodiment, PVA may be included in a particular PVA may simultaneously execute designed and configured to accelerate computer vision algo different computer vision algorithms, on one image , or even rithms for advanced driver assistance system ( “ ADAS ” ) execute different algorithms on sequential images or por 1138 , autonomous driving, augmented reality ( “ AR ” ) appli tions of an image . In at least one embodiment, among other cations, and / or virtual reality ( “ VR ” ) applications. In at least things, any number of PVAs may be included in hardware one embodiment, PVA may provide a balance between acceleration cluster and any number of vector processors performance and flexibility . For example, in at least one may be included in each PVA . In at least one embodiment, embodiment, each PVA may include , for example and with PVA may include additional error correcting code ( “ ECC ” ) out limitation , any number of reduced instruction set com memory , to enhance overall system safety. puter ( “ RISC ” ) cores , direct memory access ( “ DMA ” ), [ 0185 ] In at least one embodiment, accelerator ( s ) 1114 and / or any number of vector processors . may include a con ter vision network on - chip and static [ 0181 ] In at least one embodiment, RISC cores may inter random - access memory ( “ SRAM ” ), for providing a high act with image sensors ( e.g. , image sensors of any cameras bandwidth , low latency SRAM for accelerator ( s) 1114. In at described herein ), image signal processor ( s ), etc. In at least least one embodiment, on -chip memory may include at least one embodiment, each RISC core may include any amount 4 MB SRAM , comprising, for example and without limita of memory . In at least one embodiment, RISC cores may use tion , eight field -configurable memory blocks, that may be any of a number of protocols , depending on embodiment. In accessible by both a PVA and a DLA . In at least one at least one embodiment, RISC cores may execute a real embodiment, each pair of memory blocks may include an time operating system ( “ RTOS ” ). In at least one embodi advanced peripheral bus ( " APB " ) interface, configuration ment, RISC cores may be implemented using one or more circuitry, a controller, and a multiplexer. In at least one integrated circuit devices , application specific integrated embodiment, any type of memory may be used . In at least circuits ( “ ASICs ” ), and / or memory devices. For example , in one embodiment, a PVA and a DLA may access memory via at least one embodiment, RISC cores could include an a backbone that provides a PVA and a DLA with high - speed instruction cache and / or a tightly coupled RAM . access to memory . In at least one embodiment, a backbone [ 0182 ] In at least one embodiment, DMA may enable may include a computer vision network on - chip that inter components of PVA to access system memory independently connects a PVA and a DLA to memory ( e.g. , using APB ) . of CPU ( s ) 1106. In at least one embodiment, DMA may [ 0186 ] In at least one embodiment, a computer vision support any number of features used to provide optimization network on - chip may include an interface that determines , to a PVA including, but not limited to , supporting multi before transmission of any control signal/ address /data , that dimensional addressing and / or circular addressing . In at both a PVA and a DLA provide ready and valid signals. In least one embodiment, DMA may support up to six or more at least one embodiment, an interface may provide for dimensions of addressing , which may include , without limi separate phases and separate channels for transmitting con tation , block width , block height, block depth , horizontal trol signals / addresses / data , as well as burst -type communi block stepping , vertical block stepping, and / or depth step cations for continuous data transfer . In at least one embodi ping . ment, an interface may comply with International US 2021/0081752 A1 Mar. 18 , 2021 16

Organization for Standardization ( “ ISO " ) 26262 or Interna tions may be considered as triggers for AEB In at least one tional Electrotechnical Commission ( “ IEC ” ) 61508 stan embodiment, a DLA may run a neural network for regress dards, although other standards and protocols may be used . ing confidence value . In at least one embodiment, neural [ 0187 ] In at least one embodiment, one or more of SoC ( s ) network may take as its input at least some subset of 1104 may include a real- time ray - tracing hardware accel parameters, such as bounding box dimensions , ground plane erator. In at least one embodiment, real - time ray -tracing estimate obtained ( e.g. , from another subsystem ), output hardware accelerator may be used to quickly and efficiently from IMU sensor ( s ) 1166 that correlates with vehicle 1100 determine positions and extents of objects ( e.g. , within a orientation , distance , 3D location estimates of object world model ) , to generate real - time visualization simula obtained from neural network and / or other sensors ( e.g. , tions , for RADAR signal interpretation, for sound propaga LIDAR sensor ( s ) 1164 or RADAR sensor( s ) 1160 ) , among tion synthesis and / or analysis, for simulation of SONAR others . systems , for general wave propagation simulation , for com [ 0192 ] In at least one embodiment, one or more of SoC ( s ) parison to LIDAR data for purposes of localization and / or 1104 may include data store ( s ) 1116 ( e.g. , memory ). In at other functions, and /or for other uses . least one embodiment, data store ( s ) 1116 may be on - chip [ 0188 ] In at least one embodiment, accelerator ( s ) 1114 can memory of SoC ( s ) 1104 , which may store neural networks have a wide array of uses for autonomous driving. In at least to be executed on GPU ( s ) 1108 and / or a DLA . In at least one one embodiment, a PVA may be used for key processing embodiment, data store ( s ) 1116 may be large enough in stages in ADAS and autonomous vehicles . In at least one capacity to store multiple instances of neural networks for embodiment, a PVA's capabilities are a good match for redundancy and safety. In at least one embodiment, data algorithmic domains needing predictable processing, at low store ( s ) 1116 may comprise L2 or L3 cache ( s ). power and low latency. In other words , a PVA performs well [ 0193 ] In at least one embodiment, one or more of SoC ( s ) on semi-dense or dense regular computation , even on small 1104 may include any number of processor ( s) 1110 ( e.g. , data sets , which might require predictable run - times with embedded processors ) . In at least one embodiment, proces low latency and low power. In at least one embodiment, such sor ( s ) 1110 may include a boot and power management as in vehicle 1100 , PVAs might be designed to run classic processor that may be a dedicated processor and subsystem computer vision algorithms, as they can be efficient at object to handle boot power and management functions and related detection and operating on integer math . security enforcement. In at least one embodiment, a boot and [ 0189 ] For example, according to at least one embodiment power management processor may be a part of a boot of technology , a PVA is used to perform computer stereo sequence of SoC ( s ) 1104 and may provide runtime power vision . In at least one embodiment, a semi - global matching management services. In at least one embodiment, a boot based algorithm may be used in some examples, although power and management processor may provide clock and this is not intended to be limiting . In at least one embodi voltage programming, assistance in system low power state ment, applications for Level 3-5 autonomous driving use transitions , management of SoC ( s ) 1104 thermals and tem motion estimation / stereo matching on - the - fly ( e.g. , structure perature sensors , and / or management of Soc ( s ) 1104 power from motion , pedestrian recognition , lane detection , etc.) . In states . In at least one embodiment, each temperature sensor at least one embodiment, a PVA may perform computer may be implemented as a ring - oscillator whose output stereo vision functions on inputs from two monocular cam frequency is proportional to temperature, and SoC ( s ) 1104 eras . may use ring - oscillators to detect temperatures of CPU ( s ) [ 0190 ] In at least one embodiment, a PVA may be used to 1106 , GPU ( s ) 1108 , and / or accelerator ( s ) 1114. In at least prm dense optical flow . For example, in at least one one embodiment, if temperatures are determined to exceed embodiment, a PVA could process raw RADAR data ( e.g. , a threshold , then a boot and power management processor using a 4D Fast Fourier Transform ) to provide processed may enter a temperature fault routine and put SoC ( s ) 1104 RADAR data . In at least one embodiment, a PVA is used for into a lower power state and / or put vehicle 1100 into a time of flight depth processing, by processing raw time of chauffeur to safe stop mode ( e.g. , bring vehicle 1100 to a flight data to provide processed time of flight data , for safe stop) . example . [ 0194 ] In at least one embodiment, processor ( s ) 1110 may [ 0191 ] In at least one embodiment, a DLA may be used to further include a set of embedded processors that may serve run any type of network to enhance control and driving as an audio processing engine which may be an audio safety , including for example and without limitation , a subsystem that enables full hardware support for multi neural network that outputs a measure of confidence for each channel audio over multiple interfaces, and a broad and object detection . In at least one embodiment, confidence flexible range of audio I / O interfaces . In at least one embodi may be represented or interpreted as a probability, or as ment, an audio processing engine is a dedicated processor providing a relative " weight ” of each detection compared to core with a digital signal processor with dedicated RAM . other detections . In at least one embodiment, a confidence [ 0195 ] In at least one embodiment, processor ( s ) 1110 may measure enables a system to make further decisions regard further include an always - on processor engine that may ing which detections should be considered as true positive provide necessary hardware features to support low power detections rather than false positive detections . In at least sensor management and wake use cases . In at least one one embodiment, a system may set a threshold value for embodiment, an always - on processor engine may include , confidence and consider only detections exceeding threshold without limitation , a processor core , a tightly coupled RAM , value as true positive detections. In an embodiment in which supporting peripherals ( e.g., timers and interrupt control an automatic emergency braking ( “ AEB ” ) system is used , lers ), various I/ O controller peripherals, and routing logic . false positive detections would cause vehicle to automati [ 0196 ] In at least one embodiment, processor ( s ) 1110 may cally perform emergency braking, which is obviously unde further include a safety cluster engine that includes , without sirable . In at least one embodiment, highly confident detec limitation , a dedicated processor subsystem to handle safety US 2021/0081752 A1 Mar. 18 , 2021 17 management for automotive applications . In at least one more of SoC ( s ) 1104 may further include an input/ output embodiment, a safety cluster engine may include , without controller ( s ) that may be controlled by software and may be limitation , two or more processor cores , a tightly coupled used for receiving 1/0 signals that are uncommitted to a RAM , support peripherals ( e.g. , timers , an interrupt control specific role . ler, etc. ) , and / or routing logic . In a safety mode , two or more [ 0201 ] In at least one embodiment, one or more Soc of cores may operate , in at least one embodiment, in a lockstep SoC ( s ) 1104 may further include a broad range of peripheral mode and function as a single core with comparison logic to interfaces to enable communication with peripherals, audio detect any differences between their operations. In at least encoders / decoders ( “ codecs ” ), power management, and / or one embodiment, processor ( s ) 1110 may further include a other devices. In at least one embodiment, SoC ( s ) 1104 may real - time camera engine that may include , without limita be used to process data from cameras ( e.g. , connected over tion , a dedicated processor subsystem for handling real - time Gigabit Multimedia Serial Link and Ethernet channels ), camera management. In at least one embodiment, processor sensors ( e.g. , LIDAR sensor ( s ) 1164 , RADAR sensor ( s ) ( s ) 1110 may further include a high - dynamic range signal 1160 , etc. that may be connected over Ethernet channels ), processor that may include , without limitation , an image data from bus 1102 ( e.g. , speed of vehicle 1100 , steering signal processor that is a hardware engine that is part of a wheel position , etc. ), data from GNSS sensor ( s ) 1158 ( e.g. , camera processing pipeline . connected over a Ethernet bus or a CAN bus ) , etc. In at least [ 0197 ] In at least one embodiment, processor ( s ) 1110 may one embodiment, one or more SoC of SoC ( s ) 1104 may include a video image compositor that may be a processing further include dedicated high -performance mass storage block ( e.g. , implemented on a microprocessor) that imple controllers that may include their own DMA engines , and ments video post - processing functions needed by a video that may be used to free CPU ( s ) 1106 from routine data playback application to produce a final image for a player management tasks. window . In at least one embodiment, a video image com [ 0202 ] In at least one embodiment, SoC ( s ) 1104 may be an positor may perform lens distortion correction on wide -view end - to - end platform with a flexible architecture that spans camera ( s ) 1170 , surround camera ( s ) 1174 , and / or on in automation Levels 3-5 , thereby providing a comprehensive cabin monitoring camera sensor ( s ). In at least one embodi functional safety architecture that leverages and makes ment, in - cabin monitoring camera sensor ( s ) are preferably efficient use of computer vision and ADAS techniques for monitored by a neural network running on another instance diversity and redundancy, and provides a platform for a of SoC 1104 , configured to identify in cabin events and flexible , reliable driving software stack , along with deep respond accordingly. In at least one embodiment, an in - cabin learning tools . In at least one embodiment, SoC ( s ) 1104 may system may perform , without limitation , lip reading to be faster, more reliable , and even more energy -efficient and activate cellular service and place a phone call , dictate space - efficient than conventional systems . For example, in at emails , change a vehicle's destination , activate or change a least one embodiment, accelerator (s ) 1114 , when combined vehicle's infotainment system and settings, or provide with CPU ( s ) 1106 , GPU ( S ) 1108 , and data store ( s) 1116 , may voice - activated web surfing . In at least one embodiment, provide for a fast, efficient platform for Level 3-5 autono certain functions are available to a driver when a vehicle is mous vehicles. operating in an autonomous mode and are disabled other [ 0203 ] In at least one embodiment, computer vision algo wise . rithms may be executed on CPUs, which may be configured [ 0198 ] In at least one embodiment, a video image com using a high - level programming language , such as C , to positor may include enhanced temporal noise reduction for execute a wide variety of processing algorithms across a both spatial and temporal noise reduction . For example, in at wide variety of visual data . However, in at least one embodi least one embodiment, where motion occurs in a video , ment, CPUs are oftentimes unable to meet performance noise reduction weights spatial information appropriately, requirements of many computer vision applications , such as decreasing weights of information provided by adjacent those related to execution time and power consumption , for frames . In at least one embodiment, where an image or example. In at least one embodiment, many CPUs are unable portion of an image does not include motion , temporal noise to execute complex object detection algorithms in real- time, reduction performed by video image compositor may use which is used in in - vehicle ADAS applications and in information from a previous image to reduce noise in a practical Level 3-5 autonomous vehicles . current image . [ 0204 ] Embodiments described herein allow for multiple [ 0199 ] In at least one embodiment, a video image com neural networks to be performed simultaneously and / or positor may also be configured to perform stereo rectifica sequentially, and for results to be combined together to tion on input stereo lens frames. In at least one embodiment, enable Level 3-5 autonomous driving functionality. For a video image compositor may further be used for user example, in at least one embodiment, a CNN executing on interface composition when an operating system desktop is a DLA or a discrete GPU ( e.g. , GPU ( s ) 1120 ) may include in use , and GPU ( s ) 1108 are not required to continuously text and word recognition , allowing reading and understand render new surfaces. In at least one embodiment, when ing of traffic signs, including signs for which a neural GPU ( s ) 1108 are powered on and active doing 3D rendering, network has not been specifically trained . In at least one a video image compositor may be used to offload GPU ( s ) embodiment, a DLA may further include a neural network 1108 to improve performance and responsiveness . that is able to identify, interpret, and provide semantic [ 0200 ] In at least one embodiment, one or more SoC of understanding of a sign , and to pass that semantic under SoC ( s ) 1104 may further include a mobile industry proces standing to path planning modules running on a CPU sor interface ( “ MIPI” ) camera serial interface for receiving Complex . video and input from cameras , a high - speed interface , and / or [ 0205 ] In at least one embodiment, multiple neural net a video input block that may be used for a camera and related works may be run simultaneously , as for Level 3 , 4 , or 5 pixel input functions. In at least one embodiment, one or driving. For example, in at least one embodiment, a warning US 2021/0081752 A1 Mar. 18 , 2021 18 sign stating “ Caution : flashing lights indicate icy condi to train and / or update neural networks based at least in part tions , ” along with an electric light, may be independently or on input ( e.g. , sensor data ) from sensors of a vehicle 1100 . collectively interpreted by several neural networks. In at [ 0210 ] In at least one embodiment, vehicle 1100 may least one embodiment, such warning sign itself may be further include network interface 1124 which may include , identified as a traffic sign by a first deployed neural network without limitation, wireless antenna ( s) 1126 ( e.g. , one or ( e.g. , a neural network that has been trained ), text “ flashing more wireless antennas for different communication proto lights indicate icy conditions ” may be interpreted by a cols , such as a cellular antenna , a Bluetooth antenna , etc. ) . second deployed neural network , which informs a vehicle's In at least one embodiment, network interface 1124 may be path planning software ( preferably executing on a CPU used to enable wireless connectivity to Internet cloud ser Complex ) that when flashing lights are detected , icy condi vices ( e.g. , with server ( s) and / or other network devices ) , tions exist. In at least one embodiment, a flashing light may with other vehicles , and / or with computing devices ( e.g. , be identified by operating a third deployed neural network client devices of passengers ). In at least one embodiment, to over multiple frames, informing a vehicle's path - planning communicate with other vehicles , a direct link may be software of a presence ( or an absence ) of flashing lights. In established between vehicle 110 and another vehicle and / or at least one embodiment, all three neural networks may run an indirect link may be established ( e.g. , across networks simultaneously, such as within a DLA and / or on GPU ( S ) and over the Internet ). In at least one embodiment, direct 1108 . links may be provided using a vehicle - to - vehicle commu [ 0206 ] In at least one embodiment, a CNN for facial nication link . In at least one embodiment, a vehicle - to recognition and vehicle owner identification may use data vehicle communication link may provide vehicle 1100 infor from camera sensors to identify presence of an authorized mation about vehicles in proximity to vehicle 1100 ( e.g. , driver and /or owner of vehicle 1100. In at least one embodi vehicles in front of, on a side of, and / or behind vehicle ment, an always - on sensor processing engine may be used to 1100 ) . In at least one embodiment, such aforementioned unlock a vehicle when an owner approaches a driver door functionality may be part of a cooperative adaptive cruise and turns on lights, and , in a security mode , to disable such control functionality of vehicle 1100 . vehicle when an owner leaves such vehicle . In this way, [ 0211 ] In at least one embodiment, network interface 1124 SoC ( s ) 1104 provide for security against theft and / or car may include an SoC that provides modulation and demodu jacking lation functionality and enables controller ( s ) 1136 to com [ 0207 ] In at least one embodiment, a CNN for emergency municate over wireless networks . In at least one embodi vehicle detection and identification may use data from ment, network interface 1124 may include a radio frequency microphones 1196 to detect and identify emergency vehicle front - end for up - conversion from baseband to radio fre sirens. In at least one embodiment, SoC ( s ) 1104 use a CNN quency, and down conversion from radio frequency to for classifying environmental and urban sounds , as well as baseband . In at least one embodiment, frequency conver classifying visual data . In at least one embodiment, a CNN sions may be performed in any technically feasible fashion . running on a DLA is trained to identify a relative closing For example , frequency conversions could be performed speed of an emergency vehicle ( e.g. , by using a Doppler through well -known processes, and / or using super- hetero effect ). In at least one embodiment, a CNN may also be dyne processes. In at least one embodiment, radio frequency trained to identify emergency vehicles specific to a local area front end functionality may be provided by a separate chip . in which a vehicle is operating, as identified by GNSS In at least one embodiment, network interfaces may include sensor (s ) 1158. In at least one embodiment, when operating wireless functionality for communicating over LTE , in Europe , a CNN will seek to detect European sirens, and WCDMA , UMTS , GSM , CDMA2000 , Bluetooth , Blu when in North America , a CNN will seek to identify only etooth LE , Wi -Fi , Z - Wave, ZigBee , LoRaWAN , and / or other North American sirens . In at least one embodiment, once an wireless protocols. emergency vehicle is detected , a control program may be [ 0212 ] In at least one embodiment, vehicle 1100 may used to execute an emergency vehicle safety routine , slow further include data store ( s ) 1128 which may include, with ing a vehicle , pulling over to a side of a road, parking a out limitation , off -chip ( e.g. , off SOC ( s ) 1104 ) storage . In at vehicle, and / or idling a vehicle , with assistance of ultrasonic least one embodiment, data store ( s ) 1128 may include , sensor (s ) 1162 , until emergency vehicles pass . without limitation , one or more storage elements including [ 0208 ] In at least one embodiment, vehicle 1100 may RAM , SRAM , dynamic random - access memory include CPU ( S) 1118 ( e.g. , discrete CPU ( s ) , or dCPU (s )) , ( “ DRAM ” ), video random - access memory ( “ VRAM ” ), flash that may be coupled to SoC ( s ) 1104 via a high - speed memory , hard disks , and / or other components and / or devices interconnect ( e.g. , PCIe ) . In at least one embodiment, CPU that may store at least one bit of data . ( s ) 1118 may include an X86 processor, for example. CPU ( S ) [ 0213 ] In at least one embodiment, vehicle 1100 may 1118 may be used to perform any of a variety of functions, further include GNSS sensor ( s ) 1158 ( e.g. , GPS and / or including arbitrating potentially inconsistent results between assisted GPS sensors ) , to assist in mapping , perception, ADAS sensors and SoC ( s ) 1104 , and / or monitoring status occupancy grid generation , and / or path planning functions. and health of controller ( s ) 1136 and / or an infotainment In at least one embodiment, any number of GNSS sensor( s ) system on a chip ( “ infotainment SoC ” ) 1130 , for example . 1158 may be used , including , for example and without [ 0209 ] In at least one embodiment, vehicle 1100 may limitation , a GPS using a USB connector with an Ethernet include GPU ( s ) 1120 ( e.g. , discrete GPU ( s ) , or dGPU ( s ) ), to - Serial ( e.g. , RS - 232 ) bridge . that may be coupled to SoC ( s ) 1104 via a high - speed [ 0214 ] In at least one embodiment, vehicle 1100 may interconnect ( e.g. , NVIDIA's NVLINK channel ). In at least further include RADAR sensor ( s ) 1160. In at least one one embodiment, GPU ( S) 1120 may provide additional embodiment, RADAR sensor ( s ) 1160 may be used by artificial intelligence functionality , such as by executing vehicle 1100 for long - range vehicle detection , even in dark redundant and / or different neural networks, and may be used ness and /or severe weather conditions . In at least one US 2021/0081752 A1 Mar. 18 , 2021 19 embodiment, RADAR functional safety levels may be ASIL sensor ( s ) 1164 may operate at functional safety level ASIL B. In at least one embodiment, RADAR sensor ( s ) 1160 may B. In at least one embodiment, vehicle 1100 may include use a CAN bus and /or bus 1102 ( e.g. , to transmit data multiple LIDAR sensors 1164 ( e.g. , two, four, six , etc.) that generated by RADAR sensor ( s ) 1160 ) for control and to may use an Ethernet channel ( e.g. , to provide data to a access object tracking data , with access to Ethernet channels Gigabit Ethernet switch ) . to access raw data in some examples. In at least one [ 0219 ] In at least one embodiment, LIDAR sensor ( s ) 1164 embodiment, a wide variety of RADAR sensor types may be may be capable of providing a list of objects and their used . For example, and without limitation , RADAR sensor distances for a 360 - degree field of view . In at least one ( s ) 1160 may be suitable for front, rear, and side RADAR embodiment, commercially available LIDAR sensor ( s ) 1164 use . In at least one embodiment, one or more sensor of may have an advertised range of approximately 100 m , with RADAR sensors ( s ) 1160 is a Pulse Doppler RADAR sensor . an accuracy of 2 cm to 3 cm , and with support for a 100 [ 0215 ] In at least one embodiment, RADAR sensor (s ) Mbps Ethernet connection , for example. In at least one 1160 may include different configurations, such as long embodiment, one or more non - protruding LIDAR sensors range with narrow field of view , short - range with wide field may be used . In such an embodiment, LIDAR sensor( s ) of view , short - range side coverage, etc. In at least one 1164 may include a small device that may be embedded into embodiment, long - range RADAR may be used for adaptive a front, a rear , a side , and / or a corner location of vehicle cruise control functionality. In at least one embodiment, 1100. In at least one embodiment, LIDAR sensor ( s ) 1164 , in long - range RADAR systems may provide a broad field of such an embodiment, may provide up to a 120 -degree view realized by two or more independent scans , such as horizontal and 35 - degree vertical field -of - view , with a 200 m within a 250 m (meter ) range . In at least one embodiment, range even for low - reflectivity objects . In at least one RADAR sensor ( s ) 1160 may help in distinguishing between embodiment, front -mounted LIDAR sensor ( s ) 1164 may be static and moving objects, and may be used by ADAS configured for a horizontal field of view between 45 degrees system 1138 for emergency brake assist and forward colli and 135 degrees. sion warning. In at least one embodiment, sensors 1160 ( s ) [ 0220 ] In at least one embodiment, LIDAR technologies, included in a long - range RADAR system may include, such as 3D flash LIDAR , may also be used . In at least one without limitation , monostatic multimodal RADAR with embodiment, 3D flash LIDAR uses a flash of a laser as a multiple ( e.g. , six or more ) fixed RADAR antennae and a transmission source , to illuminate surroundings of vehicle high - speed CAN and FlexRay interface . In at least one 1100 up to approximately 200 m . In at least one embodi embodiment, with six antennae, a central four antennae may ment, flash LIDAR unit includes, without limitation , a create a focused beam pattern , designed to record vehicle's receptor, which records laser pulse transit time and reflected 1100 surroundings at higher speeds with minimal interfer light on each pixel , which in turn corresponds to a range ence from traffic in adjacent lanes. In at least one embodi from vehicle 1100 to objects. In at least one embodiment, ment, another two antennae may expand field of view , flash LIDAR may allow for highly accurate and distortion making it possible to quickly detect vehicles entering or free images of surroundings to be generated with every laser leaving a lane of vehicle 1100 . flash . In at least one embodiment, four flash LIDAR sensors [ 0216 ] In at least one embodiment, mid - range RADAR may be deployed , one at each side of vehicle 1100. In at least systems may include , as an example , a range of up to 160 m one embodiment, 3D flash LIDAR systems include , without ( front) or 80 m ( rear ), and a field of view of up to 42 degrees limitation , a solid - state 3D staring array LIDAR camera ( front) or 150 degrees ( rear ). In at least one embodiment, with no moving parts other than a fan ( e.g. , a non - scanning short -range RADAR systems may include, without limita LIDAR device ). In at least one embodiment, flash LIDAR tion , any number of RADAR sensor ( s ) 1160 designed to be device may use a 5 nanosecond class I ( eye - safe ) laser pulse installed at both ends of a rear bumper. When installed at per frame and may capture reflected laser light as a 3D range both ends of a rear bumper, in at least one embodiment, a point cloud and co - registered intensity data . RADAR sensor system may create two beams that con [ 0221 ] In at least one embodiment, vehicle 1100 may stantly monitor blind spots in a rear direction and next to a further include IMU sensor ( s ) 1166. In at least one embodi vehicle . In at least one embodiment, short - range RADAR ment, IMU sensor ( s ) 1166 may be located at a center of a systems may be used in ADAS system 1138 for blind spot rear axle of vehicle 1100. In at least one embodiment, IMU detection and / or lane change assist . sensor ( s ) 1166 may include, for example and without limi [ 0217 ] In at least one embodiment, vehicle 1100 may tation , accelerometer (s ), magnetometer (s ) , gyroscope ( s ), a further include ultrasonic sensor ( s ) 1162. In at least one magnetic compass , magnetic compasses , and / or other sensor embodiment, ultrasonic sensor ( s ) 1162 , which may be posi types. In at least one embodiment, such as in six -axis tioned at a front, a back , and / or side location of vehicle 1100 , applications , IMU sensor ( s ) 1166 may include, without may be used for parking assist and / or to create and update limitation , accelerometers and gyroscopes . In at least one an occupancy grid . In at least one embodiment, a wide embodiment, such as in nine - axis applications, IMU sensor variety of ultrasonic sensor ( s ) 1162 may be used , and ( s ) 1166 may include, without limitation , accelerometers , different ultrasonic sensor ( s ) 1162 may be used for different gyroscopes, and magnetometers . ranges of detection ( e.g. , 2.5 m , 4 m ) . In at least one [ 0222 ] In at least one embodiment, IMU sensor ( s ) 1166 embodiment, ultrasonic sensor ( s ) 1162 may operate at func may be implemented as a miniature, high performance tional safety levels of ASIL B. GPS - Aided Inertial Navigation System ( "GPS / INS " ) that [ 0218 ] In at least one embodiment, vehicle 1100 may combines micro - electro -mechanical systems ( “ MEMS ” ) include LIDAR sensor ( s ) 1164. In at least one embodiment, inertial sensors , a high - sensitivity GPS receiver, and LIDAR sensor ( s ) 1164 may be used for object and pedes advanced Kalman filtering algorithms to provide estimates trian detection , emergency braking , collision avoidance, of position , velocity, and attitude. In at least one embodi and / or other functions. In at least one embodiment, LIDAR ment, IMU sensor (s ) 1166 may enable vehicle 1100 to US 2021/0081752 A1 Mar. 18 , 2021 20 estimate its heading without requiring input from a magnetic distance from vehicles ahead . In at least one embodiment, a sensor by directly observing and correlating changes in lateral ACC system performs distance keeping, and advises velocity from a GPS to IMU sensor ( s ) 1166. In at least one vehicle 1100 to change lanes when necessary . In at least one embodiment, IMU sensor ( s ) 1166 and GNSS sensor ( s) 1158 embodiment, a lateral ACC is related to other ADAS appli may be combined in a single integrated unit . cations, such as LC and CW . [ 0223 ] In at least one embodiment, vehicle 1100 may [ 0228 ] In at least one embodiment, a CACC system uses include microphone ( s ) 1196 placed in and / or around vehicle information from other vehicles that may be received via 1100. In at least one embodiment, microphone ( s ) 1196 may network interface 1124 and / or wireless antenna ( s ) 1126 from be used for emergency vehicle detection and identification , other vehicles via a wireless link , or indirectly, over a among other things. network connection ( e.g. , over the Internet) . In at least one [ 0224 ] In at least one embodiment, vehicle 1100 may embodiment, direct links may be provided by a vehicle- to further include any number of camera types, including stereo vehicle ( “ V2V ” ) communication link , while indirect links camera ( s) 1168 , wide - view camera ( s ) 1170 , infrared camera may be provided by an infrastructure - to -vehicle ( * 12V ” ) ( s ) 1172 , surround camera (s ) 1174 , long -range camera ( s ) communication link . In general, V2V communication pro 1198 , mid - range camera ( s ) 1176 , and / or other camera types. vides information about immediately preceding vehicles In at least one embodiment, cameras may be used to capture ( e.g. , vehicles immediately ahead of and in same lane as image data around an entire periphery of vehicle 1100. In at vehicle 1100 ) , while 12V communication provides informa least one embodiment, which types of cameras used depends tion about traffic further ahead . In at least one embodiment , on vehicle 1100. In at least one embodiment, any combina a CACC system may include either or both 12V and V2V tion of camera types may be used to provide necessary information sources . In at least one embodiment, given coverage around vehicle 1100. In at least one embodiment, information of vehicles ahead of vehicle 1100 , a CACC a number of cameras deployed may differ depending on system may be more reliable and it has potential to improve embodiment. For example , in at least one embodiment, traffic flow smoothness and reduce congestion on road . vehicle 1100 could include six cameras, seven cameras , ten cameras , twelve cameras , or another number of cameras . In [ 0229 ] In at least one embodiment, an FCW system is at least one embodiment, cameras may support, as an designed to alert a driver to a hazard , so that such driver may example and without limitation , Gigabit Multimedia Serial take corrective action . In at least one embodiment, an FCW Link ( “ GMSL ” ) and / or Gigabit Ethernet communications. system uses a front - facing camera and /or RADAR sensor ( s ) In at least one embodiment, each camera might be as 1160 , coupled to a dedicated processor, DSP, FPGA, and / or described with more detail previously herein with respect to ASIC , that is electrically coupled to provide driver feedback , FIG . 11A and FIG . 11B . such as a display, speaker, and / or vibrating component. In at [ 0225 ] In at least one embodiment, vehicle 1100 may least one embodiment, an FCW system may provide a further include vibration sensor ( s ) 1142. In at least one warning, such as in form of a sound , visual warning, embodiment, vibration sensor ( s ) 1142 may measure vibra vibration and /or a quick brake pulse . tions of components of vehicle 1100 , such as axle ( s ). For [ 0230 ] In at least one embodiment, an AEB system detects example, in at least one embodiment, changes in vibrations an impending forward collision with another vehicle or other may indicate a change in road surfaces . In at least one object, and may automatically apply brakes if a driver does embodiment, when two or more vibration sensors 1142 are not take corrective action within a specified time or distance used , differences between vibrations may be used to deter parameter. In at least one embodiment, AEB system may use mine friction or slippage of road surface ( e.g. , when a front - facing camera ( s) and / or RADAR sensor ( s ) 1160 , difference in vibration is between a power - driven axle and a coupled to a dedicated processor, DSP, FPGA , and /or ASIC . freely rotating axle ). In at least one embodiment, when an AEB system detects a [ 0226 ] In at least one embodiment, vehicle 1100 may hazard , it will typically first alert a driver to take corrective include ADAS system 1138. In at least one embodiment, action to avoid collision and , if that driver does not take ADAS system 1138 may include , without limitation , an corrective action , that AEB system may automatically apply SoC , in some examples. In at least one embodiment, ADAS brakes in an effort to prevent, or at least mitigate , an impact system 1138 may include , without limitation , any number of a predicted collision . In at least one embodiment, an AEB and combination of an autonomous / adaptive / automatic system may include techniques such as dynamic brake cruise control ( " ACC ” ) system , a cooperative adaptive support and / or crash imminent braking. cruise control ( “ CACC ” ) system , a forward crash warning [ 0231 ] In at least one embodiment, an LDW system pro ( “ FCW ” ) system , an automatic emergency braking ( “ AEB ” ) vides visual , audible , and / or tactile warnings , such as steer system , a lane departure warning ( “ LDW ) ” system , a lane ing wheel or seat vibrations, to alert driver when vehicle keep assist ( “ LKA ” ) system , a blind spot warning ( “ BSW ” ) 1100 crosses lane markings . In at least one embodiment, an system , a rear cross - traffic warning ( “ RCTW ” ) system , a LDW system does not activate when a driver indicates an collision warning ( “ CW ” ) system , a lane centering ( “ LC ” ) intentional lane departure , such as by activating a turn system , and /or other systems, features, and / or functionality. signal. In at least one embodiment, an LDW system may use [ 0227 ] In at least one embodiment, ACC system may use front - side facing cameras, coupled to a dedicated processor, RADAR sensor ( s ) 1160 , LIDAR sensor ( s ) 1164 , and / or any DSP, FPGA , and / or ASIC , that is electrically coupled to number of camera ( s ). In at least one embodiment, ACC provide driver feedback , such as a display, speaker, and / or system may include a longitudinal ACC system and / or a vibrating component. In at least one embodiment, an LKA lateral ACC system . In at least one embodiment, a longitu system is a variation of an LDW system . In at least one dinal ACC system monitors and controls distance to another embodiment, an LKA system provides steering input or vehicle immediately ahead of vehicle 1100 and automati braking to correct vehicle 1100 if vehicle 1100 starts to exit cally adjusts speed of vehicle 1100 to maintain a safe its lane . US 2021/0081752 A1 Mar. 18 , 2021 21

[ 0232 ] In at least one embodiment, a BSW system detects ondary computer's output may be trusted, and when it and warns a driver of vehicles in an automobile’s blind spot . cannot . For example, in at least one embodiment, when that In at least one embodiment , a BSW system may provide a secondary computer is a RADAR - based FCW system , a visual , audible, and / or tactile alert to indicate that merging neural network ( s) in that supervisory MCU may learn when or changing lanes is unsafe . In at least one embodiment, a an FCW system is identifying metallic objects that are not , BSW system may provide an additional warning when a in fact, hazards, such as a drainage grate or manhole cover driver uses a turn signal. In at least one embodiment, a BSW that triggers an alarm . In at least one embodiment, when a system may use rear - side facing camera ( s) and / or RADAR secondary computer is a camera -based LDW system , a sensor (s ) 1160 , coupled to a dedicated processor, DSP , neural network in a supervisory MCU may learn to override FPGA, and /or ASIC , that is electrically coupled to driver LDW when bicyclists or pedestrians are present and a lane feedback , such as a display, speaker, and / or vibrating com departure is , in fact, a safest maneuver. In at least one ponent . embodiment, a supervisory MCU may include at least one of [ 0233 ] In at least one embodiment, an RCTW system may a DLA or a GPU suitable for running neural network ( s ) with provide visual , audible , and / or tactile notification when an associated memory . In at least one embodiment, a supervi object is detected outside a rear - camera range when vehicle sory MCU may comprise and / or be included as a component 1100 is backing up . In at least one embodiment, an RCTW of SoC ( s ) 1104 . system includes an AEB system to ensure that vehicle brakes [ 0237 ] In at least one embodiment, ADAS system 1138 are applied to avoid a crash . In at least one embodiment, an may include a secondary computer that performs ADAS RCTW system may use one or more rear - facing RADAR functionality using traditional rules of computer vision . In at sensor ( s ) 1160 , coupled to a dedicated processor, DSP , least one embodiment, that secondary computer may use FPGA , and / or ASIC , that is electrically coupled to provide classic computer vision rules ( if - then ), and presence of a driver feedback , such as a display, speaker, and / or vibrating neural network ( s ) in a supervisory MCU may improve component. reliability, safety and performance . For example , in at least [ 0234 ] In at least one embodiment, conventional ADAS one embodiment, diverse implementation and intentional systems may be prone to false positive results which may be non - identity makes an overall system more fault- tolerant, annoying and distracting to a driver, but typically are not especially to faults caused by software ( or software -hard catastrophic , because conventional ADAS systems alert a ware interface ) functionality. For example, in at least one driver and allow that driver to decide whether a safety embodiment, if there is a software bug or error in software condition truly exists and act accordingly. In at least one running on a primary computer, and non - identical software embodiment, vehicle 1100 itself decides, in case of conflict code running on a secondary computer provides a consistent ing results, whether to heed result from a primary computer overall result , then a supervisory MCU may have greater or a secondary computer ( e.g. , a first controller or a second confidence that an overall result is correct, and a bug in controller of controllers 1136 ) . For example, in at least one software or hardware on that primary computer is not embodiment, ADAS system 1138 may be a backup and / or causing a material error. secondary computer for providing perception information to [ 0238 ] In at least one embodiment, an output of ADAS a backup computer rationality module . In at least one system 1138 may be fed into a primary computer's percep embodiment, a backup computer rationality monitor may tion block and / or a primary computer's dynamic driving task run redundant diverse software on hardware components to block . For example, in at least one embodiment, if ADAS detect faults in perception and dynamic driving tasks . In at system 1138 indicates a forward crash warning due to an least one embodiment, outputs from ADAS system 1138 object immediately ahead , a perception block may use this may be provided to a supervisory MCU . In at least one information when identifying objects. In at least one embodiment, if outputs from a primary computer and out embodiment, a secondary computer may have its own neural puts from a secondary computer conflict, a supervisory network that is trained and thus reduces a risk of false MCU determines how to reconcile conflict to ensure safe positives , as described herein . operation. [ 0239 ] In at least one embodiment, vehicle 1100 may [ 0235 ] In at least one embodiment, a primary computer further include infotainment SoC 1130 ( e.g. , an in - vehicle may be configured to provide a supervisory MCU with a infotainment system ( IVI ) ) . Although illustrated and confidence score , indicating that primary computer's confi described as an SoC , infotainment system SoC 1130 , in at dence in a chosen result. In at least one embodiment, if that least one embodiment, may not be an SoC , and may include , confidence score exceeds a threshold , that supervisory MCU without limitation, two or more discrete components . In at may follow that primary computer's direction , regardless of least one embodiment, infotainment SoC 1130 may include , whether that secondary computer provides a conflicting or without limitation , a combination of hardware and software inconsistent result. In at least one embodiment, where a that may be used to provide audio ( e.g. , music , a personal confidence score does not meet a threshold , and where digital assistant, navigational instructions, news, radio , etc. ) , primary and secondary computers indicate different results video ( e.g. , TV , movies , streaming, etc. ) , phone ( e.g. , hands ( e.g. , a conflict ), a supervisory MCU may arbitrate between free calling ), network connnnectivity ( e.g. , LTE , WiFi, etc. ) , computers to determine an appropriate outcome . and / or information services ( e.g. , navigation systems , rear [ 0236 ] In at least one embodiment, a supervisory MCU parking assistance , a radio data system , vehicle related may be configured to run a neural network ( s ) that is trained information such as fuel level, total distance covered , brake and configured to determine, based at least in part on outputs fuel level , oil level , door open / close , air filter information , from a primary computer and outputs from a secondary etc. ) to vehicle 1100. For example, infotainment SoC 1130 computer, conditions under which that secondary computer could include radios , disk players , navigation systems , video provides false alarms. In at least one embodiment, neural players , USB and Bluetooth connectivity, carputers, in - car network ( s ) in a supervisory MCU may learn when a sec entertainment, WiFi, steering wheel audio controls , hands US 2021/0081752 A1 Mar. 18 , 2021 22 free voice control, a heads - up display ( “ HUD ” ), HMI dis manipulation, based , at least partially , on the techniques play 1134 , a telematics device, a control panel ( e.g. , for described above in relations to FIGS . 1-7 . controlling and / or interacting with various components, [ 0244 ] FIG . 11D is a diagram of a system 1176 for features, and / or systems ) , and /or other components . In at communication between cloud - based server ( s ) and autono least one embodiment, infotainment SoC 1130 may further mous vehicle 1100 of FIG . 11A , according to at least one be used to provide information ( e.g. , visual and / or audible ) embodiment. In at least one embodiment, system 1176 may to user ( s ) of vehicle 1100 , such as information from ADAS include , without limitation , server ( s ) 1178 , network ( s ) 1190 , system 1138 , autonomous driving information such as and any number and type of vehicles , including vehicle planned vehicle maneuvers , trajectories, surrounding envi 1100. In at least one embodiment, server ( s) 1178 may ronment information ( e.g. , intersection information , vehicle include, without limitation , a plurality of GPUs 1184 ( A ) information , road information , etc. ) , and /or other informa 1184 ( H ) ( collectively referred to herein as GPUs 1184 ) , tion . PCIe switches 1182 ( A ) -1182 ( D ) ( collectively referred to [ 0240 ] In at least one embodiment, infotainment SoC 1130 herein as PCIe switches 1182 ) , and /or CPUs 1180 ( A )-1180 may include any amount and type of GPU functionality. In ( B ) ( collectively referred to herein as CPUs 1180 ) . In at least at least one embodiment, infotainment SoC 1130 may com one embodiment, GPUs 1184 , CPUs 1180 , and PCIe municate over bus 1102 with other devices, systems , and / or switches 1182 may be interconnected with high - speed inter components of vehicle 1100. In at least one embodiment, connects such as , for example and without limitation , infotainment SoC 1130 may be coupled to a supervisory NVLink interfaces 1188 developed by NVIDIA and / or PCIe MCU such that a GPU of an infotainment system may connections 1186. In at least one embodiment, GPUs 1184 perform some self - driving functions in event that primary are connected via an NVLink and / or NVSwitch SoC and controller ( s ) 1136 ( e.g. , primary and / or backup computers of GPUs 1184 and PCIe switches 1182 are connected via PCIe vehicle 1100 ) fail . In at least one embodiment, infotainment interconnects . Although eight GPUs 1184 , two CPUs 1180 , SoC 1130 may put vehicle 1100 into a chauffeur to safe stop and four PCIe switches 1182 are illustrated , this is not mode , as described herein . intended to be limiting . In at least one embodiment, each of [ 0241 ] In at least one embodiment, vehicle 1100 may server ( s ) 1178 may include, without limitation , any number further include instrument cluster 1132 ( e.g. , a digital dash , of GPUs 1184 , CPUs 1180 , and / or PCIe switches 1182 , in an electronic instrument cluster, a digital instrument panel, any combination . For example, in at least one embodiment, etc. ) . In at least one embodiment, instrument cluster 1132 server ( s ) 1178 could each include eight, sixteen , thirty -two , may include , without limitation , a controller and / or super and /or more GPUs 1184 . computer ( e.g. , a discrete controller or supercomputer ). In at [ 0245 ] In at least one embodiment, server ( s ) 1178 may least one embodiment, instrument cluster 1132 may include , receive, over network ( s ) 1190 and from vehicles , image data without limitation , any number and combination of a set of representative of images showing unexpected or changed instrumentation such as a speedometer, fuel level, oil pres road conditions, such as recently commenced road -work . In sure, tachometer, odometer, turn indicators, gearshift posi at least one embodiment, server ( s) 1178 may transmit, over tion indicator, seat belt warning light ( s ), parking - brake network ( s ) 1190 and to vehicles , neural networks 1192 , warning light( s ), engine -malfunction light (s ), supplemental updated or otherwise, and / or map information 1194 , includ restraint system ( e.g. , airbag ) information, lighting controls, ing , without limitation , information regarding traffic and safety system controls, navigation information , etc. In some road conditions . In at least one embodiment, updates to map examples, information may be displayed and / or shared information 1194 may include , without limitation , updates among infotainment SoC 1130 and instrument cluster 1132 . for HD map 1122 , such as information regarding construc In at least one embodiment, instrument cluster 1132 may be tion sites , potholes, detours, flooding, and / or other obstruc included as part of infotainment SoC 1130 , or vice versa . tions . In at least one embodiment, neural networks 1192 , [ 0242 ] Inference and / or training logic 815 are used to and / or map information 1194 may have resulted from new perform inferencing and / or training operations associated training and / or experiences represented in data received with one or more embodiments . Details regarding inference from any number of vehicles in an environment, and / or and / or training logic 815 are provided herein in conjunction based at least in part on training performed at a data center with FIGS . 8A and /or 8B . In at least one embodiment, ( e.g. , using server ( s ) 1178 and / or other servers ). inference and / or training logic 815 may be used in system [ 0246 ] In at least one embodiment, server ( s ) 1178 may be FIG . 11C for inferencing or predicting operations based , at used to train machine learning models ( e.g. , neural net least in part, on weight parameters calculated using neural works ) based at least in part on training data . In at least one network training operations, neural network functions and / embodiment, training data may be generated by vehicles, or architectures , or neural network use cases described and / or may be generated in a simulation ( e.g. , using a game herein . engine ). In at least one embodiment, any amount of training [ 0243 ] In at least one embodiment, one or more circuits , data is tagged ( e.g. , where associated neural network ben processors, systems , robots, or other devices or techniques efits from supervised learning ) and / or undergoes other pre are adapted , with reference to the above figure, to identify a processing . In at least one embodiment, any amount of goal of a demonstration based , at least partially , on the training data is not tagged and / or pre -processed ( e.g. , where techniques described above in relations to FIGS . 1-7 . In at associated neural network does not require supervised learn least one embodiment, one or more circuits, processors , ing ) . In at least one embodiment, once machine learning systems , robots, or other devices or techniques are adapted , models are trained , machine learning models may be used by with reference to the above figure , to implement a robotic vehicles ( e.g. , transmitted to vehicles over network ( s ) 1190 ) , device capable of observing a demonstration , identifying a and / or machine learning models may be used by server ( s ) goal of the demonstration, and achieving the goal by robotic 1178 to remotely monitor vehicles. US 2021/0081752 A1 Mar. 18 , 2021 23

[ 0247 ] In at least one embodiment, server ( s ) 1178 may operating systems (UNIX and Linux, for example ), embed receive data from vehicles and apply data to up - to - date ded software , and / or graphical user interfaces, may also be real - time neural networks for real - time intelligent inferenc used . ing . In at least one embodiment, server ( s) 1178 may include [ 0251 ] Embodiments may be used in other devices such as deep - learning supercomputers and / or dedicated Al comput handheld devices and embedded applications. Some ers powered by GPU ( s ) 1184 , such as a DGX and DGX examples of handheld devices include cellular phones, Inter Station machines developed by NVIDIA . However, in at net Protocol devices, digital cameras , personal digital assis least one embodiment, server ( s ) 1178 may include deep tants ( “ PDAs” ), and handheld PCs . In at least one embodi learning infrastructure that uses CPU -powered data centers . ment, embedded applications may include a microcontroller, a digital signal processor ( “ DSP ” ), system on a chip , net [ 0248 ] In at least one embodiment, deep - learning infra work computers ( “ NetPCs ” ), set - top boxes , network hubs, structure of server ( s ) 1178 may be capable of fast, real - time wide area network ( “ WAN ” ) switches , or any other system inferencing, and may use that capability to evaluate and that may perform one or more instructions in accordance verify health of processors , software, and / or associated with at least one embodiment . hardware in vehicle 1100. For example, in at least one [ 0252 ] In at least one embodiment, computer system 1200 embodiment, deep - learning infrastructure may receive peri may include , without limitation , processor 1202 that may odic updates from vehicle 1100 , such as a sequence of include, without limitation , one or more execution units images and / or objects that vehicle 1100 has located in that 1208 to perform machine learning model training and / or sequence of images ( e.g. , via computer vision and / or other inferencing according to techniques described herein . In at machine learning object classification techniques ). In at least least one embodiment, computer system 1200 is a single one embod nt, deep - learning infrastructure may run its processor desktop or server system , but in another embodi own neural network to identify objects and compare them ment, computer system 1200 may be a multiprocessor with objects identified by vehicle 1100 and , if results do not system . In at least one embodiment, processor 1202 may match and deep - learning infrastructure concludes that Al in include, without limitation , a complex instruction set com vehicle 1100 is malfunctioning, then server ( s ) 1178 may puter ( “ CISC ” ) microprocessor, a reduced instruction set transmit a signal to vehicle 1100 instructing a fail -safe computing ( “ RISC ” ) microprocessor, a very long instruction computer of vehicle 1100 to assume control, notify passen word ( “ VLIW ” ) microprocessor, a processor implementing gers, and complete a safe parking maneuver. a combination of instruction sets, or any other processor [ 0249 ] In at least one embodiment, server ( s ) 1178 may device, such as a digital signal processor , for example. In at include GPU ( s ) 1184 and one or more programmable infer least one embodiment, processor 1202 may be coupled to a ence accelerators ( e.g. , NVIDIA's TensorRT 3 devices ). In processor bus 1210 that may transmit data signals between at least one embodiment, a combination of GPU - powered processor 1202 and other components in computer system servers and inference acceleration may make real - time 1200 . responsiveness possible . In at least one embodiment, such as [ 0253 ] In at least one embodiment, processor 1202 may where performance is less critical, servers powered by include, without limitation , a Level 1 ( “ L1” ) internal cache CPUs , FPGAs, and other processors may be used for infer memory ( " cache ” ) 1204. In at least one embodiment, pro encing at least one embodiment, hardware structure ( s ) cessor 1202 may have a single internal cache or multiple 815 are used to perform one or more embodiments . Details levels of internal cache . In at least one embodiment, cache regarding hardware structure ( x ) 815 are provided herein in memory may reside external to processor 1202. Other conjunction with FIGS . 8A and / or 8B . embodiments may also include a combination of both inter nal and external caches depending on particular implemen Computer Systems tation and needs . In at least one embodiment, a register file 1206 may store different types of data in various registers [ 0250 ] FIG . 12 is a block diagram illustrating an exem including, without limitation , integer registers , floating point plary computer system , which may be a system with inter registers, status registers , and an instruction pointer register. connected devices and components, a system - on - a - chip [ 0254 ] In at least one embodiment, execution unit 1208 , ( SOC ) or some combination thereof formed with a processor including, without limitation , logic to perform integer and that may include execution units to execute an instruction , floating point operations, also resides in processor 1202. In according to at least one embodiment. In at least one at least one embodiment, processor 1202 may also include embodiment, a computer system 1200 may include, without a microcode ( “ ucode ” ) read only memory ( “ ROM ’ ) that limitation , a component, such as a processor 1202 to employ stores microcode for certain macro instructions. In at least execution units including logic to perform algorithms for one embodiment, execution unit 1208 may include logic to process data , in accordance with present disclosure , such as handle a packed instruction set 1209. In at least one embodi in embodiment described herein . In at least one embodi ment, by including packed instruction set 1209 in an instruc ment, computer system 1200 may include processors , such tion set of a general - purpose processor, along with associ as PENTIUM® Processor family , XeonTM Itanium®, ated circuitry to execute instructions, operations used by XScaleTM and / or StrongARMTM , Intel® CoreTM , or Intel® many multimedia applications may be performed using NervanaTM microprocessors available from Intel Corpora packed data in processor 1202. In at least one embodiment, tion of Santa Clara , Calif ., although other systems ( including many multimedia applications may be accelerated and PCs having other microprocessors , engineering worksta executed more efficiently by using a full width of a proces tions , set - top boxes and like ) may also be used . In at least sor's data bus for performing operations on packed data , one embodiment, computer system 1200 may execute a which may eliminate a need to transfer smaller units of data version of WINDOWS operating system available from across that processor's data bus to perform one or more Microsoft Corporation of Redmond , Wash ., although other operations one data element at a time . US 2021/0081752 A1 Mar. 18 , 2021 24

[ 0255 ] In at least one embodiment, execution unit 1208 inference and / or training logic 815 may be used in system may also be used in microcontrollers, embedded processors , FIG . 12 for inferencing or predicting operations based , at graphics devices, DSPs , and other types of logic circuits. In least in part, on weight parameters calculated using neural at least one embodiment, computer system 1200 may network training operations , neural network functions and / include , without limitation , a memory 1220. In at least one or architectures, or neural network use cases described embodiment, memory 1220 may be a Dynamic Random herein . Access Memory ( “ DRAM ” ) device , a Static Random Access [ 0260 ] In at least one embodiment, one or more circuits , Memory ( " SRAM ” ) device , a flash memory device , or processors , systems , robots , or other devices or techniques another memory device . In at least one embodiment, are adapted , with reference to the above figure , to identify a memory 1220 may store instruction ( s ) 1219 and /or data goal of a demonstration based , at least partially, on the 1221 represented by data signals that may be executed by techniques described above in relations to FIGS . 1-7 . In at processor 1202 . least one embodiment, one or more circuits, processors , [ 0256 ] In at least one embodiment, a system logic chip systems , robots , or other devices or techniques are adapted , may be coupled to processor bus 1210 and memory 1220. In with reference to the above figure , to implement a robotic at least one embodiment, a system logic chip may include , device capable of observing a demonstration , identifying a without limitation, a memory controller hub ( “MCH ” ) 1216 , goal of the demonstration , and achieving the goal by robotic and processor 1202 may communicate with MCH 1216 via manipulation, based , at least partially, on the techniques processor bus 1210. In at least one embodiment, MCH 1216 described above in relations to FIGS . 1-7 . may provide a high bandwidth memory path 1218 to [ 0261 ] FIG . 13 is a block diagram illustrating an electronic memory 1220 for instruction and data storage and for device 1300 for utilizing a processor 1310 , according to at storage of graphics commands, data and textures . In at least least one embodiment. In at least one embodiment, elec one embodiment, MCH 1216 may direct data signals tronic device 1300 may be , for example and without limi between processor 1202 , memory 1220 , and other compo tation , a notebook, a tower server , a rack server , a blade nents in computer system 1200 and to bridge data signals server , a laptop , a desktop , a tablet , a mobile device, a phone , between processor bus 1210 , memory 1220 , and a system an embedded computer, or any other suitable electronic I / O interface 1222. In at least one embodiment, a system device . logic chip may provide a graphics port for coupling to a [ 0262 ] In at least one embodiment, electronic device 1300 graphics controller. In at least one embodiment, MCH 1216 may include , without limitation , processor 1310 communi may be coupled to memory 1220 through high bandwidth catively coupled to any suitable number or kind of compo memory path 1218 and a graphics / video card 1212 may be nents , peripherals, modules , or devices. In at least one coupled to MCH 1216 through an Accelerated Graphics Port embodiment, processor 1310 is coupled using a bus or ( “ AGP ” ) interconnect 1214 . interface, such as a IC bus , a System Management Bus [ 0257 ] In at least one embodiment, computer system 1200 ( “ SMBus ” ), a Low Pin Count ( LPC ) bus , a Serial Peripheral may use system I / O interface 1222 as a proprietary hub Interface ( “ SPI” ), a High Definition Audio ( “ HDA ” ) bus , a interface bus to couple MCH 1216 to an I / O controller hub Serial Advance Technology Attachment ( “ SATA ” ) bus , a ( “ ICH ” ) 1230. In at least one embodiment, ICH 1230 may Universal Serial Bus ( “USB ” ) ( versions 1 , 2 , 3 , etc. ) , or a provide direct connections to some I / O devices via a local Universal Asynchronous Receiver / Transmitter ( “ UART ” ) I / O bus . In at least one embodiment, a local I / O bus may bus . In at least one embodiment, FIG . 13 illustrates a system , include , without limitation , a high - speed I / O bus for con which includes interconnected hardware devices or “ chips” , necting peripherals to memory 1220 , a chipset, and proces whereas in other embodiments, FIG . 13 may illustrate an sor 1202. Examples may include , without limitation , an exemplary SoC . In at least one embodiment, devices illus audio controller 1229 , a firmware hub ( “ flash BIOS ' ) 1228 , trated in FIG . 13 may be interconnected with proprietary a wireless transceiver 1226 , a data storage 1224 , a legacy I / O interconnects , standardized interconnects ( e.g. , PCIe ) or controller 1223 containing user input and keyboard inter some combination thereof. In at least one embodiment, one faces 1225 , a serial expansion port 1227 , such as a Universal or more components of FIG . 13 are interconnected using Serial Bus ( “USB ” ) port, and a network controller 1234. In compute express link (CXL ) interconnects. at least one embodiment, data storage 1224 may comprise a [ 0263 ] In at least one embodiment, FIG . 13 may include a hard disk drive , a floppy disk drive, a CD - ROM device , a display 1324 , a touch screen 1325 , a touch pad 1330 , a Near flash memory device , or other mass storage device . Field Communications unit ( “ NFC ” ) 1345 , a sensor hub [ 0258 ] In at least one embodiment, FIG . 12 illustrates a 1340 , a thermal sensor 1346 , an Express Chipset ( “ EC ” ) system , which includes interconnected hardware devices or 1335 , a Trusted Platform Module ( “ TPM " ) 1338 , BIOS / “ chips ” , whereas in other embodiments, FIG . 12 may illus firmware / flash memory ( “ BIOS, FW Flash ” ) 1322 , a DSP trate an exemplary SoC . In at least one embodiment, devices 1360 , a drive 1320 such as a Solid State Disk ( “ SSD ” ) or a illustrated in FIG . 12 may be interconnected with propri Hard Disk Drive ( “ HDD " ), a wireless local area network etary interconnects , standardized interconnects ( e.g. , PCIe ) unit ( “ WLAN " ) 1350 , a Bluetooth unit 1352 , a Wireless or some combination thereof. In at least one embodiment, Wide Area Network unit ( “ WWAN " ) 1356 , a Global Posi one or more components of computer system 1200 are tioning System ( GPS ) unit 1355 , a camera ( “USB 3.0 interconnected using compute express link ( CXL ) intercon camera " ) 1354 such as a USB 3.0 camera , and /or a Low nects . Power Double Data Rate ( “ LPDDR ” ) memory unit [ 0259 ] Inference and / or training logic 815 are used to ( " LPDDR3" ) 1315 implemented in , for example, an perform inferencing and / or training operations associated LPDDR3 standard . These components may each be imple with one or more embodiments . Details regarding inference mented in any suitable manner . and / or training logic 815 are provided herein in conjunction [ 0264 ] In at least one embodiment, other components may with FIGS . 8A and / or 8B . In at least one embodiment, be communicatively coupled to processor 1310 through US 2021/0081752 A1 Mar. 18 , 2021 25 components described herein . In at least one embodiment, [ 0269 ] In at least one embodiment, computer system 1400 , an accelerometer 1341 , an ambient light sensor ( “ ALS ” ) in at least one embodiment, includes , without limitation , 1342 , a compass 1343 , and a gyroscope 1344 may be input devices 1408 , a parallel processing system 1412 , and communicatively coupled to sensor hub 1340. In at least one display devices 1406 that can be implemented using a embodiment, a thermal sensor 1339 , a fan 1337 , a keyboard conventional cathode ray tube ( “ CRT ” ), a liquid crystal 1336 , and touch pad 1330 may be communicatively coupled display ( “ LCD " ), a light emitting diode ( “ LED " ) display , a to EC 1335. In at least one embodiment , speakers 1363 , plasma display, or other suitable display technologies. In at headphones 1364 , and a microphone ( “ mic " ) 1365 may be least one embodiment, user input is received from input communicatively coupled to an audio unit ( " audio codec and devices 1408 such as keyboard , mouse , touchpad , micro class D amp " ) 1362 , which may in turn be communicatively phone, etc. In at least one embodiment, each module coupled to DSP 1360. In at least one embodiment, audio unit described herein can be situated on a single semiconductor 1362 may include , for example and without limitation , an platform to form a processing system . audio coder /decoder ( " codec ” ) and a class D amplifier. In at [ 0270 ] Inference and /or training logic 815 are used to least one embodiment, a SIM card ( “ SIM ” ) 1357 may be perform inferencing and / or training operations associated communicatively coupled to WWAN unit 1356. In at least with one or more embodiments . Details regarding inference one embodiment, components such as WLAN unit 1350 and and / or training logic 815 are provided herein in conjunction Bluetooth unit 1352 , as well as WWAN unit 1356 may be with FIGS . 8A and / or 8B . In at least one embodiment, implemented in a Next Generation Form Factor (“ NGFF ” ). inference and / or training logic 815 may be used in system [ 0265 ] Inference and / or training logic 815 are used to FIG . 14 for inferencing or predicting operations based , at perform inferencing and / or training operations associated least in part, on weight parameters calculated using neural with one or more embodiments . Details regarding inference network training operations , neural network functions and / and / or training logic 815 are provided herein in conjunction or architectures, or neural network use cases described with FIGS . 8A and / or 8B . In at least one embodiment, herein . inference and / or training logic 815 may be used in system [ 0271 ] In at least one embodiment, one or more circuits, FIG . 13 for inferencing or predicting operations based , at processors , systems , robots , or other devices or techniques least in part, on weight parameters calculated using neural are adapted , with reference to the above figure , to identify a network training operations, neural network functions and / goal of a demonstration based , at least partially, on the or architectures, or neural network use cases described techniques described above in relations to FIGS . 1-7 . In at herein . least one embodiment, one or more circuits , processors, systems , robots , or other devices or techniques are adapted, [ 0266 ] In at least one embodiment, one or more circuits, with reference to the above figure , to implement a robotic processors , systems , robots, or other devices or techniques device capable of observing a demonstration , identifying a are adapted , with reference to the above figure, to identify a goal of the demonstration , and achieving the goal by robotic goal of a demonstration based , at least partially, on the manipulation , based , at least partially, on the techniques techniques described above in relations to FIGS . 1-7 . In at described above in relations to FIGS . 1-7 . least one embodiment, one or more circuits, processors , [ 0272 ] FIG . 15 illustrates a computer system 1500 , systems , robots, or other devices or techniques are adapted , according to at least one embodiment. In at least one with reference to the above figure, to implement a robotic embodiment , computer system 1500 includes, without limi device capable of observing a demonstration , identifying a tation , a computer 1510 and a USB stick 1520. In at least one goal of the demonstration , and achieving the goal by robotic embodiment, computer 1510 may include, without limita manipulation , based , at least partially, on the techniques tion , any number and type of processor ( s ) ( not shown ) and described above in relations to FIGS . 1-7 . a memory (not shown ). In at least one embodiment, com [ 0267 ] FIG . 14 illustrates a computer system 1400 , puter 1510 includes , without limitation, a server , a cloud according to at least one embodiment. In at least one instance , a laptop, and a desktop computer. embodiment, computer system 1400 is configured to imple [ 0273 ] In at least one embodiment, USB stick 1520 ment various processes and methods described throughout includes , without limitation , a processing unit 1530 , a USB this disclosure . interface 1540 , and USB interface logic 1550. In at least one [ 0268 ] In at least one embodiment, computer system 1400 embodiment, processing unit 1530 may be any instruction comprises , without limitation , at least one central processing execution system , apparatus, or device capable of executing unit ( “ CPU ” ) 1402 that is connected to a communication bus instructions . In at least one embodiment, processing unit 1410 implemented using any suitable protocol , such as PCI 1530 may include , without limitation , any number and type ( " Peripheral Component Interconnect" ), peripheral compo of processing cores ( not shown ) . In at least one embodiment, nent interconnect express ( “ PCI - Express ” ), AGP ( “ Acceler processing unit 1530 comprises an application specific inte ated Graphics Port ” ) , HyperTransport, or any other bus or grated circuit ( " ASIC " ) that is optimized to perform any point - to - point communication protocol (s ). In at least one amount and type of operations associated with machine embodiment , computer system 1400 includes, without limi learning . For instance, in at least one embodiment, process tation , a main memory 1404 and control logic ( e.g. , imple ing unit 1530 is a tensor processing unit ( “#TPC ” ) that is mented as hardware, software, or a combination thereof) and optimized to perform machine learning inference operations. data are stored in main memory 1404 , which may take form In at least one embodiment, processing unit 1530 is a vision of random access memory ( “RAM ” ). In at least one embodi processing unit ( “ VPU ” ) that is optimized to perform ment, a network interface subsystem ( “ network interface ” ) machine vision and machine learning inference operations. 1422 provides an interface to other computing devices and [ 0274 ] In at least one embodiment, USB interface 1540 networks for receiving data from and transmitting data to may be any type of USB connector or USB socket . For other systems with computer system 1400 . instance , in at least one embodiment, USB interface 1540 is US 2021/0081752 A1 Mar. 18 , 2021 26 a USB 3.0 Type - C socket for data and power . In at least one access memories (DRAMs ) ( including stacked DRAMs) , embodiment, USB interface 1540 is a USB 3.0 Type - A Graphics DDR SDRAM ( GDDR ) ( e.g. , GDDR5 , GDDR6 ) , connector . In at least one embodiment, USB interface logic or High Bandwidth Memory ( HBM ) and / or may be non 1550 may include any amount and type of logic that enables volatile memories such as 3D XPoint or Nano - Ram . In at processing unit 1530 to interface with devices ( e.g. , com least one embodiment, some portion of processor memories puter 1510 ) via USB connector 1540 . 1601 may be volatile memory and another portion may be [ 0275 ] Inference and /or training logic 815 are used to non - volatile memory ( e.g. , using a two - level memory ( 2LM ) perform inferencing and / or training operations associated hierarchy ). with one or more embodiments . Details regarding inference [ 0280 ] As described herein , although various multi - core and / or training logic 815 are provided herein in conjunction processors 1605 and GPUs 1610 may be physically coupled with FIGS . 8A and /or 8B . In at least one embodiment, to a particular memory 1601 , 1620 , respectively , and / or a inference and / or training logic 815 may be used in system unified memory architecture may be implemented in which FIG . 15 for inferencing or predicting operations based , at a virtual system address space ( also referred to as “ effective least in part, on weight parameters calculated using neural address ” space ) is distributed among various physical network training operations, neural network functions and / memories . For example, processor memories 1601 ( 1 ) -1601 or architectures, or neural network use cases described ( M ) may each comprise 64 GB of system memory address herein . space and GPU memories 1620 ( 1 ) -1620 ( N ) may each com [ 0276 ] In at least one embodiment, one or more circuits, prise 32 GB of system memory address space resulting in a processors , systems , robots, or other devices or techniques total of 256 GB addressable memory when M = 2 and N = 4 . are adapted , with reference to the above figure, to identify a Other values for N and M are possible . goal of a demonstration based , at least partially , on the [ 0281 ] FIG . 16B illustrates additional details for an inter techniques described above in relations to FIGS . 1-7 . In at connection between a multi - core processor 1607 and a least one embodiment, one or more circuits, processors , graphics acceleration module 1646 in accordance with one systems , robots, or other devices or techniques are adapted , exemplary embodiment. In at least one embodiment, graph with reference to the above figure , to implement a robotic ics acceleration module 1646 may include one or more GPU device capable of observing a demonstration , identifying a chips integrated on a line card which is coupled to processor goal of the demonstration , and achieving the goal by robotic 1607 via high - speed link 1640 ( e.g. , a PCIe bus , NVLink , manipulation , based , at least partially, on the techniques etc. ) . In at least one embodiment, graphics acceleration described above in relations to FIGS . 1-7 . module 1646 may alternatively be integrated on a package [ 0277 ] FIG . 16A illustrates an exemplary architecture in or chip with processor 1607 . which a plurality of GPUs 1610 ( 1 ) -1610 ( N ) is communica [ 0282 ] In at least one embodiment, processor 1607 tively coupled to a plurality of multi - core processors 1605 includes a plurality of cores 1660A - 1660D , each with a ( 1 ) -1605 ( M ) over high - speed links 1640 ( 1 ) -1640 ( N ) ( e.g. , translation lookaside buffer ( " TLB ” ) 1661A - 1661D and one buses , point - to - point interconnects, etc. ) . In at least one or more caches 1662A - 1662D . In at least one embodiment, embodiment, high - speed links 1640 ( 1 ) -1640 ( N ) support a cores 1660A - 1660D may include various other components communication throughput of 4 GB / s , 30 GB / s , 80 GB / s or for executing instructions and processing data that are not higher. In at least one embodiment, various interconnect illustrated . In at least one embodiment, caches 1662A protocols may be used including, but not limited to , PCIe 4.0 1662D may comprise Level 1 ( L1 ) and Level 2 ( L2 ) caches . or 5.0 and NVLink 2.0 . In various figures, “ N ” and “ M ” In addition , one or more shared caches 1656 may be represent positive integers , values of which may be different included in caches 1662A - 1662D and shared by sets of cores from figure to figure . 1660A - 1660D . For example , one embodiment of processor [ 0278 ] In addition , and in at least one embodiment, two or 1607 includes 24 cores, each with its own L2 cache , twelve more of GPUs 1610 are interconnected over high - speed shared L2 caches , and twelve shared L3 caches . In this links 1629 ( 1 ) -1629 ( 2 ), which may be implemented using embodiment, one or more L2 and L3 caches are shared by similar or different protocols /links than those used for high two adjacent cores . In at least one embodiment, processor speed links 1640 ( 1 ) -1640 ( N ) . Similarly, two or more of 1607 and graphics acceleration module 1646 connect with multi - core processors 1605 may be connected over a high system memory 1614 , which may include processor memo speed link 1628 which may be symmetric multi - processor ries 1601 ( 1 ) -1601 ( M ) of FIG . 16A . ( SMP ) buses operating at 20 GB / s , 30 GB / s , 120 GB / s or [ 0283 ] In at least one embodiment, coherency is main higher. Alternatively, all communication between various tained for data and instructions stored in various caches system components shown in FIG . 16A may be accom 1662A - 1662D , 1656 and system memory 1614 via inter plished using similar protocols/ links ( e.g. , over a common core communication over a coherence bus 1664. In at least interconnection fabric ). one embodiment, for example , each cache may have cache [ 0279 ] In at least one embodiment, each multi - core pro coherency logic / circuitry associated therewith to communi cessor 1605 is communicatively coupled to a processor cate to over coherence bus 1664 in response to detected memory 1601 ( 1 ) -1601 ( M ) , via memory interconnects 1626 reads or writes to particular cache lines . In at least one ( 1 ) -1626 ( M ) , respectively, and each GPU 1610 ( 1 ) -1610 ( N ) embodiment, a cache snooping protocol is implemented is communicatively coupled to GPU memory 1620 ( 1 ) -1620 over coherence bus 1664 to snoop cache accesses . ( N ) over GPU memory interconnects 1650 ( 1 ) -1650 ( N ), [ 0284 ] In at least one embodiment, a proxy circuit 1625 respectively . In at least one embodiment, memory intercon communicatively couples graphics acceleration module nects 1626 and 1650 may utilize similar or different memory 1646 to coherence bus 1664 , allowing graphics acceleration access technologies. By way of example, and not limitation , module 1646 to participate in a cache coherence protocol as processor memories 1601 ( 1 ) -1601 ( M ) and GPU memories a peer of cores 1660A - 1660D . In particular, in at least one 1620 may be volatile memories such as dynamic random embodiment, an interface 1635 provides connectivity to US 2021/0081752 A1 Mar. 18 , 2021 27 proxy circuit 1625 over high - speed link 1640 and an inter module 1646 may be dedicated to a single application face 1637 connects graphics acceleration module 1646 to executed on processor 1607 or may be shared between high -speed link 1640 . multiple applications . In at least one embodiment, a virtu [ 0285 ] In at least one embodiment, an accelerator integra alized graphics execution environment is presented in which tion circuit 1636 provides cache management, memory resources of graphics processing engines 1631 ( 1 ) -1631 ( N ) access , context management, and interrupt management are shared with multiple applications or virtual machines services on behalf of a plurality of graphics processing ( VMs) . In at least one embodiment, resources may be engines 1631 ( 1 ) -1631 ( N ) of graphics acceleration module subdivided into " slices ” which are allocated to different 1646. In at least one embodiment, graphics processing VMs and / or applications based on processing requirements engines 1631 ( 1 ) -1631 ( N ) may each comprise a separate and priorities associated with VMs and /or applications. graphics processing unit (GPU ). In at least one embodiment, [ 0289 ] In at least one embodiment, accelerator integration graphics processing engines 1631 ( 1 ) -1631 ( N ) alternatively circuit 1636 performs as a bridge to a system for graphics may comprise different types of graphics processing engines acceleration module 1646 and provides address translation within a GPU , such as graphics execution units , media and system memory cache services . In addition , in at least processing engines ( e.g. , video encoders /decoders ), sam one embodiment, accelerator integration circuit 1636 may plers , and blit engines. In at least one embodiment, graphics provide virtualization facilities for a host processor to man acceleration module 1646 may be a GPU with a plurality of age virtualization of graphics processing engines 1631 ( 1 ) graphics processing engines 1631 ( 1 ) -1631 ( N ) or graphics 1631 ( N ), interrupts , and memory management. processing engines 1631 ( 1 ) -1631 ( N ) may be individual [ 0290 ] In at least one embodiment, because hardware GPUs integrated on a common package , line card , or chip . resources of graphics processing engines 1631 ( 1 ) -1631 ( N ) [ 0286 ] In at least one embodiment, accelerator integration are mapped explicitly to a real address space seen by host circuit 1636 includes a memory management unit (MMU ) processor 1607 , any host processor can address these 1639 for performing various memory management functions resources directly using an effective address value . In at least such as virtual -to - physical memory translations ( also one embodiment, one function of accelerator integration referred to as effective - to - real memory translations) and circuit 1636 is physical separation of graphics processing memory access protocols for accessing system memory engines 1631 ( 1 ) -1631 ( N ) so that they appear to a system as 1614. In at least one embodiment, MMU 1639 may also independent units . include a translation lookaside buffer ( TLB ) ( not shown ) for [ 0291 ] In at least one embodiment, one or more graphics caching virtual/ effective to physical/ real address transla memories 1633 ( 1 ) -1633 ( M ) are coupled to each of graphics tions . In at least one embodiment, a cache 1638 can store processing engines 1631 ( 1 ) -1631 ( N ), respectively and commands and data for efficient access by graphics process N = M . In at least one embodiment, graphics memories ing engines 1631 ( 1 ) -1631 ( N ) . In at least one embodiment, 1633 ( 1 ) -1633 ( M ) store instructions and data being pro data stored in cache 1638 and graphics memories 1633 ( 1 ) cessed by each of graphics processing engines 1631 ( 1 ) -1631 1633 ( M ) is kept coherent with core caches 1662A - 1662D , ( N ). In at least one embodiment, graphics memories 1633 1656 and system memory 1614 , possibly using a fetch unit ( 1 ) -1633 ( M ) may be volatile memories such as DRAMS 1644. As mentioned , this may be accomplished via proxy ( including stacked DRAMS ), GDDR memory ( e.g. , circuit 1625 on behalf of cache 1638 and memories 1633 GDDR5 , GDDR6 ) , or HBM , and / or may be non - volatile ( 1 ) -1633 ( M ) ( e.g. , sending updates to cache 1638 related to memories such as 3D ) (Point or Nano - Ram . modifications /accesses of cache lines on processor caches [ 0292 ] In at least one embodiment, to reduce data traffic 1662A - 1662D , 1656 and receiving updates from cache over high - speed link 1640 , biasing techniques can be used to 1638 ) . ensure that data stored in graphics memories 1633 ( 1 ) -1633 [ 0287 ] In at least one embodiment, a set of registers 1645 ( M ) is data that will be used most frequently by graphics store context data for threads executed by graphics process processing engines 1631 ( 1 ) -1631 ( N ) and preferably not ing engines 1631 ( 1 ) -1631 ( N ) and a context management used by cores 1660A - 1660D ( at least not frequently ). Simi circuit 1648 manages thread contexts . For example, context larly, in at least one embodiment, a biasing mechanism management circuit 1648 may perform save and restore attempts to keep data needed by cores ( and preferably not operations to save and restore contexts of various threads graphics processing engines 1631 ( 1 ) -1631 ( N ) ) within during contexts switches ( e.g. , where a first thread is saved caches 1662A - 1662D , 1656 and system memory 1614 . and a second thread is stored so that a second thread can be [ 0293 ] FIG . 16C illustrates another exemplary embodi execute by a graphics processing engine ). For example , on ment in which accelerator integration circuit 1636 is inte a context switch , context management circuit 1648 may grated within processor 1607. In this embodiment, graphics store current register values to a designated region in processing engines 1631 ( 1 ) -1631 ( N ) communicate directly memory ( e.g. , identified by a context pointer ). It may then over high - speed link 1640 to accelerator integration circuit restore register values when returning to a context. In at least 1636 via interface 1637 and interface 1635 (which , again , one embodiment, an interrupt management circuit 1647 may be any form of bus or interface protocol ). In at least one receives and processes interrupts received from system embodiment, accelerator integration circuit 1636 may per devices. form similar operations as those described with respect to [ 0288 ] In at least one embodiment, virtual/ effective FIG . 16B , but potentially at a higher throughput given its addresses from a graphics processing engine 1631 are trans close proximity to coherence bus 1664 and caches 1662A lated to real / physical addresses in system memory 1614 by 1662D , 1656. In at least one embodiment, an accelerator MMU 1639. In at least one embodiment, accelerator inte integration circuit supports different programming models gration circuit 1636 supports multiple ( e.g. , 4 , 8 , 16 ) graph including a dedicated - process programming model ( no ics accelerator modules 1646 and / or other accelerator graphics acceleration module virtualization ) and shared pro devices . In at least one embodiment, graphics accelerator gramming models ( with virtualization ), which may include US 2021/0081752 A1 Mar. 18 , 2021 28 programming models which are controlled by accelerator can be a single job requested by an application or may integration circuit 1636 and programming models which are contain a pointer to a queue of jobs . In at least one embodi controlled by graphics acceleration module 1646 . ment, WD 1684 is a pointer to a job request queue in an [ 0294 ] In at least one embodiment, graphics processing application's effective address space 1682 . engines 1631 ( 1 )-1631 ( N ) are dedicated to a single applica [ 0298 ] In at least one embodiment, graphics acceleration tion or process under a single operating system . In at least module 1646 and / or individual graphics processing engines one embodiment, a single application can funnel other 1631 ( 1 ) -1631 ( N ) can be shared by all or a subset of pro application requests to graphics processing engines 1631 ( 1 ) cesses in a system . In at least one embodiment, an infra 1631 ( N ), providing virtualization within a VM partition . structure for setting up process states and sending a WD [ 0295 ] In at least one embodiment, graphics processing 1684 to a graphics acceleration module 1646 to start a job in engines 1631 ( 1 ) -1631 ( N ) , may be shared by multiple a virtualized environment may be included . VM / application partitions. In at least one embodiment, [ 0299 ] In at least one embodiment, a dedicated -process shared models may use a system hypervisor to virtualize programming model is implementation - specific. In at least graphics processing engines 1631 ( 1 ) -1631 ( N ) to allow one embodiment, in this model, a single process owns access by each operating system . In at least one embodi graphics acceleration module 1646 or an individual graphics ment, for single -partition systems without a hypervisor, processing engine 1631. In at least one embodiment, when graphics processing engines 1631 ( 1 ) -1631 ( N ) are owned by graphics acceleration module 1646 is owned by a single an operating system . In at least one embodiment, an oper process, a hypervisor initializes accelerator integration cir ating system can virtualize graphics processing engines cuit 1636 for an owning partition and an operating system 1631 ( 1 ) -1631 ( N ) to provide access to each process or appli initializes accelerator integration circuit 1636 for an owning cation . process when graphics acceleration module 1646 is [ 0296 ] In at least one embodiment, graphics acceleration assigned . module 1646 or an individual graphics processing engine [ 0300 ] In at least one embodiment, in operation, a WD 1631 ( 1 )-1631 ( N ) selects a process element using a process fetch unit 1691 in accelerator integration slice 1690 fetches handle. In at least one embodiment, process elements are next WD 1684 , which includes an indication of work to be stored in system memory 1614 and are addressable using an done by one or more graphics processing engines of graph effective address to real address translation technique ics acceleration module 1646. In at least one embodiment, described herein . In at least one embodiment, a process data from WD 1684 may be stored in registers 1645 and used handle may be an implementation - specific value provided to by MMU 1639 , interrupt management circuit 1647 and / or a host process when registering its context with graphics context management circuit 1648 as illustrated . For processing engine 1631 ( 1 ) -1631( N ) ( that is , calling system example, one embodiment of MMU 1639 includes segment/ software to add a process element to a process element page walk circuitry for accessing segment /page tables 1686 linked list ) . In at least one embodiment, a lower 16 - bits of within an OS virtual address space 1685. In at least one a process handle may be an offset of a process element embodiment, interrupt management circuit 1647 may pro within a process element linked list . cess interrupt events 1692 received from graphics accelera [ 0297 ] FIG . 16D illustrates an exemplary accelerator inte tion module 1646. In at least one embodiment, when per gration slice 1690. In at least one embodiment, a “ slice ” forming graphics operations, an effective address 1693 comprises a specified portion of processing resources of generated by a graphics processing engine 1631 ( 1 ) -1631 ( N ) accelerator integration circuit 1636. In at least one embodi is translated to a real address by MMU 1639 . ment, an application is effective address space 1682 within [ 0301 ] In at least one embodiment, registers 1645 are system memory 1614 stores process elements 1683. In at duplicated for each graphics processing engine 1631 ( 1 ) least one embodiment, process elements 1683 are stored in 1631 ( N ) and / or graphics acceleration module 1646 and may response to GPU invocations 1681 from applications 1680 be initialized by a hypervisor or an operating system . In at executed on processor 1607. In at least one embodiment, a least one embodiment, each of these duplicated registers process element 1683 contains process state for correspond may be included in an accelerator integration slice 1690 . ing application 1680. In at least one embodiment, a work Exemplary registers that may be initialized by a hypervisor descriptor ( WD ) 1684 contained in process element 1683 are shown in Table 1 . TABLE 1 Hypervisor Initialized Registers Register # Description

1 Slice Control Register 2 Real Address (RA ) Scheduled Processes Area Pointer 3 Authority Mask Override Register 4 Interrupt Vector Table Entry Offset 5 Interrupt Vector Table Entry Limit 6 State Register 7 Logical Partition ID 8 Real address ( RA ) Hypervisor Accelerator Utilization Record Pointer 9 Storage Description Register US 2021/0081752 A1 Mar. 18 , 2021 29

[ 0302 ] Exemplary registers that may be initialized by an a graphics acceleration module type , a work descriptor operating system are shown in Table 2 . ( WD ) , an authority mask register (AMR ) value , and a TABLE 2 Operating System Initialized Registers Register # Description 1 Process and Thread Identification 2 Effective Address ( EA ) Context Save / Restore Pointer 3 Virtual Address ( VA ) Accelerator Utilization Record Pointer 4 Virtual Address ( VA ) Storage Segment Table Pointer 5 Authority Mask 6 Work descriptor

[ 0303 ] In at least one embodiment, each WD 1684 is context save / restore area pointer ( CSRP ) . In at least one specific to a particular graphics acceleration module 1646 embodiment, graphics acceleration module type describes a and / or graphics processing engines 1631 ( 1 ) -1631 ( N ) . In at targeted acceleration function for a system call . In at least least one embodiment, it contains all information required one embodiment, graphics acceleration module type may be by a graphics processing engine 1631 ( 1 ) -1631 ( N ) to do a system -specific value . In at least one embodiment, WD is work , or it can be a pointer to a memory location where an formatted specifically for graphics acceleration module application has set up a command queue of work to be completed. 1646 and can be in a form of a graphics acceleration module [ 0304 ] FIG . 16E illustrates additional details for one 1646 command , an effective address pointer to a user exemplary embodiment of a shared model . This embodi defined structure , an effective address pointer to a queue of ment includes a hypervisor real address space 1698 in which commands, or any other data structure to describe work to a process element list 1699 is stored . In at least one be done by graphics acceleration module 1646 . embodiment, hypervisor real address space 1698 is acces [ 0308 ] In at least one embodiment, an AMR value is an sible via a hypervisor 1696 which virtualizes graphics AMR state to use for a current process. In at least one acceleration module engines for operating system 1695 . [ 0305 ] In at least one embodiment, shared programming embodiment, a value passed to an operating system is models allow for all or a subset of processes from all or a similar to an application setting an AMR . In at least one subset of partitions in a system to use a graphics acceleration embodiment, if accelerator integration circuit 1636 (not module 1646. In at least one embodiment, there are two shown ) and graphics acceleration module 1646 implemen programming models where graphics acceleration module tations do not support a User Authority Mask Override 1646 is shared by multiple processes and partitions, namely Register (UAMOR ), an operating system may apply a time - sliced shared and graphics directed shared . current UAMOR value to an AMR value before passing an [ 0306 ] In at least one embodiment, in this model, system hypervisor 1696 owns graphics acceleration module 1646 AMR in a hypervisor call . In at least one embodiment, and makes its function available to all operating systems hypervisor 1696 may optionally apply a current Authority 1695. In at least one embodiment, for a graphics acceleration Mask Override Register ( AMOR ) value before placing an module 1646 to support virtualization by system hypervisor AMR into process element 1683. In at least one embodi 1696 , graphics acceleration module 1646 may adhere to ment, CSRP is one of registers 1645 containing an effective certain requirements, such as ( 1 ) an application's job request address of an area in an application's effective address space must be autonomous ( that is , state does not need to be 1682 for graphics acceleration module 1646 to save and maintained between jobs ) , or graphics acceleration module restore context state . In at least one embodiment, this pointer 1646 must provide a context save and restore mechanism , ( 2 ) an application's job request is guaranteed by graphics is optional if no state is required to be saved between jobs acceleration module 1646 to complete in a specified amount or when a job is preempted. In at least one embodiment, of time , including any translation faults, or graphics accel context save / restore area may be pinned system memory . eration module 1646 provides an ability to preempt process [ 0309 ] Upon receiving a system call , operating system ing of a job , and ( 3 ) graphics acceleration module 1646 must 1695 may verify that application 1680 has registered and be guaranteed fairness between processes when operating in been given authority to use graphics acceleration module a directed shared programming model . 1646. In at least one embodiment, operating system 1695 [ 0307 ] In at least one embodiment, application 1680 is then calls hypervisor 1696 with information shown in Table required to make an operating system 1695 system call with 3 . TABLE 3 OS to Hypervisor Call Parameters Parameter # Description

1 A work descriptor (WD ) 2 An Authority Mask Register ( AMR ) value ( potentially masked ) US 2021/0081752 A1 Mar. 18 , 2021 30

TABLE 3 - continued OS to Hypervisor Call Parameters Parameter # Description 3 An effective address ( EA ) Context Save / Restore Area Pointer (CSRP ) 4 A process ID ( PID ) and optional thread ID ( TID ) 5 A virtual address ( VA ) accelerator utilization record pointer ( AURP ) 6 Virtual address of storage segment table pointer ( SSTP ) 7 A logical interrupt service number ( LISN )

[ 0310 ] In at least one embodiment, upon receiving a at least one embodiment, while multiple instances of bias / hypervisor call , hypervisor 1696 verifies that operating coherence management circuitry 1694A - 1694E are illus system 1695 has registered and been given authority to use trated in FIG . 16F , bias / coherence circuitry may be imple graphics acceleration module 1646. In at least one embodi mented within an MMU of one or more host processors 1605 ment, hypervisor 1696 then puts process element 1683 into and / or within accelerator integration circuit 1636 . a process element linked list for a corresponding graphics [ 0314 ] One embodiment allows GPU memories 1620 to acceleration module 1646 type. In at least one embodiment, be mapped as part of system memory, and accessed using a process element may include information shown in Table shared virtual memory ( SVM ) technology, but without suf 4 . fering performance drawbacks associated with full system TABLE 4 Process Element Information Element # Description 1 A work descriptor (WD ) 2 An Authority Mask Register ( AMR ) value ( potentially masked ) . 3 An effective address ( EA ) Context Save /Restore Area Pointer (CSRP ) 4 A process ID ( PID ) and optional thread ID ( TID ) 5 A virtual address ( VA ) accelerator utilization record pointer ( AURP ) 6 Virtual address of storage segment table pointer ( SSTP ) 7 A logical interrupt service number ( LISN ) 8 Interrupt vector table , derived from hypervisor call parameters 9 A state register ( SR ) value A logical partition ID ( LPID ) 11 A real address (RA ) hypervisor accelerator utilization record pointer 12 Storage Descriptor Register ( SDR )

[ 0311 ] In at least one embodiment, hypervisor initializes a cache coherence . In at least one embodiment, an ability for plurality of accelerator integration slice 1690 registers 1645 . GPU memories 1620 to be accessed as system memory [ 0312 ] As illustrated in FIG . 16F , in at least one embodi without onerous cache coherence overhead provides a ben ment, a unified memory is used , addressable via a common eficial operating environment for GPU offload . In at least virtual memory address space used to access physical pro one embodiment, this arrangement allows software of host cessor memories 1601 ( 1 ) -1601 ( N ) and GPU memories 1620 processor 1605 to setup operands and access computation ( 1 ) -1620 ( N ). In this implementation , operations executed on results , without overhead of tradition I / O DMA data copies . GPUs 1610 ( 1 ) -1610 ( N ) utilize a same virtual / effective In at least one embodiment, such traditional copies involve memory address space to access processor memories 1601 driver calls , interrupts and memory mapped I / O (MMO ) ( 1 ) -1601 ( M ) and vice versa , thereby simplifying program accesses that are all inefficient relative to simple memory mability. In at least one embodiment, a first portion of a accesses . In at least one embodiment, an ability to access virtual / effective address space is allocated to processor GPU memories 1620 without cache coherence overheads memory 1601 ( 1 ) , a second portion to second processor can be critical to execution time of an offloaded computa memory 1601 ( N ) , a third portion to GPU memory 1620 ( 1 ) , tion . In at least one embodiment, in cases with substantial and so on . In at least one embodiment, an entire virtual/ streaming write memory traffic , for example, cache coher effective memory space ( sometimes referred to as an effec ence overhead can significantly reduce an effective write tive address space ) is thereby distributed across each of bandwidth seen by a GPU 1610. In at least one embodiment, processor memories 1601 and GPU memories 1620 , allow efficiency of operand setup , efficiency of results access , and ing any processor or GPU to access any physical memory efficiency of GPU computation may play a role in deter with a virtual address mapped to that memory . mining effectiveness of a GPU offload . [ 0313 ] In at least one embodiment, bias/ coherence man [ 0315 ] In at least one embodiment, selection of GPU bias agement circuitry 1694A - 1694E within one or more of and host processor bias is driven by a bias tracker data MMUS 1639 A - 1639E ensures cache coherence between structure . In at least one embodiment, a bias table may be caches of one or more host processors ( e.g. , 1605 ) and GPUs used , for example , which may be a page - granular structure 1610 and implements biasing techniques indicating physical ( e.g. , controlled at a granularity of a memory page ) that memories in which certain types of data should be stored . In includes 1 or 2 bits per GPU - attached memory page . In at US 2021/0081752 A1 Mar. 18 , 2021 31 least one embodiment, a bias table may be implemented in core . In at least one embodiment, integrated circuit 1700 a stolen memory range of one or more GPU memories 1620 , includes peripheral or bus logic including a USB controller with or without a bias cache in a GPU 1610 ( e.g. , to cache 1725 , a UART controller 1730 , an SPI / SDIO controller frequently / recently used entries of a bias table ) . Alterna 1735 , and an 1 ? 2S / 122C controller 1740. In at least one tively , in at least one embodiment, an entire bias table may embodiment, integrated circuit 1700 can include a display be maintained within a GPU . device 1745 coupled to one or more of a high - definition [ 0316 ] In at least one embodiment, a bias table entry multimedia interface ( HDMI ) controller 1750 and a mobile associated with each access to a GPU attached memory 1620 industry processor interface ( MIPI ) display interface 1755 . is accessed prior to actual access to a GPU memory , causing In at least one embodiment, storage may be provided by a following operations. In at least one embodiment, local flash memory subsystem 1760 including flash memory and requests from a GPU 1610 that find their page in GPU bias a flash memory controller . In at least one embodiment, a are forwarded directly to a corresponding GPU memory memory interface may be provided via a memory controller 1620. In at least one embodiment, local requests from a GPU 1765 for access to SDRAM or SRAM memory devices . In that find their page in host bias are forwarded to processor at least one embodiment, some integrated circuits addition 1605 ( e.g. , over a high - speed link as described herein ). In at ally include an embedded security engine 1770 . least one embodiment, requests from processor 1605 that [ 0322 ] Inference and /or training logic 815 are used to find a requested page in host processor bias complete a perform inferencing and / or training operations associated request like a normal memory read . Alternatively, requests with one or more embodiments . Details regarding inference directed to a GPU - biased page may be forwarded to a GPU and / or training logic 815 are provided herein in conjunction 1610. In at least one embodiment, a GPU may then transition with FIGS . 8A and /or 8B . In at least one embodiment, a page to a host processor bias if it is not currently using a inference and / or training logic 815 may be used in integrated page . In at least one embodiment, a bias state of a page can circuit 1700 for inferencing or predicting operations based , be changed either by a software - based mechanism , a hard at least in part, on weight parameters calculated using neural ware - assisted software - based mechanism , or, for a limited network training operations, neural network functions and / set of cases , a purely hardware -based mechanism . or architectures , or neural network use cases described [ 0317 ] In at least one embodiment, one mechanism for herein . changing bias state employs an API call ( e.g. , OpenCL ), [ 0323 ] In at least one embodiment, one or more circuits , which , in turn , calls a GPU's device driver which , in turn , processors , systems , robots , or other devices or techniques sends a message ( or enqueues a command descriptor) to a are adapted , with reference to the above figure , to identify a GPU directing it to change a bias state and, for some goal of a demonstration based , at least partially, on the transitions, perform a cache flushing operation in a host . In techniques described above in relations to FIGS . 1-7 . In at at least one embodiment, a cache flushing operation is used least one embodiment, one or more circuits, processors , for a transition from host processor 1605 bias to GPU bias , systems , robots , or other devices or techniques are adapted, but is not for an opposite transition . with reference to the above figure , to implement a robotic [ 0318 ] In at least one embodiment, cache coherency is device capable of observing a demonstration , identifying a maintained by temporarily rendering GPU - biased pages goal of the demonstration , and achieving the goal by robotic uncacheable by host processor 1605. In at least one embodi manipulation , based , at least partially, on the techniques ment, to access these pages , processor 1605 may request described above in relations to FIGS . 1-7 . access from GPU 1610 , which may or may not grant access [ 0324 ] FIGS . 18A and 18B illustrate exemplary integrated right away. In at least one embodiment, thus, to reduce circuits and associated graphics processors that may be communication between processor 1605 and GPU 1610 it is fabricated using one or more IP cores, according to various beneficial to ensure that GPU - biased pages are those which embodiments described herein . In addition to what is illus are required by a GPU but not host processor 1605 and vice trated , other logic and circuits may be included in at least versa . one embodiment, including additional graphics processors [ 0319 ] Hardware structure (s ) 815 are used to perform one cores , peripheral interface controllers, or general- purpose or more embodiments. Details regarding a hardware struc processor cores . ture ( s ) 815 may be provided herein in conjunction with [ 0325 ] FIGS . 18A and 18B are block diagrams illustrating FIGS . 8A and / or 8B . exemplary graphics processors for use within an SoC , [ 0320 ] FIG . 17 illustrates exemplary integrated circuits according to embodiments described herein . FIG . 18A illus and associated graphics processors that may be fabricated trates an exemplary graphics processor 1810 of a system on using one or more IP cores , according to various embodi a chip integrated circuit that may be fabricated using one or ments described herein . In addition to what is illustrated , more IP cores , according to at least one embodiment. FIG . other logic and circuits may be included in at least one 18B illustrates an additional exemplary graphics processor embodiment, including additional graphics processors/ 1840 of a system on a chip integrated circuit that may be cores , peripheral interface controllers , or general - purpose fabricated using one or more IP cores , according to at least processor cores . one embodiment. In at least one embodiment, graphics [ 0321 ] FIG . 17 is a block diagram illustrating an exem processor 1810 of FIG . 18A is a low power graphics plary system on a chip integrated circuit 1700 that may be processor core . In at least one embodiment, graphics pro fabricated using one or more IP cores , according to at least cessor 1840 of FIG . 18B is a higher performance graphics one embodiment. In at least one embodiment, integrated processor core . In at least one embodiment, each of graphics circuit 1700 includes one or more application processor ( s ) processors 1810 , 1840 can be variants of graphics processor 1705 ( e.g. , CPUs) , at least one graphics processor 1710 , and 1710 of FIG . 17 . may additionally include an image processor 1715 and / or a [ 0326 ] In at least one embodiment, graphics processor video processor 1720 , any of which may be a modular IP 1810 includes a vertex processor 1805 and one or more US 2021/0081752 A1 Mar. 18 , 2021 32 fragment processor ( s ) 1815A - 1815N ( e.g. , 1815A , 1815B , tions based , at least in part, on weight parameters calculated 1815C , 1815D , through 1815N - 1 , and 1815N ) . In at least using neural network training operations, neural network one embodiment, graphics processor 1810 can execute dif functions and / or architectures, or neural network use cases ferent shader programs via separate logic , such that vertex described herein . processor 1805 is optimized to execute operations for vertex shader programs, while one or more fragment processor ( s ) [ 0330 ] In at least one embodiment, one or more circuits, 1815A - 1815N execute fragment ( e.g. , pixel ) shading opera processors , systems , robots , or other devices or techniques tions for fragment or pixel shader programs. In at least one are adapted , with reference to the above figure, to identify a embodiment, vertex processor 1805 performs a vertex pro goal of a demonstration based , at least partially, on the cessing stage of a 3D graphics pipeline and generates techniques described above in relations to FIGS . 1-7 . In at primitives and vertex data . In at least one embodiment, least one embodiment, one or more circuits , processors , fragment processor ( s ) 1815A - 1815N use primitive and ver systems , robots , or other devices or techniques are adapted, tex data generated by vertex processor 1805 to produce a with reference to the above figure, to implement a robotic framebuffer that is displayed on a display device . In at least device capable of observing a demonstration , identifying a one embodiment, fragment processor ( s ) 1815A - 1815N are goal of the demonstration , and achieving the goal by robotic optimized to execute fragment shader programs as provided manipulation, based , at least partially, on the techniques for in an OpenGL API , which may be used to perform described above in relations to FIGS . 1-7 . similar operations as a pixel shader program as provided for [ 0331 ] FIGS . 19A and 19B illustrate additional exemplary in a Direct 3D API . graphics processor logic according to embodiments [ 0327 ] In at least one embodiment, graphics processor described herein . FIG . 19A illustrates a graphics core 1900 1810 additionally includes one or more memory manage that may be included within graphics processor 1710 of FIG . ment units (MMUs ) 1820A - 1820B , cache ( s ) 1825A - 1825B , 17 , in at least one embodiment, and may be a unified shader and circuit interconnect( s ) 1830A - 1830B . In at least one core 1855A - 1855N as in FIG . 18B in at least one embodi embodiment, one or more MMU ( s ) 1820A - 1820B provide ment . FIG . 19B illustrates a highly - parallel general -purpose for virtual to physical address mapping for graphics proces graphics processing unit ( “GPGPU ” ) 1930 suitable for sor 1810 , including for vertex processor 1805 and / or frag deployment on a multi - chip module in at least one embodi ment processor ( s ) 1815A - 1815N , which may reference ver ment. tex or image /texture data stored in memory , in addition to [ 0332 ] In at least one embodiment, graphics core 1900 vertex or image /texture data stored in one or more cache ( s) includes a shared instruction cache 1902 , a texture unit 1825A - 1825B . In at least one embodiment, one or more 1918 , and a cache / shared memory 1920 that are common to MMU ( s ) 1820A - 1820B may be synchronized with other execution resources within graphics core 1900. In at least MMUs within a system , including one or more MMUS one embodiment, graphics core 1900 can include multiple associated with one or more application processor ( s ) 1705 , slices 1901A - 1901N or a partition for each core , and a image processors 1715 , and / or video processors 1720 of graphics processor can include multiple instances of graph FIG . 17 , such that each processor 1705-1720 can participate ics core 1900. In at least one embodiment, slices 1901A in a shared or unified virtual memory system . In at least one 1901N can include support logic including a local instruc embodiment, one or more circuit interconnect ( s ) 1830A tion cache 1904A - 1904N , a thread scheduler 1906A - 1906N , 1830B enable graphics processor 1810 to interface with a thread dispatcher 1908A - 1908N , and a set of registers other IP cores within SoC , either via an internal bus of SoC 1910A - 1910N . In at least one embodiment, slices 1901A or via a direct connection . 1901N can include a set of additional function units ( AFUS [ 0328 ] In at least one embodiment, graphics processor 1912A - 1912N ) , floating -point units ( FPUs 1914A - 1914N ) , 1840 includes one or more shader core ( s ) 1855A - 1855N integer arithmetic logic units ( ALUS 1916A - 1916N ) , ( e.g. , 1855A , 1855B , 1855C , 1855D , 1855E , 1855F, through address computational units ( ACUs 1913A - 1913N ) , double 1855N - 1 , and 1855N ) as shown in FIG . 18B , which provides precision floating - point units ( DPFPUs 1915A - 1915N ) , and for a unified shader core architecture in which a single core matrix processing units (MPUs 1917A - 1917N ). or type or core can execute all types of programmable shader [ 0333 ] In at least one embodiment, FPUs 1914A - 1914N code , including shader program code to implement vertex can perform single - precision ( 32 -bit ) and half - precision shaders, fragment shaders, and / or compute shaders . In at ( 16 - bit ) floating point operations, while DPFPUs 1915A least one embodiment, a number of shader cores can vary . In 1915N perform double precision ( 64 - bit ) floating point at least one embodiment, graphics processor 1840 includes operations. In at least one embodiment, ALUS 1916A an inter - core task manager 1845 , which acts as a thread 1916N can perform variable precision integer operations at dispatcher to dispatch execution threads to one or more 8 - bit , 16 - bit , and 32 - bit precision , and can be configured for shader cores 1855A - 1855N and a tiling unit 1858 to accel mixed precision operations. In at least one embodiment, erate tiling operations for tile - based rendering , in which MPUs 1917A - 1917N can also be configured for mixed rendering operations for a scene are subdivided in image precision matrix operations, including half -precision float space , for example to exploit local spatial coherence within ing point and 8 - bit integer operations . In at least one a scene or to optimize use of internal caches . embodiment, MPUs 1917-1917N can perform a variety of [ 0329 ] Inference and / or training logic 815 are used to matrix operations to accelerate machine learning application perform inferencing and / or training operations associated frameworks , including enabling support for accelerated gen with one or more embodiments . Details regarding inference eral matrix to matrix multiplication ( GEMM ) . In at least one and / or training logic 815 are provided herein in conjunction embodiment, AFUs 1912A - 1912N can perform additional with FIGS . 8A and / or 8B . In at least one embodiment, logic operations not supported by floating - point or integer inference and / or training logic 815 may be used in integrated units, including trigonometric operations ( e.g. , sine, cosine, circuit 18A and / or 18B for inferencing or predicting opera etc. ) . US 2021/0081752 A1 Mar. 18 , 2021 33

[ 0334 ] Inference and / or training logic 815 are used to point operations , while a different subset of floating point perform inferencing and / or training operations associated units can be configured to perform 64 - bit floating point with one or more embodiments. Details regarding inference operations. and / or training logic 815 are provided herein in conjunction [ 0339 ] In at least one embodiment, multiple instances of with FIGS . 8A and /or 8B . In at least one embodiment, GPGPU 1930 can be configured to operate as a compute inference and / or training logic 815 may be used in graphics cluster . In at least one embodiment, communication used by core 1900 for inferencing or predicting operations based , at compute clusters 1936A - 1936H for synchronization and least in part, on weight parameters calculated using neural data exchange varies across embodiments . In at least one network training operations, neural network functions and / embodiment, multiple instances of GPGPU 1930 commu or architectures , or neural network use cases described nicate over host interface 1932. In at least one embodiment, herein . GPGPU 1930 includes an I / O hub 1939 that couples GPGPU 1930 with a GPU link 1940 that enables a direct [ 0335 ] In at least one embodiment, one or more circuits, connection to other instances of GPGPU 1930. In at least processors , systems , robots, or other devices or techniques one embodiment, GPU link 1940 is coupled to a dedicated are adapted , with reference to the above figure , to identify a GPU - to - GPU bridge that enables communication and syn goal of a demonstration based , at least partially, on the chronization between multiple instances of GPGPU 1930. In techniques described above in relations to FIGS . 1-7 . In at at least one embodiment, GPU link 1940 couples with a least one embodiment, one or more circuits, processors , high - speed interconnect to transmit and receive data to other systems , robots, or other devices or techniques are adapted , GPGPUs or parallel processors . In at least one embodiment, with reference to the above figure , to implement a robotic multiple instances of GPGPU 1930 are located in separate device capable of observing a demonstration, identifying a data processing systems and communicate via a network goal of the demonstration , and achieving the goal by robotic device that is accessible via host interface 1932. In at least manipulation, based , at least partially, on the techniques one embodiment GPU link 1940 can be configured to enable described above in relations to FIGS . 1-7 . a connection to a host processor in addition to or as an ( 0336 ] FIG . 19B illustrates a general -purpose processing alternative to host interface 1932 . unit ( GPGPU ) 1930 that can be configured to enable highly [ 0340 ] In at least one embodiment, GPGPU 1930 can be parallel compute operations to be performed by an array of configured to train neural networks. In at least one embodi graphics processing units , in at least one embodiment. In at ment, GPGPU 1930 can be used within an inferencing least one embodiment, GPGPU 1930 can be linked directly platform . In at least one embodiment, in which GPGPU to other instances of GPGPU 1930 to create a multi -GPU 1930 is used for inferencing , GPGPU 1930 may include cluster to improve training speed for deep neural networks. fewer compute clusters 1936A - 1936H relative to when In at least one embodiment, GPGPU 1930 includes a host GPGPU 1930 is used for training a neural network . In at interface 1932 to enable a connection with a host processor. least one embodiment, memory technology associated with In at least one embodiment, host interface 1932 is a PCI memory 1944A - 1944B may differ between inferencing and Express interface . In at least one embodiment, host interface training configurations, with higher bandwidth memory 1932 can be a vendor - specific communications interface or technologies devoted to training configurations. In at least communications fabric . In at least one embodiment, GPGPU one embodiment, an inferencing configuration of GPGPU 1930 receives commands from a host processor and uses a 1930 can support inferencing specific instructions . For global scheduler 1934 to distribute execution threads asso example, in at least one embodiment, an inferencing con ciated with those commands to a set of compute clusters figuration can provide support for one or more 8 - bit integer 1936A - 1936H . In at least one embodiment, compute clusters dot product instructions, which may be used during infer 1936A - 1936H share a cache memory 1938. In at least one encing operations for deployed neural networks. embodiment, cache memory 1938 can serve as a higher [ 0341 ] Inference and / or training logic 815 are used to level cache for cache memories within compute clusters perform inferencing and /or training operations associated 1936A - 1936H . with one or more embodiments. Details regarding inference and / or training logic 815 are provided herein in conjunction [ 0337 ] In at least one embodiment, GPGPU 1930 includes with FIGS . 8A and / or 8B . In at least one embodiment, memory 1944A - 1944B coupled with compute clusters inference and / or training logic 815 may be used in GPGPU 1936A - 1936H via a set of memory controllers 1942A 1930 for inferencing or predicting operations based , at least 1942B . In at least one embodiment, memory 1944A - 1944B in part, on weight parameters calculated using neural net can include various types of memory devices including work training operations , neural network functions and / or dynamic random access memory (DRAM ) or graphics ran architectures, or neural network use cases described herein . dom access memory, such as synchronous graphics random [ 0342 ] In at least one embodiment, one or more circuits, access memory ( SGRAM ) , including graphics double data processors , systems , robots , or other devices or techniques rate (GDDR ) memory . are adapted , with reference to the above figure , to identify a [ 0338 ] In at least one embodiment, compute clusters goal of a demonstration based , at least partially, on the 1936A - 1936H each include a set of graphics cores , such as techniques described above in relations to FIGS . 1-7 . In at graphics core 1900 of FIG . 19A , which can include multiple least one embodiment, one or more circuits, processors , types of integer and floating point logic units that can systems , robots , or other devices or techniques are adapted, perform computational operations at a range of precisions with reference to the above figure , to implement a robotic including suited for machine learning computations. For device capable of observing a demonstration , identifying a example, in at least one embodiment, at least a subset of goal of the demonstration , and achieving the goal by robotic floating point units in each of compute clusters 1936A manipulation, based , at least partially, on the techniques 1936H can be configured to perform 16 - bit or 32 - bit floating described above in relations to FIGS . 1-7 . US 2021/0081752 A1 Mar. 18 , 2021 34

[ 0343 ] FIG . 20 is a block diagram illustrating a computing communication interfaces and / or protocol ( s ), such as NV system 2000 according to at least one embodiment. In at Link high - speed interconnect, or interconnect protocols . least one embodiment, computing system 2000 includes a [ 0347 ] In at least one embodiment, parallel processor (s ) processing subsystem 2001 having one or more processor ( s ) 2012 incorporate circuitry optimized for graphics and video 2002 and a system memory 2004 communicating via an processing , including , for example, video output circuitry, interconnection path that may include a memory hub 2005 . and constitutes a graphics processing unit ( GPU ) . In at least In at least one embodiment, memory hub 2005 may be a one embodiment, parallel processor (s ) 2012 incorporate separate component within a chipset component or may be circuitry optimized for general purpose processing . In at integrated within one or more processor ( s ) 2002. In at least least embodiment, components of computing system 2000 one embodiment, memory hub 2005 couples with an I / O may be integrated with one or more other system elements subsystem 2011 via a communication link 2006. In at least on a single integrated circuit . For example , in at least one one embodiment, I / O subsystem 2011 includes an I / O hub embodiment, parallel processor ( s ) 2012 , memory hub 2005 , 2007 that can enable computing system 2000 to receive processor ( s ) 2002 , and I / O hub 2007 can be integrated into input from one or more input device ( s ) 2008. In at least one a system on chip ( SoC ) integrated circuit. In at least one embodiment, I / O hub 2007 can enable a display controller, embodiment, components of computing system 2000 can be which may be included in one or more processor ( s ) 2002 , to integrated into a single package to form a system in package provide outputs to one or more display device ( s ) 2010A . In ( SIP ) configuration. In at least one embodiment, at least a at least one embodiment, one or more display device ( s ) portion of components of computing system 2000 can be 2010A coupled with I / O hub 2007 can include a local , integrated into a multi- chip module ( MCM ) , which can be internal, or embedded display device . interconnected with other multi - chip modules into a modular [ 0344 ] In at least one embodiment, processing subsystem computing system . 2001 includes one or more parallel processor ( s ) 2012 [ 0348 ] Inference and / or training logic 815 are used to coupled to memory hub 2005 via a bus or other communi perform inferencing and / or training operations associated cation link 2013. In at least one embodiment, communica with one or more embodiments. Details regarding inference tion link 2013 may use one of any number of standards and /or training logic 815 are provided herein in conjunction based communication link technologies or protocols, such with FIGS . 8A and / or 8B . In at least one embodiment, as , but not limited to PCI Express, or may be a vendor inference and / or training logic 815 may be used in system specific communications interface or communications fab FIG . 2000 for inferencing or predicting operations based , at ric . In at least one embodiment, one or more parallel least in part, on weight parameters calculated using neural processor ( s ) 2012 form a computationally focused parallel network training operations, neural network functions and / or vector processing system that can include a large number or architectures , or neural network use cases described of processing cores and / or processing clusters , such as a herein . many - integrated core ( MIC ) processor . In at least one [ 0349 ] In at least one embodiment, one or more circuits, embodiment, some or all of parallel processor ( s ) 2012 form processors, systems , robots , or other devices or techniques a graphics processing subsystem that can output pixels to are adapted , with reference to the above figure , to identify a one of one or more display device ( s ) 2010A coupled via I / O goal of a demonstration based , at least partially, on the Hub 2007. In at least one embodiment, parallel processor ( s ) techniques described above in relations to FIGS. 1-7 . In at 2012 can also include a display controller and display least one embodiment, one or more circuits, processors , interface (not shown ) to enable a direct connection to one or systems , robots , or other devices or techniques are adapted , more display device ( s ) 2010B . with reference to the above figure, to implement a robotic [ 0345 ] In at least one embodiment, a system storage unit device capable of observing a demonstration , identifying a 2014 can connect to I /O hub 2007 to provide a storage goal of the demonstration , and achieving the goal by robotic mechanism for computing system 2000. In at least one manipulation, based , at least partially , on the techniques embodiment, an I / O switch 2016 can be used to provide an described above in relations to FIGS . 1-7 . interface mechanism to enable connections between I / O hub 2007 and other components , such as a network adapter 2018 Processors and / or a wireless network adapter 2019 that may be inte [ 0350 ] FIG . 21A illustrates a parallel processor 2100 grated into platform , and various other devices that can be according to at least one embodiment. In at least one added via one or more add - in device ( s ) 2020. In at least one embodiment, various components of parallel processor 2100 embodiment, network adapter 2018 can be an Ethernet may be implemented using one or more integrated circuit adapter or another wired network adapter. In at least one devices , such as programmable processors, application spe embodiment, wireless network adapter 2019 can include one cific integrated circuits ( ASICs ), or field programmable gate or more of a Wi- Fi, Bluetooth , near field communication arrays ( FPGA ). In at least one embodiment, illustrated ( NFC ) , or other network device that includes one or more parallel processor 2100 is a variant of one or more parallel wireless radios . processor ( s) 2012 shown in FIG . 20 according to an exem [ 0346 ] In at least one embodiment, computing system plary embodiment. 2000 can include other components not explicitly shown , [ 0351 ] In at least one embodiment, parallel processor 2100 including USB or other port connections, optical storage includes a parallel processing unit 2102. In at least one drives, video capture devices, and like , may also be con embodiment, parallel processing unit 2102 includes an I / O nected to I / O hub 2007. In at least one embodiment , com unit 2104 that enables communication with other devices , munication paths interconnecting various components in including other instances of parallel processing unit 2102. In FIG . 20 may be implemented using any suitable protocols , at least one embodiment, 1/0 unit 2104 may be directly such as PCI ( Peripheral Component Interconnect) based connected to other devices . In at least one embodiment, I / O protocols ( e.g. , PCI - Express ), or other bus or point- to - point unit 2104 connects with other devices via use of a hub or US 2021/0081752 A1 Mar. 18 , 2021 35 switch interface , such as a memory hub 2105. In at least one execution of such graphics processing operations , including embodiment, connections between memory hub 2105 and but not limited to , texture sampling logic to perform texture I / O unit 2104 form a communication link 2113. In at least operations, as well as tessellation logic and other vertex one embodiment, 1/0 unit 2104 connects with a host inter processing logic . In at least one embodiment, processing face 2106 and a memory crossbar 2116 , where host interface cluster array 2112 can be configured to execute graphics 2106 receives commands directed to performing processing processing related shader programs such as , but not limited operations and memory crossbar 2116 receives commands to , vertex shaders , tessellation shaders, geometry shaders , directed to performing memory operations. and pixel shaders . In at least one embodiment, parallel [ 0352 ] In at least one embodiment, when host interface processing unit 2102 can transfer data from system memory 2106 receives a command buffer via 1/0 unit 2104 , host via I / O unit 2104 for processing. In at least one embodiment, interface 2106 can direct work operations to perform those during processing , transferred data can be stored to on - chip commands to a front end 2108. In at least one embodiment, memory ( e.g. , parallel processor memory 2122 ) during front end 2108 couples with a scheduler 2110 , which is processing , then written back to system memory. configured to distribute commands or other work items to a [ 0356 ] In at least one embodiment, when parallel process processing cluster array 2112. In at least one embodiment, ing unit 2102 is used to perform graphics processing, scheduler 2110 ensures that processing cluster array 2112 is scheduler 2110 can be configured to divide a processing properly configured and in a valid state before tasks are workload into approximately equal sized tasks , to better distributed to a cluster of processing cluster array 2112. In enable distribution of graphics processing operations to at least one embodiment, scheduler 2110 is implemented via multiple clusters 2114A - 2114N of processing cluster array firmware logic executing on a microcontroller. In at least one 2112. In at least one embodiment, portions of processing embodiment, microcontroller implemented scheduler 2110 cluster array 2112 can be configured to perform different is configurable to perform complex scheduling and work types of processing . For example , in at least one embodi distribution operations at coarse and fine granularity, ment, a first portion may be configured to perform vertex enabling rapid preemption and context switching of threads shading and topology generation , a second portion may be executing on processing array 2112. In at least one embodi configured to perform tessellation and geometry shading , ment, host software can prove workloads for scheduling on and a third portion may be configured to perform pixel processing cluster array 2112 via one of multiple graphics shading or other screen space operations, to produce a processing paths. In at least one embodiment, workloads can rendered image for display . In at least one embodiment, then be automatically distributed across processing array intermediate data produced by one or more of clusters cluster 2112 by scheduler 2110 logic within a microcon 2114A - 2114N may be stored in buffers to allow intermediate troller including scheduler 2110 . data to be transmitted between clusters 2114A - 2114N for [ 0353 ] In at least one embodiment, processing cluster further processing. array 2112 can include up to “ N ” processing clusters ( e.g. , [ 0357 ] In at least one embodiment, processing cluster cluster 2114A , cluster 2114B , through cluster 2114N ) , where array 2112 can receive processing tasks to be executed via “ N ” represents a positive integer (which may be a different scheduler 2110 , which receives commands defining process integer “ N ” than used in other figures ). In at least one ing tasks from front end 2108. In at least one embodiment, embodiment, each cluster 2114A - 2114N of processing clus processing tasks can include indices of data to be processed , ter array 2112 can execute a large number of concurrent e.g. , surface ( patch ) data , primitive data , vertex data, and / or threads. In at least one embodiment, scheduler 2110 can pixel data , as well as state parameters and commands allocate work to clusters 2114A - 2114N of processing cluster defining how data is to be processed ( e.g. , what program is array 2112 using various scheduling and / or work distribu to be executed ). In at least one embodiment, scheduler 2110 tion algorithms, which may vary depending on workload may be configured to fetch indices corresponding to tasks or arising for each type of program or computation . In at least may receive indices from front end 2108. In at least one one embodiment, scheduling can be handled dynamically by embodiment, front end 2108 can be configured to ensure scheduler 2110 , or can be assisted in part by compiler logic processing cluster array 2112 is configured to a valid state during compilation of program logic configured for execu before a workload specified by incoming command buffers tion by processing cluster array 2112. In at least one embodi ( e.g. , batch - buffers, push buffers , etc.) is initiated . ment, different clusters 2114A - 2114N of processing cluster [ 0358 ] In at least one embodiment, each of one or more array 2112 can be allocated for processing different types of instances of parallel processing unit 2102 can couple with a programs or for performing different types of computations. parallel processor memory 2122. In at least one embodi [ 0354 ] In at least one embodiment, processing cluster ment, parallel processor memory 2122 can be accessed via array 2112 can be configured to perform various types of memory crossbar 2116 , which can receive memory requests parallel processing operations. In at least one embodiment, from processing cluster array 2112 as well as I / O unit 2104 . processing cluster array 2112 is configured to perform In at least one embodiment, memory crossbar 2116 can general -purpose parallel compute operations. For example, access parallel processor memory 2122 via a memory inter in at least one embodiment, processing cluster array 2112 face 2118. In at least one embodiment, memory interface can include logic to execute processing tasks including 2118 can include multiple partition units ( e.g. , partition unit filtering of video and / or audio data , performing modeling 2120A , partition unit 2120B , through partition unit 2120N ) operations, including physics operations, and performing that can each couple to a portion ( e.g. , memory unit) of data transformations . parallel processor memory 2122. In at least one embodi [ 0355 ] In at least one embodiment, processing cluster ment, a number of partition units 2120A - 2120N is config array 2112 is configured to perform parallel graphics pro ured to be equal to a number of memory units , such that a cessing operations. In at least one embodiment, processing first partition unit 2120A has a corresponding first memory cluster array 2112 can include additional logic to support unit 2124A , a second partition unit 2120B has a correspond US 2021/0081752 A1 Mar. 18 , 2021 36 ing memory unit 2124B , and an N - th partition unit 2120N embodiment, partition unit 2120 includes an L2 cache 2121 , has a corresponding N - th memory unit 2124N . In at least a frame buffer interface 2125 , and a ROP 2126 ( raster one embodiment, a number of partition units 2120A - 2120N operations unit ) . In at least one embodiment, L2 cache 2121 may not be equal to a number of memory units. is a read /write cache that is configured to perform load and [ 0359 ] In at least one embodiment, memory units 2124A store operations received from memory crossbar 2116 and 2124N can include various types of memory devices, includ ROP 2126. In at least one embodiment, read misses and ing dynamic random access memory (DRAM ) or graphics urgent write - back requests are output by L2 cache 2121 to random access memory , such as synchronous graphics ran frame buffer interface 2125 for processing . In at least one dom access memory ( SGRAM ), including graphics double embodiment, updates can also be sent to a frame buffer via data rate ( GDDR ) memory . In at least one embodiment, frame buffer interface 2125 for processing . In at least one memory units 2124A - 2124N may also include 3D stacked embodiment, frame buffer interface 2125 interfaces with one memory, including but not limited to high bandwidth of memory units in parallel processor memory, such as memory ( HBM ) . In at least one embodiment, render targets , memory units 2124A - 2124N of FIG . 21 ( e.g. , within parallel such as frame buffers or texture maps may be stored across processor memory 2122 ) . memory units 2124A - 2124N , allowing partition units [ 0363 ] In at least one embodiment, ROP 2126 is a pro 2120A - 2120N to write portions of each render target in cessing unit that performs raster operations such as stencil, parallel to efficiently use available bandwidth of parallel z test , blending, etc. In at least one embodiment, ROP 2126 processor memory 2122. In at least one embodiment, a local then outputs processed graphics data that is stored in graph instance of parallel processor memory 2122 may be ics memory . In at least one embodiment, ROP 2126 includes excluded in favor of a unified memory design that utilizes compression logic to compress depth or color data that is system memory in conjunction with local cache memory. written to memory and decompress depth or color data that [ 0360 ] In at least one embodiment, any one of clusters is read from memory . In at least one embodiment, compres 2114A - 2114N of processing cluster array 2112 can process sion logic can be lossless compression logic that makes use data that will be written to any of memory units 2124A of one or more of multiple compression algorithms. In at 2124N within parallel processor memory 2122. In at least least one embodiment, a type of compression that is per one embodiment, memory crossbar 2116 can be configured formed by ROP 2126 can vary based on statistical charac to transfer an output of each cluster 2114A - 2114N to any teristics of data to be compressed. For example , in at least partition unit 2120A - 2120N or to another cluster 2114A one embodiment, delta color compression is performed on 2114N , which can perform additional processing operations depth and color data on a per -tile basis . on an output . In at least one embodiment, each cluster [ 0364 ] In at least one embodiment, ROP 2126 is included 2114A - 2114N can communicate with memory interface within each processing cluster ( e.g. , cluster 2114A - 2114N of 2118 through memory crossbar 2116 to read from or write to FIG . 21A ) instead of within partition unit 2120. In at least various external memory devices. In at least one embodi one embodiment, read and write requests for pixel data are ment, memory crossbar 2116 has a connection to memory transmitted over memory crossbar 2116 instead of pixel interface 2118 to communicate with I / O unit 2104 , as well fragment data . In at least one embodiment, processed graph as a connection to a local instance of parallel processor ics data may be displayed on a display device , such as one memory 2122 , enabling processing units within different of one or more display device ( s ) 2010 of FIG . 20 , routed for processing clusters 2114A - 2114N to communicate with sys further processing by processor ( s ) 2002 , or routed for further tem memory or other memory that is not local to parallel processing by one of processing entities within parallel processing unit 2102. In at least one embodiment, memory processor 2100 of FIG . 21A . crossbar 2116 can use virtual channels to separate traffic [ 0365 ] FIG . 21C is a block diagram of a processing cluster streams between clusters 2114A - 2114N and partition units 2114 within a parallel processing unit according to at least 2120A - 2120N . one embodiment. In at least one embodiment, a processing [ 0361 ] In at least one embodiment, multiple instances of cluster is an instance of one of processing clusters 2114A parallel processing unit 2102 can be provided on a single 2114N of FIG . 21A . In at least one embodiment, processing add - in card , or multiple add - in cards can be interconnected . cluster 2114 can be configured to execute many threads in In at least one embodiment, different instances of parallel parallel, where “ thread ” refers to an instance of a particular processing unit 2102 can be configured to interoperate even program executing on a particular set of input data . In at if different instances have different numbers of processing least one embodiment, single - instruction , multiple - data cores , different amounts of local parallel processor memory, ( SIMD ) instruction issue techniques are used to support and / or other configuration differences . For example, in at parallel execution of a large number of threads without least one embodiment, some instances of parallel processing providing multiple independent instruction units . In at least unit 2102 can include higher precision floating point units one embodiment, single - instruction , multiple - thread ( SIMT ) relative to other instances. In at least one embodiment, techniques are used to support parallel execution of a large systems incorporating one or more instances of parallel number of generally synchronized threads, using a common processing unit 2102 or parallel processor 2100 can be instruction unit configured to issue instructions to a set of implemented in a variety of configurations and form factors, processing engines within each one of processing clusters . including but not limited to desktop, laptop , or handheld [ 0366 ] In at least one embodiment, operation of process personal computers , servers , workstations, game consoles , ing cluster 2114 can be controlled via a pipeline manager and /or embedded systems. 2132 that distributes processing tasks to SIMT parallel [ 0362 ] FIG . 21B is a block diagram of a partition unit processors . In at least one embodiment, pipeline manager 2120 according to at least one embodiment. In at least one 2132 receives instructions from scheduler 2110 of FIG . 21A embodiment, partition unit 2120 is an instance of one of and manages execution of those instructions via a graphics partition units 2120A - 2120N of FIG . 21A . In at least one multiprocessor 2134 and / or a texture unit 2136. In at least US 2021/0081752 A1 Mar. 18 , 2021 37 one embodiment, graphics multiprocessor 2134 is an exem at least one embodiment, any memory external to parallel plary instance of a SIMT parallel processor. However, in at processing unit 2102 may be used as global memory. In at least one embodiment, various types of SIMT parallel pro least one embodiment, processing cluster 2114 includes cessors of differing architectures may be included within multiple instances of graphics multiprocessor 2134 and can processing cluster 2114. In at least one embodiment, one or share common instructions and data , which may be stored in more instances of graphics multiprocessor 2134 can be L1 cache 2148 . included within a processing cluster 2114. In at least one [ 0370 ] In at least one embodiment, each processing cluster embodiment, graphics multiprocessor 2134 can process data 2114 may include an MMU 2145 (memory management and a data crossbar 2140 can be used to distribute processed unit ) that is configured to map virtual addresses into physical data to one of multiple possible destinations, including other addresses. In at least one embodiment, one or more instances shader units . In at least one embodiment, pipeline manager of MMU 2145 may reside within memory interface 2118 of 2132 can facilitate distribution of processed data by speci FIG . 21A . In at least one embodiment, MMU 2145 includes fying destinations for processed data to be distributed via a set of page table entries ( PTEs ) used to map a virtual data crossbar 2140 . address to a physical address of a tile and optionally a cache [ 0367 ] In at least one embodiment, each graphics multi line index. In at least one embodiment, MMU 2145 may processor 2134 within processing cluster 2114 can include include address translation lookaside buffers ( TLB ) or an identical set of functional execution logic ( e.g. , arithmetic caches that may reside within graphics multiprocessor 2134 logic units , load - store units, etc. ) . In at least one embodi or L1 2148 cache or processing cluster 2114. In at least one ment, functional execution logic can be configured in a embodiment, a physical address is processed to distribute pipelined manner in which new instructions can be issued surface data access locally to allow for efficient request before previous instructions are complete . In at least one interleaving among partition units. In at least one embodi embodiment, functional execution logic supports a variety ment, a cache line index may be used to determine whether of operations including integer and floating point arithmetic , a request for a cache line is a hit or miss . comparison operations, Boolean operations, bit - shifting, and [ 0371 ] In at least one embodiment, a processing cluster computation of various algebraic functions. In at least one 2114 may be configured such that each graphics multipro embodiment, same functional- unit hardware can be lever cessor 2134 is coupled to a texture unit 2136 for performing aged to perform different operations and any combination of texture mapping operations, e.g. , determining texture functional units may be present. sample positions , reading texture data , and filtering texture [ 0368 ] In at least one embodiment, instructions transmit data . In at least one embodiment, texture data is read from ted to processing cluster 2114 constitute a thread. In at least an internal texture Ll cache ( not shown ) or from an L1 one embodiment, a set of threads executing across a set of cache within graphics multiprocessor 2134 and is fetched parallel processing engines is a thread group . In at least one from an L2 cache, local parallel processor memory, or embodiment, a thread group executes a common program on system memory , as needed . In at least one embodiment, each different input data. In at least one embodiment, each thread graphics multiprocessor 2134 outputs processed tasks to within a thread group can be assigned to a different pro data crossbar 2140 to provide processed task to another cessing engine within a graphics multiprocessor 2134. In at processing cluster 2114 for further processing or to store least one embodiment, a thread group may include fewer processed task in an L2 cache, local parallel processor threads than a number of processing engines within graphics memory , or system memory via memory crossbar 2116. In multiprocecessor 2134. In at least one embodiment, when a at least one embodiment, a preROP 2142 (pre - raster opera thread group includes fewer threads than a number of tions unit) is configured to receive data from graphics processing engines , one or more of processing engines may multiprocessor 2134 , and direct data to ROP units , which be idle during cycles in which that thread group is being may be located with partition units as described herein ( e.g. , processed. In at least one embodiment, a thread group may partition units 2120A - 2120N of FIG . 21A ) . In at least one also include more threads than a number of processing embodiment, preROP 2142 unit can perform optimizations engines within graphics multiprocessor 2134. In at least one for color blending, organizing pixel color data , and perform embodiment, when a thread group includes more threads ing address translations. than number of processing engines within graphics multi [ 0372 ] Inference and / or training logic 815 are used to processor2134 , processing can be performed over consecu perform inferencing and / or training operations associated tive clock cycles . In at least one embodiment, multiple with one or more embodiments . Details regarding inference thread groups can be executed concurrently on a graphics and / or training logic 815 are provided herein in conjunction multiprocessor 2134 . with FIGS . 8A and /or 8B . In at least one embodiment, [ 0369 ] In at least one embodiment, graphics multiproces inference and / or training logic 815 may be used in graphics sor 2134 includes an internal cache memory to perform load processing cluster 2114 for inferencing or predicting opera and store operations. In at least one embodiment, graphics tions based , at least in part, on weight parameters calculated multiprocessor 2134 can forego an internal cache and use a using neural network training operations, neural network cache memory ( e.g. , L1 cache 2148 ) within processing functions and / or architectures , or neural network use cases cluster 2114. In at least one embodiment, each graphics described herein . multiprocessor 2134 also has access to L2 caches within [ 0373 ] In at least one embodiment, one or more circuits , partition units ( e.g. , partition units 2120A - 2120N of FIG . processors , systems , robots , or other devices or techniques 21A ) that are shared among all processing clusters 2114 and are adapted , with reference to the above figure , to identify a may be used to transfer data between threads . In at least one goal of a demonstration based , at least partially, on the embodiment, graphics multiprocessor 2134 may also access techniques described above in relations to FIGS . 1-7 . In at off -chip global memory, which can include one or more of least one embodiment, one or more circuits, processors , local parallel processor memory and / or system memory. In systems , robots , or other devices or techniques are adapted , US 2021/0081752 A1 Mar. 18 , 2021 38 with reference to the above figure , to implement a robotic [ 0378 ] In at least one embodiment, GPGPU cores 2162 device capable of observing a demonstration, identifying a include SIMD logic capable of performing a single instruc goal of the demonstration, and achieving the goal by robotic tion on multiple sets of data . In at least one embodiment, manipulation, based , at least partially, on the techniques GPGPU cores 2162 can physically execute SIMD4 , SIMD8 , described above in relations to FIGS . 1-7 . and SIMD16 instructions and logically execute SIMDI , [ 0374 ] FIG . 21D shows a graphics multiprocessor 2134 SIMD2 , and SIMD32 instructions. In at least one embodi according to at least one embodiment. In at least one ment, SIMD instructions for GPGPU cores can be generated embodiment, graphics multiprocessor 2134 couples with at compile time by a shader compiler or automatically pipeline manager 2132 of processing cluster 2114. In at least generated when executing programs written and compiled one embodiment, graphics multiprocessor 2134 has an for single program multiple data ( SPMD ) or SIMT archi execution pipeline including but not limited to an instruction tectures . In at least one embodiment, multiple threads of a cache 2152 , an instruction unit 2154 , an address mapping program configured for an SIMT execution model can unit 2156 , a register file 2158 , one or more general purpose executed via a single SIMD instruction . For example, in at graphics processing unit ( GPGPU ) cores 2162 , and one or least one embodiment, eight SIMT threads that perform more load / store units 2166. In at least one embodiment, same or similar operations can be executed in parallel via a GPGPU cores 2162 and load / store units 2166 are coupled single SIMD8 logic unit . with cache memory 2172 and shared memory 2170 via a [ 0379 ] In at least one embodiment, memory and cache memory and cache interconnect 2168 . interconnect 2168 is an interconnect network that connects each functional unit of graphics multiprocessor 2134 to [ 0375 ] In at least one embodiment, instruction cache 2152 register file 2158 and to shared memory 2170. In at least one receives a stream of instructions to execute from pipeline embodiment, memory and cache interconnect 2168 is a manager 2132. In at least one embodiment, instructions are crossbar interconnect that allows load / store unit 2166 to cached in instruction cache 2152 and dispatched for execu implement load and store operations between shared tion by an instruction unit 2154. In at least one embodiment, memory 2170 and register file 2158. In at least one embodi instruction unit 2154 can dispatch instructions as thread ment, register file 2158 can operate at a same frequency as groups ( e.g. , warps ), with each thread of thread group GPGPU cores 2162 , thus data transfer between GPGPU assigned to a different execution unit within GPGPU cores cores 2162 and register file 2158 can have very low latency. 2162. In at least one embodiment, an instruction can access In at least one embodiment, shared memory 2170 can be any of a local , shared , or global address space by specifying used to enable communication between threads that execute an address within a unified address space . In at least one on functional units within graphics multiprocessor 2134. In embodiment, address mapping unit 2156 can be used to at least one embodiment, cache memory 2172 can be used translate addresses in a unified address space into a distinct as a data cache for example , to cache texture data commu memory address that can be accessed by load / store units nicated between functional units and texture unit 2136. In at 2166 . least one embodiment, shared memory 2170 can also be [ 0376 ] In at least one embodiment, register file 2158 used as a program managed cache . In at least one embodi provides a set of registers for functional units of graphics ment, threads executing on GPGPU cores 2162 can pro multiprocessor 2134. In at least one embodiment, register grammatically store data within shared memory in addition file 2158 provides temporary storage for operands connected to automatically cached data that is stored within cache to data paths of functional units ( e.g. , GPGPU cores 2162 , memory 2172. load / store units 2166 ) of graphics multiprocessor 2134. In at [ 0380 ] In at least one embodiment, a parallel processor or least one embodiment, register file 2158 is divided between GPGPU as described herein is communicatively coupled to each of functional units such that each functional unit is host /processor cores to accelerate graphics operations, allocated a dedicated portion of register file 2158. In at least machine learning operations, pattern analysis operations, one embodiment, register file 2158 is divided between and various general purpose GPU (GPGPU ) functions. In at different warps being executed by graphics multiprocessor least one embodiment, a GPU may be communicatively 2134 . coupled to host processor / cores over a bus or other inter [ 0377 ] In at least one embodiment, GPGPU cores 2162 connect ( e.g. , a high - speed interconnect such as PCIe or can each include floating point units ( FPUs ) and / or integer NVLink ). In at least one embodiment, a GPU may be arithmetic logic units ( ALUS) that are used to execute integrated on a package or chip as cores and communica instructions of graphics multiprocessor 2134. In at least one tively coupled to cores over an internal processor bus / embodiment, GPGPU cores 2162 can be similar in archi interconnect internal to a package or chip . In at least one tecture or can differ in architecture . In at least one embodi embodiment, regardless a manner in which a GPU is con ment, a first portion of GPGPU cores 2162 include a single nected , processor cores may allocate work to such GPU in precision FPU and an integer ALU while a second portion of a form of sequences of commands / instructions contained in GPGPU cores include a double precision FPU . In at least a work descriptor. In at least one embodiment, that GPU then one embodiment, FPUs can implement IEEE 754-2008 uses dedicated circuitry / logic for efficiently processing these standard floating point arithmetic or enable variable preci commands / instructions. sion floating point arithmetic. In at least one embodiment, [ 0381 ] Inference and / or training logic 815 are used to graphics multiprocessor 2134 can additionally include one perform inferencing and / or training operations associated or more fixed function or special function units to perform with one or more embodiments . Details regarding inference specific functions such as copy rectangle or pixel blending and / or training logic 815 are provided herein in conjunction operations. In at least one embodiment, one or more of with FIGS . 8A and / or 8B . In at least one embodiment, GPGPU cores 2162 can also include fixed or special func inference and / or training logic 815 may be used in graphics tion logic . multiprocessor 2134 for inferencing or predicting operations US 2021/0081752 A1 Mar. 18 , 2021 39 based , at least in part, on weight parameters calculated using manipulation , based , at least partially, on the techniques neural network training operations, neural network functions described above in relations to FIGS . 1-7 . and / or architectures, or neural network use cases described [ 0386 ] FIG . 23 is a block diagram of a graphics processor herein . 2300 , according to at least one embodiment. In at least one [ 0382 ] In at least one embodiment, one or more circuits , embodiment, graphics processor 2300 includes a ring inter processors , systems, robots, or other devices or techniques connect 2302 , a pipeline front - end 2304 , a media engine are adapted , with reference to the above figure , to identify a 2337 , and graphics cores 2380A - 2380N . In at least one goal of a demonstration based , at least partially, on the embodiment, ring interconnect 2302 couples graphics pro techniques described above in relations to FIGS . 1-7 . In at cessor 2300 to other processing units , including other graph least one embodiment, one or more circuits, processors , ics processors or one or more general- purpose processor systems , robots, or other devices or techniques are adapted , cores . In at least one embodiment, graphics processor 2300 with reference to the above figure , to implement a robotic is one of many processors integrated within a multi- core device capable of observing a demonstration , identifying a processing system . goal of the demonstration, and achieving the goal by robotic [ 0387 ] In at least one embodiment, graphics processor manipulation , based , at least partially , on the techniques 2300 receives batches of commands via ring interconnect described above in relations to FIGS . 1-7 . 2302. In at least one embodiment, incoming commands are [ 0383 ] FIG . 22 illustrates a multi - GPU computing system interpreted by a command streamer 2303 in pipeline front 2200 , according to at least one embodiment. In at least one end 2304. In at least one embodiment, graphics processor embodiment, multi - GPU computing system 2200 can 2300 includes scalable execution logic to perform 3D geom include a processor 2202 coupled to multiple general pur etry processing and media processing via graphics core ( s ) pose graphics processing units (GPGPUs ) 2206A - D via a 2380A - 2380N . In at least one embodiment, for 3D geometry host interface switch 2204. In at least one embodiment, host processing commands, command streamer 2303 supplies interface switch 2204 is a PCI express switch device that commands to geometry pipeline 2336. In at least one couples processor 2202 to a PCI express bus over which embodiment, for at least some media processing commands, processor 2202 can communicate with GPGPUs 2206A - D . command streamer 2303 supplies commands to a video front In at least one embodiment, GPGPUs 2206A - D can inter end 2334 , which couples with media engine 2337. In at least connect via a set of high - speed point -to - point GPU - to -GPU one embodiment, media engine 2337 includes a Video links 2216. In at least one embodiment, GPU - to - GPU links Quality Engine ( VQE ) 2330 for video and image post 2216 connect to each of GPGPUs 2206A - D via a dedicated processing and a multi - format encode / decode ( MFX ) 2333 GPU link . In at least one embodiment, P2P GPU links 2216 engine to provide hardware - accelerated media data encod enable direct communication between each of GPGPUS ing and decoding. In at least one embodiment, geometry 2206A - D without requiring communication over host inter pipeline 2336 and media engine 2337 each generate execu face bus 2204 to which processor 2202 is connected . In at tion threads for thread execution resources provided by at least one embodiment, with GPU - to - GPU traffic directed to least one graphics core 2380 . P2P GPU links 2216 , host interface bus 2204 remains [ 0388 ] In at least one embodiment, graphics processor available for system memory access or to communicate with 2300 includes scalable thread execution resources featuring other instances of multi -GPU computing system 2200 , for graphics cores 2380A - 2380N ( which can be modular and are example, via one or more network devices . While in at least sometimes referred to as core slices ) , each having multiple sub - cores 2350A - 50N , 2360A - 2360N ( sometimes referred one embodiment GPGPUs 2206A - D connect to processor to as core sub - slices ). In at least one embodiment, graphics 2202 via host interface switch 2204 , in at least one embodi processor 2300 can have any number of graphics cores ment processor 2202 includes direct support for P2P GPU 2380A . In at least one embodiment, graphics processor 2300 links 2216 and can connect directly to GPGPUs 2206A - D . includes a graphics core 2380A having at least a first [ 0384 ] Inference and / or training logic 815 are used to sub - core 2350A and a second sub - core 2360A . In at least perform inferencing and / or training operations associated one embodiment, graphics processor 2300 is a low power with one or more embodiments . Details regarding inference processor with a single sub - core ( e.g. , 2350A ) . In at least and / or training logic 815 are provided herein in conjunction one embodiment, graphics processor 2300 includes multiple with FIGS . 8A and / or 8B . In at least one embodiment, graphics cores 2380A - 2380N , each including a set of first inference and / or training logic 815 may be used in multi sub - cores 2350A - 2350N and a set of second sub - cores GPU computing system 2200 for inferencing or predicting 2360A - 2360N . In at least one embodiment, each sub - core in operations based , at least in part, on weight parameters first sub - cores 2350A - 2350N includes at least a first set of calculated using neural network training operations, neural execution units 2352A - 2352N and media / texture samplers network functions and / or architectures, or neural network 2354A - 2354N . In at least one embodiment, each sub - core in use cases described herein . second sub - cores 2360A - 2360N includes at least a second [ 0385 ] In at least one embodiment, one or more circuits, set of execution units 2362A - 2362N and samplers 2364A processors , systems , robots, or other devices or techniques 2364N . In at least one embodiment, each sub -core 2350A are adapted , with reference to the above figure, to identify a 2350N , 2360A - 2360N shares a set of shared resources goal of a demonstration based , at least partially , on the 2370A - 2370N . In at least one embodiment, shared resources techniques described above in relations to FIGS . 1-7 . In at include shared cache memory and pixel operation logic . least one embodiment, one or more circuits, processors , [ 0389 ] Inference and / or training logic 815 are used to systems , robots, or other devices or techniques are adapted , perform inferencing and / or training operations associated with reference to the above figure , to implement a robotic with one or more embodiments . Details regarding inference device capable of observing a demonstration , identifying a and /or training logic 815 are provided herein in conjunction goal of the demonstration, and achieving the goal by robotic with FIGS . 8A and / or 8B . In at least one embodiment, US 2021/0081752 A1 Mar. 18 , 2021 40 inference and / or training logic 815 may be used in graphics embodiment, if more than four micro - ops are needed to processor 2300 for inferencing or predicting operations complete an instruction , instruction decoder 2428 may based , at least in part, on weight parameters calculated using access microcode ROM 2432 to perform that instruction . In neural network training operations, neural network functions at least one embodiment, an instruction may be decoded into and / or architectures, or neural network use cases described a small number of micro - ops for processing at instruction herein . decoder 2428. In at least one embodiment, an instruction [ 0390 ] In at least one embodiment, one or more circuits , may be stored within microcode ROM 2432 should a processors , systems , robots, or other devices or techniques number of micro - ops be needed to accomplish such opera are adapted , with reference to the above figure, to identify a tion . In at least one embodiment, trace cache 2430 refers to goal of a demonstration based , at least partially , on the an entry point programmable logic array ( “ PLA ” ) to deter techniques described above in relations to FIGS . 1-7 . In at mine a correct micro - instruction pointer for reading micro least one embodiment, one or more circuits, processors , code sequences to complete one or more instructions from systems , robots, or other devices or techniques are adapted , microcode ROM 2432 in accordance with at least one with reference to the above figure , to implement a robotic embodiment. In at least one embodiment, after microcode device capable of observing a demonstration , identifying a ROM 2432 finishes sequencing micro - ops for an instruction , goal of the demonstration , and achieving the goal by robotic front end 2401 of a machine may resume fetching micro -ops manipulation, based , at least partially, on the techniques from trace cache 2430 . described above in relations to FIGS . 1-7 . [ 0394 ] In at least one embodiment, out - of - order execution [ 0391 ] FIG . 24 is a block diagram illustrating micro engine ( “ out of order engine ” ) 2403 may prepare instruc architecture for a processor 2400 that may include logic tions for execution . In at least one embodiment, out - of - order circuits to perform instructions, according to at least one execution logic has a number of buffers to smooth out and embodiment. In at least one embodiment, processor 2400 re - order flow of instructions to optimize performance as they may perform instructions, including x86 instructions, ARM go down a pipeline and get scheduled for execution . In at instructions , specialized instructions for application -specific least one embodiment, out -of - order execution engine 2403 integrated circuits ( ASICs ) , etc. In at least one embodiment, includes, without limitation , an allocator / register renamer processor 2400 may include registers to store packed data , 2440 , a memory uop queue 2442 , an integer /floating point such as 64 - bit wide MMXTM registers in microprocessors uop queue 2444 , a memory scheduler 2446 , a fast scheduler enabled with MMX technology from Intel Corporation of 2402 , a slow / general floating point scheduler ( “ slow /general Santa Clara, Calif. In at least one embodiment, MMX FP scheduler " ) 2404 , and a simple floating point scheduler registers , available in both integer and floating point forms, ( “ simple FP scheduler ” ) 2406. In at least one embodiment, may operate with packed data elements that accompany fast schedule 2402 , slow / general floating point scheduler single instruction , multiple data ( “ SIMD ” ) and streaming 2404 , and simple floating point scheduler 2406 are also SIMD extensions ( “ SSE ” ) instructions. In at least one collectively referred to herein as " uop schedulers 2402 , embodiment, 128 - bit wide XMM registers relating to SSE2 , 2404 , 2406.” In at least one embodiment, allocator/ register SSE3 , SSE4 , AVX , or beyond ( referred to generically as renamer 2440 allocates machine buffers and resources that “ SSEx ” ) technology may hold such packed data operands. each uop needs in order to execute . In at least one embodi In at least one embodiment, processor 2400 may perform ment, allocator / register renamer 2440 renames logic regis instructions to accelerate machine learning or deep learning ters onto entries in a register file . In at least one embodiment, algorithms, training, or inferencing. allocator / register renamer 2440 also allocates an entry for [ 0392 ] In at least one embodiment, processor 2400 each uop in one of two uop queues , memory uop queue 2442 includes an in - order front end ( “ front end ” ) 2401 to fetch for memory operations and integer / floating point uop queue instructions to be executed and prepare instructions to be 2444 for non -memory operations, in front of memory sched used later in a processor pipeline . In at least one embodi uler 2446 and uop schedulers 2402 , 2404 , 2406. In at least ment, front end 2401 may include several units . In at least one embodiment, uop schedulers 2402 , 2404 , 2406 , deter one embodiment, an instruction prefetcher 2426 fetches mine when a uop is ready to execute based on readiness of instructions from memory and feeds instructions to an their dependent input register operand sources and avail instruction decoder 2428 which in turn decodes or interprets ability of execution resources uops need to complete their instructions . For example , in at least one embodiment, operation . In at least one embodiment, fast scheduler 2402 instruction decoder 2428 decodes a received instruction into may schedule on each half of a main clock cycle while one or more operations called “ micro - instructions” or slow / general floating point scheduler 2404 and simple float " micro -operations " ( also called “ micro ops" or " uops " ) that ing point scheduler 2406 may schedule once per main a machine may execute . In at least one embodiment, instruc processor clock cycle . In at least one embodiment, uop tion decoder 2428 parses an instruction into an opcode and schedulers 2402 , 2404 , 2406 arbitrate for dispatch ports to corresponding data and control fields that may be used by schedule uops for execution . micro - architecture to perform operations in accordance with [ 0395 ] In at least one embodiment, execution block 2411 at least one embodiment. In at least one embodiment, a trace includes, without limitation , an integer register file /bypass cache 2430 may assemble decoded uops into program network 2408 , a floating point register file /bypass network ordered sequences or traces in a uop queue 2434 for execu ( " FP register file / bypass network " ) 2410 , address generation tion . In at least one embodiment, when trace cache 2430 units ( “ AGUs ” ) 2412 and 2414 , fast Arithmetic Logic Units encounters a complex instruction , a microcode ROM 2432 ( ALUS) ( “ fast ALUs” ) 2416 and 2418 , a slow Arithmetic provides uops needed to complete an operation. Logic Unit ( “ slow ALU ” ) 2420 , a floating point ALU ( “ FP ” ) [ 0393 ] In at least one embodiment, some instructions may 2422 , and a floating point move unit ( “ FP move ” ) 2424. In be converted into a single micro - op , whereas others need at least one embodiment, integer register file /bypass network several micro - ops to complete full operation . In at least one 2408 and floating point register file / bypass network 2410 are US 2021/0081752 A1 Mar. 18 , 2021 41 also referred to herein as “ register files 2408 , 2410.” In at support a range of operands having bits of various widths , least one embodiment, AGUSs 2412 and 2414 , fast ALUS such as 128 - bit wide packed data operands in conjunction 2416 and 2418 , slow ALU 2420 , floating point ALU 2422 , with SIMD and multimedia instructions . and floating point move unit 2424 are also referred to herein [ 0398 ] In at least one embodiment, uop schedulers 2402 , as “ execution units 2412 , 2414 , 2416 , 2418 , 2420 , 2422 , and 2404 , 2406 dispatch dependent operations before a parent 2424.” In at least one embodiment, execution block 2411 load has finished executing. In at least one embodiment, as may include , without limitation , any number ( including uops may be speculatively scheduled and executed in pro zero ) and type of register files, bypass networks, address cessor 2400 , processor 2400 may also include logic to generation units , and execution units , in any combination . handle memory misses . In at least one embodiment, if a data [ 0396 ] In at least one embodiment, register networks load misses in a data cache , there may be dependent opera 2408 , 2410 may be arranged between uop schedulers 2402 , tions in flight in a pipeline that have left a scheduler with 2404 , 2406 , and execution units 2412 , 2414 , 2416 , 2418 , temporarily incorrect data . In at least one embodiment, a 2420 , 2422 , and 2424. In at least one embodiment, integer replay mechanism tracks and re - executes instructions that register file /bypass network 2408 performs integer opera use incorrect data . In at least one embodiment, dependent tions . In at least one embodiment, floating point register operations might need to be replayed and independent ones file /bypass network 2410 performs floating point operations . may be allowed to complete . In at least one embodiment, In at least one embodiment, each of register networks 2408 , schedulers and a replay mechanism of at least one embodi 2410 may include , without limitation , a bypass network that ment of a processor may also be designed to catch instruc may bypass or forward just completed results that have not tion sequences for text string comparison operations . yet been written into a register file to new dependent uops . [ 0399 ] In at least one embodiment, “ registers ” may refer to In at least one embodiment, register networks 2408 , 2410 on - board processor storage locations that may be used as may communicate data with each other . In at least one part of instructions to identify operands. In at least one embodiment, integer register file /bypass network 2408 may embodiment, registers may be those that may be usable from include , without limitation , two separate register files, one outside of a processor ( from a programmer's perspective ). In register file for a low -order thirty - two bits of data and a at least one embodiment, registers might not be limited to a second register file for a high order thirty - two bits of data . particular type of circuit. Rather, in at least one embodiment, In at least one embodiment, floating point register file / a register may store data, provide data , and perform func bypass network 2410 may include, without limitation , 128 tions described herein . In at least one embodiment, registers bit wide entries because floating point instructions typically described herein may be implemented by circuitry within a have operands from 64 to 128 bits in width . processor using any number of different techniques, such as [ 0397 ] In at least one embodiment, execution units 2412 , dedicated physical registers, dynamically allocated physical 2414 , 2416 , 2418 , 2420 , 2422 , 2424 may execute instruc registers using register renaming, combinations of dedicated tions . In at least one embodiment, register networks 2408 , and dynamically allocated physical registers, etc. In at least 2410 store integer and floating point data operand values one embodiment, integer registers store 32 - bit integer data . that micro - instructions need to execute . In at least one A register file of at least one embodiment also contains eight embodiment, processor 2400 may include, without limita multimedia SIMD registers for packed data . tion , any number and combination of execution units 2412 , [ 0400 ] Inference and / or training logic 815 are used to 2414 , 2416 , 2418 , 2420 , 2422 , 2424. In at least one embodi perform inferencing and / or training operations associated ment, floating point ALU 2422 and floating point move unit with one or more embodiments . Details regarding inference 2424 , may execute floating point, MMX , SIMD , AVX and and / or training logic 815 are provided herein in conjunction SSE , or other operations, including specialized machine with FIGS . 8A and / or 8B . In at least one embodiment learning instructions. In at least one embodiment, floating portions or all of inference and / or training logic 815 may be point ALU 2422 may include , without limitation , a 64 - bit by incorporated into execution block 2411 and other memory or 64 - bit floating point divider to execute divide , square root, registers shown or not shown . For example, in at least one and remainder micro ops . In at least one embodiment, embodiment, training and / or inferencing techniques instructions involving a floating point value may be handled described herein may use one or more of ALUs illustrated in with floating point hardware . In at least one embodiment, execution block 2411. Moreover, weight parameters may be ALU operations may be passed to fast ALUS 2416 , 2418. In stored in on - chip or off - chip memory and / or registers at least one embodiment, fast ALUS 2416 , 2418 may ( shown or not shown ) that configure ALUs of execution execute fast operations with an effective latency of half a block 2411 to perform one or more machine learning algo clock cycle . In at least one embodiment, most complex rithms, neural network architectures, use cases , or training integer operations go to slow ALU 2420 as slow ALU 2420 techniques described herein . may include , without limitation , integer execution hardware [ 0401 ] In at least one embodiment, one or more circuits , for long - latency type of operations, such as a multiplier, processors , systems , robots , or other devices or techniques shifts, flag logic , and branch processing . In at least one are adapted , with reference to the above figure , to identify a embodiment, memory load / store operations may be goal of a demonstration based , at least partially, on the executed by AGUS 2412 , 2414. In at least one embodiment, techniques described above in relations to FIGS . 1-7 . In at fast ALU 2416 , fast ALU 2418 , and slow ALU 2420 may least one embodiment, one or more circuits, processors , perform integer operations on 64 - bit data operands. In at systems , robots , or other devices or techniques are adapted, least one embodiment, fast ALU 2416 , fast ALU 2418 , and with reference to the above figure , to implement a robotic slow ALU 2420 may be implemented to support a variety of device capable of observing a demonstration , identifying a data bit sizes including sixteen , thirty - two, 128 , 256 , etc. In goal of the demonstration , and achieving the goal by robotic at least one embodiment, floating point ALU 2422 and manipulation, based , at least partially , on the techniques floating point move unit 2424 may be implemented to described above in relations to FIGS . 1-7 . US 2021/0081752 A1 Mar. 18 , 2021 42

[ 0402 ] FIG . 25 illustrates a deep learning application learning application processor is used to train a machine processor 2500 , according to at least one embodiment. In at learning model , such as a neural network , to predict or infer least one embodiment, deep learning application processor information provided to deep learning application processor 2500 uses instructions that, if executed by deep learning 2500. In at least one embodiment, deep learning application application processor 2500 , cause deep learning application processor 2500 is used to infer or predict information based processor 2500 to perform some or all of processes and on a trained machine learning model ( e.g. , neural network ) techniques described throughout this disclosure . In at least that has been trained by another processor or system or by one embodiment, deep learning application processor 2500 deep learning application processor 2500. In at least one is an application - specific integrated circuit ( ASIC ). In at embodiment, processor 2500 may be used to perform one or least one embodiment, application processor 2500 performs more neural network use cases described herein . matrix multiply operations either “ hard - wired ” into hard [ 0406 ] In at least one embodiment, one or more circuits , ware as a result of performing one or more instructions or processors , systems , robots , or other devices or techniques both . In at least one embodiment, deep learning application are adapted , with reference to the above figure, to identify a processor 2500 includes , without limitation , processing goal of a demonstration based , at least partially , on the clusters 2510 ( 1 ) -2510 ( 12 ) , Inter - Chip Links ( “ ICLs " ) 2520 techniques described above in relations to FIGS . 1-7 . In at ( 1 ) -2520 ( 12 ) , Inter -Chip Controllers ( “ ICCs ” ) 2530 ( 1 ) -2530 least one embodiment , one or more circuits , processors, ( 2 ) , high -bandwidth memory second generation ( “ HBM2” ) systems , robots , or other devices or techniques are adapted , 2540 ( 1 ) -2540 ( 4 ) , memory controllers ( “ Mem Ctrlrs ” ) 2542 with reference to the above figure , to implement a robotic ( 1 ) -2542 ( 4 ) , high bandwidth memory physical layer ( “ HBM device capable of observing a demonstration , identifying a PHY ” ) 2544 ( 1 ) -2544 ( 4 ) , a management - controller central goal of the demonstration , and achieving the goal by robotic processing unit (“ management - controller CPU ” ) 2550 , a manipulation, based , at least partially, on the techniques Serial Peripheral Interface, Inter -Integrated Circuit, and described above in relations to FIGS . 1-7 . General Purpose Input/ Output block ( " SPI , IC , GPIO " ) [ 0407 ] FIG . 26 is a block diagram of a neuromorphic 2560 , a peripheral component interconnect express control processor 2600 , according to at least one embodiment. In at ler and direct memory access block ( “ PCIe Controller and least one embodiment, neuromorphic processor 2600 may DMA ” ) 2570 , and a sixteen - lane peripheral component receive one or more inputs from sources external to neuro interconnect express port ( “ PCI Express x 16 ” ) 2580 . morphic processor 2600. In at least one embodiment, these [ 0403 ] In at least one embodiment, processing clusters inputs may be transmitted to one or more neurons 2602 2510 may perform deep learning operations, including infer within neuromorphic processor 2600. In at least one ence or prediction operations based on weight parameters embodiment, neurons 2602 and components thereof may be calculated one or more training techniques, including those implemented using circuitry or logic , including one or more described herein . In at least one embodiment, each process arithmetic logic units ( ALUS ) . In at least one embodiment, ing cluster 2510 may include, without limitation , any num neuromorphic processor 2600 may include , without limita ber and type of processors . In at least one embodiment, deep tion , thousands or millions of instances of neurons 2602 , but learning application processor 2500 may include any num any suitable number of neurons 2602 may be used . In at least ber and type of processing clusters 2500. In at least one one embodiment, each instance of neuron 2602 may include embodiment, Inter -Chip Links 2520 are bi -directional . In at a neuron input 2604 and a neuron output 2606. In at least one least one embodiment, Inter - Chip Links 2520 and Inter - Chip embodiment, neurons 2602 may generate outputs that may Controllers 2530 enable multiple deep learning application be transmitted to inputs of other instances of neurons 2602 . processors 2500 to exchange information , including activa For example , in at least one embodiment, neuron inputs tion information resulting from performing one or more 2604 and neuron outputs 2606 may be interconnected via machine learning algorithms embodied in one or more synapses 2608 . neural networks. In at least one embodiment, deep learning [ 0408 ] In at least one embodiment, neurons 2602 and application processor 2500 may include anynumber (includ synapses 2608 may be interconnected such that neuromor ing zero ) and type of ICLs 2520 and ICCs 2530 . phic processor 2600 operates to process or analyze infor [ 0404 ] In at least one embodiment, HBM2s 2540 provide mation received by neuromorphic processor 2600. In at least a total of 32 Gigabytes ( GB ) of memory . In at least one one embodiment, neurons 2602 may transmit an output embodiment, HBM2 2540 ( i ) is associated with both pulse ( or “ fire ” or “ spike ” ) when inputs received through memory controller 2542 ( i ) and HBM PHY 2544 ( i ) where “ i ” neuron input 2604 exceed a threshold . In at least one is an arbitrary integer. In at least one embodiment, any embodiment, neurons 2602 may sum or integrate signals number of HBM2s 2540 may provide any type and total received at neuron inputs 2604. For example , in at least one amount of high bandwidth memory and may be associated embodiment, neurons 2602 may be implemented as leaky with any number ( including zero ) and type of memory integrate - and - fire neurons, wherein if a sum ( referred to as controllers 2542 and HBM PHYs 2544. In at least one a “ membrane potential ” ) exceeds a threshold value , neuron embodiment, SPI , IPC , GPIO 2560 , PCIe Controller and 2602 may generate an output ( or “ fire ” ) using a transfer DMA 2570 , and / or PCIe 2580 may be replaced with any function such as a sigmoid or threshold function . In at least number and type of blocks that enable any number and type one embodiment, a leaky integrate - and - fire neuron may sum of communication standards in any technically feasible signals received at neuron inputs 2604 into a membrane fashion . potential and may also apply a decay factor ( or leak ) to [ 0405 ] Inference and / or training logic 815 are used to reduce a membrane potential. In at least one embodiment, a perform inferencing and / or training operations associated leaky integrate - and - fire neuron may fire if multiple input with one or more embodiments . Details regarding inference signals are received at neuron inputs 2604 rapidly enough to and / or training logic 815 are provided herein in conjunction exceed a threshold value ( i.e. , before a membrane potential with FIGS . 8A and / or 8B . In at least one embodiment, deep decays too low to fire ). In at least one embodiment, neurons