Indexing with Kythe A Demonstration Jonathan Godbout [email protected]

ABSTRACT give other information about the indexed code. It has VNames which For decades Lispers have had the power of code cross-references uniquely identify a node in a code base. It has Edges which annotate (jump to definition, list callers, etc.) for any code they’ve loaded into how two nodes relate to each other. their Lisp image. But what about cross referencing code that isn’ For example, take the variable object from threadp in Bordeaux- (or can’t be) loaded into the image? Wouldn’t it be great if we could threads [7]: ask “who, in the global Lisp community, calls this function?” The (defun threadp (object) only option currently available is to download all Lisp code and use (typep object 'sb-:thread)) “grep” or similar text-based tools. At Google we use Kythe [4] as a The variable object next to threadp would have a node: cross-reference database for all Lisp code, whether loaded into our local Lisp image or not. We will show how Lisp is cross-referenced { on a static web-page with hyperlinks between definitions. With ticket: "kythe://corpus??lang=lisp?path=PATH this we can also get call graphs and call hierarchies 1. #BORDEAUX-THREADS%3A%3AOBJECT%20%3AVARIABLE %20loc%3D%2825%3A16-25%3A22%29", ACM Reference Format: Jonathan Godbout. 2020. Indexing Common Lisp with Kythe: A Demonstra- kind: "variable", tion. In Proceedings of the 13th European Lisp Symposium (ELS’20). ACM, language: "lisp", New York, NY, USA, 3 pages. https://doi.org/10.5281/zenodo.3765987 name: "object", qualified_name: "object", 1 INTRODUCTION location: { corpus: "corpus", Almost every project will have a large number of files path: "PATH/TO/bordeaux-threads and functions. As soon as the number of files goes above 1, or /src/impl-sbcl.lisp", the number of possible on-screen pages goes above 1, users will line_number: 25, get confused about what definitions are used where. SLIME5 [ ] line_number_end: 25, has jump-to-definition using “M-.”, so when the code has been column_number: 16, loaded into the Lisp image we can jump to function definitions column_number_end: 22 and call sites. On websites with static code, such as https://www. }, github.com, where the code is viewed statically on screen, it would v_name: { be nice to get hyperlinks between the definitions and their usage. signature: Kythe https://kythe.io/ is a service that allows users to implement "BORDEAUX-THREADS::OBJECT :VARIABLE loc=(25:16-25:22)", language-specific indexers and then to upload graphs describing corpus: "corpus", the structure of the code. This allows for code display and editing path: engines to provide services like jump-to-definition. At Google we "PATH/TO/bordeaux-threads/src/impl-sbcl.lisp", have implemented a Lisp plugin for the Kythe indexer to produce language: "lisp" cross reference data for Google’s Common Lisp code base. We will } start with a brief overview of Kythe, and then discuss indexing Lisp. } 2 KYTHE OVERVIEW The VName uniquely identifies the node. The slot kind tells which Kythe is a database for storing code graphs for large code bases kind of node this is, so “variable” tells us this is a variable. The slot across multiple languages. Its schema is designed to accommodate location tells us where the source location of the referenced code. facets of different languages. Part of its schema are nodes which The slot ticket is just a URI encoding of the VName. By location name functions and variables, define exact locations in a file, or reference we mean a node containing the location of a form in the code. 1some limitations apply There would be a second node for the instance of the variable which is the first argument to typep. Finally there would be an Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed edge for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. { For all other uses, contact the owner/author(s). source: node1, ELS’20, April 27–28 2020, Zürich, Switzerland target: node2, © 2020 Copyright held by the owner/author(s). ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. edge_kind: ref https://doi.org/10.5281/zenodo.3765987 } ELS’20, April 27–28 2020, Zürich, Switzerland Jonathan Godbout

(setf (bear-cat my-bear-cat) 'friendly)) We would like a reference from the bear-cat setter to the cat slot in the bear structure. In (most) Lisps, this would be fine, we would just add a call to who-calls for (setf bear-cat), but the Lisp language specification does not require such a function to exist. In fact SBCL does not create setf functions for structure-objects, so we must start by going through the code and creating location references for all structure-object accessors.

4 INTER-LANGUAGE REFERENCES Figure 1: Kythe Calling the Lisp Indexer We often make calls from one language into another language, for example Lisp’s foreign functions calls into C. At Google, the most common format for data interchange between systems is called node1 node2 where and are the first and second nodes discussed Protocol Buffers [2], or protobuf for short. A protobuf is a data above. interchange format that a language can implement. For full details on Kythe’s schema please reference https://kythe. To implement support for protobuf messages languages can use io/docs/schema/. their native structures but they must serialize the messages into a standard format before sending them out. Then any other language 3 STRATEGY that implements the protobuf standard can deserialize and read the In an out-of-band process, we start up a Lisp indexing service, messages. The content of the messages can be deserialized without and have it load all the code required to populate the who-calls knowledge of the protobuf schema used, but a protobuf schema database with the requisite information. This is essentially how detailing types and names are required for human readable output. SLIME determines jump-to-definition targets (along with some Here is an example protobuf schema defining one “message” (a heuristics needed for problems discussed later). structure) that contains a string: You may have: syntax = "proto2"; foo.lisp uses bar.lisp package example; The Lisp indexing plugin loads bar.lisp and foo.lisp into the Lisp image and the Lisp implementation determines the cross-reference message HelloWorld { information locally. If you are trying to create all cross-references optional string hello_world_string = 1; for foo.lisp and bar is a function defined in bar.lisp we can inspect } the who-calls database to get this cross-reference. In SBCL [3] you get all of the top level defun and defvar forms, Below we have lisp code that creates the Lisp standard-object but none of the top level forms that don’t define a data structure corresponding to the structure. that are needed later. For example, code that is run at start-up (let ((my-proto time, such as (setf *foo* ’foo), at the top level may not have a (make-instance 'example:hello-world cross reference in the who-calls database because the can :hello-world-string ``hello-world''))) compile the call away. We will go through some examples:. (print (hello-world-string my-proto))) Local variable bindings aren’t stored in the who-calls database. We would like a reference from “hello-world-string” in the If you have a function Lisp code to the “hello_world_string” in the protobuf schema. (defun print-a (a) As Kythe is just a database service that stores a graph of the code (print a)) for contextualization in a language agnostic form, so long as you you would like to have a cross-reference from the a in print- know the signature for the “hello_world_string” you can just a’s lambda list to its use in the function’s body. This is not stored create a cross-reference in Kythe. in the who-calls database. To solve cases such as this we have a number of parsers (e.g. “defun” parser) that will get the symbols 5 MACROS to be bound and store their location. Iterating through all of the The use of a small number of parsers to understand local bindings code, with the correct set of parsers, will give us all of the local is not ideal but it is doable for the built in commands. In contrast definitions. Currently our parser is only a decent heuristic, andour Common Lisp is known for its powerful syntax-extending ability, method parser does not correctly cross-reference types. namely macros. For a detailed look at macros please consut Let Next we have hidden parameters that don’t show up in the code Over Lambda [6], we will go over a basic examples below. or the who-calls database. Take for example: (defvar *process-data-mutex* (make-mutex)) (defstruct bear cat) (defmacro with-data-mutex ((mutex) &body body) (defun set-bear-cat-friendly (my-bear-cat) `(let ((,mutex *process-data-mutex*)) ... lots of code ... (sb-thread:get-mutex ,mutex) Indexing Common Lisp with Kythe ELS’20, April 27–28 2020, Zürich, Switzerland

,@body REFERENCES (sb-thread:release-mutex ,mutex))) [1] Armed bear common lisp. https://abcl.org/. [2] Protocol buffers. https://developers.google.com/protocol-buffers. Accessed: 2020- 02-10. (defun process-data (data) [3] Steel bank common lisp. http://www.sbcl.org/. (with-data-mutex (data-mutex) [4] Kythe: A pluggable, (mostly) language-agnostic ecosystem for building tools that (format t "I have mutex ~a" data-mutex) work with code. https://kythe.io/, 2019. Accessed: 2020-02-10. [5] Slime: The superior lisp interaction mode for emacs. https://www.economics. (print a))) utoronto.ca/osborne/latex/BIBTEX.HTM, 2019. Accessed: 2020-02-10. [6] Doug Hoyte. Let Over Lambda. Lulu.com, 2008. The variable “data-mutex” is bound in the macro with-data- [7] Stelian Ionescu. Bordeaux threads. https://github.com/sionescu/bordeaux-threads. mutex but we would need to create a new parser for with-data-mutex! This technique is inherently non-scalable; sadly we do not yet have a solution. There have been two possible ways brought up to extend our indexers support for macros. The first is updating the brief support for SBCL in the who-calls database, or in a contrib. This would necessarily be tightly bound to SBCL and any language which wants decent macro cross-references would have to do the same. The other idea is to implement a code walker that expands macros and determines what variables are being bound during expansion. This would be less robust, but it would be compiler- independent.

6 DOCUMENTATION Kythe creates a code graph with nodes representing objects such as functions and variables, edges connecting those object, and proper- ties of the objects themselves. For functions this can include their docstrings and their variables. For globals this also includes their docstrings. Lisp makes this easy by having the docstrings of a func- tion or global reference as a slot on the function description during run time. That way, when you parse a function, you can just send Kythe its comment as a Kythe graph node.

7 SO WHAT IS IN THIS FOR ME? The beauty of Kythe is you get a graph of your code sitting in a database you can use to for code hyperlinking (as with Slime) or any other kind of code introspection. You can make code graphs over large projects, or multiple projects, without needing anything loaded in a REPL; the indexing is completely out-of-band. You could power Emacs without having to load Slime, though that seems far- fetched as we already have Slime. You can create a Kythe plugin for your own favorite inter-operating language, and have cross references between them. For example a Java-based Lisp (ABCL [1]) index is well within reason!

8 FUTURE WORK Sadly, we do not have a great answer to local bindings with macros. Macros are hard, and syntax is always changing. My current work- ing idea is to use a code walker and inspect the environment as we go. The Kythe Lisp plugin only works for SBCL. It would be nice to get it to work for every Common Lisp, or at least the major versions of Common Lisp. Since the code stopped trying to be generic a while back, this would take a little bit of effort. Kythe itself is an open source system, as well as several language plugins such as C++ and Java. We plan to open source the Common Lisp plugin.