Some good readings

And a brief overview of each, written by myself.

Also take a look at the MIT Probabilistic Computing Project's Reading List



Introduction to Mathematical Logic
by Alonzo Church

Formal logic is the foundation of most modern scientific breakthroughs. This book provides a strong and thorough stepping stone into mathematics.

Set Theory and Its Logic
by Willard Van Orman Quine

This book delves into the mathematical theory of classes. I'll leave a quote by Zermelo, which Quine mentions in the introduction: "Set theory is that branch of mathematics whose task is to investigate mathematically the fundamental notions of 'number', 'order', and 'function' taking them in their pristine, simple form, and to develop thereby the logical foundations of all of arithmetic and analysis."

The Calculi of Lambda-Conversion
by Alonzo Church

The content of this work has heavily served for advancement in the field of computation. At its core lies functions and abstraction as concepts for a formal system. Also, it's extremely well written.

Introduction to Probability
by Dimitri P. Bertsekas & John N. Tsitsiklis

This book effectively communicates basic concepts in probability theory and thoroughly describes them by both motivation and formal expression.

Markov Chains and Mixing Times [pdf]
by David A. Levin & Yuval Peres

This is a comprehensive text on Markov chains, focused on determining the rate of convergence to the stationary distribution.


Introduction to the Theory of Computation
by Michael Sipser

This book covers the foundations of computation: automata & languages, computability theory, and complexity theory.

Introduction to Algorithms
by Thomas H. Cormen, Ronald L. Rivest, Charles E. Leiserson, & Clifford Stien

This book covers algorithms and algorithmic design techniques that are crucial to efficiently solving large classes of mathematical problems.

The Structure and Interpretation of Computer Programs
by Harold Abelson, Gerald Sussman, & Julie Sussman

This book teaches principles of computer programming. Its focus is on abstraction, recursion, modularity, and interpreters. It can be appreciated more after reading The Calculi of Lambda-Conversion by Alonzo Church (see above).

Principles of Computer System Design
by Jerome H. Saltzer & M. Franz Kaashoek

This book analyzes fundamental principles and abstractions in computer system design. It teaches how to analyze computer systems and how to design good computer systems.

Artifical Intelligence: A Modern Approach
by Stuart Russel & Peter Norvig

This book is standard for understanding what AI typically refers to, and covers ideas for solving problems, including the roles of learning, reasoning, planning, and predicting in solving goal-oriented tasks.

Elements of Statistical Learning: Data Mining, Inference, and Prediction
by Trevor Hastie, Robert Tibshirani, & Jerome Friedman

This book covers the field of statistics as it applies to learning from data. It explains concepts for learning in a statistical framework as a mean for solving variants of a basic learning problem. It covers topics including regression, classification, neural networks, inference, and graphical models.

Machine Learning: A Probabilistic Perspective
by Kevin P. Murphy

This book applies the tools of probability theory to machine learning, where the goal of machine learning is "to develop methods that can automatically detect patterns in data, and then to use the uncovered patterns to predict future data or other outcomes of interest." It covers topics including generative models, Bayesian variants of regressors and classifiers, Gaussian processes, {exact, variational, & Monte Carlo} inference, clustering, structure learning, and deep learning.

Reinforcement Learning: An Introduction [pdf]
by Richard S. Sutton & Andrew G. Barto

This book covers ideas and algorithms of reinforcement learning (as both planning and learning). It discusses dynamic programming for perfect models, Monte Carlo methods, and temporal-difference learning methods such as Q-learning. It expands on these methods for approximation in large or infinite state spaces. For more reinforcement learning, check out Deep Reinforcement Learning: An Overview by Yuxi Li (on arXiv).

Deep Learning in Neural Networks: An Overview [arXiv]
by Jürgen Schmidhuber

This preprint of an overview of deep learning covers the historical progression of artificial neural networks and their applications, primarily focused in supervised an unsupervised learning settings with some coverage of reinforcement learning. From data handling to backpropagation, feedforward networks to recurrent networks, and shallow to deep networks, this work serves as a handbook for deep learning, particularly in an historical context.

Natural Intelligence

Society of Mind
by Marvin Minky

This classic book models intelligence as a well-organized collection of simple interacting agents. It develops theories for learning, memory, action, and the processing of language within this system. It is extremely well written, and a must-read for any study of intelligence.

Gödel, Escher, Bach: An Eternal Golden Braid
by Douglas R. Hofstadter

This classic book discusses themes in mathematics, art, and music as metaphors for (a theory of) the emergence of cognition. Its focus on the philosophy of mind is maintained primarily as an undertone, though the analogies become apparent as the book progresses. It is extremely well written.

The Origin of Concepts
by Susan Carey

Overview TODO

Research Papers

Artificial Intelligence

Combining Q-Learning and Search with Amortized Value Estimates [arXiv]
(2019) Hamrick, J.B., Bapst, V., Sanchez-Gonzalez, A., Pfaff, T., Weber, T., Buesing, L., & Battaglia, P.W.

Overview TODO

Write, Execute, Assess: Program Synthesis with a REPL [arXiv]
(2019) Ellis, K., Nye, M.I., Pu, Y., Sosa, F., Tenenbaum, J.B., & Solar-Lezama, A.

Overview TODO

Synthetic Datasets for Neural Program Synthesis [pdf]
(2019) Shin, R., Kant, N., Gupta, K., Bender, C., Trabucco, B., Singh, R., & Song, D.X.

Overview TODO

Relational inductive biases, deep learning, and graph networks [arXiv]
(2018) Battaglia, P.W., Hamrick, J.B., Bapst, V., Sanchez-Gonzalez, A., Zambaldi, V.F., Malinowski, M., Tacchetti, A., Raposo, D., Santoro, A., Faulkner, R., Gülçehre, Ç., Song, H.F., Ballard, A.J., Gilmer, J., Dahl, G.E., Vaswani, A., Allen, K.R., Nash, C., Langston, V., Dyer, C., Heess, N.M., Wierstra, D., Kohli, P., Botvinick, M.M., Vinyals, O., Li, Y., & Pascanu, R.

Graph networks provide a general framework for encoding relational inductive biases in structured representations. "Blocks" that compute on graphs are comprised of three update functions (on edges, vertices, and a global attribute) and three aggregation functions (on edges incident each vertex, all edges, and all vertices). The authors demonstrated that this framework with neural nets for the update functions could e.g. learn to find the shortest path on graphs and could learn to perform physical simulations with transfer to systems that had different numbers of entities. Notable gaps on GNs are that control flow and recursion are difficult to represent and require seperate components for interpretation.

Leveraging Grammar and Reinforcement Learning for Neural Program Synthesis [arXiv]
(2018) Bunel, R., Hausknecht, M.J., Devlin, J., Singh, R., & Kohli, P.

Overview TODO.. Introduces the Karel dataset..

Tree-to-tree Neural Networks for Program Translation [arXiv]
(2018) Chen, X., Liu, C., & Song, D.
Learning Explanatory Rules from Noisy Data [arXiv]
(2017) Evans, R., & Grefenstette, E.

Differentiable inductive logic programming by treating ILP problems as binary classification. Robust to noise and capable of working with raw perceptual input (e.g. pixels). It restricts a lot of structure (e.g. all predicates have arity less than three and (WLOG) are defined with exactly two clauses. There are other explicit restrictions), allowing for a neural architecture which generically operates on logic programming. See the classification likelihood function in Section 4.2 for a high-level understanding of the architecture. It takes a "world" (language and background assumptions), rule templates (which specify ranges of clauses that can be generated), and positive and negative examples to produce a target classifier.

RobustFill: Neural Program Learning under Noisy I/O [arXiv]
(2017) Devlin, J., Uesato, J., Bhupatiraju, S., Singh, R., Mohamed, A. R., & Kohli, P.

Input/output examples are inputs to a double-attentional LSTM that produces hidden state. Each I/O example for a program has its own such layers with shared weights, and the hidden states of those layers are pooled at each timestep before sent to a softmax yielding a real vector in the number of tokens in the program vocabulary. This architecture is trained for 256 million random programs with 4 I/O examples each. Finally, the output of the softmax is passed into a beam search decoder that outputs the highest-scoring program. The researchers also experiment with an approach that does not decode a program and instead gets the output directly from the network architecture via latent program representation.

DeepCoder: Learning to Write Programs [arXiv]
(2017) Balog, M., Gaunt, A.L., Brockschmidt, M., Nowozin, S., & Tarlow, D.

Uses an encoder/decoder to learn a distribution over programs conditioned on input/output examples. The encoder maps examples to a latent real-valued vector, and the decoder maps that vector to logits of each primitive DSL function appearing in the source code. These logits are used to guide program enumeration favoring likely programs based on presence of the primitives.

Mastering the game of Go with deep neural networks and tree search [link]
(2016) Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Driessche, G.V., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T.P., Leach, M., Kavukcuoglu, K., Graepel, T., & Hassabis, D.

Learns to play Go with reinforcement learning. Actions are sampled using a policy network and evaluated using a separate value network. Monte Carlo tree search is used to select actions that maximize value with some bias for exploration. The policy network's state is initialized using the result of supervised learning on expert data and subsequently learned via self-play. The later iteration of this work, AlphaGo Zero, is completely self-play based where domain knowledge is limited to rules (used during tree search), scoring, grid representation (19x19 is baked into the NN architecture), and board symmetry (for training invariance and action sampling).

Example-Directed Synthesis: A Type-Theoretic Interpretation [pdf]
(2016) Frankle, J., Osera, P., Walker, D., & Zdancewic, S.

The authors formalize examples as refinement types, and describe a synthesis procedure for synthesizing programs that satisfy such examples based on theorem-proving. Their formalism allows for rich specifications in the type system that don't necessitate concrete values and permits negative constraints. Synthesis of programs converts a single type problem to potentially many subproblems by either manipulating the environment ("left-rule") or by constructing an expression ("right-rule").

The computational origin of representation and conceptual change [pdf]
(2016) Piantadosi, S.T.

Overview TODO

Neuro-Symbolic Program Synthesis [arXiv]
(2016) Parisotto, E., Mohamed, A., Singh, R., Li, L., Zhou, D., & Kohli, P.

This introduces a technique for synthesizing programs in a domain-specific language from input-output examples. It maps the examples to a continuous space using a selection of encoders with deep recurrent networks, and sends those encoded examples to neural networks that generate and encode partial solution trees.

Probabilistic data analysis with probabilistic programming [arXiv]
(2016) Saad, F., & Mansinghka, V.

This is a thorough introduction and overview of probabilistic programming and its applications with the conceptual and mathematical formalism for Composable Generative Population Models (CGPMs). Figure 8 (p. 23) has a really useful example and a diagram describing the flow of logic at a high level.

Building Machines that Learn and Think Like People [arXiv]
(2016) Lake, B. M., Ullman, T. D., Tenenbaum. J. B., & Gershman, S. J.

This reviews progress in cognitive science and AI research, and argues for future direction to get close to building machines that learn and think like people. This primarily involves the building of explanatory and causal models rather than pattern-recognizers, initializing systems with intuitive theories of physics and psychology, and build machings that support learning-to-learn to accelerate the learning process for related tasks.

Do you see what I mean? Visual resolution of linguistic ambiguities [arXiv]
(2016) Berzak, Y., Barbu, A., Harari, D., Katz, B., & Ullman, S.

This describes a model that uses visual context to disambiguate a sentence. It constructs an HMM for each predicate interpreted from a query sentence, which are connected to variable trackers that are themselves HMMs for potential bounding boxes. A sort-of linguistic encoder then constrains the system to resolve the ambiguities.

Understanding visual concepts with continuation learning [arXiv]
(2016) Whitney, W. F., Chang, M., Kulkarni, T., & Tenenbaum, J. B.

This describes a model for disentangling visual concepts by learning the transformations of input over time, yielding a symbolic representation for objects and motion. It uses a gating mechanism which receives from two consecutive frames. The goal is essentially to learn concepts like lighting and azimuth, which were given in a supervised manner in Kulkarni's deepconv paper (see below).

Neural Programmer-Interpreters [arXiv]
(2015) Reed, S. E., & de Freitas, N.

Learns to represent and execute programs by training on execution traces. The traces are used per-subroutine, permitting composition with previously learned subroutines. Equations 1-2 are critical to understanding the system's operation. The system is parameterized by domain-specific differentiable state encoders, typically multilayer perceptrons, which effectively allow use of intermediate computation. The execution traces used for training are tuples of the environment, arguments (both inputs to the state encoder) and program-ID for input, and the next program-ID with a return value if the program should terminate for output.

Human-level concept learning through probabilistic program induction [pdf]
(2015) Lake, B. M., Salakhutdinov, R., & Tenenbaum, J. B.

This describes a model for program induction that aims to perform well in one-shot learning scenarios — to successfully generalize from a single example. It introduces Bayesian program learning. In the context of handwritten characters (the focus of the paper), it achieves human-level performance and outperforms deep learning.

Deep convolutional inverse graphics network [arXiv]
(2015) Kulkarni, T. D., Whitney, W. F., Kohli, P., & Tenenbaum, J. B.

This describes a system centered on performing analysis by synthesis. It uses a hybrid encoder-decoder model to learn a disentangled representation for transformations like lighting at azimuthal rotation. After the encoding step, it uses a generic Stochastic Gradient Variational Bayes estimator to backpropagate reconstruction error when decoding. Clamping is used for (and against) training on each transformation to disentangle the representation.

Concepts in a probabilistic language of thought [pdf]
(2014) Goodman, N. D., Tenenbaum, J. B., & Gerstenberg, T.

This describes knowledge of concepts in a probabilistic language of thought, formally represented as functions in an enriched stochastic lambda calculus. Primary features are compositionality and symbolic scaffolding for productive thought, and probabilistic inference for successful reasoning under uncertainty.

Bootstrap learning via modular concept discovery [pdf]
(2013) Dechter, E., Malmaud, J., Adams, R. P., & Tenenbaum, J. B.

This describes a model for concept learning as a problem of program induction. It represents programs as a subset of simply-typed lambda calculus restricted to primitive combinators, and defines a stochastic grammer over programs. The frontier of promising programs are split into partial solutions which are preferentially enumerated to decide on modular subtrees which yield the most compressive set of solutions. The experiments are in symbolic regression and Boolean function learning.

Structure Discovery in Nonparametric Regression through Compositional Kernel Search [pdf]
(2013) Duvenaud, D. K., Lloyd, J. R., Grosse, R. B., Tenenbaum, J. B., & Ghahramani, Z.

This presents an automatic statistician, focused on regression problems. It uses kernel families and their attributes, and compositions of those kernels, as an expressive language of models. It does stochastic search over an implementation of Bayesian Occam's Razor on the model language. Also check out the follow-up paper and source code for more info.

Bootstrapping in a language of thought: A formal model of numerical concept learning [link]
(2012) Piantadosi S. T., Goodman, N. D., & Tenenbaum, J. B.

This describes a model for making elegant inductive leaps on a logical system. It is built on a lambda calculus with a collection of sort-of cognitive primitives, and inference is done with a Markov-chain Monte-Carlo algorithm to find the highest probability lexicon.

Theory learning as stochastic search in a language of thought [pdf]
(2012) Ullman, T. D., Goodman, N. D., & Tenenbaum, J. B.

This describes a hierarchical Bayesian framework for theory acquisition, aimed at being an algorithmic model for the development of theories in children. It uses a construction of statements with probabilistic Horn clause grammars starting with foundational predicates, then with theories as simple laws relating them, then with concrete models to extend the predicates to objects in the given domain, then finally tying to the observations to yield constraints for inference.

Sound texture perception via statistics of the auditory periphery: evidence from sound synthesis [pdf]
(2011) McDermott, J. H., & Simoncelli, E. P.

This describes a model for processing real-world sound textures. Figure 1 shows an overview of the process for analyzing waveforms to obtain values of statistical representations for the sounds. Sound is then synthesized by applying an iterative procedure to random knows which converges to meeting the same original statistics. There are discrepancies in realism when sound perception involves pitch, rhythm, and reverberation. Introducing feedback mechanisms to perpetuate perceptions would certainly improve the system, and also aligns with typical musical recognition.

Church: a language for generative models [pdf]
(2008) Goodman, N. D., Mansinghka, V. K., Roy, D. M., Bonawitz, K., & Tenenbaum, J. B.

This introduces a universal language for describing stochastic generative processes. The most remarkable features are its querying capabilities with conditional sampling, stochastic memoization, and generic schemes for inference. The language, Church, is a generalization of Scheme which introduces probabilistic computation to the language. The most useful resource for probabilistic models of cognition using Church is at

A rational analysis of rule-based concept learning [pdf]
(2008) Goodman, N. D., Tenenbaum J. B., Feldman J., & Griffiths, T. L.

This describes a model for concept learning with rational analysis. It starts with a context-free grammar comprising of a set of feature predicates and logical connectives, and defines a syntactic prior over formulae of the language to capture simplicity bias and a likelihood corresponding to the explanatory completeness of a formula. It goes on to define the Rational Rules model which joins these priors and likelihoods to yield generalized inference consistent with the standard practices of Bayesian modeling.

The rational basis of representativeness [pdf]
(2001) Tenenbaum, J. B. & Griffiths, T. L.

This defines representativeness in the context of Bayesian inference. It analyzes how Bayesian inference stands compared to statistical and inductive methods in addressing the task of categorization. It features the famous example of coin flips and addressing whether a coin is biased.

Distributed Systems

No compromises: distributed transactions with consistency, availability, and performance [pdf]
(2015) Dragojević, A., Narayanan, D., Nightingale, E. B., Renzelmann, M., ... & Castro, M.
Large-scale cluster management at Google with Borg [pdf]
(2015) Verma, A., Pedrosa, L., Korupolu, M., Oppenheimer, D., Tune, E., & Wilkes, J.
Wormhole: reliable pub-sub to support geo-replicated internet services [pdf]
(2015) Sharma, Y., Ajoux, P., Ang, P., Callies, D., Choudhary, A., ... & Kumar, S.
In search of an understandable consensus algorithm [pdf]
(2014) Ongaro, D., & Ousterhout, J.
The tail at scale [pdf]
(2013) Dean, J., & Barroso, L. A.
Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing [pdf]
(2012) Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., ... & Stoica, I.
ZooKeeper: wait-free coordination for Internet-scale systems [pdf]
(2010) Hunt, P., Konar, M., Junqueira, F. P., & Reed, B.
PNUTS: Yahoo!'s hosted data serving platform [pdf]
(2008) Cooper, B. F., Ramakrishnan, R., Srivastava, U., Silberstein, A., Bohannon, P., ... & Yerneni, R.
Dynamo: amazon's highly available key-value store [pdf]
(2007) DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., ... & Vogels, W.
Chord: A scalable peer-to-peer lookup service for internet applications [pdf]
(2001) Stoica, I., Morris, R., Karger, D., Kaashoek, M. F., & Balakrishnan, H.

Copyright (c) 2016-2023 Lore Anaya Pozo