\pdfoutput=1
\documentclass[11pt]{article}
\usepackage{amsmath}
\usepackage{EMNLP2023}
\usepackage{times}
\usepackage{latexsym}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{microtype}
\usepackage{inconsolata}

\title{Symbolic Logic and Syntax in a Generative Model of Text}

\author{Greg Coppola \\
  {\em Apocalypse Like Right Now} \\
  \texttt{greg@coppola.ai}}

\begin{document}
\maketitle
\section{A Symbolic Probability Database}
\subsection{The Concept of a Discrete Probabilistic Database}
We can refer generally to a {\em discrete logical database}, as a set of pairs $\left\{(z, p) \in (\ell, [0,1]) \right\}$, where each $z$ is a statement in $\ell$ and each $p$ is its associated probability.
One example of such a network would be a {\em Bayesian Network} \citep{koller:propabilistic-graphical-models:2009}.
A Bayesian Network could satisfy all of the properties we need of evaluating and learning from sentences in $\ell$, but for the problem of {\em universal quantification}, i.e., sentences of the form $\forall x\ f(x) \rightarrow g(x)$.
More generally we can look to various theories of {\em logic programming} \citep{deraedt:probabilistic-inductive-logic-programming:2008}, which do attempt to address the quantification issue.
There is no clear sense of unity or standardization so far, it seems, in theories of logical inference like there is in neural network inference.

\subsection{Motivation for a Discrete Probabilistic Database}
We propose that by making the knowledge database discrete, we will avoid the problem of hallucinations, because we have two clearly stored dimensions:
\begin{itemize}
\item discrete database membership: for a given sentence $z$, either $(z, p) \in K$ for some $p$, or it is not. We can thus detect whether a fact has actually been seen, and incorporated, before.
\item ability to evaluate likelihood: for a given sentence $z$ that is in $K$, we can evaluate whether its associated probability $p$ is high or low
\end{itemize}
Given this information correctly constructed, hallucination could be avoided, because unseen things are not in this database.

\subsection{Requirements of a Logical Database}
In order to incorporate a discrete probabilistic model into our generative story, we must assume a system that has the following properties:
\begin{itemize}
\item given a model $\Phi$, can efficiently assign a probability $P_\Phi(z)$ for any $z \in \ell$
\item given a database of logical facts $\left\{z_n\right\}_{n \in U}$ of all facts in the entire universe $U$, we can call $\Phi = \textsc {ReTrain}(\left\{z_n\right\}_{n \in W})$
\end{itemize}
Presumably, we would want to be able to make ``on-line'' updates to the model, and it will not be practical to re-estimate all parameters all of the time. 
However, we want to leave $\Phi$ an abstract object, and so avoid considerations of how updates could be made online, to avoid restraining the model class prematurely, and leave optimizations for future work.

\section{Syntax and Semantics}
Let $z_n$ be a logical form, $x_n$ be a paragraph of surface tokens, and $y_n$ be a parse that mediates between them.
For a given paragraph $x_n$, the set of candidate parses $C(x_n)$ will grow more than exponentially in the length of $x_n$.
However, in practice this space can be pruned using, if necessary using a $k$-best pruning from an existing parsing model \citep{charniak:coarse-to-fine:2005,huang:forest-reranking:2008}.
We will assume that a parse $y$ uniquely determines a logical form $z$, so that we can write that the set of candidate parses for a sentence $x$ is $C(x)$ a set of pairs $(y, z)$, where $y$ is a parse that yields $x$, and $z$ is the unique logical form determined by $y$.

\section{Generating Text with Latent Logical Forms}
\subsection{ChatGPT's Generative Story}
Let ${\bf x} = \left[x_n\right]_{n=1}^N$ be a document of $N$ paragraphs, each of length $W$ tokens.
A ChatGPT-style language model \cite{radford:improving-language-understanding-by-generative:2018} models the probability of a document as:
\[ p({\bf x}) = \Pi_{i=1}^n p(x_n \mid x_{n-1}) \]
That is, in the generative story, one paragraph is created based only on a previous one, with no hypothesized latent state of any kind.

\subsection{High-Level Flow of a Latent Variable Generative Story}
We want to create a generative story, which we call {\em SymbolicGPT}, in which, to create $x_n$ based on $x_{n-1}$:
\begin{itemize}
\item with probability $p(z_n \mid x_{n-1}; \theta, \Phi)$, choose a logical form $z_n$, based on (encoding vectors from) the previous paragraph $x_{n-1}$, the neural parameters $\theta$, and the prior likelihood of articulating the logical concept $z_n$ according to logical database $\Phi$
\item with probability $p(y_n \mid x_{n-1}, z_n; \theta)$, choose a syntactic analysis $y_n$, based on $z_n$, (encoding vectors from) the previous paragraph $x_{n-1}$, and the neural parameters $\theta$
\item the output text $x_n$ is determined uniquely by $y_n$
\end{itemize}

\subsection{Factorization of Logical Forms}
We suppose that the $x_n$ are {\em paragraphs}, which is to say sequences of sentences.
We assume that we receive the document or corpus segmented into paragraphs $x_n$, but that each $x_n$ contains within it logically separable sentences $[s_1, ..., s_{S_{x_n}}]$, but that tokenization internal to $x_n$ is to be done internal to the model we are describing.
The idea is that since each $x_n$ represents a sequence of logically separable sentences, we can use the distribution over sentence interpretations from one logical sentence within $x_n$, to mutually help inform the interpretation of another logically separable sentence within $x_n$.
That is, in the generative story, we include a function $p(z_n \mid x_{n-1})$, rather than $p(z_n \mid z_{n-1}, x_{n-1})$.
That is, $z_n$ does {\em not} depend on $z_{n-1}$, to avoid a blowing up in complexity, both conceptual and run-time, of the computational graph.

\subsection{Generating the Logical Form}
First, we choose a logical form $z_n$, based on $x_{n-1}$, $\theta$, and $\Phi$:
\[ p(z_n \mid x_{n-1}, \Phi, \theta) \]
One straightforward way to factor this is as a product of the syntactic likelihood of producing $z_n$ (as a syntactic object), in the given context:
\[ p(z_n \mid x_{n-1}, \theta) \]
And then the logical likelihood of saying $z_n$ at all, according to $\Phi$:
\[ p(z_n \mid \Phi) \]

\subsection{Generating the Parse}
We next choose the parse $y_n$ in the normal transformer way:
\[ p(y_n \mid z_n, x_{n-1}; \theta) \]
There are a number of different parse formalisms (CFG parsing, unlabeled dependency parsing, labeled dependency parsing), and we are not currently proposing to choose between them.
The only requirement of the parsing formalism is that it must map a surface form $x_n$ to some $z_n$ that we can reason with.
One can use a discriminative parser to constrain the space, and this would create a parse in a bottom-up fashion.
Once the space is pruned and structured, the generative story can be told in a top-down fashion, like in the original parsing models \citep{collins:a-new-statistical-parser:1996,charniak:statistical-parsing-with-a-cfg:1997}.

\subsection{Generating the Text}
The text $x_n$, we reiterate, is a part of the parse $y_n$.

\section{Parameter Estimation}
The problem of parameter estimation in the ChatGPT GPT model is the problem of estimating what we are calling $\theta$, the continuously valued parameters of the neural network part of the model.
In our case, we must also now estimate $\Phi$, which is the discrete logical model.
And, we must hypothesize latent syntactic/logical parses.

\subsection{Expectation Maximization}
The first problem to be solved is that the logical forms are {\em not} part of the given data, and so must be modeled as {\em latent variables}.
The ChatGPT 	transformer does not use latent variables in the way that they were understood in the context of part-of-speech tagging or syntactic parsing, and this greatly simplifies the training.
However, in order to realize the traditional vision of syntax as mapping logical forms to surface form \citep{montague:universal-grammar:1970,chomsky:minimalist-program:2014,steedman:syntactic-process:2000}, we will have to involve a logical representation, which is not present in the data, and therefore is latent.
In order to train with latent variables, we can use {\em expectation maximization} \citep{dempster:maximum-likelihood:1977}.
In the expectation step, we assign distributions to possible parses $y_n$, each of which implies an associated logical representations $z_n$.
In the {\em maximization} step, we optimize $\theta$ and $\phi$.

Boot-strapping this process will perhaps be difficult.
One option is to start by boot-strapping from a high resource language, e.g., English.
A syntactic parser can be discriminatively trained based on a GPT generative model, the way that few shot learning happens today.
If logical world knowledge can be encoded based on English, then other languages can map onto this logical space.

\subsection{Heterogenous Training}
The neural network part of the model can be trained using back-propagation.
It is not clear exactly how the logical model would be trained, or whether back-propagation would be appropriate.
We have specified that we require the abstract interface $\Phi = \textsc {ReTrain}(\left\{z_n\right\}_{n \in W})$ to retrain $\Phi$, but how this would be done is difficult future work.
However, in order to allow the most generality, we can train the neural network part in a heterogeneous way compared to the logical model.
The predictions of the logical model can be treated as exogenous features, and back-propagation can work through the exclusively neural part.

\subsection{The Logical Model}
The fundamental difficulty in estimating a model using symbolic logic is the question of, what are the symbols?
In other words, the problem for the Bayesian network formulation is that we must identify both a graphical structure $G$, not known a priori, and a probability model $\Phi$ over these.
The question of how this would work must be left to future work.
However, for a fixed directed Bayesian network, the problem of parameter estimation is trivial, and just involves counting.

\bibliography{anthology,custom}
\bibliographystyle{acl_natbib}
\end{document}