Using Artificial Intelligence to Augment Human Intelligence

Industry news

2023.09.23.

Using Artificial Intelligence to Augment Human Intelligence

What are computers for?

Historically, different answers to this question – that is,
different visions of computing – have helped inspire and
determine the computing systems humanity has ultimately
built. Consider the early electronic computers. ENIAC, the
world’s first general-purpose electronic computer, was
commissioned to compute artillery firing tables for the United
States Army. Other early computers were also used to solve
numerical problems, such as simulating nuclear explosions,
predicting the weather, and planning the motion of rockets. The
machines operated in a batch mode, using crude input and output
devices, and without any real-time interaction. It was a vision
of computers as number-crunching machines, used to speed up
calculations that would formerly have taken weeks, months, or more
for a team of humans.

In the 1950s a different vision of what computers are for began to
develop. That vision was crystallized in 1962, when Douglas
Engelbart proposed that computers could be used as a way
of augmenting human
intellect. In this view, computers weren’t primarily
tools for solving number-crunching problems. Rather, they were
real-time interactive systems, with rich inputs and outputs, that
humans could work with to support and expand their own
problem-solving process. This vision of intelligence augmentation
(IA) deeply influenced many others, including researchers such as
Alan Kay at Xerox PARC, entrepreneurs such as Steve Jobs at Apple,
and led to many of the key ideas of modern computing systems. Its
ideas have also deeply influenced digital art and music, and
fields such as interaction design, data visualization,
computational creativity, and human-computer interaction.

Research on IA has often been in competition with research on
artificial intelligence (AI): competition for funding, competition
for the interest of talented researchers. Although there has
always been overlap between the fields, IA has typically focused
on building systems which put humans and machines to work
together, while AI has focused on complete outsourcing of
intellectual tasks to machines. In particular, problems in AI are
often framed in terms of matching or surpassing human performance:
beating humans at chess or Go; learning to recognize speech and
images or translating language as well as humans; and so on.

This essay describes a new field, emerging today out of a
synthesis of AI and IA. For this field, we suggest the
name artificial intelligence augmentation (AIA): the use
of AI systems to help develop new methods for intelligence
augmentation. This new field introduces important new fundamental
questions, questions not associated to either parent field. We
believe the principles and systems of AIA will be radically
different to most existing systems.

Our essay begins with a survey of recent technical work hinting at
artificial intelligence augmentation, including work
on generative interfaces – that is, interfaces
which can be used to explore and visualize generative machine
learning models. Such interfaces develop a kind of cartography of
generative models, ways for humans to explore and make meaning
from those models, and to incorporate what those models
“know” into their creative work.

Our essay is not just a survey of technical work. We believe now
is a good time to identify some of the broad, fundamental
questions at the foundation of this emerging field. To what
extent are these new tools enabling creativity? Can they be used
to generate ideas which are truly surprising and new, or are the
ideas cliches, based on trivial recombinations of existing ideas?
Can such systems be used to develop fundamental new interface
primitives? How will those new primitives change and expand the
way humans think?

Using generative models to invent meaningful creative operations

Let’s look at an example where a machine learning model makes a
new type of interface possible. To understand the interface,
imagine you’re a type designer, working on creating a new
fontWe shall egregiously abuse the distinction between
a font and a typeface. Apologies to any type designers who may be
reading.. After sketching some initial designs, you
wish to experiment with bold, italic, and condensed variations.
Let’s examine a tool to generate and explore such variations, from
any initial design. For reasons that will soon be explained the
quality of results is quite crude; please bear with us.

Of course, varying the bolding (i.e., the weight), italicization
and width are just three ways you can vary a font. Imagine that
instead of building specialized tools, users could build their own
tool merely by choosing examples of existing fonts. For instance,
suppose you wanted to vary the degree of serifing on a font. In
the following, please select 5 to 10 sans-serif fonts from the top
box, and drag them to the box on the left. Select 5 to 10 serif
fonts and drag them to the box on the right. As you do this, a
machine learning model running in your browser will automatically
infer from these examples how to interpolate your starting font in
either the serif or sans-serif direction:

In fact, we used this same technique to build the earlier bolding
italicization, and condensing tool. To do so, we used the
following examples of bold and non-bold fonts, of italic and
non-italic fonts, and of condensed and non-condensed fonts:

To build these tools, we used what’s called a generative
model; the particular model we use was trained
by James Wexler. To
understand generative models, consider that a priori
describing a font appears to require a lot of data. For
instance, if the font is $64$

We do this by building a neural network which takes a small number
of input variables, called latent variables, and produces
as output the entire glyph. For the particular model we use, we
have $40$

The generative model we use is a type of neural network known as
a variational autoencoder
(VAE). For our purposes, the details of the generative
model aren’t so important. The important thing is that by
changing the latent variables used as input, it’s possible to get
different fonts as output. So one choice of latent variables will
give one font, while another choice will give a different font:

You can think of the latent variables as a compact, high-level
representation of the font. The neural network takes that
high-level representation and converts it into the full pixel
data. It’s remarkable that just $40$

The generative model we use is learnt from a training set of more
than $50$

In fact, the model doesn’t just reproduce the training fonts. It
can also generalize, producing fonts not seen in training. By
being forced to find a compact description of the training
examples, the neural net learns an abstract, higher-level model of
what a font is. That higher-level model makes it possible to
generalize beyond the training examples already seen, to produce
realistic-looking fonts.

Ideally, a good generative model would be exposed to a relatively
small number of training examples, and use that exposure to
generalize to the space of all possible human-readable fonts.
That is, for any conceivable font – whether existing or
perhaps even imagined in the future – it would be possible
to find latent variables corresponding exactly to that font. Of
course, the model we’re using falls far short of this ideal
– a particularly egregious failure is that many fonts
generated by the model omit the tail on the capital
“Q” (you can see this in the examples above). Still,
it’s useful to keep in mind what an ideal generative model would
do.

Such generative models are similar in some ways to how scientific
theories work. Scientific theories often greatly simplify the
description of what appear to be complex phenomena, reducing large
numbers of variables to just a few variables from which many
aspects of system behavior can be deduced. Furthermore, good
scientific theories sometimes enable us to generalize to discover
new phenomena.

As an example, consider ordinary material objects. Such objects
have what physicists call a phase – they may be a
liquid, a solid, a gas, or perhaps something more exotic, like a
superconductor
or Bose-Einstein
condensate. A priori, such systems seem immensely
complex, involving perhaps $10^{23}$

Returning to the nuts and bolts of generative models, how can we
use such models to do example-based reasoning like that in the
tool shown above? Let’s consider the case of the bolding tool. In
that instance, we take the average of all the latent vectors for
the user-specified bold fonts, and the average for all the
user-specified non-bold fonts. We then compute the difference
between these two average vectors:

We’ll refer to this as the bolding vector. To make some
given font bolder, we simply add a little of the bolding vector to
the corresponding latent vector, with the amount of bolding vector
added controlling the boldness of the resultIn
practice, sometimes a slightly different procedure is used. In
some generative models the latent vectors satisfy some constraints
– for instance, they may all be of the same length. When
that’s the case, as in our model, a more sophisticated
“adding” operation must be used, to ensure the length
remains the same. But conceptually, the picture of adding the
bolding vector is the right way to think.:

This technique was introduced
by Larsen et al, and
vectors like the bolding vector are sometimes called
attribute vectors. The same idea is use to implement all
the tools we’ve shown. That is, we use example fonts to creating
a bolding vector, an italicizing vector, a condensing vector, and
a user-defined serif vector. The interface thus provides a way of
exploring the latent space in those four directions.

The tools we’ve shown have many drawbacks. Consider the following
example, where we start with an example glyph, in the middle, and
either increase or decrease the bolding (on the right and left,
respectively):

Examining the glyphs on the left and right we see many unfortunate
artifacts. Particularly for the rightmost glyph, the edges start to get
rough, and the serifs begin to disappear. A better generative
model would reduce those artifacts. That’s a good long-term
research program, posing many intriguing problems. But even with
the model we have, there are also some striking benefits to the
use of the generative model.

To understand these benefits, consider a naive approach to
bolding, in which we simply add some extra pixels around a glyph’s
edges, thickening it up. While this thickening perhaps matches a
non-expert’s way of thinking about type design, an expert does
something much more involved. In the following we show the
results of this naive thickening procedure versus what is actually
done, for Georgia and Helvetica:

As you can see, the naive bolding procedure produces quite
different results, in both cases. For example, in Georgia, the
left stroke is only changed slightly by bolding, while the right
stroke is greatly enlarged, but only on one side. In both
fonts, bolding doesn’t change the height of the font, while the
naive approach does.

As these examples show, good bolding is not a trivial
process of thickening up a font. Expert type designers have many
heuristics for bolding, heuristics inferred from much previous
experimentation, and careful study of historical
examples. Capturing all those heuristics in a conventional program
would involve immense work. The benefit of using the generative
model is that it automatically learns many such heuristics.

For example, a naive bolding tool would rapidly fill in the
enclosed negative space in the enclosed upper region of the letter
“A”. The font tool doesn’t do this. Instead, it goes
to some trouble to preserve the enclosed negative space, moving
the A’s bar down, and filling out the interior strokes more slowly
than the exterior. This principle is evident in the examples
shown above, especially Helvetica, and it can also be seen in the
operation of the font tool:

The heuristic of preserving enclosed negative space is not a
priori obvious. However, it’s done in many professionally
designed fonts. If you examine examples like those shown above
it’s easy to see why: it improves legibility. During training,
our generative model has automatically inferred this principle
from the examples it’s seen. And our bolding interface then makes
this available to the user.

In fact, the model captures many other heuristics. For instance,
in the above examples the heights of the fonts are (roughly)
preserved, which is the norm in professional font design. Again,
what’s going on isn’t just a thickening of the font, but rather
the application of a more subtle heuristic inferred by the
generative model. Such heuristics can be used to create fonts
with properties which would otherwise be unlikely to occur to
users. Thus, the tool expands ordinary people’s ability to
explore the space of meaningful fonts.

The font tool is an example of a kind of cognitive technology. In
particular, the primitive operations it contains can be
internalized as part of how a user thinks. In this it resembles a
program such as Photoshop or a spreadsheet or 3D graphics
programs. Each provides a novel set of interface primitives,
primitives which can be internalized by the user as fundamental
new elements in their thinking. This act of internalization of new
primitives is fundamental to much work on intelligence
augmentation.

The ideas shown in the font tool can be extended to other domains.
Using the same interface, we can use a generative model to
manipulate images of human faces using qualities such as
expression, gender, or hair color. Or to manipulate sentences
using length, sarcasm, or tone. Or to manipulate molecules using
chemical properties:

Images from Sampling Generative Networks by White.

Sentence from Pride and Prejudice by Jane Austen. Interpolated by the authors. Inspired by experiments done by the novelist Robin Sloan

Images from Automatic chemical design using a data-driven continuous representation of molecules by Gómez-Bombarelli et al.

Such generative interfaces provide a kind of cartography of
generative models, ways for humans to explore and make meaning
using those models.

We saw earlier that the font model automatically infers relatively
deep principles about font design, and makes them available to
users. While it’s great that such deep principles can be
inferred, sometimes such models infer other things that are wrong,
or undesirable. For example, White
points out the addition of a smile vector in some face
models will make faces not just smile more, but also appear more
feminine. Why? Because in the training data more women than men
were smiling. So these models may not just learn deep facts about
the world, they may also internalize prejudices or erroneous
beliefs. Once such a bias is known, it is often possible to make
corrections. But to find those biases requires careful auditing
of the models, and it is not yet clear how we can ensure such
audits are exhaustive.

More broadly, we can ask why attribute vectors work, when they
work, and when they fail? At the moment, the answers to these
questions are poorly understood.

For the attribute vector to work requires that taking any starting
font, we can construct the corresponding bold version by adding
the same vector in the latent space. However, a
priori there is no reason using a single constant vector to
displace will work. It may be that we should displace in many
different ways. For instance, the heuristics used to bold serif
and sans-serif fonts are quite different, and so it seems likely
that very different displacements would be involved:

Of course, we could do something more sophisticated than using a
single constant attribute vector. Given pairs of example fonts
(unbold, bold) we could train a machine learning algorithm to take
as input the latent vector for the unbolded version and output the
latent vector for the bolded version. With additional training
data about font weights, the machine learning algorithm could
learn to generate fonts of arbitrary weight. Attribute vectors
are just an extremely simple approach to doing these kinds of
operation.

For these reasons, it seems unlikely that attribute vectors will
last as an approach to manipulating high-level features. Over the
next few years much better approaches will be developed. However,
we can still expect interfaces offering operations broadly similar
to those sketched above, allowing access to high-level and
potentially user-defined concepts. That interface pattern doesn’t
depend on the technical details of attribute vectors.

Interactive Generative Adversarial Models

Let’s look at another example using machine learning models to
augment human creativity. It’s the interactive generative
adversarial networks, or iGANs, introduced
by Zhu et al in 2016.

One of the examples of Zhu et al is the use of iGANs in
an interface to generate images of consumer products such as
shoes. Conventionally, such an interface would require the
programmer to write a program containing a great deal of knowledge
about shoes: soles, laces, heels, and so on. Instead of doing
this, Zhu et al train a generative model using $50$

The visual quality is low, in part because the generative model
Zhu et al used is outdated by modern (2017) standards
– with more modern models, the visual quality would be much
higher.

But the visual quality is not the point. Many interesting things
are going on in this prototype. For instance, notice how the
overall shape of the shoe changes considerably when the sole is
filled in – it becomes narrower and sleeker. Many small
details are filled in, like the black piping on the top of the
white sole, and the red coloring filled in everywhere on the
shoe’s upper. These and other facts are automatically deduced
from the underlying generative model, in a way we’ll describe
shortly.

The same interface may be used to sketch landscapes. The only
difference is that the underlying generative model has been
trained on landscape images rather than images of shoes. In this
case it becomes possible to sketch in just the colors associated
to a landscape. For example, here’s a user sketching in some green
grass, the outline of a mountain, some blue sky, and snow on the
mountain:

The generative models used in these interfaces are different than
for our font model. Rather than using variational autoencoders,
they’re based on generative
adversarial networks (GANs). But the underlying idea is
still to find a low-dimensional latent space which can be used to
represent (say) all landscape images, and map that latent space to
a corresponding image. Again, we can think of points in the
latent space as a compact way of describing landscape images.

Roughly speaking, the way the iGANs works is as follows. Whatever
the current image is, it corresponds to some point in the latent
space:

Suppose, as happened in the earlier video, the user now sketches
in a stroke outlining the mountain shape. We can think of the
stroke as a constraint on the image, picking out a subspace of the
latent space, consisting of all points in the latent space whose
image matches that outline:

The way the interface works is to find a point in the latent space
which is near to the current image, so the image is not changed
too much, but also coming close to satisfying the imposed
constraints. This is done by optimizing an objective function
which combines the distance to each of the imposed constraints, as
well as the distance moved from the current point. If there’s
just a single constraint, say, corresponding to the mountain
stroke, this looks something like the following:

We can think of this, then, as a way of applying constraints to
the latent space to move the image around in meaningful ways.

The iGANs have much in common with the font tool we showed
earlier. Both make available operations that encode much subtle
knowledge about the world, whether it be learning to understand
what a mountain looks like, or inferring that enclosed negative
space should be preserved when bolding a font. Both the iGANs and
the font tool provide ways of understanding and navigating a
high-dimensional space, keeping us on the natural space of fonts
or shoes or landscapes. As Zhu et al remark:

[F]or most of us, even a simple image manipulation in Photoshop
presents insurmountable difficulties… any less-than-perfect
edit immediately makes the image look completely unrealistic. To
put another way, classic visual manipulation paradigm does not
prevent the user from “falling off” the manifold of
natural images.

Like the font tool, the iGANs is a cognitive technology. Users
can internalize the interface operations as new primitive elements
in their thinking. In the case of shoes, for example, they can
learn to think in terms of the difference they want to apply,
adding a heel, or a higher top, or a special highlight. This is
richer than the traditional way non-experts think about shoes
(“Size 11, black” etc). To the extent that
non-experts do think in more sophisticated ways –
“make the top a little higher and sleeker” –
they get little practice in thinking this way, or seeing the
consequences of their choices. Having an interface like this
enables easier exploration, the ability to develop idioms and the
ability to plan, to swap ideas with friends, and so on.

Two models of computation

Let’s revisit the question we began the essay with, the question
of what computers are for, and how this relates to intelligence
augmentation.

One common conception of computers is that they’re problem-solving
machines: “computer, what is the result of firing this
artillery shell in such-and-such a wind [and so on]?”;
“computer, what will the maximum temperature in Tokyo be in
5 days?”; “computer, what is the best move to take
when the Go board is in this position?”; “computer,
how should this image be classified?”; and so on.

This is a conception common to both the early view of computers as
number-crunchers, and also in much work on AI, both historically
and today. It’s a model of a computer as a way of outsourcing
cognition. In speculative depictions of possible future AI,
this cognitive outsourcing model often shows up in the
view of an AI as an oracle, able to solve some large class of
problems with better-than-human performance.

But a very different conception of what computers are for is
possible, a conception much more congruent with work on
intelligence augmentation.

To understand this alternate view, consider our subjective
experience of thought. For many people, that experience is verbal:
they think using language, forming chains of words in their heads,
similar to sentences in speech or written on a page. For other
people, thinking is a more visual experience, incorporating
representations such as graphs and maps. Still other people mix
mathematics into their thinking, using algebraic expressions or
diagrammatic techniques, such as Feynman diagrams and Penrose
diagrams.

In each case, we’re thinking using representations invented by
other people: words, graphs, maps, algebra, mathematical diagrams,
and so on. We internalize these cognitive technologies as we grow
up, and come to use them as a kind of substrate for our thinking.

For most of history, the range of available cognitive technologies
has changed slowly and incrementally. A new word will be
introduced, or a new mathematical symbol. More rarely, a radical
new cognitive technology will be developed. For example, in 1637
Descartes published his “Discourse on Method”,
explaining how to represent geometric ideas using algebra, and
vice versa: