Error Risk


In most machine learning discussions I have with people, I find that the notion of error risk is new to them. Here’s the basic idea: you have a trained machine learning model that’s processing incoming data, and it naturally has an error rate. Let’s say the error rate is 5%, or conversely, the model is correct about 95% of the time. Error risk, then, is the set of possible negative consequences from incorrect predictions or decisions.

In an extreme example, the error risk for a 747’s autopilot system is perilously high. For a model predicting user shopping behavior on an e-commerce site, the risk is rather low –  maybe it recommends the wrong product once or twice, but nobody gets hurt.

The depth of the model’s integration and the speed at which it makes decisions are both correlated with the amount of error risk. If the program in question is running some analytics off the side, and merely supplying supplementary information to some human decision-maker, the risk is almost zero. However, if the program is itself making decisions, such as how much to bank right in a 45mph crosswind or how much of a certain inventory to order from a supplier, the risk increases substantially.

I’ve taken to quantifying error risk by asking the following questions:

  1. Is the program or system making autonomous decisions? If yes, what happens when the wrong decision is made?
  2. If it is making decisions, what is the cycle time / how quickly are those decisions being made?
  3. If it is not making decisions, is the information it’s providing critical or supplementary? (Critical information could be things like cancer diagnostics, whereas supplementary information could be providing simple reports to a digital marketing team.)

Other questions come up in these situations, but the above are the most important.

Optimal use of machine learning in applications means gaining maximal benefit at minimum risk wherever possible. To get as close as possible to “pure upside” in implementing machine learning, what’s required is some strategic thinking around where the opportunities lie and what the error tolerance might be in those applications. Even state-of-the-art machine learning systems have intrinsic error. Therefore error risk must always be accounted for, even if the error size is tiny.

My ideas about optimal implementation of machine learning borrow heavily from ideas in portfolio optimization, especially the efficient frontier. This is the “sweet spot” in the tradeoff between rewards and risks. As machine learning makes its way into more applications, it’s worth taking the time to consider both the upside and the downside. Measures of optimality can only help to make more informed decisions about how to apply the latest technology.

From Paper to Production: Shortening the Ramp

One of the things that strikes me about the current state of machine learning is how long it still takes to get a new algorithm or model into production. From the time that 1) a paper is published to when 2) its contents are evaluated by those doing machine learning in industry and 3) they subsequently commit to developing it, years have past. It does not need to be this way.

Those doing machine learning are understandably wary of newer methods, and I can see why they might opt to give it time for hidden problems to be discovered before committing. The long-term viability of a model is often judged by its very ability to remain on the scene after many years, which, though vaguely tautological, remains valid. For those models that survive the process, they are deemed essential, timeless. The rest are disregarded.

There are two sides from which this can be viewed: the business risk side, and the development side. These are deeply intertwined.

The Risk Side

Product managers and tech leadership who are coming to grips with the reality that is the pervasiveness of machine learning have an increasing number of considerations, many of them in technology areas that may not be within their expertise. There may be a bit of silver lining, however: the discussion regarding what machine learning technology makes its way into new products is fundamentally one of risk. To the extent that they can work with those within their organization or its allies who have the expertise to accurately assess the risk of newer methods, they can quantify the level of risk for when things go wrong.

For instance, if a new method can drop the error rate of a certain type of prediction down to 3% (as opposed to a previous 5%), how does that affect the risk statistics of the business? Does it enable broader distribution or reach into a higher market segment? Does it enable new products entirely? These questions must be answered.

New qualitative capabilities may seem more difficult to judge, but that is not necessarily true. For instance, some newer capabilities involve an AI system describing what’s in a photograph, using completely natural language with understandable sentences. If a product manager or CTO is considering using this capability in a new product or feature, the error rate of the method can still be used to assess the risk to the business that the new feature exposes them to. The degree of risk will vary widely by the specific industry and application, but the process remains the same.

The Development Side

Even if all parties can agree that a new method is tempting enough to use in a feature, somebody still has to code the thing. This is where progress is sluggish. Developing new, unfamiliar models and validating them is a nontrivial effort, even for experienced ML programmers. Assuming you’re doing the first implementation in a given language or environment, it requires a degree of getting into the thought process of the researchers. Often, direct correspondence is needed to clarify details.

While many papers include pseudocode that can be readily translated into a programming language, just as many do not. From there, you are left to develop a deep understanding of the model’s description and translate its mathematical definition and data structures into a complete implementation. It’s hard work.

This is the part where things can slow down: without a clear understanding of the model and its behavior, it is not possible for a tech lead, data scientist, or ML developer to accurate judgements about the level of risk or the likelihood of bugs or other surprise behavior. More than the error rate, one has to assume that the resulting implementation will have its own quirks and bugs. To assume otherwise would be both unrealistic and foolish.

Many companies may be slower to adopt “bleeding edge” methods, then, because it is simply too difficult to enumerate implied capabilities and to quantify the risk it imposes. How can this be solved?

Shorten the Ramp

Consider the situation where there is X new deep learning model and a company really wants to use it in their products, but may not have a good way of reaching the logical conclusions of doing so. We can point out the main issues:

  • It can be a challenge to arrive at an exact error rate for the specific application before an implementation has been made. The paper will use test datasets, but the model will almost surely behave differently with the data specific to a feature.
  • There is often a break in the communication between those gaining understanding of the model and those assessing how it may affect the business overall. It could be anything from a smash hit to total disaster.
  • Even when a model is finished, it will need to land in an environment in which to run. Engineers should keep the infrastructure requirements in mind from the beginning.
  • In a waterfall or waterfall-like process, it is of course not possible to create requirements in the absence of understanding of the capabilities involved. This stalls progress.
  • Agile development is out the window, due to the high sensitivity of the relationship between model performance and feature risk or cost. These aren’t really the kinds of things you can just “ship first, iterate later”. Much needs to be worked out before it goes into the hands of feature developers.

All of this points toward two ways to shorten the ramp to deployment:

  1. If a company is genuinely interested in adopting new algorithms and models in their offerings, they need to provide representative data as soon as possible.
  2. Their engineering and/or data science team(s) need to have tools and infrastructure to support rapid prototyping of new models.

Only with datasets that are representative of what will occur in a production setting can a team judge their implementation and profile its performance. In the paper, a model may boast 90-something percent accuracy, but you may find that for your problem it is a little less, thereby affecting how risky the investment in developing the model is.

This can happen before the implementation phase, by looking at the test data used in a paper. A talented developer or data scientist can format the internal dataset to be similar to that used in a test dataset from the paper, thereby reducing opportunity for errors to arise from differences in data formatting.

An example process, then, might look like this:

  1. Product manager decides they need X new capability in their next phase of features.
  2. Their first order of business, then, is to gather and build datasets that are close to the actual problem. At the very least, they should ensure that whoever can build that dataset has access to all the tools and data sources needed to complete it quickly.
  3. Product manager hands over dataset, high-level requirements, and asks data science and/or engineering team(s) to begin investigating models.
  4. Technical team either begins profiling models they already know about or scouting for models that are known to enable the required capabilities.
  5. Development / prototyping begins with the selected model(s) and the datasets provided.
  6. Throughout development process, error rates and other important metrics are reported back to the product manager (or whomever is overseeing the process).
  7. Risk calculations are adjusted as this information flows in. For example, if it’s a photo auto-tagging feature in question, one can determine how many users are likely to experience incorrectly tagged photos and how often, based on the volume of photos and the error rate of the model. From that, one can determine how much of a risk it is to a the business – are the users likely to leave if they experience the error, or is it not a deal-breaker?
  8. Once all models have been profiled and tested, a decision can be reached about whether or not to proceed with the feature.

Of course not every company has a product manager or entire teams for data science and engineering, but the overall structure can be applied by those filling the roles – even if it’s all the same person.

In summary, the best way I see to shorten the time from a published paper to a viable production implementation is to 1) provide data as early as possible and 2) ensure engineering has tools to quickly prototype and test models. Both are difficult, but both will pay off considerably for those willing to put in the effort.

I hope this has been helpful to you. Please ask questions in the comments.



Concepts as Programs

The awesomeness that is Bayesian Program Learning

In December 2015, this nifty paper showed up on the scene, titled “Human-level concept learning through probabilistic program induction“. I read it, then read it again, then read it 20 more times. (Not even kidding here.) It’s difficult to relate why this is so important, but I’ll try my best to explain.

First, the best way for you to get introduced to this is to check out this short video:

You can grab the accompanying Matlab code here.

They call their approach Bayesian Program Learning, or BPL. It’s a huge step forward, and here are some reasons why:

  1. The current machine learning approach that’s all the rage, Deep Learning, still requires hundreds or even thousands of examples in most cases. BPL, in this context, requires as little as one example.
  2. The model is “learning to learn”: “Learning proceeds by constructing programs that best explain the observations under a Bayesian criterion, and the model “learns to learn” (23, 24) by developing hierarchical priors that allow previous experience with related concepts to ease learning of new concepts (25, 26). These priors represent a learned inductive bias (27) that abstracts the key regularities and dimensions of variation holding across both types of concepts and across instances (or tokens) of a concept in a given domain.” In short, its past experience helps inform its learning process. While some Deep Learning models have demonstrated this, they do so in a far more limited way.
  3. Explicitly representing concepts as compositionally built programs starts to look a lot like the dreams of old functional AI realized, though instead through probabilistic methods.
  4. The approach of learning nested programs is highly generalizable, and could even be made hierarchical a la Deep Belief Nets.


What immediately stood out to me about BPL is that the “primitives” could be anything, and the images could likewise be anything. In the BPL paper, the primitives are pen strokes, and the images are handwritten characters. There is no reason I’m aware of that the primitives couldn’t be partials of some other function, and that the images couldn’t be an input to another system – possibly even another BPL node. Instead of the model learning to compose handwritten characters from pen strokes, it could be learning to compose solutions to puzzles based on available moves in context.

In order to generalize the BPL beyond handwriting, we’ll need to drop some of the domain-specific constraints:

  • In the algorithm, they sample noise/variance in the motor programs. This is a natural thing to do in order to mimic human-like variability in the output of the learned motor program, but may not be valid in other contexts. This step in the algorithm can be made optional.
  • The spatial trajectories are all smooth and, generally, connected. In a more abstract domain, such as learning a sequence of actions to take in order to solve a puzzle, the resulting ‘image’ will not necessarily be smooth at all, but could end up looking more like a QR code. There is no reason that I’m aware of to have a requirement for smooth trajectories in the general case; this should be optional. (If an issue is because of the random walk algorithm used, it may be possible to swap that for a different process in order to get the required samples.)
  • Not every application will require continuous distributions. There may be cases where discrete distributions are more apt. It is worth testing this in the future.

To start exploring more general forms of the BPL algorithm and try them in real-world, production use cases, I’ve started a Python implementation. My hope is to further explore BPL and build an ecosystem around it, to encourage developers to make their own variants and test them on specific use cases. As the project matures and I do my own tests, I’ll blog about them here.

Representing concepts as programs is not a completely new idea, but BPL pulls it off unlike anything else prior. The algorithm offers a solid base to build from, and a new way to start thinking about general learning capabilities in the machine.

The Octet System: A Way to Think About AI

You see countless headlines about AI these days, littered with references to “deep learning”, “neural networks”, “bots”, “Q&A systems”, “virtual assistant”, and all manner of other proxy terms. What’s missing from this entire discussion is a way to gauge what each system is really capable of.

In the spirit of the Kardashev Scale, I’ve put together my own ranking system for AIs, which we’ll be using at Machine Colony.

(Note: I’ll try to provide as much background and example information as I can without this post reading like a cog-sci textbook. For those savvy in AI these examples will no doubt seem pedestrian, however I do try to illustrate concepts as much as possible, in a perhaps ill-fated effort to refrain from being too esoteric.)

Introducing the Octet System

I’m not much for fancy names, but in this case it was fitting: list the qualitative capabilities of an information system, and break it down into eight distinct ranks, or “classes”. They are as follows:

Class Null

The zeroth class is something which does not qualify as an intelligent system whatsoever. While this can cover any manner of programs – be they in software or manifested in processes emerging from hardware – I choose to focus on software for this example.

Programs that fall into class null have the following characteristics:

  • They are only able to follow explicit, predetermined / deterministic logic.
  • They follow simple rules – “if this then that” – with no capacity to ever learn anything.
  • They have no capacity to make nuanced decisions, i.e. based on probability and/or data.
  • They have no internal model of the world (this is related to learning).
  • They do not have their own agency.

This is by far the broadest class, as it covers the vast majority of our software systems today. Most of the world’s software is programmed for a specific task and does not really need to be a learning, decision-making system.

Class I

Programs of this class have the following characteristics that are different from Class Null:

  • They have the ability to make rudimentary decisions based on data, based on some trained model.
  • They have the ability to learn from the outcomes of their decisions, and thus to update their core model.
  • As such, their behavior may vary over time, as the data changes and their model changes.
  • They are trained for a small number of narrow tasks, and do not have the capability to go outside those tasks.

This covers things like fraud detection agents, decent spam detection, basic crawling bots (assuming they’re at least using decision trees or something similar). The decision could be a classification action – marking something as spam, for instance – or it could be deciding how well a website ranks in the universe of websites.

Class II

Classes I and II are the most similar on this scale, because the distinction is subtle: a Class II program has all of the capabilities of a Class I, but may have more than one core model and more than one domain of action. For instance, the new Google Translate app has natural language and vision capabilities, with different models for each. These models are linked and ‘cooperate’ to translate the words in the field of view of your smartphone’s camera.

Class Is, by contrast, only have one area that they’re focused on, and make decisions only based on the model relevant to that domain.

Class III

Programs of Class III have a two main distinctions from Class IIs:

  • They have a basic memory mechanism, and the ability to learn from their history in those memories. This is more advanced than simply referring to data; these programs are actually building up heuristics from their own behavioral patterns.
  • They persist some form of internal model of the world. This assists in creation of memories and new heuristics in their repertoire.

Class IV

This class starts to loosely resemble the intelligence level of insects. Class IVs not only have some kind of internal model of the world, but they gain an ability that was essential to the evolution of all complex life on earth: collaboration.

Thus, their characteristics are:

  • They have the ability to collaborate with other agents/programs. That is, they have mechanisms with which to become aware of other agents, and a medium through which to communicate signals. (Think of ants and bees leaving chemical traces, for instance.)
  • Like Class IIIs, they have an internal model of the world. However, a Class IV’s model is more closely linked to its goal structure, and not merely ad hoc / bound in one model. Its internal model may be distributed across several subsystems / mathematical models; representations of complex phenomena or experiences are encoded across various components in its cognitive architecture (vision systems, memory components, tactile systems, etc).
  • They have the ability to perform rudimentary planning, driven by fairly rigid heuristics but with a little flexibility for learning.
  • They have the ability to form basic concepts, schema, and prototypes.

While it is not a prerequisite that Class IVs have multiple distinct sensory modalities – optic, auditory, tactile, olfactory systems – that serves as a good example of the complexity level these programs start to achieve. In an AI setting, a program could have hundreds of different types of inputs, each with their own data type and respective subsystem for processing the input. The key distinction is that in Class IV programs, these subsystems have a high degree of connectivity, and thus generate more complex behavior.

Many robotics software systems could also be placed in Class IV.

Class V

The capabilities of Class Vs begin to resemble more complex animal behavior, such as rats (but not as intelligent as many apes). The primary characteristics of note are:

  • They have the ability to reflect on their own ‘thoughts’. In an AI program setting, this would mean it has the ability to optimize its own metaheuristics. In rats, for example, this is manifested as a basic form of metacognition.
  • They have the ability to perform complex planning, especially in which they are able to simulate the world and themselves in it. Which leads to:
  • They have the ability, to some degree, to simulate the world in their minds. That is, they can perturb their internal model of the world without actually taking action, and play out the results of hypothetical actions. They can imagine scenarios based on their knowledge of the world, which is intimately related to their memories (recall the memory capability from Class III).
  • Related to their planning and internal simulation capabilities, they have the ability to set their own goals and take steps to achieve them. For instance, a rat may see two different pieces of food, decide that it likes the looks of one of them better than the other, set its goal to acquire the better-looking morsel, and subsequently plan a path to get it. The planning part relies on actions it knows it can do – how fast it can or run, how far can it jump – and the terrain ahead of it, as well as memories of how it may have conquered that type of terrain before. Thus goal-setting and planning rely heavily on memory and the internal model.
  • They have a rudimentary awareness of their own agency in the environment. That is, when they are planning, they treat themselves as a factor in the environment they are simulating.

By this time, you have a program which is able to reflect, plan, collaborate with other agents, set goals, learn new behaviors and strategies for achieving its goals, and simulate hypothetical scenarios.

Class VI

You’re a class VI. So am I. Almost every human being is a Class VI – ‘almost’ because, well, this designation is questionable when applied to some politicians.

Programs of this class will start to resemble human-level intelligence and capability, though not necessarily human-like in nature. While in humans a major difference is more complex emotions, this scale does not consider emotions directly.

Artificial intelligence has not yet reached this level, and there are varying predictions as to when it will. The good news is that the expert consensus is clear on the idea that it will happen, it’s just that no one knows exactly when.

Key components of Class VI agents/AIs are:

  • Full self-awareness. The agent is fully aware of itself, its history, where the environment ends and it begins, and so on. This is related to consciousness, though Class VIs need not necessarily be conscious in a strict sense.
  • They have the ability to plan in the extremely long term, thinking ahead in ways that more basic systems cannot. Specific timescales are relative to its natural domain: for a person, decades; for an AI program, perhaps, seconds or hours.
  • Class VIs are able to invent new behaviors, processes, and even create other ‘programs’. In the case of a human, this is obviously an inventor creating a new way of solving a problem, or a software developer programming AIs somewhere in Brooklyn…

Class VII

This is what might well be referred to as ‘superintelligence‘. While some AI experts are skeptical of whether or not this can be achieved, there does seem to be broad agreement that it is imminent. Nick Bostrom writes elegantly about the subject in his book of the same name.

While nobody knows exactly what this may look like, there are two major distinguishing factors which would almost certainly be present:

  • They have the ability to systematically control their own evolution.
  • They have the ability to recursively improve themselves, perhaps even at alarmingly minuscule timescales.

Their first ability is perhaps their most profound. While humans do in some sense control our own fate, we do not yet have fine-grained control over the evolution of our brains and hence our cognitive abilities (though CRISPR may soon change that). With an artificial superintelligence, many limitations are removed. They can arbitrarily copy-paste themselves, ad finitum, and perform risk-free simulations of their new versions. They also will be essentially immortal, so long as their hardware persists and has a supply of energy.

With respect to the second ability, one might imagine an ASI (artificial superintelligence) making multiple clones of itself, each clone independently applying a self-improving strategy, and then each one in turn performing a set of benchmark tests to determine which one improved the most from the original copy. Whichever agent performed the best would become the new master copy, while the others would be taken out of the running.

This is a supercharged evolutionary algorithm, essentially. The tests would be agreed upon in advance, and even perhaps written as a cryptographically secure contract (blockchain-based or otherwise) to prevent cheating. In doing so, the agent would keep improving up to hardware limits or some theoretical asymptote.

The kind of scenario above is not at all unlikely in the near future.



AI capabilities currently exist somewhere between the Class IV and Class V marks, but are quickly marching toward Class VI. DeepMind and Facebook are leading the way in this direction, though other notable players are making important contributions. Certainly the brand-new OpenAI will have some interesting insights as well.

My hope is that this type of classification system, and others like it, will help bring some structure to the conversation around fast-emerging AI. With deeper clarity in our common language, we can have more meaningful and productive conversations about how we wish for this technology to advance and how it ought to be used. We owe it to ourselves to have the linguistic tools to accurately describe our progress.

A World Inside the Mind

Short post today, but a few things occurred to me as I was reading the paper on Bayesian Program Learning:

  • This form of recursive program induction starts to look suspiciously like simulation – something we do in our minds all the time.
  • Simulation may be a better framing for concept formation than via the classification route.
  • Mapping the ‘inner world’ to the ‘outer world’ seems a more sensible approach to understanding what’s going on. If you look at the paper, you also see some thought-provoking examples of new concept generation, such as the single-wheel motorbike example (in the images). This is the most exciting point of all.

A final design?

Combine all the elements together, along with ideas from my last post, and you get something that:

  1. Simulates an internal version of the world
  2. Is able to synthesize concepts and simulate the results, or literally ‘imagine’ the results – much like we do
  3. Is able to learn concepts from few examples
  4. Has memories of events in its lifetime / runtime, and can reference those events to recall the specific context of what else was happening at that time. That is, memories have deep linkage to one another.
  5. Is able to act of its own volition, i.e. in the absence of external stimulus. It may choose to kick off imagination routines – ‘dreaming’, if you will – optimize its internal connections, or do some other maintenance work in its downtime. Again, similar to how our brains do it while we sleep.

This starts to look like a pretty solid recipe for a complete cognitive architecture. Every requirement has been covered in some way or another, though in different models and in different situations. To really put the pieces together into a robust architecture will require many years of work, but it is worth exploring multi-model cognitive approaches.

If it results in a useful AI, then I’m all in.


What Computers Think About

It’s anybody’s guess, really.

That said, we do have some clues about what kinds of things go on inside the ‘minds’ of these little silicon monsters – under one specific condition.

There’s nothing more thrilling than seeing the spark of intelligence pop up in an AI system. You truly feel as though there’s something peeking out at you from behind all those numbers – a prototype of some bigger life form, curious and eager to grow as its learning algorithms wander the deep inner space of its mind. Yet all of this magic only takes place when there is active input, that is, when data is flowing into the system. This could mean it’s being fed images, text, audio, or raw numbers such as time series.

What happens when there’s no data flowing in?

Nothing. Nada. Zilch. Nichts.

The unexciting reality is that as soon as the flow of data is turned off, most of these things just go to sleep, so to speak. They stop. Nothing is happening in there, save for maybe a handful of residual calculations.

Spoiler alert: This is the condition mentioned at the top of the post. Computers ‘think’ about nothing when there is no data actively being fed to them.

I’ve written a bit about this before, but I reiterate that this stands in stark contrast to humans and other animals. Even in our sleep, our brains display a massive symphony of activity, repairing, adjusting, and moving memories around. Our most complex organ has a remarkable ability to reorganize itself, with or without the presence of sensory inputs. This should be a hint to us as to how we might build true AIs in the future.

So…what do computers think about, exactly?

As far as anybody can tell, nothing. Not yet anyway. For me, this is the most exciting part: creating things that persist in thinking even when there is no immediate sensory information flowing in.

Imagine if your laptop kept doing work even when you were away, and would notify you via your smartphone or smartwatch whenever it did/learned something particularly interesting or important.

Imagine if AIs of the future were trained as scientists, absorbing knowledge from human experts, and even when they ‘stalled’ would simply try new thought experiments. This is completely different from just finding patterns in data, and stopping when the data stops flowing.

Imagine if your house had a persistent AI that could think about cost-effective ways to improve itself, its property value on the market, or even how to organize a party within its halls. This requires a persistence that is not seen in today’s machine learning systems.

Who else is thinking about this?

I’ve referenced him before: Dr. Stephen Thaler has done some interesting work on this subject. Some others have mentioned ‘persistent AI’ in passing, but few seem to be focusing on it as a qualitative shift from passive machine learning systems like we use now. Even Siri is passive: it doesn’t do anything until you ask it a question or give it a command.

DeepDream and all of the related work got many people thinking about what AIs ‘see’ when they see the world, which is a similar idea to inner thought and persistence. This work shows some of what goes on on the inside, under the condition that the network is being actively stimulated from external sources.

To explore these ideas, I’ve been toying with simple AI models that that ‘think’ about their past experiences. They have a long-term memory bank, and a way of referencing past experiences through various measures of context. The choice of context metric is extremely important, which I’ll expand upon in a later post. This was partially inspired by Facebook’s Memory Network architecture, which showed a big shift in how we think about cognitive AI systems.

A past experience could be as simple as when it read a certain segment of text in an essay, or when it learned a new type of melody from a song. Our memories tend to be quite long, often entailing many sequences of intertwining sounds, sights, smells, and recollection of emotional states: “I was at the beach this last weekend, it felt amazing to just lie under the sun and relax.” In this example you’re recalling the time (or perception of it), the place, the feeling of the warm sun on your skin, and the emotions you had at the time – repose, tranquility.

AIs are different, especially the toy models I’m working on now. They don’t have high-level concepts yet, and it will be many years before they truly do. However, they can be enabled to have simple memories that are much smaller: a flashback to its particular state when it learned something new or performed some action, such as a prediction that it got correct. They also have a unique feature that we humans do not: their memories can be made essentially perfect. They can recall a scene with arbitrary accuracy, provided sufficient space in storage and/or memory.

The memory component is a necessary ingredient in useful persistence, as you would not want persistent AIs that forgot every interaction, everything they ever learned or did.

Imagine if Siri started remembering conversations it had had with you in the past? Admittedly, this could be dangerous for some.

What comes next?

My new AI startup, Machine Colony, will be taking up most of my time these days. However, as part of my work with Machine Colony and more generally, I’ll continue to investigate these working memory and long-term memory components in AI architectures. If I get really ambitious I may even attempt to publish something on it, be it a white paper or a full-on academic paper. At the very least, you can expect more blog posts and the occasional code snippet, likely in Python. I remain noncommittal because of time constraints, of course.

My sincere hope is that you finish reading this not necessarily with an immediate answer to a question in hand, but more that it provokes thought in the direction of “what if our devices and systems were truly intelligent and persistent?” It is worth thinking about how this may affect your life, because one thing is certain: it’s not a matter of if persistent AIs will emerge, but when.


Technology and the Limits of Convenience

I need an Apple Watch. Badly. I need it because the distance between my wrist and my coat pocket is simply too much, because I need to save that extra second when checking my phone for notifications. I need it because I need one more device to monitor my health.

While I’m at it, I also need an app that saves me a few seconds booking a table, finding the right bar for my Saturday night, and so on. Hell, anything that can save me those precious seconds throughout my hectic day will have my dollar.

The obvious facetiousness aside (I don’t need any of those things), I’m growing weary of seeing so many startups without missions. Don’t get me wrong, I’m all for creating amazing products. There is, however, an eventual lack of authenticity in the endless strive for greater convenience. These pure-convenience plays face continually diminishing returns.

It is easy to mistake one-off convenience for recurring utility. Entrepreneurs have gotten all too good at tricking themselves – and investors – that their ad hoc gimmick will scale to epic proportions, and keep compounding on its original value. Yet only the most central and important of product functions will see this happen. Along the peripherals, most utility is exhausted almost immediately.

It’s well known that the vast majority of startups fail. You can’t make them all into winners. You can, however, stress the importance of real, lasting, and growing utility. The more apps I see, and the more pitches I hear for the latest and greatest fad ever to hit the App Store, the more I feel like we need to focus on using technology to help us be better people. Beyond convenience, and beyond mere utility, there lies a realm of innovation wherein products are not actually products at all, but catalysts for social movements. Those same movements can help us become better citizens of humanity.

The job of the entrepreneur in all times before has been to find and capture economic opportunity. Now, however, a higher calling is in order: entrepreneurs need to rise to the challenge of taking the higher-level principles of creating things that bring about positive social change, and finding specific opportunities to execute on opportunities that build toward a greater goal. Building a business still takes as much savvy and boldness as ever, but with the new requirements of relevance to social context and mission-driven offerings. It may be the hardest problem of all, but it will turn out to be the most worth it.