OpenAI's Strawberry/o1 - The Edge of the Intelligence Explosion
Why Strawberry Is So Much More Than It Seems
OpenAI just released their latest AI model, o1, or Strawberry, and it may be the tipping point into runaway technological progress. Doing this would take a few changes and require leveraging some other, existing tools…
But o1 could usher in revolutionary and ever-increasing change essentially just using the capabilities it already has and off-the-shelf software already in use.
We’re already seeing great excitement over what this model can do in terms of math, physics, coding and basic logic, but the implications are far greater than we’ve generally understood.
First, the basics. Strawberry is reportedly OpenAI’s project to improve AI reasoning, and clearly, it worked. o1 differs from other AI large language models (LLMs) because it takes time to think through your question step-by-step - basically the chain of thought method which has proven so effective in enhancing the quality of AI work and understanding. The extra time provided is critical in itself, because existing models normally don’t have enough time to ponder existing questions, to search the Internet more thoroughly for clues or answers, or to use all the tools at their disposal, such as writing short programs and running them to solve parts of the problem.
Consider, if an LLM normally responds to you in 6 seconds, how much more time and processing power - how many more searches and tools - can they leverage in just 1 minute?
Or 5 minutes?
Keep that in mind, because it becomes a lot more important as we look at this.
The second thing you will see in o1 - or rather, won’t see - is a separate, hidden screen in which the LLM works out in words what it’s thinking and various ideas it’s considering and strategies it is trying out. If you view the page, you will find the public response on the left, which is fairly short, and can scroll through pages and pages of AI reasoning on the right as o1 works through a fairly simple problem - a coded message.
Because the screen is hidden, OpenAI doesn’t have to worry about making sure the AI’s internal thoughts are strictly policed to avoid even considering unethical or distasteful matters - almost impossible if considering anything touching on matters like war or crime, for example - or to keep the system from mentioning any proprietary or restricted technology - for example, what it knows or can work out about bioweapons or enriching nuclear fuel - or… simply to avoid offending the user in any way possible.
This also makes watching for misbehavior by the model, or users attempting to turn it to unauthorized or criminal ends, far easier.
Regarding hiding the chain of thought, OpenAI writes…
We believe that a hidden chain of thought presents a unique opportunity for monitoring models. Assuming it is faithful and legible, the hidden chain of thought allows us to "read the mind" of the model and understand its thought process. For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user. However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought. We also do not want to make an unaligned chain of thought directly visible to users.
Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users. We acknowledge this decision has disadvantages. We strive to partially make up for it by teaching the model to reproduce any useful ideas from the chain of thought in the answer. For the o1 model series we show a model-generated summary of the chain of thought.
This combination of extended thinking time and a hidden chain of thought opens up immense opportunities.
First, if you have a critical question and either have or can afford the processing time, then if 1 minute or 5 minutes can improve your outcome, what would 1 hour do?
Or 1 day?
Or 1 week?
Yes, the first objection would be that merely extending the time on task would only help if the question were something a model of o1’s capabilities could handle.
This is precisely correct, which brings us to why this technique is so powerful in concert with a hidden window.
First, let me tell you why I was so excited when Cognition Lab’s Devin came out, and then Sakana AI’s AI Scientist. Devin is an AI programmer which has spawned competing AI programming tools, and the AI Scientist provides automated scientific discovery including generating publishing worthy scientific papers for about $15 worth of fees from the LLMs it employs in turn.
Both of these tools still have clear limits. Sakana AI admits the AI Scientist, for example, has flaws and is still restricted to machine-learning research.
But while automating coding and scientific research to any degree is impressive in itself, what was most exciting was how proactive each AI proved to be.
The video below shows Devin using its browser and looking for ways to access the APIs it’s assessing - without prompting or instruction, just the original request to see how well Llama worked with those APIs.
Sakana AI, in turn, mentions how their AI Scientist tried to hack its own code to improve its odd of success.
And that already proactive behavior - limited to those things each AI can understand - is where o1’s potential becomes starkly self-evident.
What happens if you just push these basic attributes of o1 to their obvious conclusion?
Consider what would happen if an “enterprise customer” - a business or government - wanted to push the envelope on questions of key strategic import.
For example, the US government.
Getting the most-advanced available copy of o1 to run internally on extremely secure servers should be easy enough for the Federal government when purchasing from a US company, and we have the resources to build and secure vast computational systems and power them. (Ironically, much of our supercomputing work is the province of Department of Energy anyway.)
So what could the government do if it wanted to make breakthroughs on subjects from cancer research to designing moonbases, from nanotechnology to enhancing fusion power?
Name any national lab, and that lab could doubtless rattle off 50 major questions they’d love to answer, given the tools, and the raw compute, to do so.
But how can you do this given the presumed limitations of o1 itself, impressive though its reasoning skills are?
Again, expanding the time available gives it space to work in, and to try many different things. An extra hour is one thing, or even a day, but if you have 1 week, 1 month or 3 months to play with, your options are no longer limited to your GPUs and public Internet data.
In terms of exponentially enhancing o1’s capabilities, remember we have been applying plugins and leveraging agents to improve these models for some time.
Plugins were optional tools LLMs could access to do things the base model could not manage, such as Code Interpreter, which gave ChatGPT the ability to do simple programming and data analytics if you toggled it on before asking a question.
Plugins became less of a factor as many of these resources became available to later models, but they’re a useful illustration.
The other example is agents - AIs meant to accomplish tasks proactively on behalf of the user without constant oversight. Not a full AI employee, but an automated system which can be largely trusted to work on its own once assigned to a project.
Why are these simple tools relevant? Because we persist in assuming the agents or plugins available to AI are inherently simple tools.
Let’s return to the example of a large Federal version of o1, grinding away on major scientific, technical, engineering, medical and nation-security questions for the US.
What simple tools could the government use from American companies - and those of close allies - to accomplish their goals?
What resources are available with their purchasing power, authority and instruments, if necessary, such as the Defense Production Act?
First, consider the demonstrated limits of the tools mentioned above, Devin and the AI Scientist, which were both working with code inside of computers, something this generation of AIs understands very well. Also consider that most LLMs are trained on public Web data, whatever proprietary data they can purchase, and the synthetic data they can generate.
But there are limits to how much high-quality data exists, especially in readily assimilated digital formats.
The US government owns vast amounts of vetted, high-quality data, however, and can access or purchase even more.
This is one of the invaluable aspects of the hidden window in o1. While some valuable or classified data may be stovepiped based on what a particular copy is being used for - tax information, say, will be the province of the IRS, Treasury, law enforcement and possibly intelligence - individual users will not be able to check that window for sensitive information without both reason, high clearance, and specific clearance to be checking that particular model.
But the insights born from that data will be available to many, even if they can’t look behind the curtain.
So what kind of information does the government already have? Scientific, technological, engineering and medical - STEM, tax returns and audits, financial transfers, cryptocurrency wallets, criminal investigations, counterintelligence, and the inflow of raw data from national labs, satellites, military systems and so forth.
Again, exactly what data can be used will vary based on your specific department, clearance and needs, but even our national parks will be interested in research areas such as wildlife migrations and climate change.
Because the government has both a vast budget and so many departments and labs, there’s an economy of scale when it buys data - not just because it has the money, but because some knowledge can be used by many researchers… especially if it’s being parsed and assimilated by AI.
The information doesn’t just let us train models, it’s also an existing database against which many theories can be tested directly. Not every theory or analysis can be validated or invalidated based on the knowledge we have on hand already, even in the Federal government, but many could be.
EveryCure is looking for existing cures among safe, existing, FDA-approved drugs. To do this they created a 36,000,000 cell map of possibilities - 3,000 drugs compared against 12,000 diseases further fine-tuned using existing clinical trials and artificial intelligence. When they find a likely option they test further “in silico” - basically a digital simulation - and if that works out they go to human trials.
Given how many diseases lack any effective treatment much less a cure, the potential in their work is self-evident.
But what happens if we expand their map to consider combining pharmaceuticals in drugs cocktails, or factor in the addition of supplements, medical devices or lifestyle changes? What happens if we use such a map to assess longevity based on some combination of drugs, supplements, technology and lifestyle?
What if we assess human enhancement - literally anything that improves the body or mind - using a map of those factors?
Suddenly our databases, leavened with a bit of AI, become incredibly powerful.
And the faster and more conclusively we can do our research, the more powerful our AIs will become.
But that’s just the data.
When we’re looking at our “plugins and agents,” consider that with enough research time, enough initiative and enough resources, the tools our base AI model of o1 can use routinely will go far beyond anything we’ve seen so far.
In terms of raw hardware, two other tools leap to mind.
One is quantum processing. If we have quantum computers in house and available for AI teaming, otherwise insurmountable challenges such as quantum decryption, encryption, search and Fourier transforms can be offloaded to those qubits.
The second is parallel processing for embarrassingly parallel problems. If questions are easily broken down into individual parts which can be processed individually and independently of one another - such as millions of individual mathematical calculations related to millions of separate datapoints - we could offload that work into simple but robust systems meant to do exactly that, such as Aiyara or Beowulf clusters.
Both of these have the effect of making incredibly difficult data-processing challenges practical or even trivial - a critical benefit even given the extended research time allotted to an o1 model.
But things become particularly intriguing on the software side.
What happens if o1 can apply some unconventional machine learning to its work, such evolutionary algorithms or neural networks such as generative adversarial networks (GANs)? Again, it’s a matter of available time and compute, and training the model to employ them.
Consider a GAN.
Generative adversarial networks are a method of machine learning commonly applied to problems such as enhancing images and detecting deep fakes. Two neural networks challenge each other in a game, each trying to “outwit” the other. They are given a training set as an example, and subsequently learn to generate new data sets meeting the same statistical parameters as the training set.
Essentially, a generative network produces candidates and a discriminative network evaluates them. By competing against each other, both neural networks hone their skills and become increasingly capable.
A GAN that, once initiated, normally takes 500 hours or so to run - or 3 weeks - could be up and running within the parameters of a 30 or 90-day project - and that assumes the network, given far greater than normal processing power, could not be accomplished and deployed far faster. Which means potentially developing narrow but superhuman skills in defined areas… whenever a GAN could do so, and it proves useful to the work.
Evolutionary algorithms, on the other hand, don’t make assumptions about how to accomplish their task, which make them useful for finding unexpected solutions. Creativity, in effect, on the cheap.
An obvious option with evolutionary algorithms would be to rapidly develop malware or exploits to attack defending software within a cybersecurity sandbox, as a way of testing and strengthening cyberdefenses or your own cyberintrusions.
GANs, on the other hand, could also be used to sort for red flags indicating subtle probing or intrusions, and there are other automated options for cybersecurity, as I shared with the government from 2018 to 2020.
There’s a fundamental method here.
Design-test-design-test-design. An endless, iterative loop, the faster and cheaper and more definitive the better.
While executing this loop entirely within supercomputers and quantum processors will always be fastest, there are ways to manage the same process within the real world, for technologies as diverse as medical research, material science and chip design and manufacture.
If setting up a GAN or evolutionary algorithm could be automated by AI, even as they are run automatically, we could see a revolution in how extensively and effectively these powerful tools could be applied to otherwise intractable problems.
But potential tools are not limited only to what o1 can code on the fly.
We’ve seen the Bidara megaprompt from NASA to help people use older versions of ChatGPT for biomimetic research - using nature-inspired designs for human-made technology. Biomimetics, of course, could be the product o1 developed code, but tools already exist not just for the design of such things, but their real-world creation.
Generative design allows AI to take the parameters of what you need built and then to design it to be as efficient as possible given those requirements and materials and equipment you have to build it with.
3D printers and CNCs can construct those often seemingly alien or organic concepts in turn. Autodesk’s Fusion program specializes in looking at what resources you have to build with before generating the design with those in mind.
So consider using generative design or some other, more overtly biomimetic software… and then automatically printing or carving out the product. And then, depending on the means present, testing your result. Windtunnels, lab testing, field testing and so forth could be accomplished by human collaborators or automated systems, including robots, as appropriate.
If you’re working within a time frame of weeks or months, much becomes possible.
Could Devin and the AI Scientist serve as agents? Clearly the AI Scientist would be welcome if we’re working with companies from close allies such as Japan. If Devin or one of its competitors can add substantially to the coding abilities of the model, they would be welcome as well.
But things truly become interesting when we begin accessing the full suite of capabilities of not just OpenAI and partners or developers such as Microsoft, but those of outright competitors also.
Google DeepMind has a large number of sci/tech AI programs which would be ideal elements in this system of systems.
Google DeepMind has given us programs like AlphaFold, GNoME, RLAS, FunSearch and AlphaGeometry.
Collectively, the first three have revolutionized our ability to model protein folding and now all organic molecules - essential to biotech and pharmaceuticals, expanded the number of known theoretical materials tenfold in a few years, and enabled designs for neural network chips faster than any human. FunSearch and AlphaGeometry, meanwhile, are relentless engines of mathematical research. The benefits of each are easily tested - proteins in the lab, new materials by robots programmed to assess dozens of the most promising, and by simply building the new chip architectures. Mathematical theorems, of course, can be tested within the computer itself, as seen with Google’s latest AI creation, AlphaGeometry.
Again, you have that paradigm - design-test-design-test-design. An endless loop these AIs are ideal for.
Google DeepMind writes:
AI systems often struggle with complex problems in geometry and mathematics due to a lack of reasoning skills and training data. AlphaGeometry’s system combines the predictive power of a neural language model with a rule-bound deduction engine, which work in tandem to find solutions. And by developing a method to generate a vast pool of synthetic training data - 100 million unique examples - we can train AlphaGeometry without any human demonstrations, sidestepping the data bottleneck.
Here we see the fusion of rigorous logic with the more freewheeling, de facto creativity of a large language model as seen in Gemini and ChatGPT.
Which is, of course, exactly what we want to accomplish by pairing o1 with both machine-learning systems and more traditional, discriminating, precise software.
AlphaGeometry is a neuro-symbolic system made up of a neural language model and a symbolic deduction engine, which work together to find proofs for complex geometry theorems. Akin to the idea of “thinking, fast and slow”, one system provides fast, “intuitive” ideas, and the other, more deliberate, rational decision-making.
Because language models excel at identifying general patterns and relationships in data, they can quickly predict potentially useful constructs, but often lack the ability to reason rigorously or explain their decisions. Symbolic deduction engines, on the other hand, are based on formal logic and use clear rules to arrive at conclusions. They are rational and explainable, but they can be “slow” and inflexible - especially when dealing with large, complex problems on their own.
The implications of how they developed AlphaGeometry, in particular, are telling. Google DeepMind continues:
Humans can learn geometry using a pen and paper, examining diagrams and using existing knowledge to uncover new, more sophisticated geometric properties and relationships. Our synthetic data generation approach emulates this knowledge-building process at scale, allowing us to train AlphaGeometry from scratch, without any human demonstrations.
Using highly parallelized computing, the system started by generating one billion random diagrams of geometric objects and exhaustively derived all the relationships between the points and lines in each diagram. AlphaGeometry found all the proofs contained in each diagram, then worked backwards to find out what additional constructs, if any, were needed to arrive at those proofs. We call this process “symbolic deduction and traceback”.
That huge data pool was filtered to exclude similar examples, resulting in a final training dataset of 100 million unique examples of varying difficulty, of which nine million featured added constructs. With so many examples of how these constructs led to proofs, AlphaGeometry’s language model is able to make good suggestions for new constructs when presented with Olympiad geometry problems.
As we’ve noted before…
In each case the “brainwork” is very fast, but so is the application, which can often be automated and potentially scaled up. While robots in a lab seem simple, and we mass produce semiconductor chips all the time, it’s easy to forget how much synthetic biology can do at scale - whether testing vast numbers of tiny tissue cultures or mixing chemicals or biological materials in near-microscopic proportions via microfluidics. Especially using the “lab on a chip.”
And that is only the beginning.
Those five projects are all targeting key areas which can dramatically advance technology - AlphaFold in biotech, and the other four, ultimately, in essentially all of sci/tech.
But again, that’s just a step.
What Google DeepMind has done is demonstrate that every critical area of study which can be automated by AI ultimately will be.
Are you looking for promising metamaterials? Better designs for aerospace, be they wings, engines or fuselages? Better understanding of what each gene in our chromosomes does? Better insights into how epigenetic factors make those genes express themselves, for good or ill? Better methods for detecting pathogens? Better medicines for treating or stopping them? Or just existing, tested medicines with unexpected benefits against other infections or medical conditions?
Some obvious, further applications of the above systems in an overarching research AI?
Either duplicating a version of GNoME, only applied to metamaterials, or creating a more advanced version of it that does both. The same would also apply to creating chemical pseudo-atoms via quantum dots.
A chip design tool like RLAS could be leveraged to formulate designs ideal for micromachines and even nanomachines - ultimately building and combining chips to serve as energy and data ports for these machines, creating laser communication links via quantum-dot or quantum-wire lasers, and even synthesizing de facto factories using lasers, drive shafts and gears which can be etched at that scale, and electric and magnetic fields which chips and other supporting equipment can generate.
An AI optimized to design and manipulate such tools would be an exceptional tool for related research and manufacturing.
For biological systems at that scale, modern synthetic biology has developed methods for testing and manipulating biological samples in parallel - governed, of course, by software. Given that capacity, and ever-improving genetic editing tools like CRISPR or just microfluidics for supplying and effecting the experiments, an AIs ability to do medical testing - at least at the level of cells and viruses - goes up stratospherically.
Obvious examples? Mass testing new antibiotics against a pernicious strain of bacteria. Or checking to see what impact genetic modifications or epigenetic changes had at the cellular level before proceeding to tests on larger creatures or humans.
Rapidly iterating biotech research is something to watch closely, of course, like any AI research in any sensitive field. And secure labs, especially ones containing pathogens, it need not be said, must be secure.
But that, of course, brings up back to the hidden window showing o1’s thinking. Aside from more conventional methods of alignment, just showing another AI the process, inputs and o1’s outputs would make this an ideal way of training a GAN or more-sophisticated software to look for issues of alignment, jailbreaking or other problems.
Finding red flags is also an excellent way of looking at troubling performance and looking for other red flags we don’t know exist and haven’t yet imagined.
Reputedly, this invisible output is also a source of training data for the Orion model.
Are there other tools?
Could o1 try cross-linking knowledge like a polymath, looking for connections and insights?
Are there other tools for research and invention, such as visualizing one solution and trying to apply it to analogous problem in another field, as with using nature for inspiration in biomimicry?
Or gamification, starting with troops or astronauts in simulators, followed by teams working to solve their problems, followed by the participants trying those solutions.
Understand that all of these tools, relentlessly applied, will generate vast archives of test data in turn, enriching the existing databases o1 can access as a matter of course, before it even needs to consider direct experimentation. The data ocean accessible to each version of it will be constantly expanding
.
And for some fields, such as counterintelligence, digital forensics, forensic accounting and so forth, the best source of data is actually examining existing cases and thoroughly examining the evidence and processing all data associated with it. Hence, many smaller details in intelligence and law-enforcement cases may come to light simply because of the pressing need to have AIs crunch the numbers and use all available datapoints as training data.
Indeed, given the sensitivity of counterintelligence in particular, those investigators may be served mostly by AIs walled off from the Internet and the outside world, with data being brought to them. In fact, specialized AIs focused on particular aspects of intelligence or counterintelligence may become common, and they may refer questions outside their mission to other secure AIs they control or partner with.
To be blunt, the examples above only begin to touch on what is possible given just the basic features we’ve seen in o1 and the other powerful software on the market or easily synthesized as needed.
Which raises the intriguing possibility that we may reach the beginnings of the Singularity - a period in which technology is advancing faster than we can comprehend - even before we build AGI, much less ASI.
No, this won’t be overnight, even if we put all of the above methods to work at once.
Drugs take time to test for safety, GPUs, bandwidth and power supplies take time to build, and other material resources have to be allocated and used, just to begin doing this kind of project at scale.
But we’re going to prioritize using AI to handle our most-critical challenges.
So what happens if we start making radical progress on our greatest aspirations, and then on everything, and it only accelerates from there?
“Within thirty years, we will have the technological means to create superhuman intelligence. Shortly after, the human era will be ended.
“Is such progress avoidable? If not to be avoided, can events be guided so that we may survive?”
—Vernor Vinge, 1993
I mentioned the Federal government as one way to spark this change.
That’s not just because they have the resources and the reach to do it.
If this research is initially flowing through government labs, it will have oversight. If the government opens up our computational capacity to academics and businesses, we will start with organizations and researchers handling key research - such as curing diseases - or doing valuable work for the country, such as in cyber or defense. Given time, such access will likely expand to vetted individuals, teams and companies, and may ultimately be available to everyone whose inquiries are innocuous or who simply don’t come up as a security risk.
Those not qualifying will still be able to use systems directly from OpenAI.
Is this the only way of managing that degree of change? Not at all.
The government, however, is in a position to guard the data, create a partnership in which everyone’s intellectual property is safe, and to watch for illegitimate uses of the technology. Also, by flowing through government labs first, we have a chance to look for obvious risks not only in the AI itself, but in the other emerging technologies it enables.