AI Investing Bots: Claude, GPT and LLMs in the Markets

Disclaimer

This article is for informational purposes only and does not constitute financial or investment advice. Past performance of any trading strategy, AI-powered or otherwise, is not indicative of future results. All investment activity carries risk of loss.

In early 2023, a researcher at the University of Florida published a paper demonstrating that ChatGPT could predict short-term stock price movements with statistically significant accuracy by analysing news headlines — outperforming a selection of established sentiment analysis tools on the same task. The paper was widely shared, widely misunderstood, and almost certainly responsible for a wave of retail traders building their first GPT-powered investing bots.

Two years on, the landscape is substantially more sophisticated, considerably more cautious, and a great deal more interesting. Large language models — Claude, GPT-4, Gemini, and the open-source variants that have proliferated since — are now being used across the full spectrum of investing activity: from earnings call analysis and regulatory filing summarisation to real-time sentiment scoring, portfolio construction, and fully autonomous trading systems. The results are mixed in exactly the way that anyone with genuine experience in financial markets would predict. And the lessons being drawn from those mixed results are beginning to matter.

What LLMs Are Actually Good At in Investing

The most consistent finding across the research literature and practitioner experience is that large language models excel at tasks that require processing large volumes of unstructured text quickly and extracting structured insights from it. In financial markets, this is an extremely valuable capability — because financial markets generate an extraordinary volume of text that moves prices, and traditional quantitative approaches struggle to process it at the required speed and nuance.

Earnings Call and SEC Filing Analysis

Perhaps the most well-documented legitimate use case for LLMs in investing is the analysis of earnings calls, analyst day transcripts, and regulatory filings. A skilled human analyst can read an earnings call transcript and identify tone shifts, hedging language, changes in management's forward-looking statements, and subtle divergences from prior quarter commentary. An LLM can do the same across hundreds of transcripts simultaneously, flag the ones that deviate significantly from prior patterns, and produce structured summaries that allow an analyst to prioritise their attention. Hedge funds and asset managers who previously employed teams of analysts for this work are now deploying LLM pipelines that complete equivalent coverage in a fraction of the time.

Claude in particular has become a favoured tool for this application among practitioners who cite its longer context window — allowing an entire 10-K annual report or multi-quarter earnings history to be processed in a single prompt — and what users describe as a more careful, qualified approach to financial analysis than some competing models. Where GPT-4 might produce a confident-sounding assessment of a company's prospects, Claude tends to flag the assumptions embedded in that assessment and note where the evidence is ambiguous. For financial analysis, this epistemic caution is a feature, not a limitation.

Practitioner Example

Earnings Sentiment Pipeline — Quantitative Equity Fund

A mid-size quantitative equity fund documented a pipeline using Claude to analyse earnings call transcripts for 300 S&P 500 companies per quarter. The model scored each call on twelve dimensions including management tone, guidance specificity, analyst question deflection patterns, and language entropy relative to prior quarters. The resulting scores were used as a factor input alongside conventional quant signals. The fund reported statistically significant improvement in next-quarter return prediction for the subset of stocks where the LLM sentiment score diverged sharply from analyst consensus — suggesting the model was capturing information that conventional analysis was missing.

News Sentiment and Real-Time Signal Generation

The University of Florida paper that sparked widespread interest in LLM-based trading was specifically studying news sentiment — the ability of GPT to classify whether a news headline about a company was likely to have a positive or negative short-term price impact. This is a task where LLMs have genuine advantages over earlier natural language processing approaches: they understand context, they handle ambiguity, and they can reason about second-order effects in a way that keyword-based sentiment tools cannot.

The research found that GPT's sentiment classifications showed a statistically significant relationship with subsequent one-day and three-day stock returns — particularly for stocks with lower analyst coverage, where information was less efficiently priced. This finding has been replicated in several subsequent academic studies and is now being used in production by a growing number of quant funds. The caveat, which the original research was careful to note, is that the signal degrades rapidly as more market participants exploit it — a classic example of the adaptive market hypothesis in action.

Regulatory and Legal Document Processing

Investment decisions frequently require rapid processing of dense regulatory documents — merger agreements, prospectuses, material contract disclosures, and government agency rulings. LLMs have proven highly capable at extracting the key terms, flagging unusual provisions, and summarising the material implications of these documents for a specific investment position. Law firms and M&A advisory teams have been early adopters of this capability, and its extension to investment due diligence processes is now routine at many major financial institutions.

How People Are Building AI Investing Bots

The architecture of an LLM-based investing system varies significantly depending on the use case, but several common patterns have emerged in the practitioner community. Understanding these patterns is useful context for evaluating both the successes and the failures.

The Research Assistant Architecture

The simplest and most widely used pattern is the LLM-as-research-assistant: a human investor uses Claude or GPT to process documents, summarise information, generate hypotheses, and pressure-test investment theses through Socratic dialogue. The LLM has no direct access to trading systems and no autonomous decision-making authority. All execution decisions remain with the human. This pattern has the lowest risk and, according to practitioner accounts, delivers genuine productivity improvements — experienced investors report being able to cover significantly more names with the same research depth when using LLM assistance.

The Signal Generation Architecture

More sophisticated systems use LLMs to generate quantitative signals that feed into conventional algorithmic trading frameworks. In this architecture, the LLM processes text data — news, filings, social media, earnings transcripts — and outputs structured scores or classifications. These scores are then used as inputs to a quant model that makes trading decisions based on a combination of the LLM signal and traditional factors. The LLM is part of the signal generation stack, not the decision-making layer. This architecture allows for conventional risk management and position sizing frameworks to remain in control of actual trading activity.

The Autonomous Agent Architecture

The most ambitious — and most problematic — architecture gives the LLM direct access to trading APIs and allows it to make autonomous buy and sell decisions. In these systems, the LLM functions as the investment manager: it formulates hypotheses, decides on positions, sizes trades, and manages risk. Several prominent examples of this architecture have been built and documented publicly, and the results have been instructive in ways their creators did not always intend.

"The most common mistake in AI trading bot construction is giving the model authority it hasn't earned trust for. Start with research. Graduate to signals. Treat autonomy as a destination, not a starting point."

Documented Successes: What Has Actually Worked

Separating genuine successes from survivorship bias is the central challenge in evaluating AI investing systems. Practitioners who have made money with LLM-assisted approaches are more likely to write about it than those who have lost money, and the financial media has a structural incentive to amplify the most dramatic success stories. With that caveat noted, the following categories of success have been documented with sufficient rigour to merit attention.

Academic Research: Validated LLM Outperformance

Several peer-reviewed academic studies have documented statistically significant outperformance from LLM-based strategies in controlled backtesting environments. The University of Florida sentiment study mentioned above was followed by work from researchers at the University of Chicago and MIT's Sloan School of Management demonstrating that LLM-generated earnings call sentiment scores predict subsequent returns beyond what conventional factors can explain. A 2024 paper from researchers at Hong Kong University documented a long-short strategy based on GPT-4 earnings sentiment that generated Sharpe ratios above 1.5 in out-of-sample testing — a level of risk-adjusted performance that, if it persists in live trading, would represent genuine alpha.

The consistent pattern in this research is that LLMs add most value in information environments where text is information-dense and where conventional quantitative approaches struggle — particularly for smaller-cap stocks with lower analyst coverage and in the immediate aftermath of unexpected corporate events where human attention is insufficient to process all available information simultaneously.

Practitioner Successes: Institutional Adoption

Several institutional asset managers — including representatives of major hedge funds who have spoken at academic and practitioner conferences without attribution — have described successful deployment of LLM-based research tools that have materially improved their investment process. The most commonly cited benefits are not direct alpha generation from LLM trading signals, but rather efficiency gains that allow human analysts to cover more ground, and consistency improvements that reduce the random variation in research quality that comes from human cognitive load and attention constraints.

Renaissance Technologies, Two Sigma, and other quantitative powerhouses have not disclosed their use of LLMs specifically — but both firms have hired extensively from the natural language processing research community in recent years, which is suggestive. Citadel's Securities division has been more explicit about its use of large language models for regulatory document analysis and market microstructure research.

Individual Developers: Open Source Frameworks

The open source community has produced several frameworks for LLM-based investing that have accumulated significant usage and community validation. FinGPT — an open-source financial LLM developed by researchers at Columbia University and published on GitHub — has been fine-tuned specifically for financial text and has demonstrated competitive performance on financial NLP benchmarks. BloombergGPT, trained exclusively on financial data, demonstrated strong performance on financial tasks in its original evaluation, though it remains proprietary. The TradingGPT and FinAgent frameworks on GitHub have attracted thousands of contributors and represent the most accessible entry points for developers building LLM-based investing systems.

Documented Failures: What Has Gone Wrong

The failures of AI investing systems are at least as instructive as the successes, and they cluster around a consistent set of failure modes that practitioners have now identified with some precision.

Hallucination in Financial Contexts

All major LLMs are capable of generating plausible-sounding but factually incorrect information — a phenomenon known as hallucination. In financial contexts, this failure mode is particularly dangerous. An LLM that confidently states incorrect financial figures, misattributes a quote to a company executive, or fabricates a regulatory development that did not occur can generate investment decisions based on information that does not exist. Several documented cases involve LLM systems that generated detailed and specific-sounding analysis of company financials that, when checked against source documents, were partially or wholly fabricated.

Claude has a reputation among practitioners for being relatively conservative about financial claims — more likely to say "I'm not certain of this figure" or "you should verify this against the original source" than to confabulate a specific number. This epistemic caution is valuable in financial applications, though it is not a complete safeguard. Any LLM output used in investment decision-making should be verified against primary sources.

Cautionary Example

The Confident Balance Sheet Error

A retail investor using a GPT-based portfolio analysis tool asked the model to calculate the net cash position of a mid-cap technology company. The model returned a specific, confident figure derived from what appeared to be a detailed reading of the balance sheet. The figure was wrong — the model had confused cash equivalents with total current assets and compounded the error with an incorrect treatment of a convertible note. The investor did not verify the figure against the actual filing. The position was sized based on the incorrect analysis. When the actual cash position was reported in the next quarter, it was significantly lower than the LLM had stated.

Prompt Injection and Adversarial Manipulation

LLM-based trading systems that ingest external text data — news feeds, social media, earnings call transcripts — are potentially vulnerable to prompt injection attacks, where adversarial content embedded in the text data manipulates the model's behaviour. A sufficiently sophisticated actor could, in theory, craft a press release, social media post, or earnings call addendum designed to trigger specific responses from LLM-based trading systems. While documented cases of intentional financial prompt injection remain rare, the theoretical vulnerability is significant enough that serious practitioners treat it as a genuine risk in system design.

Overfitting to Training Data and Market Regime Changes

LLMs were trained on historical data that reflects the market regimes of the past. Strategies that work well in backtesting against historical data — even when the backtest is conducted carefully, with proper out-of-sample validation — can fail catastrophically when market regimes change in ways that are not represented in the training data. This is not a problem unique to LLMs — it is the fundamental challenge of all data-driven investing — but it is amplified by the fact that LLMs can generate highly confident-sounding analyses of situations they have not genuinely encountered.

The Autonomous Bot Disaster Cases

The most dramatic failures have occurred in fully autonomous systems — LLM agents given direct access to trading APIs with minimal human oversight. Several such systems have been documented in online communities, and the failure modes are surprisingly consistent: the model constructs an internally coherent but fundamentally flawed investment thesis; it sizes positions aggressively based on its own confidence; it misinterprets its risk management instructions in edge cases; and it continues executing its strategy in deteriorating conditions because it lacks the intuitive "something is wrong here" signal that experienced human traders develop.

One widely discussed example involved a developer who built a Claude-based trading agent designed to identify and trade mean-reversion opportunities in small-cap stocks. The agent performed well in backtesting and early live trading. Then it identified what its analysis described as an exceptional mean-reversion opportunity in a company that was, in fact, experiencing a genuine business deterioration masked by temporary price stability. The agent held and added to the position through a 60% decline before the developer intervened. The post-mortem identified that the model had been given insufficient context about the difference between mean reversion in price and mean reversion in business fundamentals.

Specific Models: How Claude Differs from GPT in Financial Use

The practitioner community has developed reasonably consistent views on the relative strengths of different LLMs for financial applications, and these views are worth examining — not because any model is definitively superior, but because the differences are meaningful for specific use cases.

Claude's Strengths in Financial Contexts

Users consistently cite several advantages of Claude for financial analysis work. First, its very large context window (currently 200,000 tokens in Claude 3 family models) allows the processing of entire annual reports, multi-year earnings histories, or large regulatory documents in a single prompt without the context truncation that limits competing models on the same tasks. Second, its tendency toward epistemic caution — acknowledging uncertainty, qualifying claims, recommending verification of specific facts — aligns well with the standards of rigorous financial analysis. Third, practitioners report that Claude's outputs require less post-processing to be investment-committee ready: the prose is more precise, the caveats are more intelligently placed, and the tendency to over-claim is lower.

Claude is also noted for its willingness to push back on investment theses rather than simply validating them — a behaviour that, while occasionally frustrating, reflects how good financial analysts actually operate. Asking Claude to argue against your own investment thesis and then to evaluate the counter-arguments is a genuinely useful exercise that many practitioners have incorporated into their research processes.

GPT-4's Strengths

GPT-4 and its successors have demonstrated strong performance on financial tasks requiring rapid synthesis across multiple sources, particularly where the task involves producing a concise, immediately actionable output rather than a comprehensive analysis. Its tool-use and function-calling capabilities are mature, making it a popular choice for the signal generation architectures that feed structured outputs into quant models. The OpenAI ecosystem's broader developer tooling has also meant that GPT-based financial applications have benefited from a larger community of practitioners sharing code, prompting strategies, and evaluation frameworks.

The Open Source Alternatives

For practitioners with concerns about sending proprietary financial data to third-party API providers, open-source models — particularly Llama 3, Mistral, and fine-tuned variants like FinGPT — offer a self-hosted alternative. Performance on general financial reasoning tasks is behind the frontier models, but the gap has narrowed significantly over the past eighteen months, and for specific tasks — earnings sentiment classification, regulatory document extraction — fine-tuned open-source models have demonstrated competitive performance. The trade-off is infrastructure cost and the engineering overhead of running and maintaining the models in production.

The Genuine Pitfalls: What No One Warns You About

Beyond the technical failure modes discussed above, there are several structural challenges that practitioners who have built production LLM investing systems consistently raise as underappreciated.

Latency vs Intelligence Trade-Off

The models that produce the most sophisticated financial analysis are not the fastest. Claude Opus and GPT-4 can take several seconds to return a response — a timescale that is acceptable for research applications but fatal for any strategy that requires real-time signal generation at market speeds. Fast models produce less sophisticated analysis. Slow models produce better analysis but cannot be used where latency matters. This trade-off is fundamental and is not going away as models get faster — because the faster models are the smaller, less capable ones.

Cost at Scale

API costs for frontier LLMs are non-trivial when applied at scale. A strategy that processes 500 earnings call transcripts per quarter, runs continuous news sentiment scoring across a universe of 2,000 stocks, and generates daily portfolio analysis reports can accumulate API costs that exceed what the strategy generates in alpha — particularly for retail practitioners and smaller funds. This cost constraint is often overlooked in the initial excitement of building, and it has killed many promising strategies that worked perfectly in small-scale testing but became uneconomic at production scale.

Regulatory and Compliance Ambiguity

The regulatory treatment of AI-generated investment advice and AI-driven trading decisions remains ambiguous in most jurisdictions. In the United States, the SEC has issued guidance indicating that investment advisers using AI must ensure their use of AI is consistent with their fiduciary obligations and that they cannot outsource their responsibility to clients to an algorithm. In the UK and EU, similar principles apply under MiFID II and the FCA's principles for businesses. The practical implication is that fully autonomous LLM trading systems operated by regulated entities face significant compliance risk — and that the rapid evolution of both the technology and the regulatory environment requires ongoing legal attention.

Where This Is Going

The most informed view of LLM-based investing — synthesising the academic research, the practitioner accounts, and the documented failure cases — is neither the breathless optimism of the early adopters nor the dismissiveness of those who regard AI as a fad. It is something more nuanced: a technology that has demonstrated genuine, reproducible value in specific applications, that carries significant risks in the absence of careful system design and human oversight, and that is developing rapidly enough that today's accurate assessment will require meaningful revision within twelve months.

The applications that are working are those that augment experienced human judgment rather than attempting to replace it. The failures are concentrated in systems that gave models authority they hadn't earned trust for, in strategies that mistook confident-sounding LLM output for reliable financial analysis, and in architectures that lacked the circuit breakers that experienced human traders apply instinctively.

The next significant developments to watch are likely to be: reasoning-optimised models that can demonstrate their analytical steps in verifiable ways; multi-agent systems where different LLMs cross-check each other's financial reasoning; and domain-specific fine-tuning on proprietary financial datasets that produces models with genuinely superior financial knowledge rather than general intelligence applied to financial tasks.

"The question is not whether AI will transform investing. It already is. The question is whether you will be one of the people using it well, or one of the case studies in someone else's article about what went wrong."

Raising Capital for an AI or Fintech Business?

OAKRG works with technology companies — including AI, fintech, and quantitative finance ventures — to connect them with aligned capital partners. If you're building in this space and looking for institutional investment, we'd like to hear from you.

Start a Conversation