Can LLMs like ChatGPT fit a straight line to noisy data?

Large Language Models (LLMs) like OpenAI’s ChatGPT, Google’s Gemini, Anthropic’s Claude and Microsoft’s Copilot are very topical nowadays, and they can all write Python code. Up till the last academic year*, I had a coursework element for my biological physics teaching that was basically to chose variables correctly and then to fit a straight line to noisy data**. I did this partly because data fitting is such a useful skill that I thought using coursework to push students into practicing it, would help them – not sure the coursework was that popular but I contend it taught useful skills.

The students’ performance was mixed, and I was curious to see if ChatGPT et al could do better, given a good prompt. ChatGPT itself couldn’t (at first go), but Gemini’s, Claude’s and Copilot’s code was correct.

The prompt was the same for all three LLMs:

write a python code to input a set of noisy data x and y values, fit a straight line to the data, output best fit value of intercept and best fit value of slope, and uncertainties in both the best fit values of the slope and the intercept and plot both the data and the bestfit straight line

There are several Python functions that actually do this for you, two of which are: scipy.stats.linregress, which Claude and ChatGPT went for, and scipy.optimize.curve_fit, which Gemini and Copilot went for. linregress is especially easy to use, if x and y are arrays with your noisy data you only need

fit_results = stats.linregress(x, y)
print(f"Best fit slope: {fit_results.slope:.4f} ± {fit_results.stderr:.4f}")
print(f"Best fit slope: {fit_results.intercept:.4f} ± {fit_results.intercept_stderr:.4f}")

Just three lines: one line to pass the arrays x and y to linregress, which returns the object I called fit_results, which contains all you need, i.e., the 4 values of best fit slope and intercept plus uncertainty estimates for both.

Weirdly, ChatGPT’s code (see here for my dialogue with ChatGPT) called linregress, which would give them everything they needed, but then separately calculated the uncertainty estimates and got them wrong. All three LLMs got the bestfit values right, it was the uncertainty estimates that proved tougher for LLMs, as they were for the students answering my coursework.

Claude also called linregress, but also calculated its own uncertainty estimates, instead of using the ones linregress returns, but it used the code:

x_mean = np.mean(x)
y_mean = np.mean(y)
ss_xx = np.sum((x - x_mean)**2)
ss_yy = np.sum((y - y_mean)**2)
ss_xy = np.sum((x - x_mean) * (y - y_mean))

slope_uncertainty = np.sqrt((ss_yy - slope**2 * ss_xx) / ((n - 2) * ss_xx))
intercept_uncertainty = slope_uncertainty * np.sqrt(np.sum(x**2) / n)

which I think is correct, it certainly gives the same answers as linregress does. Impressive that it gets it right but the fact that it does not know to use linregress properly is interesting.

Gemini and Copilot used curve_fit. Gemini defined the nice function:

def fit_line(x, y):
    popt, pcov = curve_fit(linear_function, x, y)
    m, c = popt
    sigma_m, sigma_c = np.sqrt(np.diag(pcov))
    return m, c, sigma_m, sigma_c

which does the job nicely, the diagonal elements of what is called pcov above are the uncertainty variances that you just square root to get uncertainty standard derivations, which is what linregress does.

I guess the lesson here is not to trust LLMs to always give the right answer, even if you give them what I hope was a pretty clear and direct prompt. ChatGPT did give me code that was correct the second time around, but I had to tell it its first code was wrong, and I could only do that as I know what the correct code is.

I guess the conclusion is that LLMs are quick and easy to use, and a time saver whether or not you know what you are doing. But they may improve the chance of you getting the right answer if you don’t know what you are doing – what the right answer looks like – but certainly don’t guarantee it.

If you want to improve the chances of succes, you can always change the prompt. The stats formulas linregress and curve_fit use to estimate uncertainties are very standard but not very intuitive unless you are a statistician. But there are other ways to estimate uncertainties, a more intuitive (at least for me) and very general way of estimating uncertainties is to use what is called the bootstrap method. For this you want the prompt


write a python code to input a set of noisy data x and y values, fit a straight line to the data, output best fit value of intercept and best fit value of slope, and uncertainties in both the best fit values of the slope and the intercept, the uncertainties should be calculated using the bootstrap method of resampling. It should then plot both the data and the bestfit straight line.

and with this code all 4 LLMs gave code that I think is correct. A bootstrap code is pretty generic and mostly quite hard to mess up, so perhaps it is not surprising that Gemini, ChatGPT, Claude and Copilot all got it right. Conclusion here is that it is best to ask easier questions than hard ones, of LLMs.

* For this coming year, there have some reorganisation so biological physics has moved to another module and will be assessed in a different way.

** I actually made it a bit harder than the example considered in the post above, as the data was for diffusion starting from the origin, so the best fit line has to go through the origin. Gemini and Copilot got this one right too, I think because they use curve_fit not linregress (for reasons I don’t know), and curve_fit is very flexible, so can just change the fitting function to

def linear_function(x, m):
    return m * x

def fit_line(x, y):
    popt, pcov = curve_fit(linear_function, x, y)
    m = popt[0]
    sigma_m = np.sqrt(pcov[0, 0])
    return m, sigma_m

which is a one-parameter fit, with the only fit parameter the slope. All good. Claude used linregress again, which is not right, as it fits both slope and intercept, i.e., its intercept instead of fixing it to be zero. Claude used numpy’s polyfit function which makes the same mistake, it fits the intercept instead of forcing it to be zero. So for this, less common, data-fitting problem, only Google’s Gemini and Microsoft’s Copilot give the right answer.

Leave a Comment