There is a nice paper out about this subject, although it looks at the matter from a rather advanced mathematics viewpoint. If the equation below looks like your idea of a least-squares linear line, then you can certainly
read the paper, but I will warn you that more than just the matrices are a challenge; it will flip back and forth between Bayesian and frequentist statistics rapidly.[1]

Also note that it is 38 pages long, has 41 notes that occupy another 16 pages (geesh, I thought I was bad with all the notes) and then just a short page of references - 55 pages total.
Here are some of the highlights that I found to be useful. The paper targets least-squares fit to a linear line, but most of the comments would be applicable to fit other equations as well.
- LS fits are make the assumption that the uncertainty is all in the y-values and that the x-values have very little or no uncertainty. This can be the case if x is time, but in other cases, such as when x is temperature, pressure..., that might be worth considering. [2]
- If you do a LS fit, the slope and intercept are what most people will latch onto and they will likely ignore the data and its scatter. This is akin to what happens when a distribution is condensed to a mean or median or other single value. If the uncertainty of the average is huge because the data is widely scattered, communicating that is likely more important than communicating the average.
- The distribution of the uncertainty in the y-data needs to be Gaussian in order for the LS fit to be proper.
The report has extensive sections on how to deal with outliers [3] and non-Gaussian distributed data if you want to get into those subject further - they are not easily handled.
The first bullet point above is quite important as an LS fit minimized the sum of the distances (squared) between the points and the line with the distance being only in the vertical direction. 200 years ago, those calculations could be long and tedious, but the procedure was quite clear. With computers nowadays,
it is possible to minimize the sum of the total distance (squared) between the points and the line with the distance being measured perpendicular to the line taking the x-distance into account as well. This was never discussed in the article and I wish it had been.
[1] That statistics has such controversies at such a fundamental level is something I've always found quite amusing and something that most people aren't even aware of. But then, who wants to air their dirty laundry in public?
[2]Sure you can measure temperature to numerous sig figs, but is your sample really thermally homogeneous? If so, you are lucky. Really lucky. As in, look over the situation again, you are almost certainly wrong. And so then how does the inhomogeneity affect your results?
[3] If you toss an outlier, the quality of the fit improves. If you keep tossing outliers, the quality of the fit keeps improving. You can keep tossing outliers and keep getting better fits until you have just 2 data points left. In this system, there is no penalty for tossing an outlier - in fact you get rewarded.