Functions

runif

runif(n,min,max) n is the number.

sample

sample(x,n,replace) x is the range n is the number when n is bigger than the length of x, you should set replace = T.

x <- sample(2:10,100,replace = T)
# the length is bigger,use repalce = T

rnorm

dnorm(x, mean = 0, sd = 1, log = FALSE)
pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
rnorm(n, mean = 0, sd = 1)

rnorm: normal distribution n: number mean: the center of the distribution sd: the range near the center

for example: rnorm(10,100,20) it means we select ten numbers between 80 and 120, the min is 100-20, max is 100+20.

set seed()

Set random seed for reproducibility

hist() Frequency Histogram

hist receives a vector it will draw a histogram to show the frequency of the data

hist(x,freq = NULL, probability = !freq,
    right = TRUE,density = NULL, angle = 45, col = "lightgray",
    main = "Histogram of",
    xlim = range(breaks), ylim = NULL,
    xlab = xname,axes = TRUE, plot = TRUE, labels = FALSE,)

freq: if T, display every values frequency, default is T,show the frequency instead of the density.
probability: an alias for !freq, for S compatibility.
right: if the data equal to the right side of range, it will beongs to this rnage.
density: draw shading lines
angle: the angle of the density
col: the color of the histogram
main: the title of the histogram
xlim: the range of the x axis, when value = range(breaks),it will draw all the range of data, but if not, it will default to chose the range.
ylim: the range of the y axis
xlab: the label of the x axis
ylab: the label of the y axis
axes: if T, draw the x and y axis,default is T
labels: default is F, show the value of the data on the histogram

plot Scatter Plot

plot function need two parameters as x,y , this function defualt draw points on the graph

plot(x, y = NULL, type = "p", xlim = NULL, ylim = NULL, log = "",
    main = NULL, sub = NULL, xlab = NULL, ylab = NULL,
    ann = par("ann"), axes = TRUE, frame.plot = axes, panel.first = NULL,
    panel.last = NULL, bty = "o", ...)

type: “p” for points, “l” for lines, “b” for both(lines and points), “c” for the lines part alone of “b”, “o” for both ‘overplotted’, “h” for ‘histogram’ like (or ‘high-density’) vertical lines, “s” for stair steps, “S” for other steps, see ‘Details’ below,
cex: size of the points
col: color
pch: the shape of the points,19 is full
xlim/ylim: xlim = c(min_x, max_x), decide the range of x/y
log: log”x”/“xy”: use log on x,y
main: tittle
sub: subtitle, down the pic
bty: the type of the frame, “o” is the default line, “l” is no frame, “c” means double line

text

Used to add text to points or decision trees, etc.

text(x, y, label, pos = 3, offset = 0.5， cex = 1.5, srt = 45)

label: the label vector
cex: size of the label, default is 1
pos: the position of the label
offset: the distance between the label and the point
srt: the angle of the label

qq Plot (Quantile-Quantile Plot)

geom_qq qqplot qqnrom These three methods produce similar results

qqnorm(All$mark)
qqline(All$mark, col = 2)  # Add a theoretical normal distribution reference line, color set to red

abline

abline(a = NULL, b = NULL, h = NULL, v = NULL, reg = NULL, coef = NULL,
    untf = FALSE, lty = 1, lwd = 1, col = "black", ...)

a,b: a*x+b
h,v: horizontal and vertical lines
reg: regression line
lty: line type
lwd: line width
col: colour

Fitting example:

   data <- data.frame(x = 1:10, y = 2 * (1:10) + rnorm(10))
   plot(data$x, data$y)  # Plot data points
   model <- lm(y ~ x, data)  # Linear fitting on data
   abline(model, col = "blue")  # Add fitted line

as.factor

usage: xxx <- as.factor(xxx) use: change vector to factor what is factor?

# Create a simple factor
fruit <- c("apple", "banana", "apple", "orange", "banana", "apple")
fruit_factor <- factor(fruit)

# Display factor structure
fruit_factor

# Specify levels and labels
fruit_factor_labeled <- factor(fruit, levels = c("apple", "banana", "orange"),
                               labels = c("Apple", "Banana", "Orange"))
fruit_factor_labeled

summary

Get parameters of fitted model

colMeans

average_marks <- colMeans(df2, na.rm = TRUE)

colSums

sum_marks <- colSums(df2, na.rm = TRUE)

apply()

median_marks <- apply(df2, 2, median, na.rm = TRUE)
#calculate median
row_sums <- apply(mat, 1, sum)
#calculate sum of each row in matrix
max_marks <- apply(df2, 2, max, na.rm = TRUE)
min_marks <- apply(df2, 2, min, na.rm = TRUE)
#calculate max and min

tapply

a <- c(2, 4, 1, 7, 6, 8, 3, 4)
c <- c("A", "A", "B", "B", "A", "A", "B", "B")
result <- tapply(a, c, mean)

d <- c("X", "Y", "X", "Y", "X", "Y", "X", "Y")
result <- tapply(a, list(c, d), mean)

tapply(vector, index, FUN, ..., default = NA, simplify = TRUE)

in the first example, variable in c vector defines which group the variable in a vector belongs to. it will calculate the mean of each group such as A,B

rep

rep(x, times = NULL, each = NULL, length.out = NULL, along = NULL)
rep(1, times = 5)
# repeat the number 1,5 times in total

lm

Linear model (lm()), Generalized linear model (glm()), Mixed effects model (lme()) Linear fitting lm(y~x)

lme

glm

Subject to exponential family distribution

gam

cooks.distance

boxplot Box Plot

x and y: These are the basic data for the box plot, can be vectors, matrices, data frames, or lists. If x is a vector, y can be omitted, in which case x represents categories and y represents corresponding values.
names.arg: If x is a vector, you can use names.arg to provide category names.
main: Set the main title of the graph.
xlab and ylab: Set the X-axis and Y-axis labels respectively.
las: Control the direction of axis labels, 0 means horizontal, 1 means vertical, 2 means rotated 45 degrees.
col and fill: Set the color and fill color of the box plot.
border: Set the border color of the box plot.
outlier.col and outlier.pch: Set the color and shape of outlier points.
horizontal: If set to TRUE, draw a horizontal box plot.
varwidth: If set to TRUE, the width of the box plot will vary according to the sample size, otherwise all box plots have the same width.
notch: If set to TRUE, the box plot will have a notch representing the 95% confidence interval, which can be used to compare the significance of medians.
range: Control the length of whiskers, default is 1.5 times the interquartile range (IQR).
par: Can pass graphical parameters, such as mar (margins), oma (outer margins), etc. How to read a box plot: Box plots are mainly used to describe the distribution of data, and its key elements include:
Box (Box): Represents the lower quartile (Q1) and upper quartile (Q3) of the data, that is, the middle 50% of the data.
Median Line: The line inside the box represents the median of the data.
Whiskers: Lines extending from both ends of the box, usually representing the minimum (excluding outliers) and maximum values, but not exceeding Q1 minus 1.5 times IQR and Q3 plus 1.5 times IQR.
Outliers: Data points outside the whiskers, representing values far from the main data set.

bwplot

AIC

fligner.test

Test homogeneity of variance, get p-value, if less than 0.05, reject the null hypothesis

shapiro.test

Test whether data follows normal distribution, if value is greater than 0.05, consider it to follow normal distribution

cor.test

durbinWatsonTest

The Durbin-Watson statistic is a number between 0 and 4, used to evaluate whether the model’s residual sequence has autocorrelation. A value close to 2 usually indicates no significant autocorrelation between residuals, while a value far from 2 may indicate the presence of autocorrelation.

dwtest()

dwtest(object, alternative = c("two.sided", "less", "greater"), lag.max = NULL)

The output will include the Durbin-Watson statistic, corresponding p-value, and test conclusion. If the p-value is greater than the significance level (usually 0.05), the null hypothesis cannot be rejected, that is, there is no autocorrelation in the residuals. Conversely, if the p-value is less than the significance level, the model may need adjustment to eliminate autocorrelation in the residuals.

kruskal.test

aov

anova

par(mfrow=c(x,y),cor = (1,1,2,2))

mfrow defines the array of plots cor defines the top, bottom, left, and right borders

dotchart()

pairs()

panel.cor()

tree()

library(tree)
my_tree <- tree(response ~ ., data = dataset, control = tree.control(nnode = n))
plot(my_tree)
text(my_tree)

response: The response variable for classification or regression problems, for classification problems it should be a factor; for regression problems it should be a numeric vector.
.: Represents all other variables, that is, using all independent variables to predict the response variable.
data: A data frame containing the response and independent variables.
control: A tree.control object used to set parameters when building a decision tree, such as maximum tree depth nnode, etc.

Parametric Tests

T-test:

One-sample T-test: Test whether the mean of a sample is significantly different from a known population mean. Independent samples T-test (two-sample T-test): Compare whether the mean differences between two independent samples are significant. Paired samples T-test: Compare the mean differences between paired data (such as measurements before and after treatment on the same subject). ANOVA (Analysis of Variance):

One-way ANOVA: Used to compare whether there are significant differences in means among three or more independent samples. Multivariate Analysis of Variance (MANOVA): When there is more than one dependent variable, used to analyze mean differences of multiple dependent variables simultaneously. Chi-square test (χ² test):

Goodness of fit test: Test whether there are significant differences between observed and expected frequencies. Independence test: Analyze whether two categorical variables are independent. Homogeneity test: Test whether the frequency distributions of multiple samples are the same. F-test:

Homogeneity of variance test: Before ANOVA, test whether the variances of each group are equal. F-test in ANOVA: Used in ANOVA to determine whether group differences are significant.

interaction.plot

interaction.plot(x.factor = mydata$Age.Group,
                 trace.factor = mydata$Gender,
                 response = mydata$Salary,
                 type = "b",
                 col = c("red", "blue"),
                 leg.legend = c("Female", "Male"))

x.factor: The factor variable on the X-axis, can be a factor variable or a numeric vector.
trace.factor: The factor variable on the Y-axis, can be a factor variable or a numeric vector.
response: The response variable, can be a numeric vector.
data: Data frame
type: Type

“l” line chart
“b” line chart and point chart
“p” point chart
“o” point chart of overplotting line chart.
“c” histogram-style line chart
legend: Legend

groupedData()

Handle repeated measurement data

radarplot() Radar Chart

radarchart/radarchartcirc(df, axistype, seg, pty, pcol, plty, plwd, pdensity, pangle, pfcol,cglty, cglwd, cglcol, axislabcol, title, maxmin, na.itp, centerzero, vlabels, vlcex, caxislabels, calcex, paxislabels, palcex, ...)

axislabcol: Color of axis labels. Color of axis labels and numbers, default is “blue
axistype: Type of axis, can be a value between 0 and 5. 0 means no axis labels, 1 means center axis labels, 2 means axis labels around the chart, 3 means both center and peripheral axis labels, 4 is the asterisk format of 1, 5 is the asterisk format of 3. Default is 0
seg: Number of segments, determines the number of divisions on the axis.
pty: Type of polygon (line type), a vector specifying point symbols, default is 16 (closed circle). If not drawing data points, should be set to 32.
pcol: Color of polygon edges, a vector of color codes for drawing data, default is 1
, will repeat
plty: Line type of polygon edges. A vector of line types for drawing data, default is 1
, will repeat
plwd: Line width of polygon edges.
pdensity: May be used to control point density, if pcol is used for points rather than lines.
pangle: May be used to control point angle distribution, if pcol is used.
pfcol: Fill color inside the polygon. pfcol = scales::alpha(“#00AFBB”, 0.5); the second parameter of the alpha() function is a value between 0 and 1, indicating the transparency of the color. 0 means completely transparent, 1 means completely opaque. In this example, 0.5 means semi-transparent, so the final pfcol color is 50% transparent blue
cglty: Line type of grid lines. Line type of radar grid, default is 3 (dashed line). For radarchartcirc(), default is 1 (solid line)
cglwd: Line width of grid lines. Line width of radar grid, default is 1, represents the thinnest line width
cglcol: Color of grid lines. Color of radar grid, default is “navy”
title: Title of the chart. maxmin: May be used to control the display of maximum and minimum values.
na.itp: How to handle NA values.
centerzero: Whether to set the chart center to 0.
vlabels: Labels of variables, usually correspond to column names of the data frame.
vlcex: Font size of variable labels.
caxislabels: Axis scale labels.
calcex: Font size of scale labels.
paxislabels: Labels of polygon axes.
palcex: Font size of polygon axis labels.

legend Legend

legend(x, y, legend, fill = NULL, col = "black", bg = "white",
       lty = 1, lwd = 1, pch = 1, angle = 45, density = NULL, bty = "o",
       box.lwd = 1, box.col = "black", text.col = "black",
       title = NULL, cex = 1, pt.bg = NA, pt.cex = 1,
       xjust = 0, yjust = 1, xpd = NA,
       ncol = 1, horiz = FALSE, title.adj = c(0.5, 0.5), ...)

R Language Common Functions