Ex) Article Title, Author, Keywords
Ex) Article Title, Author, Keywords
New Phys.: Sae Mulli 2024; 74: 678-687
Published online July 31, 2024 https://doi.org/10.3938/NPSM.74.678
Copyright © New Physics: Sae Mulli.
Eunhye Shin1, Jinseop Jang2, Junghyo Jo2*
1Bucheon Technical High School, Bucheon 14733, Korea
2Department of Physics Education, Seoul National University, Seoul 08826, Korea
Correspondence to:*jojunghyo@snu.ac.kr
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License(http://creativecommons.org/licenses/by-nc/3.0) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
This study explores the use of symbolic regression (SR) in physics education, aiming to gauge its effectiveness and educational value. SR involves deriving mathematical models from empirical data by finding symbolic representations that fit the data. We evaluate two SR algorithms, AI-Feynman and Φ-SO, using position data from objects in parabolic motion and damped oscillations. Our analysis demonstrates that SR algorithms can produce concise formulas to describe object motion. Integrating SR into physics education allows students to build on their prior knowledge of physics to formulate hypothetical symbolic terms and enhance their explanations of physical phenomena. Subsequently, students iteratively derive mathematical expressions from data, thereby nurturing a process of data-driven discovery. Furthermore, students can recognize the impact of technological advancements on scientific problem-solving. However, effective pedagogical strategies are necessary to guide students beyond mere derivation of mathematical expressions, encouraging them to interpret and elucidate models in meaningful scientific inquiries.
Keywords: Physics education, Symbolic regression, Machine learning, AI-Feynman, Φ -SO
Physics describes natural phenomena by constructing mathematical models. Extensive literature supports the view that mathematical modeling is integral to the field of physics. Hestenes notes that a physics model is inherently a mathematical model, where physical properties are represented by quantitative variables[1]. Deriving mathematical expressions that encapsulate observations or experimental measurements attempts to explain nature[2] and reason about underlying phenomena[3]. Some studies argue that a central challenge in physics is finding a symbolic expression that provides a simple yet accurate fit to a given data set[4], thereby elucidating the relationship between empirical observations and the system being studied[5].
Physics education researchers have highlighted the crucial role of students' effective use of mathematics and their comprehension of the physical meaning behind mathematical expressions in solving physics problems[6]. Sherin notes that students develop a conceptual understanding of a physical situation and then express this understanding through equations. Additionally, they can interpret an equation as a specific representation of a physical system[7]. Similarly, Hestenes emphasizes that the underlying models within physical principles and equations should be clearly articulated to students, as mathematical modeling is vital for grasping and explaining the link between scientific theories and empirical phenomena[1]. Brahmia et al. propose that mastering symbolic forms should be a learning objective in introductory physics courses, rather than merely reflecting the typical thinking patterns of physics students[8]. Mashood et al. argue that solving complex physics problems requires students to employ imagination to create systematic connections between real-world and mathematical structures, and to comprehend high-level modeling that integrates these relationships into coherent mathematical models[9].
Koza[5] coined the term ‘symbolic regression(SR)’ to describe the task of discovering a function in symbolic form that accurately corresponds to a provided finite dataset. An early example of this process is Kepler’s discovery of Mars' elliptical orbit after about 40 failed attempts over four years, which marked a scientific revolution[4]. Today, we can utilize symbolic regression with machine learning techniques to extract mathematical equations directly from data. While traditional regression methods fit parameters to predetermined equation forms, SR employs genetic programming[10], a technique inspired by Darwinian evolution, to explore both parameter values and equation structures simultaneously[5]. Whereas genetic programming-based methods achieve high prediction accuracy, they do not scale well to high dimensional data sets and are sensitive to hyper-parameters[11]. Recently, deep learning-based methods have also been applied to SR[2]. The search for a parsimonious and elegant form of the unknown equation[12] allows scientists to gain deeper insights into the underlying phenomena[13]. SR has been successfully applied in various fields, including particle kinematics[14], astrophysics[15], chemistry[16] and others.
Current physics teaching primarily follows a model-driven approach, where theories are learned deductively based on a priori knowledge. However, the real-world of science often employs a data-driven approach, where rules are discovered inductively from observational data. With the increasing ease of collecting large amounts of real-time data through various sensors and physical computing, it is becoming more feasible to utilize this data-driven approach in the classroom. Given the growing interest in the integration of artificial intelligence into science education, we suggest incorporating SR into physics education, considering its success in the natural sciences. In particular, the model selection process through SR can be highly effective in finding rules hidden in observed phenomena. A data-driven approach to physics education using SR is expected to complement traditional model-centered physics education.
As students engage in the generation, evaluation, and modification of scientific models, they develop a more profound comprehension of scientific principles and problem-solving abilities[17]. In a physics class using SR, when students build a model to explain an unfamiliar phenomenon, they derive hypotheses from observations and data about the phenomenon. The coherence of the student's hypothesis with the phenomenon can be verified by the size of the error in the derived equation. As students revise their hypotheses, they update the code, resulting in a revised model and error. Students take the initiative to iteratively modify their hypotheses and code to reduce error and improve the fit between the model and the phenomenon. This process is similar to the philosophy that scientific knowledge is advanced by ‘epistemic iteration,’ a process of building and refining knowledge by reexamining and revisiting premises at each step, rather than simply maintaining and repeating initially held beliefs[18].
For the use of SR in physics research to provide students with an indirect experience of the progression of scientific knowledge through the iterative model elaboration, it is essential first to study whether SR is feasible in an educational context. This should be followed by a discussion on the implications of introducing SR in physics education. We will focus on two algorithms, AI-Feynman[4] and Φ-SO[2], which are notable for their foundations in physics principles. AI-Feynman is known for its capability to identify the 100 equations presented in Feynman's physics lectures[4], while Φ-SO is a physical symbolic optimization method that recovers analytic functions from physical data[2]. Previous studies have demonstrated that both packages can be successfully performed on computer-generated data. However, to investigate the feasibility of SR in physics education, it is necessary to examine if SR can also be performed on noisy data collected by students in the laboratory. Therefore, we will apply both SR packages to real data collected in the laboratory, which distinguishes our work from previous studies[2, 4].
To investigate the suitability of our SR algorithm for the experimental classes, we utilized it to derive a mathematical model for the peak of parabolic motion. The input for this model was the launch angle of an object in parabolic motion. The peak of parabolic motion can be obtained from the equations for velocity and position in the perpendicular direction, based on the launch angle. However, due to the involvement of trigonometric functions, experimental measurement has limitations. The parabolic motion data was collected using a parabolic launcher and the Tracker, as shown in Fig. 1. The experiment was repeated multiple times for each angle to account for potential errors caused by slight position changes resulting from the launch impact. The average of the peak values, excluding outliers, was used as the peak value for each angle. The data shown in Fig. 2 was a result of this process. We predicted the maximum height using the theoretical equation with the mean initial velocity of 2.06 m/s and the acceleration due to gravity of 9.807 m/s
This study aimed to drive a mathematical model of damped oscillations by SR. To collect experimental data, a spring pendulum was constructed by hanging a weight on a spring, as shown in Fig. 3, and oscillated in a graduated cylinder filled with water. During the motion of the pendulum, we were careful not to let the spring touch the water, because if the spring touched the water, it would affect the motion of the pendulum. To analyze the video, we set the point of connection between the pendulum and the ring as the point of capture in the Tracker. We collected over 500 decay position data points for the X and Y axis of the damped oscillator and used 50 of them for our analysis. The data shown in Fig. 4 was a result of this process. The RMSE between the fitted curve corresponding to the general solution of damped harmonic oscillation and the collected data was
AI-Feynman is a SR algorithm known for its physics-inspired approach. According to Udrescu et al.[4], AI-Feynman's core principles involve discovering simple patterns in data and recursively finding exact equations. The authors argue that physics functions and many other scientific applications often exhibit simplifying properties that facilitate their discovery: ① Units: the function and its variables have known physical units; ② Low-order polynomial: the function (or part of it) is a low-degree polynomial; ③ Compositionality: the function is composed of a small set of elementary functions, each with no more than two arguments; ④ Smoothness: the function is continuous and possibly analytic in its domain; ⑤ Symmetry: the function shows translational, rotational, or scaling symmetry with respect to some of variables; ⑥ Separability: the function can be expressed as a sum or product of parts with distinct variables.
AI-Feynman discovers equations through a sequential process. First, it conducts dimensional analysis on the given data to simplify the problem and reduce the number of variables to dimensionless ones. Second, it uses polynomial fitting, which is a standard method for solving a system of linear equations, to determine the best-fit polynomial coefficients. Third, in the brute force step, all possible combinations of mathematical symbols, functions, and variables are tested to generate an equation that fits the given data set. Since there are
Petersen et al.[11] introduced the reinforcement learning-based deep SR framework which is the new standard for exact symbolic function recovery, particularly in the presence of noise. Building on this, Tenachi et al.[2] presented a Physical Symbolic Optimization (Φ-SO) framework for recovering analytical symbolic expressions from physics data using deep reinforcement learning techniques by learning units constraints.
Tenachi et al.[2] regard symbolic expressions as binary trees where each node represents a symbol of the expression in the library of available symbols, i.e., an input variable (e.g.,
Φ-SO uses a neural network to generate a categorical distribution and optimize the parameters according to fit quality and physical units constraints. The training of the network that generates the distribution of symbols relies on the reinforcement learning strategy. In this approach, a set of trial symbolic functions are generated and scalar reward for each function is computed by confronting it to the data. According to Peterson et al.[11], a risk-seeking policy was adopted which aims to maximize the reward for the few best-performing candidates rather than the average reward. This enables efficient exploration of the search space at the expense of average performance. The software package was obtained by referring to the GitHub repository published by Tenachi et al.[2] available at https://github.com/WassimTenachi/PhySO.
The AI-Feynman algorithm can be executed in a Python environment using the ‘run_aifeynman’ module in AI-Feynman package. The main variables used in the module are as follows, as shown in Fig. 5, in order: ‘pathdir’, ‘filename’, ‘BF_try_time’, ‘BF_ops_file_type’, ‘polyfit_deg’, ‘NN_epochs’, and ‘vars_name.’ Here ‘pathdir’ presents path to the directory containing the data file; ‘filename’ presents the name of the file containing the data; ‘BF_try_time’ denotes the time limit for each brute force call; ‘BF_ops_file_type’ presents file containing the symbols to be used in the brute force code with the package providing four defaults sets; and ‘polyfit_deg’ variable means maximum degree of the polynomial tried by the polynomial fit routine, hence appropriate variable selection is crucial. Additionally, ‘NN_epochs’ variable presents number of epochs for the training; ‘vars_name’ variable means name of the variables appearing in the equation. It is important to note that the order and number of variables must match the data file. It is available to perform dimension analysis when specifying ‘vars_name’ variables with dimensions. After dimension analysis, the data used in algorithms becomes dimensionless. For example, in the case of parabolic motion where the dimensional array is [length, time, mass], gravitational acceleration g is [1, -2, 0], initial velocity v is [1, -1, 0], and height y is [1, 0, 0]. To make them dimensionless with a dimension array of [0, 0, 0], polynomial fitting and the brute force step are conducted with a new target of
The Φ-SO algorithm can be run using the module ‘physo.SR’ in the Python environment. As shown in Fig. 6, the `X_array' means input data and `y_array' means target data. The variables `X_names', `X_units', `y_name', and `y_units' can be used to specify their names and dimensions. Constants and dimensions can be entered in the `free_consts_names' and `free_consts_units' variables. The dimensions can be set as length, time, mass, or other relevant dimensions, in any order. However, it is important to ensure that the order of the dimension array for all variables (e.g., X and y) and constants (e.g.,
Table 1 shows the SR results as mathematical models for the peak of parabolic motion. Figure 7 is a visualization of the regressed equation and theoretical calculations using the ‘matplotlib.pyplot’ package. The theoretical peak equation of parabolic motion is
Results of symbolic regression for parabolic motion.
Method | Algorithm | Equation | RMSE | Remarks |
---|---|---|---|---|
Theory | - | |||
Symbolic Regression | AI-Feynman | |||
Φ-SO | Normalized input data |
The Φ-SO algorithm produces relatively similar results to the theoretical formula. However, the denominator of the algorithm's output model was g, while the denominator of the theoretical formula was 2g. This discrepancy suggests that proper tuning of the algorithm parameters is necessary to achieve a fully consistent formula, and careful interpretation is required depending on the context. The RMSE between the observed and predicted values, estimated by the Φ-SO algorithm, after the normalization was
Table 2 shows the SR results as mathematical models for damped oscillation. Figure 8 is a visualization of the regressed formula and fitted curve on the experimental data using the ‘matplotlib.pyplot’ package. The formula for the curve fitted with the general solution of damped oscillation is
Results of symbolic regression for damped oscillation.
Method | Algorithm | Equation | RMSE | Remarks |
---|---|---|---|---|
Curve Fitting (General solution) | - | Obtained from ‘scipy.optimize.curve_fit’module | ||
Symbolic Regression | AI-Feynman | |||
Φ-SO | Normalizedinput data |
Since the target value of the experimental data has a dimension of [1, 0, 0], the dimensions of the constants are entered as [0, 1, 0], the dimensions of the constants ω and α are entered as [1, 0, 0], and the dimension of φ is entered as [0, 0, 0]. We found that after dimensional analysis, the input data has been reduced to three dimensionless variables:
From the experimental data of damped oscillation, the Φ-SO algorithm derived the regression formula in the basic form of the general solution of damped oscillation. The formula for the part where the amplitude decays exponentially was expressed as
The results of our study indicate that symbolic regression(SR) can effectively derive equations from real data. In our direct application of SR, idealized data generated by parabolic and damped oscillation formulas yielded satisfactory regressions with minimal effort. However, real data collected from experiments contained noise, leading to long learning times and more complex regression equations. Recognizing that perfect denoising is impractical in a lab setting, we used our prior knowledge of kinematics to infer the shape of the data and selectively excluded unnecessary functions from the options for creating the SR model. In our analysis of parabolic motion data, we leveraged the understanding that the initial velocity of the object can be decomposed into Cartesian axes, allowing the vertical component of the velocity to be expressed using the sine function. Additionally, we applied the knowledge that the peak of parabolic motion is related to the vertical velocity component. In the context of the damped oscillation, we employed our knowledge of the general solution of the differential equation for damped oscillation that the amplitude decreases exponentially while the frequency remains constant despite damping. This informed our approach to regress motion expressions as a blend of cosine functions with maximal initial values and exponential functions. Accordingly, we adjusted hyperparameters, selectively excluding functions other than our target set of sine, cosine, and exponential functions, to minimize errors.
Moreover, during the process of attempting the regression, many cases arose where functions similar in form to cosine, such as logarithms and arcsines, appeared. These functions were removed to reduce learning time and the complexity of equations. Regarding the method of restricting functions in AI-Feynman and Φ-SO algorithms, in the AI-Feynman algorithm, a new operator code was inputted into a text file specified in the ‘BF_ops_file_type’ variable of the ‘run_aifeynman’ module. In the Φ-SO algorithm, a list containing the names of functions was inputted into the ‘op_names’ variables of the SR module. This enabled us to obtain the expression relatively quickly. However, when students apply SR to an unfamiliar phenomenon, they must carefully observe the phenomenon, adjust terms in the equation, identify errors, and iteratively converge toward a well-fitting model. In this way, SR is expected to serve as a collaborative tool for students to model unfamiliar phenomena.
Physics classes integrating SR are expected to offer an innovative approach to overcome the limitations of previously proposed physics education programs incorporating artificial intelligence. In previous studies, deep neural network(DNN) algorithms have been proposed to be used to analyze and predict the motion of objects[19-21]. Students in these studies are guided to create a model using experimental data, make predictions with test data, and verify them using validation data, while also learning about the models’ loss or error based on actual and predicted values. However, mere predictive success of ML model may not suffice for explanatory understanding unless the tool offers some explanatory information[22]. In contrast, mathematical models derived from SR facilitate the comprehension of the relationships between variables and system behavior, enhancing students’ explanation of physical phenomena.
The problem-solving approach of both algorithms, AI-Feynman and Φ-SO, is significant in terms of physics education. AI-Feynman and Φ-SO involve dimension analysis process which is powerful when it is applied to a complete mathematical model in algebraic (differential and/or integral) form. It is remarkably productive even when a complete model is unknown or unwieldy and the analysis must be applied to a simple list of the relevant variables[23]. As dimensionless variables and scaling are powerful tools for emphasizing the universality of natural laws[24], it is taught in higher physics education. However, in AI-Feynman and Φ-SO algorithms, dimension analysis is automatically implemented and is not visible to students. Therefore, instructors can explicitly point out the dimension analysis of both algorithms and teach the importance and role of dimensionlessness in physics problem solving. It is expected that students will be able to identify the relationships between variables through dimensional analysis and learn that such a process contributes to simplifying complex systems.
Further discussion is needed to determine whether deriving mathematical equations from data using SR can be considered authentic scientific modeling. However, we identified the possibility of improving the efficiency of the modeling process with SR. Hestenes[1] proposed the scientific modeling strategy as description, formulation, ramification, and validation—in the order of their implementation. In this study, we derived formula of parabolic motion and damped oscillation using SR through a process of data collection, machine learning, formulation, and validation. Specifically, SR demonstrated time savings in the data-driven formulation. Realizing that it took Kepler four years to discover the elliptical orbit of Mars, but that SR greatly reduces the time needed to derive Kepler's laws compared to traditional methods, gives students insight into the impact of technology on the scientific problem-solving process.
To successfully implement SR in physics classes, it is essential to first explore appropriate instructional strategies. If students merely derive mathematical equations from data and cease further inquiry, it may be perceived as a mere description of phenomena. The scientific method of Whewell[25] contains observing a phenomenon and then discovering an idea that connects the observations by introducing ‘some general conception,’ which is given, not by the phenomena, but by the mind. While Kepler introduced an idea that connects Brahe’s observation, it was Newton who delved into the underlying mechanisms behind these patterns and provided insightful explanations to demonstrate why planets move in such patterns[26]. In order to offer students an opportunity to engage with scientists' discoveries, physics lessons utilizing SR should go beyond deriving equations to attempting to explain the phenomena hidden in the data with physics theory.
According to Schmidt & Lipson[10], in answer to the question of whether SR will diminish the role of scientists in the future, they argue that it may actually help scientists focus on interesting phenomena more rapidly and interpret their meaning. To engage students with captivating phenomena, it is imperative to offer ample chances for exploration and discovery within the physics classroom. While this study showcases the promise of SR as a tool to assist students in mathematical modeling in physics research, further research is necessary to gauge its efficacy at the cognitive level of students. Therefore, conducting further studies that integrate SR into actual physics classes and assess its influence on students' scientific problem-solving processes would be highly beneficial.
This work was supported by the Creative-Pioneering Researchers Program through Seoul National University.