DYNAMIC AIR TICKET PRICING USING REINFORCEMENT LEARNING METHOD

. This paper studies a dynamic air ticket pricing problem in a strategic and myopic passengers co-existence market. The strategic or myopic passengers can be further divided into high-valuation and low-valuation groups according to how they evaluate their purchases. The strategic passengers have different strategic levels. When the airline sets a ticket price, every passenger makes his or her purchase decision according to his or her type and the strategic level, or might select“wait” or “leave (the market)”. The paper firstly proposes a dynamic pricing algorithm in which the utilities of both the airline and passengers are considered. The reinforcement learning (RL) is employed to deal with the progressive or dynamic decision-making framework, in which the dynamic pricing problem is formulated as a discrete finite Markov decision process (MDP) and the Q-learning is adopted to solve the problem. By using this method, the airline can adaptively decide the ticket price based on passengers strategic behaviors and the time-varying demand. The effects of the passenger type proportion and strategic level are analyzed. The computational results show the higher proportion of strategic passengers is, the smaller price increase the airline can adopt, and the higher proportion of high-valuation strategic passengers is, the larger price increase the airline can put to use under the same strategic level. If the proportion of low-valuation strategic passengers is higher, the price increase should be gentle and step by step when the price increase strategy is adopted. If the airline uses price-cut policy, the adjustment should be small. In addition, the high-valuation passenger mainly affects high-price periods and the low-valuation passenger mainly affects low-price periods. When the proportion of strategic passengers is fixed, the lower the passenger strategic level is, the larger the price slope is. These findings can provide some references for the airline to make more precise and flexible pricing decisions.


Introduction
In recent years, as the growth rate of GDP has declined, the growth rate of air transport demand has been weakened.However, the growth rate of the whole air transport capacity has not decreased, resulting in fierce competition between airlines.In such a situation, airlines are eager to seek new approaches to identify passenger demands more accurately and respond to them in a timelier manner.
Revenue management (RM), a method to improve revenue, was firstly used and then introduced by American Airlines (AA) in the US in 1980s due to airlines' serious competition.The common approach for the airline to conduct RM is the seat inventory control.The airline sets several prices or classes for all seats on the flight, then controls the sale number of each class according to the market demand.Obviously, the more classes they set, the more revenue they can get.This leads to the ticket dynamic pricing in 2000s.The dynamic pricing is a method that implies that the airline dynamically changes the ticket price according to the demand.There are also two approaches for the airline to conduct the ticket dynamic pricing.One doesn't consider game between the airline and the passenger, and the other considers the game.Game theory is mainly used to study the rational decision-making behaviors of interdependent and mutually influencing participants in the game and the equilibrium results of these decisions.The game model can be expressed by players, actions, strategies, the information and the utility.Under the game scenario, the airline needs to consider not only their own inventory levels, costs and the impact of competitors, but also the possible choices made by passengers.The passenger will compare different products (tickets) vertically and horizontally, that is, flights at different departure time within one airline and flights in different airlines.Levin et al. [21] described the equilibrium at each moment as the Stackelberg equilibrium between the airline and the passenger.The passenger's goal is to maximize his or her utility while the airline wants to capture the passenger's surplus as much as possible to improve its total revenue.
Current studies of dynamic pricing mainly consider two or several time periods in the whole selling period.Such kind of methods cannot respond to demands in a timely manner.Meanwhile, most studies do not well consider passenger behaviors, which is more important to the ticket pricing in today's market.In this paper, we differentiate passengers into several categories, such as the strategic passenger or the myopic passenger.The strategic passenger, that is the passenger who evaluates the present and future purchase utility in order to maximize his or her benefit, may not choose immediate buy and wait for a lower price in the future, and the myopic passenger is the passenger who purchases immediately when the price is less than or equals to his or her valuation (willingness to pay).There could be no doubt that the existence of strategic passengers in the market is a difficult aspect for airlines to make pricing decisions.Shen and Su [32] pointed out the passenger choice behavior is mainly reflected in two aspects, when to buy and which to buy.Gonsch et al. [14] reviewed the related studies on dynamic pricing of strategic passengers.They found strategic passengers could continuously predict the airline's expected future price under e-commerce environment, and the airline should have according pricing policy based on the passenger categories and their consuming behaviors in the market.Thus, passengers' classification or categorization and their purchase behaviors become this paper's main focuses.
The rest of this paper is organized as follows.In Section 2, the literature review is presented.Section 3 is the problem description and assumptions.In Section 4, mathematical formulations for passengers and the airline are presented.In Section 5, the reinforcement learning (RL) algorithm is presented, which includes simulation of passenger behaviors and adoption of Q-learning to solve the decision-making problem.Section 6 is the computational experiments and analysis.Finally, the conclusion part is presented.

Literature review
In studies of dynamic pricing considering strategic customers, Mersereau and Zhang [28] assumed that the airline knows the total demand curve, but the proportion of strategic customers is unknown.They established a robust pricing model which is independent on the true proportion value.Su [33], Kremer et al. [18] divided the sale period into two periods and studied a pricing strategy to analyze the impact of different types of customers on their pricing strategies.Su [33] pointed out that customers are heterogeneous along two dimensions: their valuations for the product and the degree of patience.According to these two dimensions, customers were categorized into four different types, and the different influences on pricing policy were analyzed.Kremer et al. [18] suggested that the proportion of strategic customers influences the optimal pricing policy.Zhang and Zhang [37] considered a perishable product in a two-period model in which a retailer decides the order quantity and price at the beginning of the first period.Customers pay full price in the first period and a marked down price in the second period.Dong and Wu [3] examined the impact of strategic and heterogeneous consumers on pricing and inventory decisions in a two-period model.They found that strategic consumers may yield more revenue in specific scenarios.Li et al. [24] established the two-period models to study a platform's discount pricing strategies with strategic consumers.The results show that the large discount will reduce the total demand under the instant strategy, the fraction of strategic consumers affects the platform's strategy choices, and the existence of strategic consumers will increase the product prices for two periods.Guan and Ren [15] divided the entire sale period into the normal sale period and the clearance period.In the paper, it is assumed that the price strategy relates to the proportion of strategic customers and the dependence degree of the reference price.Correa et al. [8] proposed a class of preannounced pricing policies in which the price path corresponds to a price menu contingent on the available inventory.Some other studies established two-period dynamic pricing models considering both myopic and strategic customers, and discussed the impact of different strategic customer ratios on pricing strategies [10,16,34,36,38].All above studies only consider the dynamic pricing in two periods.
Most studies about multi-period dynamic pricing only consider strategic customers.Levin et al. [20,21] established a pricing model for oligopolistic companies and monopoly companies respectively, and proved that the monotony of pricing strategies relates to the degree of the customer rationality.Through learning from the customer arrival rate and reservation price, Levina et al. [22] proposed a strategic waiting factor which was used in the customer choice model.Liu and Zhang [25] considered two companies which provide two vertical heterogeneous products.Customers choose a product according to its quality.When customers become more strategic, the company's revenue will reduce, and the company selling low-quality products suffers more loss than the company selling high-quality products.Chen and Farias [4] acknowledged that the customer is forward looking and his or her valuation of the product decreases with time.The robust pricing strategy can be obtained when the discount factor and the cost distribution are unknown to the customer.Li et al. [23] used the Bayesian posterior probability to update the arrival rate based on the past sale experience and the number of passengers' arrival.From their studies, we can find most studies assumed there are only strategic passengers in the market, and all passengers are homogeneous with the same valuation or the same value distribution to the same product.There are few studies considering strategic and myopic customers in multiple periods [17].
With the continuous development of artificial intelligence technology, more and more scholars have tried to use intelligent methods to solve the problem of dynamic pricing.The reinforcement learning (RL) is one of the most widely used methods [1, 5-7, 9, 11, 12, 29-31, 35].Among them, Collins and Thomas [6] incorporated a variety of customer demand models within a simple airline pricing game to gauge the usefulness of three different RL approaches as a game theoretic solving mechanism.The results prove that the application of RL to the game is beneficial, and the benefit is both from solving games that are unsolvable using traditional methods and by giving an extra-dimension of insight into the game.In addition, Collins and Thomas [7], Dogan and Gner [11] used a decision framework of the Markov decision process (MDP) and Q-learning algorithm to study the dynamic pricing problem, but the customer behaviors were not considered in the learning process.In smart grids, RL was also applied in the dynamic pricing, where pricing strategies are learned in a customer simulation environment [26,27].
Being different from above-mentioned studies, this paper analyzes passenger purchase behaviors more accurately, considers the variation of passenger arrival rates and reservation prices, and conducts dynamic pricing with myopic and strategic passengers' co-existence in multiple time periods.The heterogeneity of passengers is considered along two dimensions: the valuation and the strategic level, which jointly classify passengers into four categories or types.The pricing model and the RL algorithm are established to obtain the optimal dynamic price strategies.In the RL algorithm, passengers behaviors are simulated as a learning environment and the airline is set as the agent.Further, for different passenger types, the impacts of their strategic levels and their proportions on optimal price policies are analyzed.
J. GAO ET AL.

Problem description
The airline sells a certain number of air tickets within a finite time period.This certain number  is the total number of seats of a flight which can be sold.The objective is to maximize the total revenue of the flight.Owing to the special property of air tickets, the residual value of unsold tickets is zero after the sale time ends.The entire sale time is divided into  time periods,  = {1, 2, . . .,  }.The time period should be small enough to guarantee at most one passenger arrives in each time period.  denotes the ticket price in time period .Assume there are  passengers in the market, and  denotes the number of passengers who have purchased tickets.The airline and passengers are rational and they try to maximize their own utility.Let  (, ) and  (, ) denote the total utility of the passenger and the airline respectively from time period  to the end of the sale time with  passengers have purchased tickets.
There are two basic types of passengers called strategic and myopic passengers in the market.Each type has its own arrival rate.After a passenger arrives, he or she should determine whether to buy the priced ticket given by the airline.For myopic passengers, if the price is less than or equal to their valuation, they choose to buy.Otherwise, they choose to leave the market.For strategic passengers, they compare the current utility with the future utility, and then decide whether to buy immediately or wait for the lower price in the future or leave the market.If the current utility is more than the future utility, they choose to buy, otherwise, they keep in a "wait-and-see" state.The price, the left tickets and the waiting time are all under consideration of strategic passengers for their decisions.
(1) The impact of the left tickets on strategic passengers Strategic passengers usually have strong willingness to buy tickets, which is represented by the purchase probability, but they are unwilling to buy if the price is higher than their valuation.In addition to the price, the seat inventory also plays a role in their decision-making.When the inventory is sufficient, they have no risk of future purchasing and will choose to wait until the price is equal or lower than their valuation.When the inventory becomes scarce, their purchase behaviors closely relate to their purchase willingness.If the willingness is strong, they choose to buy immediately.Otherwise, they still wait for the possible price cut.
(2) The impact of the waiting time on strategic passengers The waiting time also affects the purchase behavior of strategic passengers.If the strategic passenger has a weak willingness to buy and the price does not fall below his or her valuation after a long-time waiting, he or she chooses to leave the market.How long the strategic passenger can wait relates to his or her willingness.If the willingness is strong, the waiting time is relatively long.In addition, some passengers may worry about the price increase after a long-time waiting, so they choose to buy as soon as possible.
The strategic passenger's behavior can be described by the purchase probability density function.Figure 1 is used to illustrate the relationship between price , purchase probability density  () and purchase probability F ().The reservation price  reflects different passengers have different willingness to pay.To a particular setting price , airlines can only capture the passengers whose reservation prices(willingness to pay) are higher than and equal to .The area under the curve from  to the infinite is the purchase probability F () in the whole market,  ()is the integral of  () from zero to price .Obviously, F () = 1 −  ().
Curve 2 is gotten by moving curve 1 to the right to match the changed situation.It can be used to illustrate the effect of the remaining seats or the remaining selling periods.The decrease of remaining seats or selling periods leads to the increase of the willingness to pay (the passenger's reservation price).To a setting price , the whole purchase probability increases (the area under the curve 2 is larger than the area under the curve 1) as the remaining seats or the selling periods decrease.

The assumptions
Assumption 3.1.Passengers are heterogeneous along two dimensions, the ticket valuation and the strategic level.In the valuation dimension, we consider two types, which is the high valuation and the low valuation.Let   and   denote the mean value of the high-valuation passengers and the low-valuation passengers, respectively.In the strategic level dimension, we divide passengers into "strategic"and "myopic".Combined the valuation and the strategic dimension together, we have four types of passengers who are the high-valuation strategic passengers (HS), the high-valuation myopic passengers (HM), the low-valuation strategic passengers (LS) and the low-valuation myopic passengers (LM).Let  denote the passenger type, so  = {, , ,  }.
Assumption 3.2.The total number of passengers, the proportion of each type and the strategic level can be deduced from the historical data.where   () = ∫︀  0   (), it is the mean function of the non-stationary Poisson process.
Assumption 3.4.The airline and passengers have perfect knowledge of all market information, including the remaining capacities of the flight, the distributional and parametric characteristics of the passenger's reservation price.In addition, the reservation price distributions of all passenger types are independent.
The above assumptions about the passenger type and the passenger arrival are common in this research field.The perfect information assumption is also reasonable.Nowadays, the airline has many ways to know the market information.First, through GDS such as TravelSky, Worldspan and Galileo, the airline can know the selling progresses of all airlines in the market.Second, through cooperated OTAs and its own EC website, the airline has ability to know the passenger's behavior (such as how many times he or she explored relative platforms, what is his or her interest, what is his or her focus, etc.), which contains many features that are very valuable for the airline's decision.The passenger also has some ways to sense the market through various platforms or personnel channels.In addition, the perfect information is an important limiting case with significant potential for management insight regarding the pricing and other policies.

The passenger utility model
A fraction  of the total passengers are strategic passengers, and the remaining ᾱ = 1 −  are myopic passengers.Among strategic passengers, the proportion of high-valuation passengers is   , and the proportion of low-valuation passengers is φ = 1 −   .Similarly, among myopic passengers, the proportion of highvaluation passengers is   , and the proportion of low-valuation passengers is φ = 1 −   .  denotes passengers' proportion of type  in the total passengers.So, the proportions of the four types of passengers are  =,,, = { *   , ᾱ *   ,  * φ , ᾱ * φ }.The strategic level of the passengers of type  is   .It reflects how much the passenger values his/her future purchase.For a type  passenger, the utility of buying a ticket in the future is discounted by   ∈ [0, 1].  = 0 means that the passenger completely disregards the possibility of a future purchase.  = 1 means that the passenger values the current purchase the same as a purchase at any point in the future.Intermediate values of   ∈ (0, 1) determine how long passengers can postpone their purchases without excessive loss of utility.Obviously, 0 <   ,   ≤ 1,  , = 0.
The passenger utility function is expressed as (.).For a particular ticket, the passenger utility is denoted as  ( ′  −   ). ′  represents the passenger valuation of type .  represents the ticket price in time period  as mentioned in Section 3.1.(.) is a strict increasing function of the price, and the inverse function is  −1 (.).No matter whether the passenger is myopic or strategic, if he or she chooses to buy the ticket immediately when he or she arrives at the market in time period , his or her expected utility is The future passenger utility is expressed as  ( + 1, ).For a strategic passenger, the goal of the model is to capture his or her intertemporal behavior.Then, the present value of the future passenger utility is represented as , denotes the arrival rate of passengers of type  in time period .The range of  , is [0, λ], where λ is the maximum arrival intensity.According to Levina et al. [22], the purchase probability is related to the number of remaining seats (or the number of passengers who have purchased tickets), the current time period and the ticket price, which is expressed as Λ   (, ,   ).Then,Λ  (, , ]︀ , which means Λ  (, ,   ) is equal to the expected purchase probability of all passenger types.  (, ,   ,  ′  ) represents the present value of the passengers expected utility with price   , valuation  ′  in time period  and  tickets purchased by passengers.The passenger utility is the function of  , .
The first item represents the utility if the passenger chooses to buy the ticket immediately.Both the second and the third item are the present value of the future utility.The difference between the second and the third item is the passengers choosing to wait or leave.By merging similar items, the expression can be rewritten as: Because  , ≥ 0,  ( ′  −   ) ≥    ( + 1, ) is guaranteed to the passenger's instant purchase.It directly explains that if the immediate utility is greater than or equal to the present value of the expected future utility, it is ensured the passenger can maximize its own utility.Because the inverse function of the utility function exists, the condition can be transformed as , which represents the expected value of the passenger utility in all possible valuations.It is used as the passenger utility in the status (, ). (, ) =   (, ,   ) ,  = 0, . . .,  ;  ∈ {1, . . .,  }. (4. 3) The termination condition is: (,  ) = 0,  ∈ {1, . . .,  }. (4.5) Equations (4.4) and (4.5) mean when the sale time ends or there is no seat left, the passenger utility is zero.
In reality, the following three purchase behaviors or cases will occur under condition  ( ′  −   ) ≥    (+1, ).Let   , denote the probability of purchase by the arrival passenger of type  in case ( = 1, 2, 3) in time period .
Case 1.The myopic passengers will choose to buy tickets immediately when the current utility is greater than or equal to zero.
The purchase probability of the arrival myopic passenger is: Case 2. The strategic passengers will choose to buy tickets only if the current utility is more than or equal to the present value of the utility that they may get in the future.

𝑢 (𝑝
The purchase probability of the arrival strategic passenger is: Case 3. When the utility is more than or equal to zero, but less than the present value of the future expected utility, the arrival strategic passenger will choose to wait.The probability of the arrival strategic passenger who chooses to wait is: Let   =    1 , +    1 , +    2 , +    2 , , the average arrival intensity in time period  is   , so the probability that a myopic or strategic passenger arrives and chooses to buy the ticket in time period  is: The probability that a strategic passenger arrives and chooses to wait in time period  is: The probability that there is no passenger arrives or if he or she arrives, he or she chooses not to buy the ticket in time period  is: Obviously, the buying probability depends on the expected future utility  ( + 1, ), and the expected future utility depends on its rationality.According to Levin et al. [20], if a passenger is completely rational, he or she will adopt a balanced solution between the passenger utility and the airline utility.  in   (, ,   ) should be a balanced solution   (, ) through gaming, and the passenger utility can be represented as   (︀ , ,   (, ) )︀ .However, passengers are not completely rational in reality and they are also impossible to know the complete pricing information.Hence, we suppose passengers are partially rational and they use their past purchase experience to estimate the expected value of the future utility.

The airline's pricing model
We use the Markov decision process (MDP) to construct a dynamic programming model for the airline's pricing behavior.MDP is a discrete time state transition process.It consists of five elements (, , , , ). denotes the set of states. denotes the set of actions which the airline can select. is the set of expected immediate rewards. is the state transition probability given by  ( ′ ,  | , ), which means the probability that the state changes from  to  ′ under action ,  ∈ . ⊂ [0, 1] is the discount factor, which represents the difference of importance between the future and present reward.The state transition of MDP must satisfy the Markov property, which is the next state  ′ only depends on the current state  and the decision-maker's action .That is to say, given  and , it is conditionally independent of all previous states and actions, for which we can use the following equation  ( ′ ,  | , ) = Pr { +1 =  ′ ,  +1 =  |   = ,   = }.It means the behavior in the time period  is only related to the state of time period  − 1.For the dynamic pricing problem, the state is , which denotes how many seats have been sold.It can be written as   = , which means  seats sold in the current time period.The selectable action in this dynamic pricing problem is a set of prices .The price in time period  is   ∈ . (, ) denotes the total revenue from the start state to the end of the sale time.Figure 2 is used to explain the state transition of MDP, it contains the state nodes   and the action nodes   , after implementing action   according to a certain strategy, the state will change from   to  +1 .Actions here can be purchase, wait or leave.Figure 3 is used to explain the ticket sale process.Under each state-action pair, there are two possible results.If a ticket is sold, the reward is equal to the current price; if not, the reward is zero.When a ticket is purchased by a passenger, the number of passengers that purchased tickets is added by 1, otherwise, it keeps the same.In Figure 3, the right branch illustrates the action that the passenger bought a 30-dollar ticket set by the airline, the revenue increase is 30 dollars, the purchased number increases by 1, the time enters into the next period.It is the state changes from  =3 = 2 to  =4 = 3.
Considering the presence of all the four passenger types, if the current state is  at time period , there are also three behaviors or cases.
Case 1.When a myopic or strategic passenger arrives with a certain probability and chooses to buy a ticket, the revenue state of the next time period is transferred to  ( + 1,  + 1) and the gain obtained is   =   .Then, the revenue in this scenario is: Case 2. When a strategic passenger arrives and chooses to wait, the revenue state of the next time period is  ( + 1, ) and the return   is zero.Then, the revenue in this scenario is: )︀ *  ( + 1, ).(4.16) Figure 3.The state-action tree.
Case 3. When a passenger does not arrive or chooses to leave the market after his or her arrival, the revenue state of the next time period is also  (+1, ) and the return is zero.Then, the revenue in this scenario is: )︀]︀ *  ( + 1, ).(4.17) On equations (4.16)-(4.17), ( + 1, ) is numerically the same as  (, ) if we just consider the present revenue or income.No sale, no income.However, the purchase probability relates to the seats sold (or the remaining seats) and the current time period.Meanwhile, passengers' valuation of the ticket also relates to them.If we consider the future influence of passengers' purchase behaviors, the total revenues to the end of the sale time  (, ) and  ( + 1, ) are different.
So, the total revenue under all the three scenarios in state   =  is: Since the distribution F, () is a monotonically decrease function, and   is also a monotonic decrease function, equation (4.18) can be expressed in the form of MDP.
where, = 0, . . .,  ;  ∈ {1, . . .,  }.The termination condition is: It means when the sale time ends or all the tickets are sold out, the sale process is over and the residual value is zero.

Algorithm design
Due to the large scale of possible states, it is difficult to solve the pricing model directly by an optimizer, so we designed a RL algorithm.RL is a sequence of decisions that maximize the total future rewards through a certain sequence of action choices.RL uses a trial-and-error approach.This algorithm contains two important (4) Determination of passenger purchase behavior A two dimensional array  = [ ( 1 ,  1 ) ,  ( 2 ,  2 ) . . . (  ,   ) . ..] is added for storing every strategic passenger in waiting.Each (, ) contains the passenger waiting time and the number of remaining seats in the current time period.In each time period, when the airline interacts with the environment, the purchase behavior is determined by the current valuation distribution.If a passenger chooses to wait, he or she is added to  .Meanwhile, if a passenger in  chooses to buy the ticket or leave the market, the according  (  ,   ) is deleted from  . is dynamically updated as time elapses.

The 𝑄(𝜆) algorithm design
represents a set of all possible states,  =   ,  ′ =  +1 .The policy or strategy of pricing is expressed as , which can achieve the mapping from the state to the action,  :  → .Of course, every policy is under consideration.( | ) is actually the probability of performing action  in the state . ( ′ | , ) is the probability that the state changes from  to  ′ when setting the ticket price to .The probability can be gotten by     and 1 −     which depends on passengers' choices.    is the probability of one ticket sold, and 1 −     is the probability of no ticket sold.  () is the total revenue that can be obtained from state  to the end of the sale time under policy .Rewrite equation (4.18) as a state value function: Equation ( 5.1) shows that the expected revenue from state  is equal to the sum of the expected revenue from the next state  ′ and the revenue of this time, thus constructing a recurrence relation.Here, ( | ) is used to replace ( | ) since action here is pricing.
To learn the optimal pricing policy through time periods, the current state  and the price  should be given.The action-value function in case of the following strategy  is defined as   (, ): Equation (5.2) indicates the expected revenue that can be obtained when the action  is selected by the strategy  in the current state .
According to equations (5.1)-(5.2),we can get the relationship between the state value function and the action-value function.
The Q-learning algorithm is an offline learning algorithm.Its basic idea is that the individual learns based on another policy although the individual has a strategy of its own.This policy can be a previous policy, or it can be some mature policies, such as human strategies.By observing behaviors based on such strategies, we can get some rewards.These rewards are used to update the action-value function.The strategy of updating the action-value function is different from the strategy of choosing the action.Q-table is used to calculate the final results.Q-table is a two-dimensional matrix.One dimension is the action and the other is the state.Q-value in Q-table is also called "quality".The greedy strategy is adopted in the action choice.That is, the update of the action-value function follows the principal that only maximum Q-value action in all next possible actions will be taken.The update formula for Q-learning is: Here,  denotes the learning rate.
Particularly, the -greedy strategy is used to choose the action.The target of the -greedy strategy is to guarantee that each possible action will have a non-zero probability to be chosen in each time period.It is an exploitation of choosing the best action in the current state with the probability of 1 − , and an exploration of choosing other actions with the probability of .

𝜋(𝑎
denotes the action number.Here, it is the pricing or setting price times.
Proposition 5.1.For any given strategy , the strategy can be improved when using the -greedy strategy to evaluate it.
Proof.Define  * = arg max ∈   (, ).( The () algorithm is an extension of the Q-learning algorithm.In this algorithm, the concept of the eligibility trace should be introduced.The eligibility trace is an additional attribute in each episode, which determines the relative degree between the current state and the update value of the current state.
The idea of the eligibility trace is very simple.When a state-action pair is selected in an episode, a short-term memory (called as a trace) is assigned and the trace does not decrease as time passes by.The meaning of the trace is to determine the size of each state-action pair's eligibility for learning.The eligibility trace can accelerate the learning process.
The update process of the eligibility trace is: = *  = means that the pair or element in the eligibility trace matrix increases 1 only when the stateaction pair is the pair visited currently.We assume the future reward is as important as the present reward, then  = 1.The eligibility traces are updated in two different ways.If a greedy-action is executed, that is, the corresponding action of maximum "quality" state is selected, all of the eligibility traces will decay at a parameter, and if an exploration action is taken, that is, the action is selected randomly, then the eligibility traces will be set to zero.The update process of quality depends on the value of the eligibility traces.

Numerical test and analysis
In this section, we use numerical examples to verify the effectiveness of the algorithm proposed in this paper.The pricing problem is solved by python 3.6, and the compiler is Spyder.
Suppose that there are 18 seats that need to be sold in the remaining sale time.The set of prices that can be selected in each time period is  = {10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30}.Assume the number of passengers is 15.According to the simulation of passenger behaviors in Section 5.1, we generate the passenger arrival sequences as shown in Figure 5, which is obtained based on 100 runs.The sale time is divided into 20 periods according to the arrival sequences.From Figure 5, we can see the purchase behaviors of two different passenger types.Myopic passengers have some kinds of "rush in" behavior in the early stage of the sale period, and strategic passengers have more stable purchase behavior.
Considering pricing policies in various environments, we set ,   and   .The passenger utility functions and strategies are set by past studies [19,39].The () algorithm is used to learn the dynamic pricing strategy in the passenger behavior simulation environment.In the simulation, we suppose the proportion is known, that is  = 0.5,   = 0.5,   = 0.5.The arrival intensities of different passenger types are different.The iterative convergence graph of this algorithm is shown in Figure 6.Obviously, the algorithm can converge well.After a certain number of iterations, an effective Q-value table can be obtained, which can be used to get the optimal pricing policy.Using this pricing policy for learning in the simulation environment, we can obtain the expected revenue.Figure 7 shows the average results of 100 runs.From Figure 7 we can see that, when the high-valuation passenger arrival intensity is lower than the low-valuation passenger's intensity in the previous time period but higher than it in the later stage, the price policy is an increasing policy.It can explain that why the high-  valuation passengers have higher arrival intensities in the later periods, and hence the airline chooses to increase the price in the later periods to obtain the surplus value.

The proportion impact analysis
In this section, we test the effects of different passengers' proportion on the pricing policy.= 0 means all the passengers are myopic passengers, and  = 1 means all the passengers are strategic passengers.Assume the strategic level of the same type is the same.  = 0.5,   = 0.5,   =   = 0, and parameters ,   and   are set differently.Based on 100 runs, we can get the results as Figures 8-10.
From Figures 8-10, we can find different proportions lead to different pricing policies if the strategic level is fixed.The higher proportion of the strategic passengers is, the smaller the price can be increased.Compared Figure 9 with Figure 8, we can find the increase is gentler and much stepwise.Compared Figure 10 with Figure 9, we can find the increase is relatively larger and faster.The higher proportion of the high valuation strategic passengers is, the larger the price can be increased.When the price-cut is adopted, the less continuous-decrease-     policy is used.This can curb the strategic waiting since the strategic passenger always calculates the expected future utility before he or she makes a decision.

The strategic level impact analysis
Similarly, we change strategic level to analyze its effects on the pricing policy.Set  = 0.6,   = 0.4,   = 0.4, we change the strategic level from 0.4 to 0.8 and get the average results of 100 runs as is shown from Figures 11-13.Of course,   =   = 0.
From Figures 11-13, we can find the strategic level also has its impact on the price policy.High-valuation passengers mainly affect high-price periods, and low-valuation passengers mainly affect low-price periods.In high-price periods, the lower the strategic level of the high-valuation passengers is, the larger the slope is in the price policy.Accordingly, in low-price periods, the lower the strategic level of low-valuation passengers is, the larger the slope is in the price policy.The degree of the strategic level affects the degree of the price increase.The higher the level is, the lower the price increases.This is because when the strategic level is higher, the significance of passengers' valuation to the future utility is greater.If the price increases too fast, these strategic passengers choose to leave the market.In addition, as high strategic level passengers dominate, the purchase behavior becomes more complex.The waiting or leaving behavior will generate additional plus or loss in demand, hence cause fluctuation in general price increase trend as is shown in Figure 13.

Conclusion
This paper studies a dynamic air ticket pricing algorithm for the airline RM in a strategic and myopic passengers' co-existence market, wherein the airline can adaptively decide the ticket price by using RL according to the passengers' arrival intensity, valuation distribution and strategic behaviors.We first formulate the dynamic pricing problem to a finite discrete MDP, and then employ Q-learning to solve the problem.By using RL, we separate the airline's behavior from passengers behaviors through the transformation of the airline's utility model.Passengers' behaviors are simulated as the learning environment.The airline is set as the agent.Through interactions between the environment and the agent by using the trial-and-error approach, the optimal pricing strategy of the airline can be reached.Using this approach, we avoid to directly solve a large-scale dynamic programming problem due to the large number of states and the existence of passengers strategic behaviors.
Based on the various percentage of strategic passengers, high-valuation passengers and different strategic level, a series of computations were conducted.The computational results show the pricing policy relates to the passenger composition.The airline should be more prudential because of the existence of strategic passengers.The higher proportion of the strategic passengers is, the smaller the price can be increased.The price increase should be adopted step by step, the increase quantity should be consistent to the percentage of high-valuation strategic passengers (or low-valuation strategic passengers).The higher percentage of high-valuation strategic passengers is, the more increase of the price can be made.If the price-cut is taken, the change should be small.High-valuation passengers mainly affect high-price periods or later periods, and low-valuation passengers mainly affect low-price periods or early periods.The strategic level also plays an important role in how the airline can change the price.The lower the strategic level of high-valuation passengers is, the bigger the price increase can be made in the later periods, and the lower the strategic level of low-valuation passengers is, the bigger the price increase can be made in the early periods.These findings are helpful for airlines to set up pricing strategies in different scenarios.

Figure 1 .
Figure 1.The reservation price distribution.

Figure 2 .
Figure 2. The state transition of MDP.