THE COMPARISON OF PRICING METHODS IN THE CARBON AUCTION MARKET VIA MULTI-AGENT Q-LEARNING

. In this paper, the uniform price and discriminative price methods are compared in the carbon auction market using multi-agent Q-learning. The government and diﬀerent ﬁrms are considered as agents. The government as auctioneer allocates initial permits in the carbon auction market, and the ﬁrms as bidders compete with each other to obtain a larger share of the auction. The carbon trading market, penalty, reserve price, and bidding volume limitation are considered. The simulation analysis demonstrates that bidders have diﬀerent behavior in two pricing methods under diﬀerent amounts of carbon permits. In the uniform price, the value of bidding volume, ﬁrms’ proﬁt, and the trading volume for low permits and the value of the government revenue, clearing price, the trading price, and auction eﬃciency for high permits are greater than ones in the discriminative price method. Bidding prices have a higher dispersion in the uniform price than the discriminative price method for diﬀerent amounts of carbon permits.

There are several types of auctions that are mainly divided into two groups: static (sealed) and dynamic (clock) [28]. In a static auction, two market pricing methods, i.e., uniform pricing and discriminatory pricing, have been developed. In the uniform price method, each winner pays the market clearing price (the price at which the aggregate demand curve intersects the supply curve), while in the discriminatory price method, they pay their bid prices [12,13].
Selecting the right pricing method for a carbon market is still a hot debate. For example, Cong and Wei [11] indicate when carbon allowances are nearly low, the discriminative price method is more efficient than the uniform price, while little participants benefit more in the other method. Cong and Wei [12] show when the number of participants is relatively high, and there is no communication between them, the English auction clock is more effective than the uniform and discriminatory pricing. But when the number of bidders is relatively low and there is communication between them, discriminatory price auction is better than two other methods and prevents collusion. Tang et al. [28] demonstrate the uniform price method has a smaller effect on economic damage and emission reduction compared to the discriminatory price method.
Many studies, like Santos et al. [7], Hattori and Takahashi [16], Sugiyarto [27], Akbari-Dibavar et al. [1], and Matthäus [20], have been conducted to compare these two pricing methods in different areas, but their results are not consistent. However, there are few studies that have compared these methods in the carbon auction market. Generally, in this area, various methods have been used to analyze the behavior of participants. For instance, Jiang et al. [19] investigate how market power can influence the auction price using mathematical models and some operations like derivative. They consider two allocations patterns, mixed allocation, and single auction. They examine the effect of these two patterns on compliance costs and welfare. Cong and Wei [12] utilize experimental method to compare some carbon auction methods. Dormady [14] provides some experiments to simultaneously investigate a carbon and energy market under real-world market characteristics.
Due to the lack of auction data and the nature of its gameplay, individual behavior in this system is complex [12]. Equations cannot model interactions in these systems [5], and experiments are expensive. One of the powerful tools in studying complex systems is the multi-agent-based model [31]. For instance, Cong and Wei [11] use this tool to compare some carbon auction methods. They consider the government as the auctioneer and two types of plants as the bidders. Tang et al. [28] design a carbon allowance auction market using the multi-agent-based model. They consider two agents: the government as the regulator of emission trading scheme and different firms in all parts of China. Yu et al. [21] propose a multi-agent-based model to simulate an emission trading scheme consisting of some firms in China.
Our work is an extension of the search conducted by Cong and Wei [11]. Their study was limited to the carbon auction market, while the carbon trading market and carbon trading price can affect carbon auction and vice versa. They used the Roth-Erev reinforcement learning algorithm to adjust the bidding strategies. Their algorithm presented a bidding strategy based on the result obtained from the past, while studies show that if the agents can anticipate the long-term outcomes of their current bidding strategy rather than optimizing their immediate rewards, their profitability will improve [29]. As well as, in their study, each agent could generate a bid price close to zero because of not using reserve price. Therefore, the permits might have been sold below their value. Reserve price is the minimum price that the government expects to be paid for a permit unit in the carbon auction. It can guarantee that the permits are not purchased below their real value [9]. This paper is going to compare the uniform pricing method and discriminative pricing method using the multi-agent-based model in which each adaptive agent illustrates a firm that takes part in carbon auction in a cap-and-trade scheme and determines its bid based on Q-learning. Watkins proposed the Q-learning algorithm to solve the Markovian Decision Problems with incomplete information. This algorithm has some features that make it appropriate for repeated games against unknown opponents. First, it does not need a model from the environment. Second, it can be utilized on-line to discover the optimal strategy using the experience gained from interacting directly with the environment [29,32]. Third, an agent can predict the long-term outcomes of its actions and the actions of other agents, and therefore, it can be able to correctly model the other agents and achieve the optimal bidding strategy [29]. This algorithm has received increasing attention in the electricity auction market and has become a major tool in solving this problem [23,24,32,33]. As mentioned, the carbon permit auction market is a complex economic system due to a lack of data. In this system, each agent's behavior is strongly dependent on other agents' behavior and market conditions. Besides, each of the agents faces a lack of knowledge about their other competitors. Under such circumstances, building a model for the economic system is a complex problem, and using free-model algorithms such as the Q-learning algorithm can be very appropriate.
The next section formulates the presented multi-agent-based model. Section 3 describes the Q-Learning algorithm and proposes agents' bidding strategies according to it. Section 4 presents the experimental results. Section 5 concludes the paper and offers direction for future research.

Problem definition
In this section, we present our multi-agent-based model to compare two pricing methods, namely the uniform pricing method and discriminative pricing method in a carbon auction market. We consider two kinds of agents, government, and firms. The government, as an auctioneer, allocates the initial carbon permits to the bidder and determines the actual price (the price that each firm should pay). Also, it regulates the carbon trading price to prevent market power and manipulating price. The firms, as bidders, simultaneously submit demand schedules (their prices and the quantities that they are willing to buy at those prices) to the auctioneer. The government determines the clearing price, the gained carbon permit from each firm, and the actual carbon price by forming the aggregate demand and supply curve. The government adds demand schedules to build the aggregate demand curve. The price in which the aggregate demand curve and supply curve intersect each other demonstrates the clearing price. The demands that are above the clearing price will be answered, the ones that are at the clearing price will be rationed, and the ones that are below the clearing price will be rejected. By comparing required carbon permits during production with carbon permits gained in the auction, each firm determines its supply or demand in the carbon trading market. Given the amount of supply and demand in this market, the government determines the carbon trading price. If required carbon permits exceed the carbon permits gained from two markets, the firm will be fined. Table 1 defines the main parameters used throughout this paper.

Agents' decisions
As mentioned in the previous section, first, the government as auctioneer determines the initial carbon permit T otalP ermit. Suppose the minimum carbon permit required for all firms to support their productions is T otalemission. To decrease the total emissions, the government implements the reduction policy and controls the initial carbon permit; therefore, it provides a percentage of T otalemission, i.e., φ. Consequently, T otalP ermit is calculated based on equation (2.1): T otalP ermit = T otalemission * φ. (2.1) The firms are the bidders in our model. Each firm i needs e i permits to produce one unit of product. Suppose production cost and selling price per unit product for firm i are c i and p i , respectively. Therefore, the value of a permit to firm i can be calculated as equation (2.2): In the equation above, v i is the private value (the maximum price that each firm tends to pay for a permit unit).
In the carbon auction market, each firm gives a bid price bp i that is between reserve price and v i and a bid volume bv i (the required carbon permit to support its production). The government ranks firms in descending order in terms of their bid prices to form the aggregate demand curve. The equilibrium price is equal to the bid price generated by firm k that satisfies the following inequalities:

T otalP ermit
The initial carbon permit T otalemission The minimum carbon permit required for all firms to support their productions φ The percentage of carbon permit supply ei The required carbon permit for producing one unit product by firm i ci Production cost per unit product for firm i pi Selling price per unit product for firm i vi Private value for firm i bpi The bid price for firm i bvi Bid volume for firm i ep The equilibrium price in the carbon auction market gvi Carbon permit gained in auction for firm i cp Market clearing price api The price paid for each carbon permit unit by firm i rvi The total carbon permit required by firm i nvi Surplus or insufficient carbon permits of firm i P ermitSupply Total permit supply in carbon trading market P ermitDemand Total permit demand in carbon trading market nv i The actual carbon permit traded by firm i in carbon trading market ctp Carbon trading price Srevenue Sales revenue for firm i Ccosti The costs of carbon permits in the auction and carbon trading market for firm i P uni The penalty paid by firm i for exceeding its emissions from the gained permit Qi The product sales amount by firm i T emissioni Total emissions generated by firm i Gpermiti Total permits obtained by firm i in carbon auction and carbon trading market Algorithm parameters S = {s1, s2, . . . , sn} The set of the environment's state A = {a1, a2, . . . , am} The set of the agent's actions Q(s, a) Q value for each permissible pair (s, a) π * (s) The optimal policy P ss (a) Possibility of changing the environment' state from s to s by selecting action a R a ss The immediate reward of the agent for taking action a in state s and changing environment' state to s T = {0, 1, . . .} The steps that the algorithm repeat st The environment's state in step t at The agent' action in step t rt The immediate reward obtained by the agent in step t γ Discounted factor α Learning rate A small probability Therefore, the equilibrium price ep is equal to bp k , i.e., ep = bp k . The carbon permits gained of firm i (gv i ) in the carbon auction market for two pricing method is calculated according to the following equation: (2.5) In the uniform price method, every winner pays the market clearing price cp that is equal to the equilibrium price ep. Therefore, the actual carbon price ap i (the price that firm i should pay) in the uniform pricing method is calculated according to equation (2.6): In the discriminative price method, the winners pay their bidding price. So, the actual carbon price ap i in the discriminative pricing rule is calculated according to equation (2.7): The clearing price cp in the discriminative price method is calculated based on equation (2.8): After the firms obtain the initial carbon permits in the carbon auction market, they can trade their carbon permits in the carbon trading market. If carbon permits obtained for firm i, gv i , in the carbon auction is greater than its required carbon permits rv i , the firm has surplus carbon permits, i.e., nv i = rv i − gv i < 0. Therefore, the firm i can sell them in the carbon trading market. Similarly, if gv i is less than rv i , the firm has insufficient permits, i.e., nv i = rv i − gv i ≥ 0, so they should purchase carbon permits required in the carbon trading market. Total permit supply and total permit demand in carbon trading marked shown by P ermitSupply and P ermitDemand, respectively, are calculated according to the following equations when the total permit supply is greater than total permit demand, i.e., P ermitSupply ≥ P ermitDemand, the carbon trading price decreases and the actual permit traded by firm i, nv i is obtained by equation (2.11). Otherwise, the carbon trading price increases and nv i is calculated according to equation (2.12). (2.12)

Determining carbon price in the carbon trading market
The carbon trading price has a significant effect on the carbon auction. If the carbon trading price is high, the bidders might attempt to buy more permits in the auction to sell them in the carbon trading market at a higher price. If the carbon trading price is low, the competition between the bidders reduces, and consequently, the clearing price decreases. To maintain stability and avoid manipulation in the carbon market, the government considers the carbon trading price ctp around the market clearing price within a given range β ∼ U (0, 0.3).

Agents' bidding strategy
In this section, we present a profitable bidding strategy based on the Q-learning algorithm. First, we propose the Q-learning algorithm and then describe the proposed bidding strategy.

Q-learning algorithm
Suppose an agent interacts with its environment at discrete time steps, t = 0, 1, 2, . . ., and S = {s 1 , s 2 , . . . , s n } is a finite set of states that the environment can adopt, and A = {a 1 , a 2 , . . . , a m } is a finite set of actions that the agent can select. At each time step t, the agent observes the present state of the environment s t = s ∈ S and then takes an action a t = a ∈ A. Consequently the agent achieves an immediate reward r t+1 and the environment alters its states to new states s t+1 = s ∈ S according to the transition probability p ss (a).
In the Q-Learning algorithm, there is a lookup table containing Q value for each permissible pair (s, a). First, the table is initialized either randomly or according to the agent's knowledge. The purpose of the agent is to find the optimal policy π * (s) ∈ A to maximize Q value of each state over the long run by using the Bellman optimality: In equation (3.1), γ (0 ≤ γ ≤ 1) is the discounted factor and determines how important the future rewards are for the agents. R a ss is an immediate reward that the agent receives because of taking action a in state s and changing the environment' state from s to s . Without knowing p ss (a), the Q-learning algorithm is capable of finding the optimal policy for each state by online estimation of Q (s, a) in a recursive method and by utilizing data: s t , a t , s t+1 and r t+1 . The updating equation is as follows: where α (0 < α < 1) is the learning rate and represents the degree of new data impact on the update of estimated Q values. In other word, α represent how much the agents consider the recent information to explore possibilities. Based on the theorem of the QL convergence, the Q-value converges to the optimal value, if each (s, a) is visited infinitely and α decreases appropriately. In a single agent, the environment is stationary and Markovian, but our problem is a multi-agent case that an agent's optimal policy depends on other agents' strategy. Therefore, each agent provides a non-stationary and non-Markovian environment for the other agents, and these conditions do not guarantee convergence to the optimal policy. Some progress has been made in this area [17,18,26], but they are for some special cases and seem inappropriate for practical problems. Despite these, among all the reinforcement learning algorithms, Q-learning is applied in many studies mainly because of its simplicity.

Q-learning based bidding strategy
In auctioning carbon permits, firms try to determine their bid prices and bid volumes to maximize their profits. Based on the Q-learning algorithm, every firm learns from its bidding experience during runs to bid profitably. To determine the profitable bidding strategy for firms using the Q-learning algorithm, the states of the environment, actions, and rewards must first be specified.
The market clearing price is considered as the state of the environment. That changes between the reserve price value and the maximum price that the firms can afford to pay per unit of permit. To prevent "the curse of dimensionality" the state space is equally discretized into 16 states.
Deciding on bidding price and volume is the action of each agent. The bidding price for firm i is between the reserve price and v i , and bidding volume is between the minimum permits that the firm requires for supporting its productions and the maximum permits that the firm is allowed to bid. The price space and volume space are equally discretized into 6 prices and 16 volumes, respectively. Therefore, the action space is 6 * 16.
The reward of firm i participating in carbon permit auction is its benefit function that is calculated as follows: where Srevenue i = (p i − c i ) * Q i is sales revenue (where c i , p i and Q i present production cost and selling price per unit product and the sales amount of firm i, respectively). Ccost i displays the costs of carbon permits in the auction and carbon trading market and is calculated according to equation (3.6): P un i presents the penalty paid by firm i. If total emissions generated by firm i (T emission i ) exceed the total permits obtained from the carbon auction and carbon trading market (Gpermit i = gv i + nv i ), it should pay the penalty for its non-compliance emissions.

Algorithm implementation
According to Figure 1, the steps of firms' learning and bidding are given as follows: (1) Initialization: first, the input parameters of the algorithm are initialized. Small random numbers or 0 are assigned to all state-action combinations for each firm. Suppose, Maxiter is the maximum iteration intended to run the algorithm. It is a termination condition; therefore, the following steps are repeated until the termination condition is reached. (2) State identification: in each iteration, the agents utilize the market clearing price on the previous step as the current state. In the first iteration, the reserve price (the lowest possible value) is considered the environment's state. (3) Action selection: after identifying the environment's state, the agents select their actions (bidding decisions) according to Q-values saved in Q-value lookup tables for each state-action pair. To choose action, the agents utilize the -greedy method to balance exploitation and exploration. According to this method, the agents select the action with maximum Q-value in the state s with high probability 1 − and a random action from all admissible actions with a small probability . (4) Q-value update: after declaring the market clearing price and carbon permit allocated to each firm by the government, the firms calculate their rewards according to equation (3.5) and update the Q-values according to the current rewards and next state, which is the market clearing price of the current iteration using equation (

Experimental analysis
In this section, based on the multi-agent Q-learning, the uniform price auction and discriminative auction are compared. Therefore, the algorithm parameters, data, and the results of the simulation are presented in the next two sub-sections.

Parameter setting
To compare two auction formats, it is supposed that five adaptive agents compete in the auction carbon market and trade their permits in the carbon trading market and explore their bidding strategy. The parameters of agents are randomly generated from the interval shown in Table 2.
As can be from Table 2, the lower and upper bounds of each parameter are close to each other; therefore, there is a perfectly competitive market, and no agent possesses the market power. The Q-learning parameters α, γ and is considered 0.9, 0.1 and 0.5, respectively. These values have an essential role in exploring and convergence of the Q-learning algorithm, Therefore, the related values are selected based on the previous studies conducted in this context, like Sadr et al. [25] and Poursalimi Jaghargh and Mashhadi [22], in a way that a balance between exploration and exploitation is obtained. To ensure that permits are not sold below their value, we set the reserve price at 0.5.

Experiment results
In this section, we compare the uniform price and discriminative price method in terms of bidding price, bidding volume, firms' profit, the government revenue, the clearing price, the carbon trading price, the total amount of permits traded in the carbon trading market, the firm's benefit reduction in the carbon auction relative to the free method and emission reduction. Finally, we compared the performance of the Q-Learning algorithm with the Roth-Erev algorithm that was used in Cong and Wei [11], in terms of firms' benefit. The algorithm has coded in C++ and complied with Microsoft Visual Studio 2012.
The φ value can influence the results; therefore, we compare these two price methods for different φ values. To investigate our results for low to high φ values, we change it from 80% to 100%. According to equation (2.1), when the value of φ is small, the carbon supply quantity is small and vice versa. Figures 2 and 3 give the equilibrium price and bid prices of all agents under the uniform and discriminative pricing rules, respectively. As we can see from the two figures, the agents bid prices are much higher than the equilibrium price in the uniform price auction, while, in the discriminative price auction, they bid prices as close to equilibrium price as possible. That is because, in the discriminative price, every winner pays its bid. Hence, the bidders tempt to predict the equilibrium price and bid close to it, but in the uniform price, since every winner pays the equilibrium price, therefore, forecasting the equilibrium is less important [13]. This fact makes the bid prices in the uniform auction have a higher dispersion [2,3,30]. For example, in our experiment, the average variances for bid prices in the uniform and discriminative price are 0.423 and 0.05 price unit, respectively. As shown in the two figures, the bid price of agent 2 is almost equal to the equilibrium price for all values of φ.

Impact of auction form on bidding price
All firms bid below their private values in both methods. It can be seen in Figure 4. Note with the increase in the value of φ, the bid prices decrease. This is because φ affects supply quantity. When the supply quantity is low, the competition between firms increases; therefore, they bid higher prices and close to their private values. But, when the supply quantity increases, the competition decreases, so the firms bid as close to the reserve price as possible. As you can see, the bid prices are more reduced in the discriminative price method, and they are almost equal to 0.5 (reserve price) for φ = 90% to 100% for all firms. As previously mentioned, the bid price of firm 2 is equivalent to equilibrium price for φ = 80% to 98%, because this firm wins some of its bid; therefore, it is not economically feasible that it bids a higher price. From φ = 98% to 100%, firm 2 increases its price in the uniform price because other firms decrease their bid prices, and it allows firm 2 to increase its bid price and win more than before.

Impact of auction form on bidding volume
As shown in Figure 5, the bid volumes fluctuate around a fixed value in the uniform price method. Because, in this method, the firms try to influence the equilibrium price and keep it down. While with the increase in φ, bids volume increase in the discriminative pricing method. There are two reasons for this: first, in this method, firms pay their bids and, second, with the increase in φ, according to Figure 4, the bid prices decrease so the firms can increase their bid volume to raise their benefits. Therefore, we can say for the high value of φ, bid shading in the discriminative price is less than the uniform price method.

Impact of auction form on firms' profit
The benefit of firms is influenced by φ as shown in Figure 6. For φ less than 90%, the benefit of firms in the uniform pricing is more than the discriminative pricing method. This is because the firms pay the equilibrium price in the uniform auction. But for φ greater than 90%, the benefit of firms in the discriminative pricing is equal to or greater than the benefit of firms in the uniform pricing. Because, as shown in Figure 4, the bid prices are too low and almost equal to the equilibrium price for φ greater than 90%, therefore the benefit of firms increase. It can be concluded that when supply is scarce, the uniform price auction is more beneficial to bidders than discriminative price and vice versa. Figure 7 shows, when the carbon permit is scarce the government's revenue in the discriminative price method is greater than that in the other method. Therefore when the carbon permit is large, the government should utilize the uniform price method, and when the carbon permit is scarce, the government should use the discriminative price method [11]. Figure 8 shows that the supply quantity can influence the clearing price. In the uniform price method, the clearing price is almost stable. But when the value of φ is less than 90%, the clearing price in the discriminative price is greater than that in the uniform price method, and for φ larger than 90%, it is less in the discriminative price method. Because with the increase in the value of φ in the discriminative price, the bid prices decrease and fall near the reserve price, so the clearing price reduces. In the uniform price method, the clearing price is equal to the equilibrium price, and as you can see in Figure 2, the equilibrium price is almost stable.

Impact of auction form on the carbon trading price
As mentioned before, the government determines the carbon trading price using the clearing price. Therefore, the relationship between the carbon trading price and supply quantity in the two pricing methods will be similar to the relationship between the clearing price and supply quantity. You can see it in Figure 9.

Impact of auction form on the carbon permit traded in the carbon trading market
The volume of carbon permits traded in the carbon trading market is different for the two pricing methods. As is evident from Figure 10, with the increase in supply quantity, the volume of trading increases in the discriminative price, while it is almost stable in the uniform price method. As before mentioned, the bidders in the uniform price keep their volume of bids as low as possible to influence the equilibrium price, but in the discriminative price, with the increase in supply quantity, the bidders increase their volume of bids to resell them at secondary price and raise their revenues.

Impact of auction form on the auction efficiency
In this section, we consider the auction efficiency as a performance measure. It can be calculated as follows [12]: In the above equation, EOG i is the actual earning of the government from auction and penalties. TBF i shows the total benefit of the firms. In two parameters, i represents the type of auction. As shown in Figure 11, we definitely can not say which method is more efficient than the other. It depends on the supply quantity. When the supply quantity is low, the discriminative pricing method is more efficient than the uniform pricing method, but in case that the supply quantity is high, the uniform pricing method is more efficient than the other.

Impact of auction form on firms' benefit and emission reduction
In this section, our analysis focuses on the comparison of firms' benefit reduction in the auction compared to the situation that initial permit is allocated by free methods and emission reduction for the two pricing  methods. As shown in Figure 12, the values of firms' benefit reduction are positive for two methods; in other words, the carbon auction causes the firms' benefit to decrease. When φ is less than 90%, the amount of benefit reduction in the discriminative price is more significant than the uniform price and vice versa. This is because, in the uniform price, the firms earn more benefit than the other method for φ less than 90% (see Fig. 6).
As is evident from Figure 12, lowering the value of φ results in greater benefit reduction and emission reduction. The best environmental performance is obtained at φ = 80%, i.e., when emission reduction is 20%. At this point, the most significant reduction in firms' benefit occurs. The smallest values of firms' benefit reduction in the uniform and discriminative price methods occur at φ = 96% and φ = 100%, respectively. In other words, at these points, the best economic performances are obtained for two pricing methods. The results  of this section are appropriate for policymakers to determine the right value φ to balance between economic and environmental goals.

The comparison of Q-learning algorithm with Roth-Erev algorithm
As mentioned in Section 1, the Roth-Erev algorithm used by Cong and Wei [11] utilizes only past information and determines the bidding strategy. While if the agents can predict the long-term results of their current decisions, their profitability will increase. As said, the Q-learning algorithm can tackle this problem. In this section, we examined this claim and test the Roth-Erev algorithm and compare the results with the ones obtained from the Q-learning algorithm under the uniform and discriminative price for all values of φ. As shown in Figure 13, the average firms' profit in the Q-learning algorithm outperforms the one in the Roth-Erev algorithm for all φ values and two pricing methods.

Conclusions
In this paper, we present a multi-agent-based model to compare the uniform price and the discriminative price methods in the carbon auction. Agents represent the firms participating in the auction, and they can develop their bids using the Q-learning algorithm. Our findings show that the agents' bidding behaviors are different in the two approaches. We cannot say which method is more advantageous, but the results are remarkably close to the theoretical predictions. The main of our conclusions are as follows: Figure 13. Comparison of average firms profits for two algorithms in the two pricing methods.
(1) The bid prices in the uniform price method have higher dispersion, but they are as close to the equilibrium price as possible in the discriminative price method. Our results also show that bid prices in the uniform price method are almost greater than that in the other method. (2) In the uniform price method, the bidders bid below their maximum volume to decrease the equilibrium price, but in the discriminative price method, with the increase in supply quantity, the bidders bid their maximum volume. (3) When the supply quantity is low, the government gets more revenue under the discriminative price method, and the firms earn more revenue in the uniform price method, but when the supply quantity is large, the advisable method for the government is the uniform price method and is the other method for the firms. (4) For low supply quantity, the clearing price and consequently the carbon trading price in the discriminative price are much more than those in the uniform price method, but when supply quantity is abundant, the clearing price and carbon trading price are almost equal in two methods. (5) The amount of carbon permits traded in the carbon trading market in two pricing methods is different from each other. Because of bid shading, it is more stable in the uniform price than the discriminative price method. With the increase in supply quantity, the tradable carbon permits increase in the discriminative price method. Therefore, when the supply quantity is scarce in the auction, the tradable carbon permits are low for two methods, but when the supply quantity increases, they increase in the discriminative price method and are almost without changing in the other method. (6) The auction efficiency differs in two pricing methods for different supply quantities. When the supply quantity is low (high), the discriminative price method (the uniform price method) is more efficient than the uniform price method (the discriminative price method). (7) The amounts of emission reduction and firms' benefit reduction relative to the situation that carbon permit is allocated for free depend on carbon supply quantity. A decrease in carbon supply quantity results in a further increase in these values. When the carbon supply quantity is low, the firms' benefit reduction in the discriminative price method is greater than the uniform price method and vice versa. Therefore, if the government wants to decrease the emissions as much as possible and does not want to harm firms economically, the carbon permits should be auctioned in the uniform price method.
Finally, we compare the Q-learning algorithm with the Roth-Erev algorithm. The results demonstrate our method outperforms the Roth-Erev algorithm, and the agents obtain more benefits by using this method.
In summary, the performance of the two pricing methods depends on the amount of carbon permits allocated by the government in the auction market. Incorporating abatement activities and their costs into the presented model can be investigated for future research.