Preparing Data for a Mere Simulation is Harder Than You’d Think

Balancer Simulations is an open-source cadCAD model of a Balancer V1 AMM smart contract. This is useful for exploring “what if” questions involving an AMM pool e.g. “as an arbitrageur, how could I have made the most profit with this particular AMM pool (given historical data)?” or “how high should I set the pool fees to provide a decent ROI to liquidity providers?”. As a fundamental DeFi composing block, AMMs will be increasingly used to encourage certain behaviour – therefore it is important to find the correct parameters.


The simulation is coded to be identical to the original smart contract deployed on Ethereum, so that one can “play” with it and derive some insights. But we still needed a way to feed in user actions to the simulation, reconstructed from what actually happened on Ethereum.

The simulation doesn’t know about transactions or gas fees. Instead, it feeds on user Actions – a user creates the pool; another user joins (deposits tokens into) the pool, providing liquidity; another user swaps token A for token B, paying a fee to the pool for doing so; another user exits, taking out all the tokens he deposited in the pool + a proportion of the pool’s collected fees. These “events”, or what we called “Actions” in the data pulling script, had to be reconstructed from Ethereum log events.

Getting Ethereum log Events

It’s not so easy to get historical data from Ethereum. You need access to an archive node, and an archive node takes at least 4TB of flash storage and a week or two to sync. That’s assuming everything goes well. Plus, when you finally get data out of it, it’s not organized.

Thankfully some people have been putting historical Balancer pool events on a traditional SQL database, which can be queried easily:

select * from blockchain-etl.ethereum_balancer.BFactory_event_LOG_NEW_POOL where pool="0x..."give me all LOG_NEW_POOL events for the pool 0x….. (since a pool is only created once, there can only be one row from this SQL query)
select * from blockchain-etl.ethereum_balancer.BPool_event_LOG_JOIN where contract_address="0x..." order by block_numbergive me all the events where a user added token liquidity to the pool, and sort them by the block in which it happened
select * from blockchain-etl.ethereum_balancer.BPool_event_LOG_SWAP where contract_address="0x..." order by block_numbergive me all the events where a user swapped a token out for another using this pool 0x…, sorted by the block in which it happened

We do this for LOG_NEW_POOL, LOG_SWAP, LOG_JOIN, LOG_EXIT, event_Transfer, fee changes and weight changes (which were too complex to fit into one line of SQL). And afterwards we smush them all together into one sorted list of Events.

“Why didn’t you just get all the Events sorted by block number, you dummy?”
It just wasn’t possible the way the SQL data was organized. Besides, fee changes and weight changes had to be derived, they weren’t actual log Events.

Great, so we don’t need an Ethereum archive node after all, right? Wrong – there is a special anonymous Event emitted in addition to LOG_JOIN, LOG_SWAP, LOG_EXIT that has important information that can affect the simulation accuracy.

Because Ethereum transactions may or may not go through, and tokens in the pool might change in the meantime, you might not get back exactly the amount of tokens you were expecting.

For example, if you swap 1 TOKEN-A for 500 TOKEN-B, you might get 499 or 501 TOKEN-B. Fortunately there are variants of JOIN/SWAP/EXIT methods which let the user decide if he wants to spend exactly this much TOKEN-A, or if getting back exactly 500 TOKEN-B is more important to him.

Unfortunately, this important information was not included in the SQL database, so we needed an Ethereum archive node after all, and fellow developer Raul spent at least 2 nights deciphering this important information from the anonymous Event.

Put all the Events together into a list and group by txhash

events = []
events.extend(turn_events_into_actions(new_events, fees_dict, denorms_results))
events.extend(turn_events_into_actions(join_events, fees_dict, denorms_results))
events.extend(turn_events_into_actions(swap_events, fees_dict, denorms_results))
events.extend(turn_events_into_actions(exit_events, fees_dict, denorms_results))
events.extend(turn_events_into_actions(transfer_events, fees_dict, denorms_results))

events_grouped_by_txhash = {}
for i, action in enumerate(events):
    tx_hash = events[i].tx_hash
    if events_grouped_by_txhash.get(tx_hash) is None:
        events_grouped_by_txhash[tx_hash] = []
# save_pickle(events_grouped_by_txhash, f'{args.pool_address}/events_grouped_by_txhash.pickle')
# events_grouped_by_txhash = load_pickle(f'{args.pool_address}/events_grouped_by_txhash.pickle')


Multiple log Events could actually have been emitted by a single Ethereum transaction. So now, given NEW, JOIN, SWAP, EXIT, and Transfer (pool shares, not the tokens) Events, we want to reconstruct the transactions that were relevant to this particular pool.

The above code simply smushes the events together into a long, unsorted list, and groups them by txhash.

Dirty detail: It says turn_events_into_actions() and it even uses the Action class in data/, but actually they are not yet real Actions, they are still individual Events. That’s because when I wrote the code, I intended to make them Actions, but many other problems came up and I quickly forgot my original intention.

Exception: irregularities in the data caused by 1inch aggregated swaps

We were getting token swaps that didn’t make sense. Which BAL got turned into WBTC, and which into WETH? It is not clear.

"action": {
    "type": "swap",
    "tokens_in": [
            "amount": "447.23532971026",
            "symbol": "BAL"
            "amount": "157.26956649152",
            "symbol": "BAL"
    "tokens_out": [
            "amount": "7279711",
            "symbol": "WBTC"
            "amount": "6.450635831831913964",
            "symbol": "WETH"

As it turns out, this is the work of smart hacks working at to save on gas fees and aggregate swaps into a single transaction. So we had to modify our data parsing script to recognize transactions like these and emit two Actions instead of one.

Turn Events into simulation-relevant Actions

# Remove pool share transfers
grouped_events = list(filter(lambda acts: not (len(acts) == 1 and acts[0].action_type == 'transfer'), grouped_events))

actions = stage3_merge_actions(args.pool_address, grouped_events)

# save_pickle(actions, f"{args.pool_address}/actions.pickle")
# actions = load_pickle(f"{args.pool_address}/actions.pickle")

We remove pool share transfers because they are irrelevant to the simulation (the number of pool shares is a consequence of the inputs, not something we should feed into the simulation).

stage3_merge_actions is where we take Events and merge them into Actions.

  • Yes, it should be called stage3_merge_events_into_actions. Naming is hard.
  • stage3_merge_actions() doesn’t even use the class, which I originally intended to be used here. Oh well.
  • stage3_merge_actions() is also where we ask the Ethereum archive node for the anonymous Event, decipher it and add its data into the “Action”. This should actually belong in section 1, where we get the different Event types from the SQL database, but the code is the way it is.

Interleave hourly prices between the Actions

Since I wrote the part that gets prices from Coingecko API, I’ll explain that here.
As long as you only request 3 months of data, Coingecko gives historical hourly prices. For free. However, this only goes a ways back – you won’t get hourly pricing data for 2018 even if you request 1 day at a time. This inconsistency is conveniently not mentioned on the Coingecko API page.

The hourly pricing data that Coingecko returns is not regular either – you might get

2021-01-01 00:47:33 1000
2021-01-01 01:33:21 1001
2021-01-01 02:17:05 999

which is worrysome. Now I have to round the timestamps to the nearest hour, and 01:33:21 rounds to 02:00:00, but 02:17:05 also rounds to 02:00:00! So I’ll have to throw something away.

Then again, who’s to say other pricing data services like Tradingview aren’t doing this in the background either?

Lesson Learned

Franz Kafka was known to find faults in anything that came from his own pen, except for a chosen few, amongst them The Judgment, which he supposedly wrote in a single 8 hour cohesive sitting, and was the first short story he was truly proud of.

That was also how I wrote the data ingestion script for the simulation. It was a beautiful, cohesive solution that fit very well to the problem.

But the problem changed. The 1inch aggregated swap problem came up. Prices had to be added. The archive node had to be queried for additional information, and nobody had time to rewrite everything. Over time, it became a jumbled mess, far from the elegant solution I had envisioned.

Kafka knew what he wanted to express, and it stayed the same. But as a programmer, we think we know the problem, but we don’t. It changes, or time will show us our understanding was wrong, or incomplete. Hence:

As a programmer, prefer the flexible solution, not the most beautiful/elegant solution.

me, who else

Leave a Reply

Your email address will not be published. Required fields are marked *