Preparing Data for a Mere Simulation is Harder Than You’d Think

Balancer Simulations is an open-source cadCAD model of a Balancer V1 AMM smart contract. This is useful for exploring “what if” questions involving an AMM pool e.g. “as an arbitrageur, how could I have made the most profit with this particular AMM pool (given historical data)?” or “how high should I set the pool fees to provide a decent ROI to liquidity providers?”. As a fundamental DeFi composing block, AMMs will be increasingly used to encourage certain behaviour – therefore it is important to find the correct parameters.


The simulation is coded to be identical to the original smart contract deployed on Ethereum, so that one can “play” with it and derive some insights. But we still needed a way to feed in user actions to the simulation, reconstructed from what actually happened on Ethereum.

The simulation doesn’t know about transactions or gas fees. Instead, it feeds on user Actions – a user creates the pool; another user joins (deposits tokens into) the pool, providing liquidity; another user swaps token A for token B, paying a fee to the pool for doing so; another user exits, taking out all the tokens he deposited in the pool + a proportion of the pool’s collected fees. These “events”, or what we called “Actions” in the data pulling script, had to be reconstructed from Ethereum log events.

Getting Ethereum log Events

It’s not so easy to get historical data from Ethereum. You need access to an archive node, and an archive node takes at least 4TB of flash storage and a week or two to sync. That’s assuming everything goes well. Plus, when you finally get data out of it, it’s not organized.

Thankfully some people have been putting historical Balancer pool events on a traditional SQL database, which can be queried easily:

select * from blockchain-etl.ethereum_balancer.BFactory_event_LOG_NEW_POOL where pool="0x..."give me all LOG_NEW_POOL events for the pool 0x….. (since a pool is only created once, there can only be one row from this SQL query)
select * from blockchain-etl.ethereum_balancer.BPool_event_LOG_JOIN where contract_address="0x..." order by block_numbergive me all the events where a user added token liquidity to the pool, and sort them by the block in which it happened
select * from blockchain-etl.ethereum_balancer.BPool_event_LOG_SWAP where contract_address="0x..." order by block_numbergive me all the events where a user swapped a token out for another using this pool 0x…, sorted by the block in which it happened

We do this for LOG_NEW_POOL, LOG_SWAP, LOG_JOIN, LOG_EXIT, event_Transfer, fee changes and weight changes (which were too complex to fit into one line of SQL). And afterwards we smush them all together into one sorted list of Events.

“Why didn’t you just get all the Events sorted by block number, you dummy?”
It just wasn’t possible the way the SQL data was organized. Besides, fee changes and weight changes had to be derived, they weren’t actual log Events.

Great, so we don’t need an Ethereum archive node after all, right? Wrong – there is a special anonymous Event emitted in addition to LOG_JOIN, LOG_SWAP, LOG_EXIT that has important information that can affect the simulation accuracy.

Because Ethereum transactions may or may not go through, and tokens in the pool might change in the meantime, you might not get back exactly the amount of tokens you were expecting.

For example, if you swap 1 TOKEN-A for 500 TOKEN-B, you might get 499 or 501 TOKEN-B. Fortunately there are variants of JOIN/SWAP/EXIT methods which let the user decide if he wants to spend exactly this much TOKEN-A, or if getting back exactly 500 TOKEN-B is more important to him.

Unfortunately, this important information was not included in the SQL database, so we needed an Ethereum archive node after all, and fellow developer Raul spent at least 2 nights deciphering this important information from the anonymous Event.

Put all the Events together into a list and group by txhash

events = []
events.extend(turn_events_into_actions(new_events, fees_dict, denorms_results))
events.extend(turn_events_into_actions(join_events, fees_dict, denorms_results))
events.extend(turn_events_into_actions(swap_events, fees_dict, denorms_results))
events.extend(turn_events_into_actions(exit_events, fees_dict, denorms_results))
events.extend(turn_events_into_actions(transfer_events, fees_dict, denorms_results))

events_grouped_by_txhash = {}
for i, action in enumerate(events):
    tx_hash = events[i].tx_hash
    if events_grouped_by_txhash.get(tx_hash) is None:
        events_grouped_by_txhash[tx_hash] = []
# save_pickle(events_grouped_by_txhash, f'{args.pool_address}/events_grouped_by_txhash.pickle')
# events_grouped_by_txhash = load_pickle(f'{args.pool_address}/events_grouped_by_txhash.pickle')


Multiple log Events could actually have been emitted by a single Ethereum transaction. So now, given NEW, JOIN, SWAP, EXIT, and Transfer (pool shares, not the tokens) Events, we want to reconstruct the transactions that were relevant to this particular pool.

The above code simply smushes the events together into a long, unsorted list, and groups them by txhash.

Dirty detail: It says turn_events_into_actions() and it even uses the Action class in data/, but actually they are not yet real Actions, they are still individual Events. That’s because when I wrote the code, I intended to make them Actions, but many other problems came up and I quickly forgot my original intention.

Exception: irregularities in the data caused by 1inch aggregated swaps

We were getting token swaps that didn’t make sense. Which BAL got turned into WBTC, and which into WETH? It is not clear.

"action": {
    "type": "swap",
    "tokens_in": [
            "amount": "447.23532971026",
            "symbol": "BAL"
            "amount": "157.26956649152",
            "symbol": "BAL"
    "tokens_out": [
            "amount": "7279711",
            "symbol": "WBTC"
            "amount": "6.450635831831913964",
            "symbol": "WETH"

As it turns out, this is the work of smart hacks working at to save on gas fees and aggregate swaps into a single transaction. So we had to modify our data parsing script to recognize transactions like these and emit two Actions instead of one.

Turn Events into simulation-relevant Actions

# Remove pool share transfers
grouped_events = list(filter(lambda acts: not (len(acts) == 1 and acts[0].action_type == 'transfer'), grouped_events))

actions = stage3_merge_actions(args.pool_address, grouped_events)

# save_pickle(actions, f"{args.pool_address}/actions.pickle")
# actions = load_pickle(f"{args.pool_address}/actions.pickle")

We remove pool share transfers because they are irrelevant to the simulation (the number of pool shares is a consequence of the inputs, not something we should feed into the simulation).

stage3_merge_actions is where we take Events and merge them into Actions.

  • Yes, it should be called stage3_merge_events_into_actions. Naming is hard.
  • stage3_merge_actions() doesn’t even use the class, which I originally intended to be used here. Oh well.
  • stage3_merge_actions() is also where we ask the Ethereum archive node for the anonymous Event, decipher it and add its data into the “Action”. This should actually belong in section 1, where we get the different Event types from the SQL database, but the code is the way it is.

Interleave hourly prices between the Actions

Since I wrote the part that gets prices from Coingecko API, I’ll explain that here.
As long as you only request 3 months of data, Coingecko gives historical hourly prices. For free. However, this only goes a ways back – you won’t get hourly pricing data for 2018 even if you request 1 day at a time. This inconsistency is conveniently not mentioned on the Coingecko API page.

The hourly pricing data that Coingecko returns is not regular either – you might get

2021-01-01 00:47:33 1000
2021-01-01 01:33:21 1001
2021-01-01 02:17:05 999

which is worrysome. Now I have to round the timestamps to the nearest hour, and 01:33:21 rounds to 02:00:00, but 02:17:05 also rounds to 02:00:00! So I’ll have to throw something away.

Then again, who’s to say other pricing data services like Tradingview aren’t doing this in the background either?

Lesson Learned

Franz Kafka was known to find faults in anything that came from his own pen, except for a chosen few, amongst them The Judgment, which he supposedly wrote in a single 8 hour cohesive sitting, and was the first short story he was truly proud of.

That was also how I wrote the data ingestion script for the simulation. It was a beautiful, cohesive solution that fit very well to the problem.

But the problem changed. The 1inch aggregated swap problem came up. Prices had to be added. The archive node had to be queried for additional information, and nobody had time to rewrite everything. Over time, it became a jumbled mess, far from the elegant solution I had envisioned.

Kafka knew what he wanted to express, and it stayed the same. But as a programmer, we think we know the problem, but we don’t. It changes, or time will show us our understanding was wrong, or incomplete. Hence:

As a programmer, prefer the flexible solution, not the most beautiful/elegant solution.

me, who else

cadCAD can’t simulate humans

Now and then, when people hear I code economic simulations with cadCAD (and I just heard about its rust cousin radCAD), they want me to write one for them. And it usually involves asking how humans would behave in a specific situation.

here’s the thing

I don’t know.

I can’t simulate human psychology.

What would people do in this or that case? This is not what a simulation does. A simulation says “given these conditions, and behaviours, this will happen (most/some of the time)”. One must be very clear about what these behaviours are and formulate the question in such a way that we can answer it without having to program mini-humans. Mini-humans are impossible to verify anyway.

Bad question: “What would humans do if I raised taxes?”
Better question: “How many people would stay in my system given a group of humans (with income distribution A, differing tolerances for tax raises B) and if I raised taxes above a certain threshold?”

You get the idea.

Math helps to set your intentions in stone. If you can express every interaction and change as an equation, everyone can verify that the simulation is working as intended. After all, if I write a complex system that I claim simulates how humans would behave, how is anybody going to verify that? I was probably just guessing; and it’s hard to verify if the code does what I intended. It’s far easier to test if an equation computed the right result.

A Blockchain Use Case for Dancers

Hellen the dancer

wants to make a name for herself, so she uploads a video to Youtube, err, I mean, Odysee/, which are just the same thing in the background, the LBRY blockchain.

Uploading the video and reserving the channel name @ZoukBerlin costs some LBRY Credits (LBC).

The end result is that the website doesn’t have to show ads, annoying everyone, to sustain itself. It earns from the very act of people uploading and curating videos.

People can choose to tip her in LBC, or support her video (again with LBC) which helps it move up in the search results.

Imagine if a normal company did this with their own credits system. They’d make it impossible for people to cash out their custom credits, locking people in! Those travel companies are doing it right now as we speak.

As people tip her LBC, Hellen could use those to boost her videos, or cash out by converting them to Bitcoin, Ethereum, or anything else. It’s a balancing act.

The Video In Question

Youtube is famous for demonetizing videos for whatever reason. Since Hellen doesn’t usually dance to silence, inevitably some record company will step up and say:

Using our song in your video is a privilege. All proceeds belong to us now, thanks!

Hellen doesn’t think so, naturally. Anselmo Ralph should be honoured that she wanted to dance to his song! Indeed, if you asked the artist himself, he might even say:

Hey, thank you for dancing to my song! (nice ass)

Suffice to say this doesn’t happen on blockchain platforms. If LBRY ever considers demonetizing videos based on their audio, as a LBC token holder, Hellen and many other content creators can even vote on the issue.