How do I handle billions of objects and assets?

Hi there,

I’am planning to use Pimcore 5 for a project with a amount of objects and assets reaching the billions mark.

Pimcore will be running in a multi-environment setup (dedicated database, multiple frontends, NFS, S3, …) and the objects will be organized in 10-20 classes – of course with some relations to each other (also to the assets).

Apart from optimizing the database and using corresponding hardware, are there any suggestions for a setup with this huge amount of objects and assets? How should I start for having as few as possible effort with horizontal scaling?

Are there any projects out there running with that amount of objects and assets without problems? If yes, how they are set up? Does it make sense, to divide the frontends (with user and session data) from the other data and work with kind of a api between the instances? Or is it better (for the beginning) to keep everything in one database/cluster?

Another question is, are there any suggestions for the object and also the asset folder structure – e.g. separating in a lot of folders so it’s possible to use the advantage of lazy loading? Will the internal backend search handle billions of objects and millions of assets with an acceptable speed? Of course, it should be possible to comfortably work with those objects and assets in the backend.

I would be really happy to get some answers I can work with – of course with keeping you up-to-date on this. :wink:

Thanks in advance,
Daniel

Hi Daniel,
great to hear that you are planning to use Pimcore for such big amount of data!

I think, setting up and configuring a proper database setup will be key to that project. Getting this right will have the best impact on performance. Since Pimcore is using mysql quite directly, you can take advantage of all performance tweaks that are available there (caches, innodb disk options, table partitions, etc.).

Also make sure you have set up the redis cache properly.

Your preferred way to go is to keep all data in one system, because this makes things less complicated and more flexible. But of course this depends on the actual use case.

In terms of folder structure I would choose the best way for user experience, navigation and ability to find the data. So choose a proper folder structure that is logically to follow, not too flat and also not too deep.
But of course, millions of elements in one folder will not improve performance :wink: .

The internal search is based 1:1 on an innodb fulltext search index - so again, the better the database is set up, the better the performance will be.

In term of setting up Pimcore in Amazon AWS, have a look at this documentation page: https://pimcore.com/docs/4.6.x/Development_Documentation/Installation_and_Upgrade/System_Setup_and_Hosting/Amazon_AWS_Setup/index.html
It’s for Pimcore 4 (for Pimcore 5 we didn’t had the time to update it yet), but the basic principles should be the same - only the paths and events are a bit different.

I hope this helps a bit further and we would love to be kept up-to-date on this.

Thanks…

Christian

2 Likes

Hi Christian,

thanks a lot for your extensive answer to my question – I appreciated it.

You’re right, one system for all data would be the easiest way, but I’am really thinking about clustering into frontends and data providers – with this kind of setup I’am able to spread load and data. Do you really suggest to have billions of objects accessible by one Pimcore backend (of course in clustered mode)? Do you know about this kind of setups running smoothly?

Also many thanks for the hints to folder structure and AWS setup – I will follow them. Handling Pimcore in AWS is working very well – I know it with Version 4. This project will be hosted by a smaller hoster, managing all the setup stuff for me.

Thanks a lot for your time. For sure I keep this thread alive and provide new experiences to all interested in this.

Cheers,
Daniel

Hi,
billions of objects I don’t think we have already. Millions of objects we definitely have.

Well, as I said before, it really depends on the use cases. When I need to set all the data into relation with each other, then one instance would be easier. When I have separate data buckets that I just need to query and display during run time, the data provider approach might be better…

Cheers…

1 Like

Hi Daniel,Fash

We are also planning something similar .We are planning to store around 70 million dataobject ,one table can consist of around 20 million records and then give admin access to play around and change it if needed.
Will the dashboard be able to load this object without lags and getting stuck?
What all different challenges you see I can face?

Database setup has certain limitations as well, it will not scale beyond a given limit ,of course it depends on use case as well.

I am new to pimcore and I see a lot of challenges in scaling and managing it with such a huge volume of data.
Please help with the different ideas you guys think can help here.

Thanks
Sohit