Best practice performance large CSV Import (from Folder)

Hello guys,
i’ve installed Pimcore 6.2.1 as standalone some days ago on a native Debian Buster pc for testing purposes.
Now I am searching the best way to import huge data sets, the frontend csv import takes 1.3 secs per object (containing 15 fields, just strings and numbers). My csv file contains 430k products which ends in about 6 days of import time.

Atm I think of creating a cli import script or a pluginBundle but don’t know where to start and how to design the code as performant as possible.

Software Stack:
Pimcore 6.2.1
Debian 10 Buster
PHP 7.3.11 (memory_max 8192M)
10.3.18-MariaDB

Hardware Stack:
i5-9600k
Samsung M2. 970 Evo plus SSD
16gb Ram

kind regards,
lukas

Hi @lukashwm

Have a look at https://github.com/w-vision/DataDefinitions. It’s a bundle for importing and exporting huge amount of data and should fit your use case.

Hope this helps!

1 Like

Hi @aarongerig

thank you very much for the recommendation. That extension really seems to fit my needs. The installation was kinda easy but now I stuck at “mapping”, the only possibility is to set the from row to “custom”. Of course because the import definition doesn’t know my .csv file and its headers but there should be a way to register them or am i misunderstanding the way it works?
Another thing I am not sure about is the path field. This is where I have to say where my Objects should be stored by something like “%Produkt(id,params)” ?

Edit: Nevermind I couldn’t see the wood for the trees. Just had to put the csv headers into the “csv example”.

Update: I am importing my objects now in combination with process-manager which is a huge perfomance enhancements. My 430k products are now imported in about 5-6 hrs. But anyways are there still any performance enhancements to do or does someone has an idea of the best way to search on bottlenecks.

parallelize the import… that can work with ImportDefinitions but there hasn’t been anything done for it. 6 hours for 450k objects is pretty good though. The rest of performance loss comes down to pimcore.

Thanks for your suggestions @dpfaffenbauer

6 hours have been the estimated time, in the end it took 25 hrs.
What was most interesting about it is that the import bacame very slow in the end. For example the last 5 % took about 5 hrs. (While the import was running the calculation has always been correct 50 % = 215k articles imported and so on). I think the Bottleneck has been the SSD Controller because the partition got only 5GB left and as i do remember the most ssds are going to be kinda slow when they run out of space.

Aren’t the imports parallized ? I was sure that the process-manager does parallizing because all my six cores were acting at about 30 - 50 % as the import has been active which was different to the pimcore build-in import where only one core was active.

Nevertheless I think I am not capable of writing the routines to write those routines :confused:

But maybe someone has some experience with different hardware stacks and could tell me how, for example, a ryzen 9 3950X + SSD Raid system would impact the performance.

To be honest atm I don’t even know if i do really need the performance. But before using something in productive setups I want to have the best performance and stability possible as also a tested backup setup especially because it’s causing a ton of problems if you have to change something fundamental in a running environment…

Btw.: The performance after the import has finished is really impressive. Filtering 10k articles of a specific manufacturer is done in like real-time.

the reason things get slower to the end is a memory leak in PDO… it takes up more and more memory and it doesn’t gets released.

with parallelizing, I mean having multiple workers where each imports just a bit of the whole 450k file…

You could also import the file in chunks, run a process for lines 0-1000, after that, the next one from 1001-2000 and so on. this will release memory and increase performance. Data Definitions already supports that with CSV Files.

Cheers

1 Like

coming back to your last thoughts: how can I prepare my import definition to use chunks?

the CSV provider supports that out-of-the-box, this is how I do it:

#!/bin/bash

COUNTER=0
OFFSET=0
LIMIT=100

while [  $COUNTER -lt 250 ]; do
    echo Running Offset $OFFSET with Limit $LIMIT with loop $COUNTER
    php bin/console data-definitions:import -d 3 -p "{\"file\":\"data.csv\", \"offset\": $OFFSET, \"limit\": $LIMIT}"
    let COUNTER=COUNTER+1
    let OFFSET=COUNTER*LIMIT
done
3 Likes