0

I have a .csv file containing 1.7 million records. Some of these records are already in my database collection (2.2 million records), others are not. I want to add the records which are not yet in the collection without overwriting or updating the existing records. Each record has an enterprise number as unique property.

So far I could only found solutions which are first creating all the operations and than use the model.bulkWrite() method to perform these operations. Since these operations use updateOne() the already existing records will be updated.

So I'm looking for a way to add new records in bulk without updating or overwriting existing records.

Code

// Read data
const data = await readCSV('data/companies.csv');

// Format data
const formatted_data = data.map(record => ({
    name: record.name,
    enterprise_number: record.enterprise_number,
    vat_number: record.vat_number,
    legal_form: record.juridical_form || null,
    activity_codes: (!record.nace_codes || record.nace_codes === '') ? [] : JSON.parse(record.nace_codes.replace(/'/g, '"')),
    address: {
        street: record.street,
        house_number: record.house_number,
        additional: record.additional,
        postal_code: record.postal_code,
        city: record.city,
        country: record.country
    },
    phone: record.tel || null,
    mobile: record.mobile || null,
    email: record.email || null,
    establishment_date: record.start_date,
    status: record.is_active === '1' ? 'active' : 'inactive',
}));

const batchSize = 10000;
    
// Create batches of 10.000 records
for (let i = 0; i < formatted_data.length; i += batchSize) {
    let batch = formatted_data.slice(i, i + batchSize);

    let operations = batch.map((row) => ({
        updateOne: {
            filter: { enterprise_number: row.enterprise_number },
            update: row,
            upsert: true
        }
    }));

    try {
        let result = await Company.bulkWrite(operations);
        
        console.log(`Batch ${i} done: ${result.modifiedCount} records modified, ${result.upsertedCount} records upserted`);
    } catch (error) {
        console.error(`Error updating batch ${i}: ${error}`);
    }
}

console.log('All done');

Based on the answer of @Fraction which was a link to this topic Mongodb how to insert ONLY if does not exists (no update if exist)? I updated my code to the following

for (let i = 0; i < formatted_data.length; i += batch_size) {
    let batch = formatted_data.slice(i, i + batch_size);

    let operations = batch.map((row) => ({
        updateOne: {
            filter: { enterprise_number: row.enterprise_number },
            update: { $setOnInsert: row },
            upsert: true
        }
    }));

    try {
        let result = await Kbo.bulkWrite(operations);
        console.log(`Batch ${i} done: ${result.result.nUpserted} records inserted`);
    } catch (error) {
        console.error(`Error inserting batch ${i}: ${error}`);
    }
}

This is kind of working. It's upserting new records and it does not update the properties of a existing record BUT it's updating the updated_at property of all matching records.

Thore
  • 1,918
  • 2
  • 25
  • 50
  • Does this answer your question? [Mongodb how to insert ONLY if does not exists (no update if exist)?](https://stackoverflow.com/questions/48220677/mongodb-how-to-insert-only-if-does-not-exists-no-update-if-exist) – Fraction Feb 16 '23 at 10:06
  • @Fraction I tried the solution in that question and it's weird. The properties of the matching record are not updating but the updated_at property is updating to the current timestamp. Also the result response of the `bulkWrite` method contains a field with the number of modified rows and every row has been modified. To keep the data correct, the updated_at property may not be updated for existing records. – Thore Feb 16 '23 at 10:48
  • Why do you re-invent the wheel? Have a look at [mongoimport](https://www.mongodb.com/docs/database-tools/mongoimport/), it provides `--mode=` – Wernfried Domscheit Feb 16 '23 at 11:04
  • @Thore see this [comment](https://github.com/Automattic/mongoose/issues/6861#issuecomment-414118221), you can try to set [`timestamps`](https://mongoosejs.com/docs/guide.html#timestamps) to `false` and add `created_at` manually – Fraction Feb 16 '23 at 11:41
  • @WernfriedDomscheit `--mode=upsert` will replace existing documents – Fraction Feb 16 '23 at 11:42
  • @WernfriedDomscheit Because mongoimport is not able to format the code and I don't want to create another script to format the code – Thore Feb 16 '23 at 12:02

0 Answers0