I have a .csv file containing 1.7 million records. Some of these records are already in my database collection (2.2 million records), others are not. I want to add the records which are not yet in the collection without overwriting or updating the existing records. Each record has an enterprise number as unique property.
So far I could only found solutions which are first creating all the operations and than use the model.bulkWrite()
method to perform these operations. Since these operations use updateOne()
the already existing records will be updated.
So I'm looking for a way to add new records in bulk without updating or overwriting existing records.
Code
// Read data
const data = await readCSV('data/companies.csv');
// Format data
const formatted_data = data.map(record => ({
name: record.name,
enterprise_number: record.enterprise_number,
vat_number: record.vat_number,
legal_form: record.juridical_form || null,
activity_codes: (!record.nace_codes || record.nace_codes === '') ? [] : JSON.parse(record.nace_codes.replace(/'/g, '"')),
address: {
street: record.street,
house_number: record.house_number,
additional: record.additional,
postal_code: record.postal_code,
city: record.city,
country: record.country
},
phone: record.tel || null,
mobile: record.mobile || null,
email: record.email || null,
establishment_date: record.start_date,
status: record.is_active === '1' ? 'active' : 'inactive',
}));
const batchSize = 10000;
// Create batches of 10.000 records
for (let i = 0; i < formatted_data.length; i += batchSize) {
let batch = formatted_data.slice(i, i + batchSize);
let operations = batch.map((row) => ({
updateOne: {
filter: { enterprise_number: row.enterprise_number },
update: row,
upsert: true
}
}));
try {
let result = await Company.bulkWrite(operations);
console.log(`Batch ${i} done: ${result.modifiedCount} records modified, ${result.upsertedCount} records upserted`);
} catch (error) {
console.error(`Error updating batch ${i}: ${error}`);
}
}
console.log('All done');
Based on the answer of @Fraction which was a link to this topic Mongodb how to insert ONLY if does not exists (no update if exist)? I updated my code to the following
for (let i = 0; i < formatted_data.length; i += batch_size) {
let batch = formatted_data.slice(i, i + batch_size);
let operations = batch.map((row) => ({
updateOne: {
filter: { enterprise_number: row.enterprise_number },
update: { $setOnInsert: row },
upsert: true
}
}));
try {
let result = await Kbo.bulkWrite(operations);
console.log(`Batch ${i} done: ${result.result.nUpserted} records inserted`);
} catch (error) {
console.error(`Error inserting batch ${i}: ${error}`);
}
}
This is kind of working. It's upserting new records and it does not update the properties of a existing record BUT it's updating the updated_at
property of all matching records.