3

I'm having a problem and need some advice. I am normally a developer, however with recent staff changes at my company I am now the sole IT person, so I'm having to branch into a lot of unknown areas and really need some help.

We are running postgres 8.3. The database is trying to run AUTO_VACUUM on a large object table (pg_catalog.pg_large_object) to prevent transaction ID wrap-around. I think I understand the basics of what that means. The problem is, this table is 750G with 452 million rows. AUTO_VACUUM is writing to disk a lot, and eating up disk space (it consumed our last 250GB of 1TB yesterday). After an emergency outage, we are back up and running with 1100GB of space, and 100GB free. However, once postgres was back up and running, it kicked off the AUTO_VACUUM process again. If I kill the transaction (which I'm sure is not recommended), it just restarts.

So here are my questions:

1) For that table, how much space would it need to finish the AUTO_VACUUM process? How do I determine this?

2) Is there a better way to configure the server to handle this situation so it doesn't require ridiculous amounts of disk space whenever it needs to do this?

3) If no to 2, how do you propose this problem get fixed?

I am not a DBA, and do not have linux server administration experience, just a developer being asked to wear many hats. I'm trying to get a DBA consultant to help resolve the problem, but the company is pushing back. They don't seem to understand the severity of the problem, despite my best efforts.

Suggestions? Comments? Any advice or guidance you can provide would be greatly appreciated. If you need more info, let me know.

framauro13
  • 797
  • 2
  • 8
  • 18
  • 1
    What makes you think autovacuum is consuming your disk space? do you mean it is swapping memory to disk? is logging enabled and it is filling up your disk with log noise? – Chris Farmiloe Apr 12 '13 at 12:26
  • The logs indicate that it was writing files and ran out of disk space. Yesterday we had roughly 250GB of free space. AUTO_VACUUM started around 7:30AM, ran all day until the system came down at 10:30PM. The thing is, it's my best guess right now as I'm not too familiar with how the process actually works. I tend to have the pgadmin database status screen up just to monitor things (as we've had issues in the past with blocking), and this is the first time I've seen AUTO_VACUUM run on that table (which is the largest of our tables). I'll try to gather more information if I can. – framauro13 Apr 12 '13 at 12:40
  • 1
    Unless it's swapping lots, it should not be using lots of disk. If [`maintainance_work_mem`](http://www.postgresql.org/docs/8.3/static/runtime-config-resource.html) is set high, then it might be using lots of ram and swapping but I suspect the 'out of space' log is a red herring and your disk space issues are probably elsewhere (my guess is on a bloated index on that huge table). – Chris Farmiloe Apr 12 '13 at 12:43
  • Since this is my best guess, I'll ask this: how would I verify that this is or is not the problem? I understand why it needs to do the AUTO_VACUUM, but I just don't have the experience to determine the ramifications of performing it on a large table. – framauro13 Apr 12 '13 at 13:02
  • Good to know Chris, thanks. I'll look into it some more. I'm hoping to get in touch with a DBA today to hopefully shed some light on this. I appreciate the ideas and assistance. – framauro13 Apr 12 '13 at 13:03
  • 1
    Just how big is this big table? `SELECT pg_relation_size('tablename'), pg_total_relation_size('tablename');` . You should also check how far away from forced emergency txid wraparound prevention shutdown you are, see http://dba.stackexchange.com/q/6395/7788 – Craig Ringer Apr 13 '13 at 02:15
  • The table is 748G. It is a large object table that stores medical record documents in an EMR. I am letting it run after upgrading RAM on the system. A DBA with our hosting site has become involved, and set up a process to move the transaction logs off to another machine to make sure the empty space is not eaten up by logs. The RAM increase is allowing the site to perform while running the vacuum, so now we are just waiting it out. Thanks for all of the info. – framauro13 Apr 16 '13 at 17:22
  • 1
    @framauro13 Ouch... yes, it's difficult when business level decisions are made that cause major technical problems. I see that quite a bit, especially with extremely foolish policies that require even minor patch releases to go through an extended approval process that takes weeks or months and wastes lots of time and effort. The inevitable outcome is that patch releases are simply not applied - until the bug(s) they fixed bite the operator, and then they beg for a *backport of just that fix* because they're not allowed to install a "new version" ... but a custom homebrew patch is ok. Er... – Craig Ringer Apr 16 '13 at 23:52

2 Answers2

3

If you do not resolve this problem fairly quickly your database will go into emergency shutdown to prevent data corruption and refuse to start back up until txid wraparound vaccuum completes. Check the logs to see how close to this point you are, you'll see messages like:

WARNING:  database "mydb" must be vacuumed within 177009986 transactions 
HINT:  To avoid a database shutdown, execute a database-wide VACUUM in "mydb". 

Do not just kill the vacuum and put the problem off. You really, really need to resolve this unless you can afford some unplanned downtime.

The reason it's consuming tons of disk space is probably that you're on an old version that doesn't have auto-managed freespacemap settings, and you've likely exceeded max_fsm_pages and/or max_fsm_relations. Check the log, you may well see messages about these.

Unfortunately you can't just raise these params after the fact. This old PostgreSQL install has lost the knowledge about what space in the table is free. Proper cleanup and recovery will require a CLUSTER of the table, which needs at least as much free space as the table+index sizes and requires an exclusive lock on the table for the duration of the run.

Most of the less intrusive mitigating options like pg_reorg are no longer open to you now that you're approaching forced txid wraparound prevention. Your best bet is quite probably to give autovacuum the space it needs to complete the job - or to deal with the downtime and CLUSTER then VACUUM FREEZE the table to get the process over and done with faster.

Once you've recovered, I would recommend greatly increasing max_fsm_pages and making sure max_fsm_relations is big enough. Lots of tuning advice for these old versions is around, search.

Plan an upgrade to 9.2, which auto-manages the freespace map (as does any version 8.4+) and has all sorts of autovac enhancements to help stop you getting into these pickles in the first place.

If this situation is desperate consider getting in touch with professional PostgreSQL support provider. (Proper disclosure: I work for 2ndQuadrant, one of the listed providers).

Craig Ringer
  • 307,061
  • 76
  • 688
  • 778
  • The table itself is 748G. We upgraded the RAM on the machine to help balance user volume and this process. Having said that, we had to restart the machine which means the auto-vacuum process needs to start over again. The problem was that it was generating a TON of transaction logs which ate up disk space, which then in turn caused the server to shutdown. After that was dealt with, performance took a huge hit, so we upgraded the RAM and started it again. I hope we have this lined out now so all we have to do is wait for it to finish. I appreciate the detailed response. – framauro13 Apr 16 '13 at 17:27
  • I should also note that we are currently migrating off this system, so after this transaction ID wrap-around vacuum has bee peformed, we will not need to do it again within the life of the system (I hope). We had planned upgrades earlier this year, but the business put a stop to that due to development time and costs. So now we're here. – framauro13 Apr 16 '13 at 17:51
2

The real-time support on FreeNode's #postgresql (IRC) is amazing. There are frequently knowledgable people awake and available to talk DBA/development details. I can't recommend it enough.

matthudson
  • 182
  • 6