This project is archived and is in readonly mode.

#55 ✓resolved
ronin-150602 (at lighthouseapp)

Deadlock in 2.4.0.

Reported by ronin-150602 (at lighthouseapp) | May 25th, 2011 @ 09:25 AM

rel-2.4.0
os-linux

In the particular setup implemented in the attached file, psycopg ends up deadlocking.

The setup: all the threads share a single connection. Cursors are not shared across threads. Eventually, psycopg ends up in a deadlock. It may only happen after several hundred thousands SELECTs, though.

It does seem that at least three threads are required to trigger the bug. With intermittent concurrency problems it's never easy to tell for sure.

This is with psycopg 2.4.0 on Linux x86 using Python 2.5.4.

Thank you for your help.

Comments and changes to this ticket

  • ronin-150602 (at lighthouseapp)

    ronin-150602 (at lighthouseapp) May 25th, 2011 @ 09:33 AM

    Side note: sometimes psycopg seems to raise a SystemError exception instead of deadlocking, hinting at other underlying concurrency issues:

    Exception in thread Thread-2:
    Traceback (most recent call last):
      File "/usr/lib/python2.5/threading.py", line 486, in __bootstrap_inner
        self.run()
      File "pg_thread.py", line 20, in run
        CONNECTION.cursor().execute("SELECT 1")
    SystemError: null argument to internal routine
    
  • Daniele Varrazzo

    Daniele Varrazzo May 27th, 2011 @ 09:22 PM

    Thank you for the report and for the test case: will definitely investigate!

  • Daniele Varrazzo

    Daniele Varrazzo May 27th, 2011 @ 11:35 PM

    • State changed from “new” to “open”

    I can reproduce the deadlock, even if not consistently, with the current devel too. Sometimes instead of deadlocks I have had loss of sync with the server and also a segfault - unfortunately I don't have a core.

    Will try to look more into the issue, thanks.

  • Daniele Varrazzo

    Daniele Varrazzo May 28th, 2011 @ 08:54 AM

    Stack trace of the three threads in a deadlock:

    Thread 2:

    [#0](/projects/62710/tickets/0 "Ticket #0")  0x00110416 in __kernel_vsyscall ()
    [#1](/projects/62710/tickets/1 "Ticket #1")  0x00191169 in __lll_lock_wait () from /lib/libpthread.so.0
    [#2](/projects/62710/tickets/2 "Ticket #2")  0x0018c5cb in _L_lock_748 () from /lib/libpthread.so.0
    [#3](/projects/62710/tickets/3 "Ticket #3")  0x0018c3f1 in pthread_mutex_lock () from /lib/libpthread.so.0
    [#4](/projects/62710/tickets/4 "Ticket #4")  0x00eec9d0 in pq_execute (curs=0x9257034, query=0x918569c "SELECT 1", 
        async=0) at psycopg/pqpath.c:764
    [#5](/projects/62710/tickets/5 "Ticket #5")  0x00ef568b in _psyco_curs_execute (self=0x9257034, operation=0x0, vars=
        0x0, async=0) at psycopg/cursor_type.c:419
    [#6](/projects/62710/tickets/6 "Ticket #6")  0x00ef59db in psyco_curs_execute (self=0x9257034, args=('SELECT 1',), 
        kwargs=0x0) at psycopg/cursor_type.c:475
    [#7](/projects/62710/tickets/7 "Ticket #7")  0x081b4a03 in PyCFunction_Call (func=
        <built-in method execute of psycopg2._psycopg.cursor object at remote 0x9257034>, arg=('SELECT 1',), kw=0x0) at ../Objects/methodobject.c:85
    [#8](/projects/62710/tickets/8 "Ticket #8")  0x080fb2ce in call_function (pp_stack=0xb5e80928, oparg=1)
        at ../Python/ceval.c:3750
    

    Thread 3:

    [#0](/projects/62710/tickets/0 "Ticket #0")  0x00110416 in __kernel_vsyscall ()
    [#1](/projects/62710/tickets/1 "Ticket #1")  0x00c65df6 in poll () from /lib/libc.so.6
    [#2](/projects/62710/tickets/2 "Ticket #2")  0x007dc556 in ?? () from /usr/lib/libpq.so.5
    [#3](/projects/62710/tickets/3 "Ticket #3")  0x007dc661 in ?? () from /usr/lib/libpq.so.5
    [#4](/projects/62710/tickets/4 "Ticket #4")  0x007dc6e3 in ?? () from /usr/lib/libpq.so.5
    [#5](/projects/62710/tickets/5 "Ticket #5")  0x007db7b2 in PQgetResult () from /usr/lib/libpq.so.5
    [#6](/projects/62710/tickets/6 "Ticket #6")  0x007dbaa8 in ?? () from /usr/lib/libpq.so.5
    [#7](/projects/62710/tickets/7 "Ticket #7")  0x00eecaf4 in pq_execute (curs=0xb7844db4, query=0x918569c "SELECT 1", 
        async=0) at psycopg/pqpath.c:778
    [#8](/projects/62710/tickets/8 "Ticket #8")  0x00ef568b in _psyco_curs_execute (self=0xb7844db4, operation=0x0, vars=
        0x0, async=0) at psycopg/cursor_type.c:419
    [#9](/projects/62710/tickets/9 "Ticket #9")  0x00ef59db in psyco_curs_execute (self=0xb7844db4, args=('SELECT 1',), 
        kwargs=0x0) at psycopg/cursor_type.c:475
    [#10](/projects/62710/tickets/10 "Ticket #10") 0x081b4a03 in PyCFunction_Call (func=
        <built-in method execute of psycopg2._psycopg.cursor object at remote 0xb7844db4>, arg=('SELECT 1',), kw=0x0) at ../Objects/methodobject.c:85
    

    Thread 4:

    [#0](/projects/62710/tickets/0 "Ticket #0")  0x00110416 in __kernel_vsyscall ()
    [#1](/projects/62710/tickets/1 "Ticket #1")  0x00191169 in __lll_lock_wait () from /lib/libpthread.so.0
    [#2](/projects/62710/tickets/2 "Ticket #2")  0x0018c5cb in _L_lock_748 () from /lib/libpthread.so.0
    [#3](/projects/62710/tickets/3 "Ticket #3")  0x0018c3f1 in pthread_mutex_lock () from /lib/libpthread.so.0
    [#4](/projects/62710/tickets/4 "Ticket #4")  0x00eec9d0 in pq_execute (curs=0xb7844d24, query=0x918569c "SELECT 1", 
        async=0) at psycopg/pqpath.c:764
    [#5](/projects/62710/tickets/5 "Ticket #5")  0x00ef568b in _psyco_curs_execute (self=0xb7844d24, operation=0x0, vars=
        0x0, async=0) at psycopg/cursor_type.c:419
    [#6](/projects/62710/tickets/6 "Ticket #6")  0x00ef59db in psyco_curs_execute (self=0xb7844d24, args=('SELECT 1',), 
        kwargs=0x0) at psycopg/cursor_type.c:475
    [#7](/projects/62710/tickets/7 "Ticket #7")  0x081b4a03 in PyCFunction_Call (func=
        <built-in method execute of psycopg2._psycopg.cursor object at remote 0xb7844d24>, arg=('SELECT 1',), kw=0x0) at ../Objects/methodobject.c:85
    [#8](/projects/62710/tickets/8 "Ticket #8")  0x080fb2ce in call_function (pp_stack=0xb7625928, oparg=1)
        at ../Python/ceval.c:3750
    
  • Daniele Varrazzo

    Daniele Varrazzo May 29th, 2011 @ 09:30 AM

    The above traceback doesn't seem a deadlock: thread 3 is stuck in a poll.

  • ronin-150602 (at lighthouseapp)

    ronin-150602 (at lighthouseapp) May 29th, 2011 @ 09:54 AM

    Hi Daniele,

    Thank you for your help. :)

    If it helps, we have noticed that at the time the poll() call occurs, the PostgreSQL server thinks that all its client connections are closed. Meaning the poll() call may be waiting on a socket that's not connected anymore, although I have no idea how that is even possible.

  • Daniele Varrazzo

    Daniele Varrazzo May 29th, 2011 @ 10:23 PM

    In my test the connection to the server was still open: with a pg_terminate_backend() the client is released (with an exception of course). So it looks like another manifestation of loss of sync with the protocol, but I guess this can only be verified debugging with wireshark. Doesn't sound easy :-/

  • ronin-150602 (at lighthouseapp)

    ronin-150602 (at lighthouseapp) May 30th, 2011 @ 07:47 AM

    Aw, darn.

    Given the other symptoms (traceback, etc) that sometimes occur instead of the deadlock, I was hoping for a 'simple' concurrency issue, perhaps a missing lock somewhere, with the inconsistent connection state being merely a side effect. That would presumably be easier to fix (for an admittedly loose definition thereof) than a loss of sync at the protocol level...

    If there is something I can do to help, please let me know.

  • Daniele Varrazzo

    Daniele Varrazzo June 4th, 2011 @ 12:58 AM

    • Tag set to multithread
    • State changed from “open” to “resolved”

    The problem should be fixed.

    We were processing notifies after releasing the connection lock, together with the result process. But, while the result is in a read-only structure, processing notifies implies parsing the connection, possibly disrupting the communication for other threads.

    Attached a patch that has been useful to follow the debug trace in many threads. Not committing it as not portable.

  • ronin-150602 (at lighthouseapp)

    ronin-150602 (at lighthouseapp) June 6th, 2011 @ 09:18 AM

    Wonderful! Thank you, Daniele. Is there a branch I can pull to test the fix here?

  • ronin-150602 (at lighthouseapp)

    ronin-150602 (at lighthouseapp) June 6th, 2011 @ 09:36 AM

    Wait, I found it, sorry for the noise.

    I've compiled devel from git and run my test script on it, and can confirm that the bug no longer occurs here. :)

  • Garrett Cooper

    Garrett Cooper January 21st, 2014 @ 11:10 PM

    I wasn't able to reproduce this issue with python 2.7.3 / psycopg2 on Debian Lenny 2.6 at least. Was this resolved on/after v2.4.5?

    Here's my test program (it's a more complicated version based on what was provided above).

    $ python -c 'import psycopg2 as pg; print(pg.__version__)'
    2.4.5 (dt dec pq3 ext)
    $ cat test_issue_55.py
    #!/usr/bin/env python
    
    #from multiprocessing import Process
    from multiprocessing import Queue, TimeoutError
    try:
        from queue import Empty
    except ImportError:
        from Queue import Empty
    import sys
    from threading import Thread
    import psycopg2
    
    CONNECTION = None
    COUNT = 1000000
    
    def done(i):
        return "%d: here's a bone!" % (i, )
    
    def pat(i):
        return '%d: pat pat' % (i, )
    
    def db_select_loop(watchdog_queue, i):
    
        for j in xrange(COUNT):
            CONNECTION.cursor().execute("SELECT 1")
            watchdog_queue.put(pat(i))
        watchdog_queue.put(done(i))
    
    
    def watchdog_loop(watchdog_queue, i, dproc):
    
        epat_msgs = {
            'done': done(i),
            'pat': pat(i),
        }
        j = 0
        while True:
            try:
                pat_msg = watchdog_queue.get(timeout=60)
                if pat_msg not in epat_msgs.values():
                    raise AssertionError('The pat message did not match one of '
                                         'the expected messages: `%s` != `%s`'
                                         % (pat_msg,
                                            ','.join(['"%s"' % (m, ) for m in
                                                                     epat_msgs]), ))
                if pat_msg == epat_msgs['done']:
                    sys.stderr.write('DB worker %d is done\n' % (i, ))
                    return
                elif j == 1000:
                    sys.stderr.write('DB worker %d gave me a few pats on the head '
                                     ';)..\n' % (i, ))
                    j = 0
                else:
                    j += 1
            except (AssertionError, Empty, TimeoutError) as e:
                sys.stderr.write('Watchdog (%s) timed out waiting for pat: %s\n'
                                 % (dproc.name, str(e), ))
                while dproc.is_alive():
                    if hasattr(dproc, 'terminate'):
                        dproc.terminate()
                    else:
                        dproc.join()
                break
    
    
    def main():
    
        global CONNECTION
    
        conn_dict = dict(
    host='XXXXXX',
    port='5432',
    user='XXXXXX',
    password='XXXXXX',
    database='XXXXXX',
        )
    
        CONNECTION = psycopg2.connect(**conn_dict)
    
        # Won't work with multiprocessing.Process by design ;)..
        #cls = Process
        cls = Thread
    
        procs = []
        for i in xrange(150):
            queue = Queue()
    
            dproc = cls(target=db_select_loop, args=(queue, i, ))
            dproc.start()
            procs.append(dproc)
    
            wproc = cls(target=watchdog_loop, args=(queue, i, dproc, ))
            wproc.start()
            procs.append(wproc)
    
    
        while procs:
            for proc in procs:
                if proc.is_alive():
                    proc.join(10)
                if not proc.is_alive():
                    procs.remove(proc)
    
    
    if __name__ == '__main__':
        main()
    
  • Garrett Cooper

    Garrett Cooper January 22nd, 2014 @ 02:37 AM

    Yeah... 2.2.2 is definitely bugged in an obvious way (severe database contention, poor performance). Thanks for the bug report PVaret and the fix Daniele!

Create your profile

Help contribute to this project by taking a few moments to create your personal profile. Create your profile ยป

<b>WARNING:</b> the informations in this tracker are archived. Please submit new tickets or comments to <a href="https://github.com/psycopg/psycopg2/issues">the new tracker</a>.
<br/>
Psycopg is the most used PostgreSQL adapter for the Python programming language. At the core it fully implements the Python DB API 2.0 specifications. Several extensions allow access to many of the features offered by PostgreSQL.

Shared Ticket Bins

People watching this ticket

Attachments

Pages