2010JulAug EP log: Difference between revisions

From Monnier Group Research Wiki
Jump to navigationJump to search
No edit summary
No edit summary
Line 172: Line 172:
}
}


It has not crashed for about 15 minutes now!!!
It has not crashed for about 15 minutes now!!! It did not crash but eventually it hanged like when using the interrupt in user-space.
 
I also got this message:
Message from syslogd@mirkwood at Jul 28 19:06:40 ...
kernel:Disabling IRQ #20

Revision as of 02:23, 29 July 2010

When running ./RTscheduler 60000 the system hungs after start exposure at a random time. There is no reported event in the logs when the machine hungs, however in /var/log/kern.log I find two messages:

Jul 25 09:37:24 mirkwood kernel: ( astropci_check_reply_flags ) status: 0x0

Printed by this function:

/******************************************************************************

FUNCTION: ASTROPCI_CHECK_REPLY_FLAGS

PURPOSE:  Check the current PCI DSP status. Uses HSTR HTF bits 3,4,5.

RETURNS:  Returns DON if HTF bits are a 1 and command successfully completed.
          Returns RDR if HTF bits are a 2 and a reply needs to be read.
          Returns ERR if HTF bits are a 3 and command failed.
          Returns SYR if HTF bits are a 4 and a system reset occurred.
NOTES:    This function must be called after sending a command to the PCI
          board or controller.
                                                                                                                                                            • /

static int astropci_check_reply_flags( int devnum ) { uint32_t status = 0; int reply = TIMEOUT;

do { astropci_printf( "( astropci_check_reply_flags ) status: 0x%X\n", status ); // sds - Oct 23, 2008

status = astropci_wait_for_condition( devnum, CHECK_REPLY );

if ( status == DONE_STATUS ) reply = DON;

else if ( status == READ_REPLY_STATUS ) reply = RDR;

else if ( status == ERROR_STATUS ) reply = ERR;

else if ( status == SYSTEM_RESET_STATUS ) reply = SYR;

else if ( status == READOUT_STATUS ) reply = READOUT;

// Clear the status bits if not in READOUT if ( reply != READOUT )

			Write_HCVR( devnum, ( uint32_t )CLEAR_REPLY_FLAGS );

} while ( status == BUSY_STATUS );

return reply; }

The returned code 0x0 corresponds to TIMEOUT_STATUS:

enum { TIMEOUT_STATUS = 0, DONE_STATUS, READ_REPLY_STATUS, ERROR_STATUS, SYSTEM_RESET_STATUS, READOUT_STATUS, BUSY_STATUS };

I also get this message:

Jul 25 09:37:25 mirkwood kernel: (Write_HCVR): HCVR not ready. Count: 0 Value: 0x8073

in:

/******************************************************************************

FUNCTION: WRITE_HCVR
PURPOSE:  Writes a 32-bit value to the HCVR. Checks that the HCVR register

bit 1 is not set, otherwise a command is still in the register. Calls WriteRegister_32.

RETURNS:  None
                                                                                                                                                            • /

static int Write_HCVR( int devnum, unsigned int regVal ) { unsigned int currentHcvrValue = 0; int i, status = -EIO;

for ( i=0; i<100; i++ ) { currentHcvrValue = ReadRegister_32( devices[ devnum ].ioaddr + HCVR );

if ( ( currentHcvrValue & ( unsigned int )0x1 ) == 0 ) { status = 0; break; }

astropci_printf( "(Write_HCVR): HCVR not ready. Count: %d Value: 0x%X\n", i, currentHcvrValue ); }

if ( status == 0 ) WriteRegister_32( regVal, devices[ devnum ].ioaddr + HCVR );

return status; }

I found this (useful?) comment:

  • 30-Aug-2005 sds 1.7 Added Read/Write register functions, which include
  • delays before reading/writing the PCI DSP registers
  • (HCTR, HSTR, etc). Also includes checking bit 1 of
  • the HCVR, if it's set, do not write to the HCVR register.
  • Also did general cleanup, including re-writing the
  • astropci_wait_for_condition function. Updated for current
  • kernel PCI API.

And this other:

/******************************************************************************

FUNCTION: ASTROPCI_IOCTL()

PURPOSE:  Entry point. Control a character device.

RETURNS:  Returns 0 for success, or the appropriate error number.

NOTES:    The spinlocks have been removed because they shouldn't be used
          here since the functions used here can sleep. This will cause a
          processor to spin forever and deadlock when two PCI boards are
          active simultaneously. This is because the spin lock is global.
          A mutex (semaphore) can be used here, but it causes the load/unload
          process to result in WriteHCVR failure for some reason! Frankly,
          I don't think any locking is needed since each instance of the
          driver accesses different hardware and each instance is only
          opened by one program at a time.
                                                                                                                                                            • /

The log files of the old mirc software does not have any of these entries. CHAMP log file has: Oct 15 20:11:38 champ [<f8901392>] astropci_check_reply_flags+0x2e/0x57 [astropci]

Which is similar but not the same (the printk function in the kernen seem different, showing kernel modification?)

27/07/10

Moved driver to my local directory (kernel) to do testings on it. Removed astropci0 from /lib/udev/devices/ (to avoid loading the driver at boot time).

( astropci_check_reply_flags ) status: 0x0 was not an error message. It was always at zero because there was a mistake in the driver.

28/07/10

commented out ioctl from RTscheduler.c

//ioctl(pci_fd, ASTROPCI_GET_FRAMES_READ, &astropci_reply);

it seems a s if this line is causing the "HCVR not ready" error

Mirkwood did not crash a single time since commenting out the ioctl. A possible cause for the problem is that ioctl is a call to the Linux kernel and will switch the task in secondary mode (soft realtime). Conversely rt_task_sleep() is a call to the realtime kernel (scheduler) and will switch to primary mode (hard realtime). The fast switching of context may me causing the problem. See:

while (astropci_reply == astropci_reply_prev){

   rt_task_sleep(2000); // .01 musec 
   astropci_reply++;
   //ioctl(pci_fd, ASTROPCI_GET_FRAMES_READ, &astropci_reply);

}

Experiment: put the ioctl back in and comment out the rt_task_sleep:

while (astropci_reply == astropci_reply_prev){

   //rt_task_sleep(2000); // .01 musec 
   ioctl(pci_fd, ASTROPCI_GET_FRAMES_READ, &astropci_reply);

}

It has not crashed for about 15 minutes now!!! It did not crash but eventually it hanged like when using the interrupt in user-space.

I also got this message: Message from syslogd@mirkwood at Jul 28 19:06:40 ...

kernel:Disabling IRQ #20