2010JulAug EP log

When running ./RTscheduler 60000 the system hungs after start exposure at a random time. There is no reported event in the logs when the machine hungs, however in /var/log/kern.log I find two messages:

Jul 25 09:37:24 mirkwood kernel: ( astropci_check_reply_flags ) status: 0x0

Printed by this function:

/******************************************************************************

FUNCTION: ASTROPCI_CHECK_REPLY_FLAGS

PURPOSE:  Check the current PCI DSP status. Uses HSTR HTF bits 3,4,5.

RETURNS:  Returns DON if HTF bits are a 1 and command successfully completed.
          Returns RDR if HTF bits are a 2 and a reply needs to be read.
          Returns ERR if HTF bits are a 3 and command failed.
          Returns SYR if HTF bits are a 4 and a system reset occurred.

NOTES:    This function must be called after sending a command to the PCI
          board or controller.

- - - - /

static int astropci_check_reply_flags( int devnum ) { uint32_t status = 0; int reply = TIMEOUT;

do { astropci_printf( "( astropci_check_reply_flags ) status: 0x%X\n", status ); // sds - Oct 23, 2008

status = astropci_wait_for_condition( devnum, CHECK_REPLY );

if ( status == DONE_STATUS ) reply = DON;

else if ( status == READ_REPLY_STATUS ) reply = RDR;

else if ( status == ERROR_STATUS ) reply = ERR;

else if ( status == SYSTEM_RESET_STATUS ) reply = SYR;

else if ( status == READOUT_STATUS ) reply = READOUT;

// Clear the status bits if not in READOUT if ( reply != READOUT )

			Write_HCVR( devnum, ( uint32_t )CLEAR_REPLY_FLAGS );

} while ( status == BUSY_STATUS );

return reply; }

The returned code 0x0 corresponds to TIMEOUT_STATUS:

enum { TIMEOUT_STATUS = 0, DONE_STATUS, READ_REPLY_STATUS, ERROR_STATUS, SYSTEM_RESET_STATUS, READOUT_STATUS, BUSY_STATUS };

I also get this message:

Jul 25 09:37:25 mirkwood kernel: (Write_HCVR): HCVR not ready. Count: 0 Value: 0x8073

in:

/******************************************************************************

FUNCTION: WRITE_HCVR

PURPOSE:  Writes a 32-bit value to the HCVR. Checks that the HCVR register

bit 1 is not set, otherwise a command is still in the register. Calls WriteRegister_32.

RETURNS:  None

- - - - /

static int Write_HCVR( int devnum, unsigned int regVal ) { unsigned int currentHcvrValue = 0; int i, status = -EIO;

for ( i=0; i<100; i++ ) { currentHcvrValue = ReadRegister_32( devices[ devnum ].ioaddr + HCVR );

if ( ( currentHcvrValue & ( unsigned int )0x1 ) == 0 ) { status = 0; break; }

astropci_printf( "(Write_HCVR): HCVR not ready. Count: %d Value: 0x%X\n", i, currentHcvrValue ); }

if ( status == 0 ) WriteRegister_32( regVal, devices[ devnum ].ioaddr + HCVR );

return status; }

I found this (useful?) comment:

30-Aug-2005 sds 1.7 Added Read/Write register functions, which include
delays before reading/writing the PCI DSP registers
(HCTR, HSTR, etc). Also includes checking bit 1 of
the HCVR, if it's set, do not write to the HCVR register.
Also did general cleanup, including re-writing the
astropci_wait_for_condition function. Updated for current
kernel PCI API.

And this other:

/******************************************************************************

FUNCTION: ASTROPCI_IOCTL()

PURPOSE:  Entry point. Control a character device.

RETURNS:  Returns 0 for success, or the appropriate error number.

NOTES:    The spinlocks have been removed because they shouldn't be used
          here since the functions used here can sleep. This will cause a
          processor to spin forever and deadlock when two PCI boards are
          active simultaneously. This is because the spin lock is global.
          A mutex (semaphore) can be used here, but it causes the load/unload
          process to result in WriteHCVR failure for some reason! Frankly,
          I don't think any locking is needed since each instance of the
          driver accesses different hardware and each instance is only
          opened by one program at a time.

- - - - /

The log files of the old mirc software does not have any of these entries. CHAMP log file has: Oct 15 20:11:38 champ [<f8901392>] astropci_check_reply_flags+0x2e/0x57 [astropci]

Which is similar but not the same (the printk function in the kernen seem different, showing kernel modification?)

27/07/10

Moved driver to my local directory (kernel) to do testings on it. Removed astropci0 from /lib/udev/devices/ (to avoid loading the driver at boot time).

( astropci_check_reply_flags ) status: 0x0 was not an error message. It was always at zero because there was a mistake in the driver.

28/07/10

commented out ioctl from RTscheduler.c

//ioctl(pci_fd, ASTROPCI_GET_FRAMES_READ, &astropci_reply);

it seems a s if this line is causing the "HCVR not ready" error

Mirkwood did not crash a single time since commenting out the ioctl. A possible cause for the problem is that ioctl is a call to the Linux kernel and will switch the task in secondary mode (soft realtime). Conversely rt_task_sleep() is a call to the realtime kernel (scheduler) and will switch to primary mode (hard realtime). The fast switching of context may me causing the problem. See:

while (astropci_reply == astropci_reply_prev){

   rt_task_sleep(2000); // .01 musec 
   astropci_reply++;
   //ioctl(pci_fd, ASTROPCI_GET_FRAMES_READ, &astropci_reply);

}

Experiment: put the ioctl back in and comment out the rt_task_sleep:

while (astropci_reply == astropci_reply_prev){

   //rt_task_sleep(2000); // .01 musec 
   ioctl(pci_fd, ASTROPCI_GET_FRAMES_READ, &astropci_reply);

}

It has not crashed for about 15 minutes now!!! It did not crash but eventually it hanged like when using the interrupt in user-space.

I also got this message: Message from syslogd@mirkwood at Jul 28 19:06:40 ...

kernel:Disabling IRQ #20

29/07/10

The hung problem was caused by the ioctl/rt_task_sleep switching contest. Now the program does not crash but it seems to stop at the ioctl, therefore I will put in print statements to test this possibility.

Lost all the wiki edit of today during an auto-logout from the wiki page.

30/07/10

Test if the HCVR message is relevant to the crash:

add user space ISR handler.

/***********************************************************************

*     Task to transfer data from astropci on interrupt
***********************************************************************/

void interrupt_task(void *cookie) {

 int err;
 while(!exc.endProg){
   // blocking interrupt handler 
   err = rt_intr_wait(&intr_desc, TM_INFINITE);
   if (0 >= err) { 
     rt_printf("Timeout on data interrupt!!!\n");
     break;
   } //else rt_printf("Interrupt OK\n");
   rt_sem_v(&switch_sem);
 }

}

Any IRQ20 sent from the astropci board will trigger a semaphore.

I put the semaphore just before the ioctl while loop:

// blocking semaphore from interrupt handler err = rt_sem_p(&switch_sem, timeout);

astropci_reply_prev=astropci_reply;

// poll astropci (some latency here.. will need to average

while (astropci_reply == astropci_reply_prev){

  ioctl(pci_fd, ASTROPCI_GET_FRAMES_READ, &astropci_reply);

}

the "HCVR not ready" error does not show any more in kernel messages. After a while I get a kernel crash probably due to context switching from Xenomai to Linux kernel.

31/07/10

The program seem to crash after no interrupt is received (the semaphore goes into timeout after 10 seconds). Does this have anything to do with the "kernel:Disabling IRQ #20"

I commented out the spin_loc from the driver for a test since that may conflict with xenomai.

2010JulAug EP log

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools