2010JulAug EP log: Difference between revisions
Monnierast (talk | contribs) No edit summary |
No edit summary |
||
Line 282: | Line 282: | ||
Writing data to disc now works. | Writing data to disc now works. | ||
------- | |||
2010 Aug 04 | |||
This is the result of top after a fresh reboot on the new xenomai mirkwood. Notice how litte freememory there is. | |||
Tasks: 63 total, 1 running, 62 sleeping, 0 stopped, 0 zombie | |||
Cpu0 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st | |||
Cpu1 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st | |||
Mem: 905100k total, 787496k used, 117604k free, 729684k buffers | |||
Swap: 2650684k total, 0k used, 2650684k free, 16140k cached |
Latest revision as of 21:41, 4 August 2010
When running ./RTscheduler 60000 the system hungs after start exposure at a random time. There is no reported event in the logs when the machine hungs, however in /var/log/kern.log I find two messages:
Jul 25 09:37:24 mirkwood kernel: ( astropci_check_reply_flags ) status: 0x0
Printed by this function:
/******************************************************************************
FUNCTION: ASTROPCI_CHECK_REPLY_FLAGS PURPOSE: Check the current PCI DSP status. Uses HSTR HTF bits 3,4,5. RETURNS: Returns DON if HTF bits are a 1 and command successfully completed. Returns RDR if HTF bits are a 2 and a reply needs to be read. Returns ERR if HTF bits are a 3 and command failed. Returns SYR if HTF bits are a 4 and a system reset occurred.
NOTES: This function must be called after sending a command to the PCI board or controller.
- /
static int astropci_check_reply_flags( int devnum ) { uint32_t status = 0; int reply = TIMEOUT;
do { astropci_printf( "( astropci_check_reply_flags ) status: 0x%X\n", status ); // sds - Oct 23, 2008
status = astropci_wait_for_condition( devnum, CHECK_REPLY );
if ( status == DONE_STATUS ) reply = DON;
else if ( status == READ_REPLY_STATUS ) reply = RDR;
else if ( status == ERROR_STATUS ) reply = ERR;
else if ( status == SYSTEM_RESET_STATUS ) reply = SYR;
else if ( status == READOUT_STATUS ) reply = READOUT;
// Clear the status bits if not in READOUT if ( reply != READOUT )
Write_HCVR( devnum, ( uint32_t )CLEAR_REPLY_FLAGS );
} while ( status == BUSY_STATUS );
return reply; }
The returned code 0x0 corresponds to TIMEOUT_STATUS:
enum { TIMEOUT_STATUS = 0, DONE_STATUS, READ_REPLY_STATUS, ERROR_STATUS, SYSTEM_RESET_STATUS, READOUT_STATUS, BUSY_STATUS };
I also get this message:
Jul 25 09:37:25 mirkwood kernel: (Write_HCVR): HCVR not ready. Count: 0 Value: 0x8073
in:
/******************************************************************************
FUNCTION: WRITE_HCVR
PURPOSE: Writes a 32-bit value to the HCVR. Checks that the HCVR register
bit 1 is not set, otherwise a command is still in the register. Calls WriteRegister_32.
RETURNS: None
- /
static int Write_HCVR( int devnum, unsigned int regVal ) { unsigned int currentHcvrValue = 0; int i, status = -EIO;
for ( i=0; i<100; i++ ) { currentHcvrValue = ReadRegister_32( devices[ devnum ].ioaddr + HCVR );
if ( ( currentHcvrValue & ( unsigned int )0x1 ) == 0 ) { status = 0; break; }
astropci_printf( "(Write_HCVR): HCVR not ready. Count: %d Value: 0x%X\n", i, currentHcvrValue ); }
if ( status == 0 ) WriteRegister_32( regVal, devices[ devnum ].ioaddr + HCVR );
return status; }
I found this (useful?) comment:
- 30-Aug-2005 sds 1.7 Added Read/Write register functions, which include
- delays before reading/writing the PCI DSP registers
- (HCTR, HSTR, etc). Also includes checking bit 1 of
- the HCVR, if it's set, do not write to the HCVR register.
- Also did general cleanup, including re-writing the
- astropci_wait_for_condition function. Updated for current
- kernel PCI API.
And this other:
/******************************************************************************
FUNCTION: ASTROPCI_IOCTL() PURPOSE: Entry point. Control a character device. RETURNS: Returns 0 for success, or the appropriate error number. NOTES: The spinlocks have been removed because they shouldn't be used here since the functions used here can sleep. This will cause a processor to spin forever and deadlock when two PCI boards are active simultaneously. This is because the spin lock is global. A mutex (semaphore) can be used here, but it causes the load/unload process to result in WriteHCVR failure for some reason! Frankly, I don't think any locking is needed since each instance of the driver accesses different hardware and each instance is only opened by one program at a time.
- /
The log files of the old mirc software does not have any of these entries. CHAMP log file has: Oct 15 20:11:38 champ [<f8901392>] astropci_check_reply_flags+0x2e/0x57 [astropci]
Which is similar but not the same (the printk function in the kernen seem different, showing kernel modification?)
27/07/10
Moved driver to my local directory (kernel) to do testings on it. Removed astropci0 from /lib/udev/devices/ (to avoid loading the driver at boot time).
( astropci_check_reply_flags ) status: 0x0 was not an error message. It was always at zero because there was a mistake in the driver.
28/07/10
commented out ioctl from RTscheduler.c
//ioctl(pci_fd, ASTROPCI_GET_FRAMES_READ, &astropci_reply);
it seems a s if this line is causing the "HCVR not ready" error
Mirkwood did not crash a single time since commenting out the ioctl. A possible cause for the problem is that ioctl is a call to the Linux kernel and will switch the task in secondary mode (soft realtime). Conversely rt_task_sleep() is a call to the realtime kernel (scheduler) and will switch to primary mode (hard realtime). The fast switching of context may me causing the problem. See:
while (astropci_reply == astropci_reply_prev){
rt_task_sleep(2000); // .01 musec astropci_reply++; //ioctl(pci_fd, ASTROPCI_GET_FRAMES_READ, &astropci_reply);
}
Experiment: put the ioctl back in and comment out the rt_task_sleep:
while (astropci_reply == astropci_reply_prev){
//rt_task_sleep(2000); // .01 musec ioctl(pci_fd, ASTROPCI_GET_FRAMES_READ, &astropci_reply);
}
It has not crashed for about 15 minutes now!!! It did not crash but eventually it hanged like when using the interrupt in user-space.
I also got this message: Message from syslogd@mirkwood at Jul 28 19:06:40 ...
kernel:Disabling IRQ #20
29/07/10
The hung problem was caused by the ioctl/rt_task_sleep switching contest. Now the program does not crash but it seems to stop at the ioctl, therefore I will put in print statements to test this possibility.
Lost all the wiki edit of today during an auto-logout from the wiki page.
30/07/10
Test if the HCVR message is relevant to the crash:
add user space ISR handler.
/***********************************************************************
* Task to transfer data from astropci on interrupt ***********************************************************************/
void interrupt_task(void *cookie) {
int err; while(!exc.endProg){ // blocking interrupt handler err = rt_intr_wait(&intr_desc, TM_INFINITE); if (0 >= err) { rt_printf("Timeout on data interrupt!!!\n"); break; } //else rt_printf("Interrupt OK\n"); rt_sem_v(&switch_sem); }
}
Any IRQ20 sent from the astropci board will trigger a semaphore.
I put the semaphore just before the ioctl while loop:
// blocking semaphore from interrupt handler err = rt_sem_p(&switch_sem, timeout);
astropci_reply_prev=astropci_reply;
// poll astropci (some latency here.. will need to average
while (astropci_reply == astropci_reply_prev){
ioctl(pci_fd, ASTROPCI_GET_FRAMES_READ, &astropci_reply);
}
the "HCVR not ready" error does not show any more in kernel messages. After a while I get a kernel crash probably due to context switching from Xenomai to Linux kernel.
31/07/10
The program seem to crash after no interrupt is received (the semaphore goes into timeout after 10 seconds). Does this have anything to do with the "kernel:Disabling IRQ #20"
I commented out the spin_lock from the driver for a test since that may conflict with xenomai. In practice the program crashes either using the interrupt method or the ioct method.
01/08/10 Maybe converting the old driver to kernel 2.6.32 is actually easier that debugging the new driver. I try to compile the old driver. I copy the kernel dir in my home to kernel.old and copy the CHAMP kernel dir to my home dir, Now copy new Makefile from kernel to kernel_old and compile driver: make -C /lib/modules/2.6.32.11-xenomai-2.5.3/build SUBDIRS=/home/ep41/kernel modules -lnative -lrtdm make[1]: Entering directory `/usr/src/linux-2.6.32.11'
CC [M] /home/ep41/kernel/astropci.o
In file included from /home/ep41/kernel/astropci.c:65: /usr/include/xenomai/native/task.h: In function ‘rt_task_spawn’: /usr/include/xenomai/native/task.h:317: warning: ‘rt_task_create’ is deprecated (declared at /usr/include/xenomai/native/task.h:250) /home/ep41/kernel/astropci.c: In function ‘astropci_init’: /home/ep41/kernel/astropci.c:193: error: implicit declaration of function ‘pci_module_init’ /home/ep41/kernel/astropci.c: In function ‘astropci_exit’: /home/ep41/kernel/astropci.c:290: error: void value not ignored as it ought to be /home/ep41/kernel/astropci.c: In function ‘__astropci_isr’: /home/ep41/kernel/astropci.c:829: warning: initialization makes integer from pointer without a cast make[2]: *** [/home/ep41/kernel/astropci.o] Error 1 make[1]: *** [_module_/home/ep41/kernel] Error 2 make[1]: Leaving directory `/usr/src/linux-2.6.32.11' make: *** [default] Error 2
OK, pci_module_init has been substituted with pci_register_driver. I just need to change the name. From line 290 I simply remove the variable int ret and all references to it since the function now returns void.
- ret = unregister_chrdev(major[i], board[i]); + unregister_chrdev(major[i], board[i]);
02/08/10
I want RTscheduler to be executable by the user "observe" instead of root. to do this I need to pass the command xeno_nucleus.xenomai_gid=1002 to the kernel. This needs to be added in /boot/grub/menu.lst :
kernel /boot/vmlinuz-2.6.32.11-xenomai-2.5.3 root=LABEL=ROOTFS-XENO ro noht mem=900M memmap=1080M$900M xeno_nucleus.xenomai_gid=1002
I also need to add observe to the xenomai group:
usermod -G xenomai observe
Now observe can execute RTscheduler 60000.
I am sick of this wiki. I lost again part of my log that I thought I saved. It seems as if saving is not enough. You have to close the complete session!
The modification to the kernel driver were successful. I also had to use again the rt_sem since the ioctl crashed the machine. The driver needs to be modified since it uses the kernel framework now strongly deprecated (need to use RTDM).
Other problem is the lack of communication between RTscheduler and spooler which I investigate now.
Added wrapper to rt_pipe_create and rt_pipe_write in order to display meaningful error messages. Running RTscheduler I find that the error message returned by rt_pipe_write is "out of memory". This means that the spooler in not empting the pipe.
03/08/10
Modified the spooler code to open /dev/rtpXX instead of /proc/xenomai/register/pipe/pipeXX (which is the recommended way). Spooler cannot open the device. I try as root. Bingo! Permission problem. I need to add spooler to the xenomai group:
usermod -G xenomai spooler.
Writing data to disc now works.
2010 Aug 04
This is the result of top after a fresh reboot on the new xenomai mirkwood. Notice how litte freememory there is.
Tasks: 63 total, 1 running, 62 sleeping, 0 stopped, 0 zombie
Cpu0 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu1 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 905100k total, 787496k used, 117604k free, 729684k buffers
Swap: 2650684k total, 0k used, 2650684k free, 16140k cached