SDK "streamon" format

biometrics · Dec 29, 2018

Hi everyone,

I'm new to Tello and the SDK. I've gotten to the point where I can control the drone via my Android app. Now I want to display the video stream. After issuing the "streamon" command I'm receiving data on port 11111.

From what I've read it's raw H.264. From the one Python script I've read that uses the "streamon" command it appends each packet until the packet size is not 1460 and then passes the appended data to a H.264 decoder to extract the frames.

I tried that and tested it with ffmpeg to extract the frames but it didn't find any (ffmpeg -i "frames.h264" -c:v copy -f mp4 "frames.mp4").

I don't want to use Ffmpeg/Mplayer/VLC or similar video streaming players, I want to do it myself.

So my question is, what is the format of the data I'm receiving?

hellowill89 · Dec 30, 2018

I am also interested in receiving the video feed. What is the format and protocol?

biometrics · Jan 9, 2019

I got a Xiaomi Mi WiFi Extender 2 so now I can connect my PC and phone via the extender to the drone and I can use the debugger on my PC.

I've read through the "What's possible?" thread (Tello. Whats possible?) and looked at the various posted code examples and have made some changes to my code. The one example looked for the packet header 1st four bytes to be 0x00000001 and the 5th byte to be 0x07 and set a "SPS received" flag. From then it started writing the packet data to a file. (Note I am using the SDK so I don't need to strip off the first two bytes.)

I tried that as well. Using the Ffmpeg command to extract an .mp4 (ffmpeg -i "frames.h264" -c:v copy -f mp4 "frames.mp4") I get 1 second of video but only the top 10%.

Any help would be appreciated.

hellowill89 · Jan 9, 2019

You don't actually need to deal with SPS when using "streamon" and listening on port 11111. Just divide each frame up by segmenting where the packet is a shorter length. See here: dji-sdk/Tello-Python

Neoflash · Jan 9, 2019

@hellowill89 @biometrics it looks like we are all trying to figure this out at the same time. Maybe we can help each other with our findings and specific needs. I'm having a really hard time groking this whole video streaming thing. Do any of you guys actually understand video encoding and decoding at a low level? I'm trying to package the h.264 feed into an mp4 format that is compatible with web browsers' Media Extensions API (ISO BMFF) and I'm reading stuff about moov, moof and mdat and it's all freaking chinese to me. Do you guys understand this stuff?

hellowill89 · Jan 9, 2019

@Neoflash, my situation is different since I'm making a mobile app and using Android Java instead of Python. It was extremely difficult to get it to work, but ultimately it worked for me. This is the basic thing about video:

Video comes in packets which are slices of a frame. When a packet with data length less than 1460 is received, that indicates the last packet of a frame. Once a whole frame is collected, it can be passed to a decoder (like an H.264 decoder in python, or MediaCodec in Android), and a single decoded YUV420 frame will be the output. In my situation, I then take the YUV420 frame and convert it to a Bitmap (Android concept), to preview it and process it.

Neoflash · Jan 9, 2019

hellowill89 said:
@Neoflash, my situation is different since I'm making a mobile app and using Android Java instead of Python. It was extremely difficult to get it to work, but ultimately it worked for me. This is the basic thing about video:

Video comes in packets which are slices of a frame. When a packet with data length less than 1460 is received, that indicates the last packet of a frame. Once a whole frame is collected, it can be passed to a decoder (like an H.264 decoder in python, or MediaCodec in Android), and a single decoded YUV420 frame will be the output. In my situation, I then take the YUV420 frame and convert it to a Bitmap (Android concept), to preview it and process it.

Very interesting. So in reality your are splitting the stream into single images, am I understanding this correctly? And what the heck is YUV420?

hellowill89 · Jan 9, 2019

A single frame is decoded at a time, yes. The frame was composed of the data from a few packets. Also note there is no header information in the packet, just data. YUV420 is a common intermediate format for cameras, which allows for fast processing in theory. For example, it is a format which works well with preview surfaces (an Android thing). In my case, I actually convert it to a Bitmap (a larger image format) so that I can do some processing on it.

Neoflash · Jan 9, 2019

hellowill89 said:
A single frame is decoded at a time, yes. The frame was composed of the data from a few packets. Also note there is no header information in the packet, just data. YUV420 is a common intermediate format for cameras, which allows for fast processing in theory. For example, it is a format which works well with preview surfaces (an Android thing). In my case, I actually convert it to a Bitmap (a larger image format) so that I can do some processing on it.

Ok, Ok, this is starting to make some sense. I think I might just drop the idea of trying to feed a <video> element an MP4 using the Media Extensions API and simply use a <canvas> element instead and feed it a stream of decoded frames. It means I'll have to code my own player but that shouldn't be too complicated and give me more control. I guess I have to devise some kind of mechanism to deal with frame rate, drift and stuff like that in order to keep latency as low as possible. Man this is turning out to be a real hassle. Thanks for the help though. Mind if I hit you up if I have further questions?

hellowill89 · Jan 9, 2019

Yes, this was very difficult. I'm not looking forward to getting photo capture to work... Yes, you can message me with questions. Also, this link I shared above: dji-sdk/Tello-Python takes you to the specific line where frame segmentation occurs.

Neoflash · Jan 9, 2019

hellowill89 said:
Yes, this was very difficult. I'm not looking forward to getting photo capture to work... Yes, you can message me with questions. Also, this link I shared above: dji-sdk/Tello-Python takes you to the specific line where frame segmentation occurs.

Yeah, I had seen it, I just couldn't really understand why they were doing that. Your explanation made it clear. Here's a question for you: I have the option of decoding the individual H.264 frames on the server (Node.js) or directly on the client (web browser), where would you do it? If I do it on the client it will have to be a JavaScript decoder, which I'm told work quite well, if I do it on the server it could be any decoder implementation really, I can just run it in a child process if it's not a JS decoder.

hellowill89 · Jan 9, 2019

After decoding, the frame will be larger in memory, so take that into account. If you are decoding on the server and then sending the decoded frame to the client, that is probably not right. The client can decode.

Krag · Jan 9, 2019

I have been down most of these paths. I was trying to get the raw video to play in a browser window but never managed to make it work. No matter what I did I always came back to needing a server to take the video and do something with it.

It seems you are looking into how to decode the video in a javascript. If I remember right there are two issues with that. First is that it is way to slow and the second is that most javascript decoders only support baseline mp4 profile and I believe (not certain) that Tello uses more than baseline.

The only approach I think I didn't take was using javascript to munge the incoming mpeg data and feeding it into the Media Source Extensions api for playback. That might work but it would require knowing more about both video encoding and MSE than I was up for.

Neoflash · Jan 9, 2019

Krag said:
The only approach I think I didn't take was using javascript to munge the incoming mpeg data and feeding it into the Media Source Extensions api for playback. That might work but it would require knowing more about both video encoding and MSE than I was up for.

@Krag AMEN brother! That's what I have been trying to do for the past two weeks and I've never been this frustrated in my life. I've tried everything I know of and can understand and I can't find anyone with enough MSE and video encoding/decoding knowledge to help me out. So I'm officially giving up the MSE option and going to try using a canvas element with decoded frames instead.

As for the performance and the baseline mp4 profile problem, have you tried this decoder, it seems to work with main profile: Decode and play your H264 videos in JavaScript

Krag · Jan 9, 2019

I looked at maybe half a dozen different options, I don't remember trying that one. It doesn't seem to work for me in chrome. The issue with emscripten approaches was usually speed.

Because of Websocket UDP you can't directly connect to the Tello from the browser anyway. So at the end of the day you are going to have to have something serving the video data. I figured that was a better place to process it than trying to do it in the browser.

Neoflash · Jan 9, 2019

Krag said:
I looked at maybe half a dozen different options, I don't remember trying that one. It doesn't seem to work for me in chrome. The issue with emscripten approaches was usually speed.

Because of Websocket UDP you can't directly connect to the Tello from the browser anyway. So at the end of the day you are going to have to have something serving the video data. I figured that was a better place to process it than trying to do it in the browser.

You are right about that. Where I'm hoping to squeeze a few ms better latency is by using RTCDataChannels instead of WebSockets to get the data from the server to the client browser. RTCDataChannels are essentially running on UDP instead of Websockets' TCP.

I'll do some tests with both client side and server side decoding and see what gives me the best performance. Either way, if there is a bottleneck (frames are coming in from the camera much faster than the browser/server can decode them) I'm thinking that I could try splitting the stream amongst several workers (threads). This solution might not work all that great on older machines and low-end mobile devices but really, who the hell owns a drone but doesn't have modern computer / mobile phone hardware.

Right now, all I want is to get a #$%% live-ish video feed showing up in the browser, even if I get 1 or 2 secs latency. The best I was able to do with the god-forsaken MSE solution was around 10-15 secs latency and I had to use ffmpeg to convert the stream to webm/vp9 with tiny resolution and bitrate because I never got MSE to play a goddamn MP4.

hellowill89 · Jan 9, 2019

I just have to say generally try to allocate as few objects as possible when doing these operations as they are intensive. Always reuse packet and frame buffers (byte[ ]). I know with JavaScript, everything is an object, but perhaps if one can deal with primitives in these operations, it is better.

Krag · Jan 9, 2019

Low latency is a whole other story. You'll find that nothing happens at as-soon-as-possible. Its all designed to provide smooth playback not real time. If you want low latency you need to use WebRTC. And that will also make your head hurt.

Neoflash · Jan 9, 2019

Krag said:
Low latency is a whole other story. You'll find that nothing happens at as-soon-as-possible. Its all designed to provide smooth playback not real time. If you want low latency you need to use WebRTC. And that will also make your head hurt.

I am using WebRTC, but not the built-in media transport(MediaStreams), the generic data transport (DataChannels), which is technically SCTP but when configured properly ends up essentially being UDP. Doesn't get better than this for real-time live streaming video. This really is the best solution in terms of transport protocol at the moment. Now I just have to find the most efficient way to decode the h.264 frames.

Krag · Jan 9, 2019

I would try to use the media streams. They will deal with the other part of the problem which is getting the frames displayed as quickly as possible. The browsers media player isnt designed for that. Unless you really are able to decode to canvas. Even then make sure the decoder supports delivering the frame at once or skipping.