For 3D VR videos, this would be useful for adjusting IPD for every person, rather than use the static IPD of the camera setup. Also, allowing just a little bit of head movement would really increase the immersiveness. I don't need to travel long distances inside the video. If the video is already filmed with static stereo setup, it would be even easier to reconstruct an accurate 4D video limited to short travel distances without glaring errors.
We've been waiting 4 years. I just don't understand what is taking so long.
Even at a low resolution, the difference is night and day. Even with a very small window, this is a leap forward for VR immersion. Why in the hell is no one using it?!?
As they state, it requires a 46 camera rig to record, immense computational power to compress the data into something usable, and a 1 Gbs data stream, which means there is nearly zero commercial use, and certainly not widespread use, at those specs.
Oh, and it's covered in a patent minefield. You can check the author names to see. Pretty much every paper from Google gets covered in them, so be wary of implementing anything you read in their papers without a serious patent search. This one took me 10 seconds.
Not everything whizzbang you see in a paper or demo is not being sold as consumer products because of ignorance. There is always a reason, usually one you have not discovered yet.
I can't say I've had any good experiences with VR video... it's incredibly natural for gaming, but for scripted video having my head control the camera dilutes the experience for me: you either lose a key part of the language of cinema, or you make people sick by periodically taking control away from them.
The absolute best application of VR video I can imagine is immersive theatre, and I think with videogame engines are accomplishing that very well with modern performance capture.
Is there an experience you'd recommend that you think could change my mind?
For the most part, it's called porn, and you already know whether or not you are interested.
Beyond that, though, there is a lot of potential just past the horizon. Check out the demo for the paper I linked. Sure, you can't move around very far, but you can move enough to be part of the scene, unlike the usual tripod video setup.
One challenge here is dedicated hardware support, running any sort of video processing/decoding on the CPU is unfeasible (GPU less so), particularly if the context is a battery-powered device (Quest 3, Apple Vision Pro), with high resolution and framerate (6K-8K, 60fps).
That doesn't even get into the logistics of capturing, serving, and software supporting this video data either
It's not just a paper. They have a working demo. IDK what its hardware requirements are, but they are certainly lower than the specs of a decent VR rig.
Even with its relatively low resolution, the demo blew me away. Given a choice between current 8K@120 content and something equivalent to this 4 year old demo, I would choose the latter hands down.
One thing I liked about Team Ico (a studio behind the Shadow of the Colossus, Ico, Last Guardian video games) was how the player can move the camera just a little but during automated sequences
Getting that kind of look around in a video scene would be really engaging. A bit different than VR or watching in The Sphere, with the engagement being that there are still things right out of view you have to pan the camera for
> Getting that kind of look around in a video scene would be really engaging.
It might be interesting for one or two movies specifically built around the feature, but otherwise it would be a gimmick no one would care for. For games, sure, but movies are a different experience.
maybe, but for the last half decade nearly my entire social circle cannot sit still for movies and just won't go out of their way to do it anymore. I'm big into cinema, but they are not. Treating this like the fidget-spinning surrogate that a large portion of the population relies upon could potentially make it a hit for some viewing experiences. Its a thesis I would pursue, for money, at least.
This would make it unbearable to watch a movie with anyone else, so it doesn’t really solve the social issue. But even if you’re watching it alone, it only really makes sense if the movie itself takes advantage of it in some interesting way, which starts to get into game territory. It wouldn’t even work for the most part: how do you deal with cuts and changes of scenery? It makes no sense in the context of a movie; what you’re looking for is a game and we can already do that.
Maybe you could have it work as a documentary (good luck getting a bored social group to go for that) or a virtual tour, but we already have 3D interactions of those too.
We’ve had tons of movie viewing experiments and ultimately always go back to the tried and true 2D screen, with the bolder ideas being relegated mostly to the domain of theme park gimmicks. Which are interesting in their own right, but don’t survive on their own.
yes, my main use case would be for fidgeters solo, just like those Team ICO games
> how do you deal with cuts and changes of scenery?
the same way the games did it. by doing nothing special at all and retaining the same functionality. it really depends on how this 4D reconstruction works before I could say it uniquely adversely affects the experience
for the most part what's interesting to me is that the overhead costs seem low enough not to care about random things big studios did at great expense with no way to justify the market appeal. its either a portfolio piece or 1,000 monhtly users supporting my lifestyle indefinitely.
Whenever I play a video game (Monster Hunter World comes to me immediately) and see an establishing shot with moving camera (like the ones demoed on their web site) I think the game really wants to run in an a VR headset where you can walk around and see different angles.
(Funny there is a VR mod for Monster Hunter Rise which makes me think just how fun Monster Hunter VR would be)
General rule in video games: Assume anything you can’t see isn’t there and anything you can see is fake. Any cutscene where you don’t control the camera is almost certainly a janky mess (or at least not polished) from other perspectives.
There are exceptions but not many. It takes a ton more work to make a scene look good from any arbitrary viewpoint.
Actually, of the examples they showed, all but one clip featured both camera and in-camera motion. Granted not a lot of the former, but according to my non-expert opinion, maybe enough to construct a disparity map.
Yeh, I got my terms confused. My bad. Disparity is, I believe, only possible to get from a stereo pair. TFA presents something closer to a camera track, albeit a very short one. From a camera track it is a short hop to extracting a point cloud. What make the author's approach possibly unique is that the foreground object is in motion as well as the camera. In standard VFX foreground object motion is usually avoided.
Curiosity, what is the difference between 4D or 6DoF (six degrees of freedom)? Sounds a lot like the 6DoF work that Lytro did back in 2012, although this obviously is coming at the problem from the other direction, generating it rather than capturing it.
Lytro added 2 spatial dimensions of info to 2D image capture: the angles the light was traveling at when it entered the camera. They could simulate the image with different camera parameters, which was good for changing depth of field after the fact, but the occlusion information was limited by the diameter of the aperture. They tried to make depth maps, but that extra data was not a silver bullet. As far as I could tell, they were still fundamentally COLMAPing, they just had extra hints to guess with.
This is spot-on. Note that the aperture on the camera was quite large, I want to say on the order of 100mm? They sourced really exotic hardware for that cinema camera.
They also had the "Immerge," which was a ~1m diameter, hexagonal array of 2D cameras. They got the 4D data from having a 2D (spatially distributed) array of 2D samples (each camera's views). It's under sampled, because they threw out most of the light, but using 3D as a prior for reconstructing missing rays is generally pretty effective.
But I also understand a lot of what they demoed at first was smoke and mirrors, plus a lot of traditional 3D VFX workflows. Still impressive as hell at the time, it's just that the tech has progressed significantly since ~2018.
I got as Lytro Illium off Ebay at a reasonable price but it is a bit of a white elephant. I was hoping to shoot stereograms but I haven't been able to do it with the stock software (I just get two images that look the same with no clear disparity)
I've seen open source software for plentopic images which might be able to generate a point cloud but I've only gotten one good shot of the Lytro which was similar to a shot I took with this crazy lens
The reconstruction is a 3-dimensional scene that has animation contained in it.
You can move a virtual camera 3-dimensionally within the scene at any individual frame (x, y, z), and also move the scene through its animation to play the animation forwards and backwards (in other words, you move the camera through the 'time' axis).
> in other words, you move the camera through the 'time' axis
So, like the scrubber in any video? Doesn’t feel like that warrants the 4D moniker. Which is not to say you’re not right, I think you are and that’s what they mean, but it that being the case it feels more buzzword than anything.
Yes, it’s like a video scrubber. But it’s perhaps more like the timeline in Blender/Maya, and maybe even Cinema4D ;)
You are correct though, they both serve the same function.
However, I’ve yet to see a video player that lets you reposition the camera as if you were using photo mode in a video game. That’s (essentially) what this thing offers.
I think it means that, given a normal flat 2D video, you get back that video but as a 3D scene, meaning you can move and pan the camera around as the 3d video plays.
And I guess they call it 4D since you had a flat 2d video + time dimension, so 3d video + time dimension = 4 dimensions.
Colloquially, meaning “used in or characteristic of familiar and informal conversation” 4D films have a definition, and that ain’t it. 3D over time is colloquially still referred as 3D, as evidenced by decades of 3D blockbusters.
Re: relevance, once of the prospective uses of work like this is in conversion of "flat" conventional video into "spatial" video, eg as popular on the Apple Vision Pro.
I've been interested in the state of the art in that domain myself, having thousands of 2D videos I've shot which I would love to see "spatialized" well, someday.
There's the 2D frame and the time dimension. Then there's the structural information conveyed by motion, parallax, scene composition, camera movement, etc.
That's why there's the 180 rule, amongst other things.
Algorithms can take a video and turn it into a 4D volume. As can our brains.
We're 3+1 dimensional beings. Time doesn't have the same metric as spacial dimensions, so you can't add them together. You can't rotate a temporal object along the xt-plane, for example, nor can you speak about an object's length along the t-axis. The three spacial dimensions are interchangeable, but time is special, so calling it 4D is incorrect.
For 3D VR videos, this would be useful for adjusting IPD for every person, rather than use the static IPD of the camera setup. Also, allowing just a little bit of head movement would really increase the immersiveness. I don't need to travel long distances inside the video. If the video is already filmed with static stereo setup, it would be even easier to reconstruct an accurate 4D video limited to short travel distances without glaring errors.
https://augmentedperception.github.io/deepviewvideo/
We've been waiting 4 years. I just don't understand what is taking so long.
Even at a low resolution, the difference is night and day. Even with a very small window, this is a leap forward for VR immersion. Why in the hell is no one using it?!?
>Why in the hell is no one using it?!?
As they state, it requires a 46 camera rig to record, immense computational power to compress the data into something usable, and a 1 Gbs data stream, which means there is nearly zero commercial use, and certainly not widespread use, at those specs.
Oh, and it's covered in a patent minefield. You can check the author names to see. Pretty much every paper from Google gets covered in them, so be wary of implementing anything you read in their papers without a serious patent search. This one took me 10 seconds.
Not everything whizzbang you see in a paper or demo is not being sold as consumer products because of ignorance. There is always a reason, usually one you have not discovered yet.
Is there demand?
I can't say I've had any good experiences with VR video... it's incredibly natural for gaming, but for scripted video having my head control the camera dilutes the experience for me: you either lose a key part of the language of cinema, or you make people sick by periodically taking control away from them.
The absolute best application of VR video I can imagine is immersive theatre, and I think with videogame engines are accomplishing that very well with modern performance capture.
Is there an experience you'd recommend that you think could change my mind?
For the most part, it's called porn, and you already know whether or not you are interested.
Beyond that, though, there is a lot of potential just past the horizon. Check out the demo for the paper I linked. Sure, you can't move around very far, but you can move enough to be part of the scene, unlike the usual tripod video setup.
I didn't read the paper, but am familiar with: https://mpeg-miv.org/
One challenge here is dedicated hardware support, running any sort of video processing/decoding on the CPU is unfeasible (GPU less so), particularly if the context is a battery-powered device (Quest 3, Apple Vision Pro), with high resolution and framerate (6K-8K, 60fps).
That doesn't even get into the logistics of capturing, serving, and software supporting this video data either
It's not just a paper. They have a working demo. IDK what its hardware requirements are, but they are certainly lower than the specs of a decent VR rig.
Even with its relatively low resolution, the demo blew me away. Given a choice between current 8K@120 content and something equivalent to this 4 year old demo, I would choose the latter hands down.
Possibly because it is hard to make a video streaming service that requires over 300 Mbits of bandwidth commercially viable.
A breakthrough in insect-eye-like cameras.
One thing I liked about Team Ico (a studio behind the Shadow of the Colossus, Ico, Last Guardian video games) was how the player can move the camera just a little but during automated sequences
Getting that kind of look around in a video scene would be really engaging. A bit different than VR or watching in The Sphere, with the engagement being that there are still things right out of view you have to pan the camera for
> Getting that kind of look around in a video scene would be really engaging.
It might be interesting for one or two movies specifically built around the feature, but otherwise it would be a gimmick no one would care for. For games, sure, but movies are a different experience.
maybe, but for the last half decade nearly my entire social circle cannot sit still for movies and just won't go out of their way to do it anymore. I'm big into cinema, but they are not. Treating this like the fidget-spinning surrogate that a large portion of the population relies upon could potentially make it a hit for some viewing experiences. Its a thesis I would pursue, for money, at least.
This would make it unbearable to watch a movie with anyone else, so it doesn’t really solve the social issue. But even if you’re watching it alone, it only really makes sense if the movie itself takes advantage of it in some interesting way, which starts to get into game territory. It wouldn’t even work for the most part: how do you deal with cuts and changes of scenery? It makes no sense in the context of a movie; what you’re looking for is a game and we can already do that.
Maybe you could have it work as a documentary (good luck getting a bored social group to go for that) or a virtual tour, but we already have 3D interactions of those too.
We’ve had tons of movie viewing experiments and ultimately always go back to the tried and true 2D screen, with the bolder ideas being relegated mostly to the domain of theme park gimmicks. Which are interesting in their own right, but don’t survive on their own.
yes, my main use case would be for fidgeters solo, just like those Team ICO games
> how do you deal with cuts and changes of scenery?
the same way the games did it. by doing nothing special at all and retaining the same functionality. it really depends on how this 4D reconstruction works before I could say it uniquely adversely affects the experience
for the most part what's interesting to me is that the overhead costs seem low enough not to care about random things big studios did at great expense with no way to justify the market appeal. its either a portfolio piece or 1,000 monhtly users supporting my lifestyle indefinitely.
Haven't played the other games but Ico was incredible. It gave me the same feeling as Another World which was maybe 10 years prior.
I agree. I think that this is similar to the appeal of old-school stereoscopy.
Our children will be so weird out by blade runner. Not by the zoom into the picture, but by the fact that the guy believes in halucinated data.
Who says that the recording medium simply didn't have 1 petapixel resolution? Or its analog analog to stick with the movie
Whenever I play a video game (Monster Hunter World comes to me immediately) and see an establishing shot with moving camera (like the ones demoed on their web site) I think the game really wants to run in an a VR headset where you can walk around and see different angles.
(Funny there is a VR mod for Monster Hunter Rise which makes me think just how fun Monster Hunter VR would be)
General rule in video games: Assume anything you can’t see isn’t there and anything you can see is fake. Any cutscene where you don’t control the camera is almost certainly a janky mess (or at least not polished) from other perspectives.
There are exceptions but not many. It takes a ton more work to make a scene look good from any arbitrary viewpoint.
I was wondering how were they getting depth from a video where camera is still.
> we utilize a comprehensive set of data-driven priors, including monocular depth maps
> Our method relies on off-the-shelf methods, e.g., mono-depth estimation, which can be incorrect.
Actually, of the examples they showed, all but one clip featured both camera and in-camera motion. Granted not a lot of the former, but according to my non-expert opinion, maybe enough to construct a disparity map.
I imagine having stereo video would also help generate a depth map from disparity?
Yeh, I got my terms confused. My bad. Disparity is, I believe, only possible to get from a stereo pair. TFA presents something closer to a camera track, albeit a very short one. From a camera track it is a short hop to extracting a point cloud. What make the author's approach possibly unique is that the foreground object is in motion as well as the camera. In standard VFX foreground object motion is usually avoided.
This reminds me of the description of Disneys(future movies) in Cloud Atlas. The movie had a good visualization, this feels like that.
I liked Cloud Atlas, I should watch it again. It was weird and ambitious.
Curiosity, what is the difference between 4D or 6DoF (six degrees of freedom)? Sounds a lot like the 6DoF work that Lytro did back in 2012, although this obviously is coming at the problem from the other direction, generating it rather than capturing it.
Lytro added 2 spatial dimensions of info to 2D image capture: the angles the light was traveling at when it entered the camera. They could simulate the image with different camera parameters, which was good for changing depth of field after the fact, but the occlusion information was limited by the diameter of the aperture. They tried to make depth maps, but that extra data was not a silver bullet. As far as I could tell, they were still fundamentally COLMAPing, they just had extra hints to guess with.
This is spot-on. Note that the aperture on the camera was quite large, I want to say on the order of 100mm? They sourced really exotic hardware for that cinema camera.
They also had the "Immerge," which was a ~1m diameter, hexagonal array of 2D cameras. They got the 4D data from having a 2D (spatially distributed) array of 2D samples (each camera's views). It's under sampled, because they threw out most of the light, but using 3D as a prior for reconstructing missing rays is generally pretty effective.
But I also understand a lot of what they demoed at first was smoke and mirrors, plus a lot of traditional 3D VFX workflows. Still impressive as hell at the time, it's just that the tech has progressed significantly since ~2018.
I got as Lytro Illium off Ebay at a reasonable price but it is a bit of a white elephant. I was hoping to shoot stereograms but I haven't been able to do it with the stock software (I just get two images that look the same with no clear disparity)
I've seen open source software for plentopic images which might be able to generate a point cloud but I've only gotten one good shot of the Lytro which was similar to a shot I took with this crazy lens
https://7artisans.store/products/7artisans-50mm-f0-95-large-...
The scene itself moves over time, hence the 4D. Vanilla gaussian splating already give you 6 degrees of freedom since you have a full 3D scene.
Move in 3D space + rotate in 3D space, I think.
But w/ time should it be 7?
The results are impressive, but what makes this 4D? Where’s the extra dimension and how is it relevant to 3D human beings?
The reconstruction is a 3-dimensional scene that has animation contained in it.
You can move a virtual camera 3-dimensionally within the scene at any individual frame (x, y, z), and also move the scene through its animation to play the animation forwards and backwards (in other words, you move the camera through the 'time' axis).
> in other words, you move the camera through the 'time' axis
So, like the scrubber in any video? Doesn’t feel like that warrants the 4D moniker. Which is not to say you’re not right, I think you are and that’s what they mean, but it that being the case it feels more buzzword than anything.
Yes, it’s like a video scrubber. But it’s perhaps more like the timeline in Blender/Maya, and maybe even Cinema4D ;)
You are correct though, they both serve the same function.
However, I’ve yet to see a video player that lets you reposition the camera as if you were using photo mode in a video game. That’s (essentially) what this thing offers.
I think it means that, given a normal flat 2D video, you get back that video but as a 3D scene, meaning you can move and pan the camera around as the 3d video plays. And I guess they call it 4D since you had a flat 2d video + time dimension, so 3d video + time dimension = 4 dimensions.
This work is about taking an input with 2 spatial dimensions, plus 1 time dimension,
and synthesizing a (limited) model with 3 spatial dimensions, plus 1 time dimension.
3D over time is colloquially called "4D;" though we don't call video "3D" by analogy as the term binds strongly to its purely spatial use.
> 3D over time is colloquially called "4D;"
Colloquially, meaning “used in or characteristic of familiar and informal conversation” 4D films have a definition, and that ain’t it. 3D over time is colloquially still referred as 3D, as evidenced by decades of 3D blockbusters.
https://en.wikipedia.org/wiki/3D_film
https://en.wikipedia.org/wiki/4D_film
Re: relevance, once of the prospective uses of work like this is in conversion of "flat" conventional video into "spatial" video, eg as popular on the Apple Vision Pro.
I've been interested in the state of the art in that domain myself, having thousands of 2D videos I've shot which I would love to see "spatialized" well, someday.
Time
By that logic all videos would be (at least) 3D. But no one would take you seriously if you said that.
Videos are already 4D.
There's the 2D frame and the time dimension. Then there's the structural information conveyed by motion, parallax, scene composition, camera movement, etc.
That's why there's the 180 rule, amongst other things.
Algorithms can take a video and turn it into a 4D volume. As can our brains.
The input videos already have that dimension, so that can't be the answer.
I agree that it shouldn't be, but it is apparent that it (redundantly) is.
We are all 4-dimensional beings on this fine day.
We're 3+1 dimensional beings. Time doesn't have the same metric as spacial dimensions, so you can't add them together. You can't rotate a temporal object along the xt-plane, for example, nor can you speak about an object's length along the t-axis. The three spacial dimensions are interchangeable, but time is special, so calling it 4D is incorrect.
the first HyperNeRF cat video is quite interesting-looking and surreal!
He was having a meowt of body experience.
Looks flat to me!
[dead]